Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Gene a ing Abs ac Rhy hm S eams
Jus in Bosma
Supe iso : Se gi Jo da
Co-Supe iso : Behzad Haki
Augus 2025
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Gene a ing Abs ac Rhy hm S eams
Jus in Bosma
Supe iso : Se gi Jo da
Co-Supe iso : Behzad Haki
Augus 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 Ou line ................................... 2
1.3 Supplemen a y Ma e ial . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Rela ed Wo ks 4
2.1 A chi ec u es ................................ 4
2.2 T ans o me s ................................ 5
2.2.1 Inpu Rep esen a ion o T ans o me Model . . . . . . . . . . . . . . . 5
2.2.2 Posi ionalEncoding ............................ 5
2.3 Va ia ional Au oencode s . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 LossC i e ia ................................ 6
2.4 T ans o me Based Models o Music Gene a ion . . . . . . . . . . . . . 7
2.5 Non-Tokenized T ans o me Inpu s and Rep esen a ion . . . . . . . . . 8
2.6 U ilizing he La en Space o Gene a i e Pu poses . . . . . . . . . . . 8
2.7 Rhy hmicFea u es ............................. 9
2.8 G oo eT ans o me ............................. 10
2.8.1 O e iew .................................. 10
2.8.2 A chi ec u e................................. 11
2.8.3 Da a..................................... 11
2.8.4 S eams/Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8.5 G oo e.................................... 12
3 Design 13
3.1 S eams ................................... 13
3.2 Ins umen O e iew............................ 14
3.2.1 Encode ................................... 15
3.2.2 La en Space ................................ 17
3.2.3 Decode ................................... 18
4 Da ase s 20
4.1 Anno a ed Candombe Reco dings . . . . . . . . . . . . . . . . . . . . . 21
4.2 LAKH.................................... 22
4.3 G oo eMIDI................................. 23
4.4 ElBongose o ................................ 25
4.5 TapTamD um................................ 26
4.6 Da ase P e-p ocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Model 29
5.1 A chi ec u e................................. 29
5.2 Da aandT aining ............................. 30
5.3 Valida ion.................................. 31
5.3.1 Densi y ................................... 35
5.3.2 Ou pu S eam Quali y . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.3 Rela ionships be ween Fea u e Values . . . . . . . . . . . . . . . . . . . 38
6 Discussion 41
6.1 Da a..................................... 41
6.2 Fea u es................................... 43
7 Conclusion 45
7.1 Da aChanges................................ 45
7.2 Fea u eChanges .............................. 46
7.3 Fu u e Implemen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Lis o Figu es 48
Bibliog aphy 50
A Fi s Appendix 53
B Second Appendix 56
Dedica ion
I dedica e his wo k o my mo he , Linnae. Wi hou you cons an suppo , I would
ne e ha e been able o be he pe son I am oday. I am so g a e ul o ha e a mo he
as wonde ul as you.
2Chap e 1. In oduc ion
melody c ea ion, ha monic addi ions o exis ing melodies, he gene a ion o hy hms
ela ed o he piece o music, and a ious ools o aid in he mixing and mas e ing
ac i i ies. Al hough hese sys ems ha e helped in he c ea i e p ocess, hey end o
be e y speci ic in hei implemen a ion, such as hy hm gene a ion sys ems wi h a
mapping o hei ou pu s o speci ic d ums like kick, sna e, and hi-ha .
Fo ou sys em, cu en ly known as T iple S eams, we aim o c ea e a eal- ime
sys em ha ocuses on he c ea ion o abs ac hy hmic s eams. Ou in en is
o push he use o hink mo e me hodically abou how o apply hese hy hmic
s eams o enhance hei pe o mance wi hou sac i icing c ea i e choices.
To achie e his goal, we pu sue h ee main objec i es:
1. Da ase C ea ion: Build a collec ion o di e se and in e es ing hy hmic
sequences d awn om bo h d um-based and non-d um sou ces, mo ing beyond
one- o-one mappings be ween ins umen s and ou pu s eams.
2. Model Design: De elop a gene a i e model wi h eal- ime con ol ea u es,
enabling pe o me s o manipula e ou pu s dynamically and ea he sys em
as a playable ins umen .
3. E alua ion: Ensu e ha he sys em p oduces di e se, musically meaning ul
hy hmic s eams ha ela e cohe en ly o he use ’s inpu .
1.2 Ou line
•Chap e 2: Rela ed Wo ks – O e iew o p io wo k ha in o ms ou
a chi ec u e, da a ep esen a ion, and con ol ea u es. We examine ele an
machine lea ning models, hy hmic ea u e ep esen a ions, and he G oo e-
T ans o me sys em ha se es as ou ounda ion.
•Chap e 3: Ins umen Design – High-le el o e iew o he sys em’s de-
sign, pa i ioned in o encode , la en space, and decode componen s. We also
desc ibe he sys em’s inpu s, ou pu s, and in e ac i e con ols.
1.3. Supplemen a y Ma e ial 3
•Chap e 4: Da a Cu a ion – Desc ip ion o da ase s used o aining and
alida ion, p ep ocessing me hods, challenges encoun e ed, and he a ionale
behind design choices.
•Chap e 5: Model O e iew and Valida ion – P esen a ion o he model’s
a chi ec u e, aining p ocess, and e alua ion o hy hmic ou pu s, including
he e ec o con ol ea u es on sys em pe o mance.
•Chap e 6: Discussion – Re lec ion on he sys em’s cu en s a e, iden i y-
ing s eng hs, limi a ions, and possible imp o emen s in da a, con ol ea u es,
and ou pu quali y.
•Chap e 7: Conclusion – Summa y o key indings, open challenges, and
planned u u e wo k, including e inemen s o da ase s, con ol mechanisms,
and deploymen .
1.3 Supplemen a y Ma e ial
Th oughou his hesis, we will use supplemen a y ma e ial o p o ide addi ional
insigh s and explana ions o he ex . All download links, p ocessed da a, igu es,
code, and o he esou ces used in his hesis can be ound in he ollowing gi hub
eposi o y:
h ps://gi hub.com/jus inbosma/T ipleS eams
Chap e 2
Rela ed Wo ks
O e he pas decade, nume ous con ibu ions ha e been made o he in e sec ion o
audio, music, and machine lea ning, and each con ibu ion can be seen as a s epping
s one o ou cu en wo k. Howe e , in his sec ion we will ocus on he mos ela ed
wo ks o ou cu en esea ch, and how hey di ec ly impac ou implemen a ion.
We will begin wi h a b ie o e iew o he Machine Lea ning models employed in ou
sys em. Following his, we will examine some o he mo e ecen gene a i e models
ha ha e had a s ong in luence on ou decisions. Finally, we will gi e an o e iew
o he p io wo k o Behzad Haki, speci ically he G oo eT ans o me , allowing us
o show whe e ou cu en esea ch de elops om.
2.1 A chi ec u es
Due o he sequen ial na u e o symbolic no a ion, and speci ically hy hm, we
employ a ans o me model o he modeling o empo al dependencies. To allow
us o c ea e a ia ions on hy hms, we u ilize a a ia ional au oencode (VAE) o
encode hey hy hms in o a la en space, whe e we can in e pola e be ween simila
hy hms o sligh a iance om he gene a ions based on he inpu . Below, we will
gi e an o e iew o bo h a chi ec u es.
4
2.2. T ans o me s 5
2.2 T ans o me s
Fi s desc ibed by Ashish Vaswani in hei 2017 pape "A en ion is All You Need",
he ans o me is a deep lea ning model ha has become undamen al o he a ea o
Na u al Language P ocessing (NLP) [3]. This model di e s om Recu en Neu al
Ne wo ks (RNN) in ha i allows pa allel p ocessing o inpu sequences, making
aining and in e ence mo e e icien .
The wo impo an aspec s o ans o me models a e posi ional encoding and sel -
a en ion. In he example o ex , posi ional encoding assigns a numbe o each
wo d, a he han looking a i s posi ion in he sen ence. This p o ides in o ma ion
abou he oken, allowing he model o conside sequen ial in o ma ion. Wi h sel -
a en ion, each wo d is assigned a weigh and is compa ed agains he o he wo ds
in pa allel. This allows he model o "lea n" he g amma based on how wo ds a e
used in e e yday li e.
2.2.1 Inpu Rep esen a ion o T ans o me Model
A he bo om mos encode , he inpu is i s b oken down in o okens, which
would be wo ds and subwo ds in NLP. Each oken is assigned a unique index om
he model’s ocabula y. A e assigning he unique index, each oken is con e ed
in o a dense embedding ec o o size V×d, whe e V is he ocabula y size and d
is he embedding dimension.
2.2.2 Posi ional Encoding
To keep ack o he o de o he sequence, T ans o me s use a posi ional encoding
o main ain he in o ma ion abou he ela i e o absolu e posi ion o he oken.
This can be achie ed by using he ollowing equa ion:
PE(pos,2i)= sin pos/100002i/dmodel
6Chap e 2. Rela ed Wo ks
whe e pos is he posi ion and iis he dimension. Each dimension io he posi ional
encoding is ela ed o a sinusoid [3].
2.3 Va ia ional Au oencode s
Va ia ional au oencode s a e sel -supe ised deep lea ning models ha allow us o
isola e impo an in o ma ion, known as la en a iables, ia an encode . The la en
a iables a e hen used o decode ou inpu . The inpu is no encoded di ec ly, bu
is encoded in a p obabilis ic dis ibu ion, om which he nume ical ep esen a ion
is sampled [4]. One o he mos impo an aspec s o Va ia ional Au oencode s is
he la en space, which is he space consis ing o all o he la en a iables om
ou da ase . The da a is comp essed in o a smalle dimensional la en space, whe e
only he meaning ul in o ma ion ex ac ed om he au oencode is ep esen ed.
This encoding o in o ma ion in o a la en space allows us no only o ec ea e ou
o iginal inpu bu also o gene a e new da a.
2.3.1 Loss C i e ia
Fo aining he VAE, we use he econs uc ion loss o econs uc he inpu om
he encoded space. This can be w i en o mally as:
L econ(x,ˆx) = di e ence(x,ˆx)
whe e we a emp o minimize he loss be ween di e ence(x,ˆx).
The econs uc ion loss will di e depending on he ype o da a ha we a e using,
o example, wi h okenized language, we could calcula e c oss-en opy loss be ween
he p edic ed and o iginal sequences [5].
L econ(x,ˆx) = −X
x log(ˆx )
Because we a e in e es ed in gene a ing new samples om ou inpu da a, we canno
ely only on a loss unc ion. We need o use ano he egula iza ion e m o ensu e
2.4. T ans o me Based Models o Music Gene a ion 7
ha we can sample om anywhe e in he la en space be ween ou o iginal da a
poin s. To achie e his, we use he Kullback-Leible (KL) Di e gence, whe e we
ensu e he la en space ollows a s anda d no mal dis ibu ion. The KL Di e gence
is o mally de ined as:
LKL =DKL N(µ(x), σ2(x)) ∥ N(0,1)
F om he e, ou o al loss can be de ined as ollows:
L o al =L econ +LKL
Ha ing a la en space ha is con inuous, whe e nea by poin s yield simila samples
when decoded, and comple e, whe e all he poin s in he la en space should ha e
meaning ul con en when decoded, is an impo an aspec o ou implemen a ion.
This allows us o in e pola e be ween wo da a poin s in ou space.
2.4 T ans o me Based Models o Music Gene a-
ion
The Music T ans o me , de eloped by Huang e al in 2019 inco po a es a ela i e
posi ional a en ion mechanism, which allows o he c ea ion o songs up o one
minu e in leng h. They show ha hei new implemen a ion ou pe o ms LSTM-
based models like Pe o mance RNN in main aining mo i s, epe i ion, and s uc u e
o e long spans. I is also capable o con inuing an exis ing melody o ha monizing
wi h a gi en melody [6].
The Pop Music T ans o me by Huang e al. is an ex ension o he Music T ans-
o me ha is mo e capable o cap u ing he hy hmic s uc u e o pop music [7].
The implemen a ion ocuses on enhancing he ep esen a ion o inpu /ou pu se-
quences. This was made possible by he use o REMI (RE amped MIDI de i ed
e en s) okeniza ion, explici ly encoding ba s, bea posi ions, empo changes, and
cho ds alongside ypical no e e en s.
8Chap e 2. Rela ed Wo ks
2.5 Non-Tokenized T ans o me Inpu s and Rep e-
sen a ion
In hei pape p eceding he implemen a ion o G ooVAE, Gillick e al. sugges he
use o an enhanced piano oll ep esen a ion o d ums [8]. This ep esen a ion uses
a bina y piano oll ela i e o a ixed g id, se ing onse s o ’1’ and es s o’0’. They
also inco po a e he use o eloci y and o se s, mic o- ime de ia ions in ela ion
wi h he g id, in o hei ep esen a ion. This allows o a simple ep esen a ion o
d ums ha can be used as inpu o hei sys em. In his wo k on G oo eT ans o me ,
Behzad Haki uses his HVO (hi , eloci y, o se ) ep esen a ion o he inpu and
ou pu o he sys em.
2.6 U ilizing he La en Space o Gene a i e Pu -
poses
O e he pas decade, he e ha e been nume ous eal- ime models o he use in
music and sound gene a ion. We will look a some o hese ha ela e o ou
p ojec and show how hese models can help in ou esea ch.
MusicVAE is a ecu en a ia ional au o-encode model de eloped o c ea ion o
symbolic music o a ying leng hs. The au ho s no e ha he applica ion o VAEs
o sequen ial da a has been limi ed, and hey s uggle o model sequences wi h
long- e m s uc u e. In an a emp o sol e his issue, hey employ he use o a
hie a chical decode , which ou pu s embeddings o subsequences and hen uses he
embeddings o gene a e new subsequences [9]. This ex ension o MusicVAE can gen-
e a e loops by sampling poin s in he la en space. I also allows one o in e pola e
be ween wo loops, by mo ing be ween poin s in he la en space, which allows one
o mix be ween he loops o mo e musical con ol. In his G oo eT ans o me imple-
men a ion, Behzhad Haki u ilizes his echnique o in e pola ing be ween wo loops
in he la en space, and is some hing we will include in ou new implemen a ion.
GLSR-VAE is ano he app oach o loop gene a ion using ecu en VAEs. I employs
2.7. Rhy hmic Fea u es 9
egula iza ion echniques o imp o e he dis ibu ions o he embeddings in he la en
space. This allows o a smoo he ansi ion be ween loops in he space, gi ing mo e
con ol and simila i y, while s ill allowing in e es ing a ia ion [10]. An impo an
applica ion in his wo k o ou own is he obse a ion ha a dimension in he la en
space can be associa ed wi h he densi y o a gene a ed sequence, and mo ing in
one di ec ion can inc ease he densi y o he sequence. This will be u ilized when
we implemen ou con ol ea u es o ou inal sys em.
Gene a i e Ad e sa ial Ne wo ks (GANS) ha e also been u ilized in he c ea ion o
musical loops. One example o his is MidiNe (Yang e al.), whe e a con olu ional
GAN is used o c ea e musical sequences [11]. They p opose a condi ional mech-
anism o use a ailable p io knowledge, allowing he model o gene a e melodies
om sc a ch, ollowing inpu ia cho d sequences, o by condi ioning on he p io
melodies.
O he app oaches o c ea ing hy hmic spaces like he one men ioned abo e include
Rhy hmVAE by Tokui and R-VAE by Vigliensoni e al. [12] [13]. Bo h o hese
use VAEs o map d um pa e ns in o 2-dimensional la en space encodings. While
mo ing h ough his space, he use can gene a e andom pa e ns, sligh ly modi y
pa e ns, and in e pola e be ween a ious pa e ns o c ea e in e es ing hy hmic
sequences.
2.7 Rhy hmic Fea u es
Fo ou implemen a ion o he T iple S eams hy hm gene a ing sys em, we wish
o c ea e a ious con ol ea u es o allow he pe o me o ha e mo e con ol o e
he gene a ed s eams. Gomez-Ma ín e al. ha e eco ded a gene al se o hy hmic
ea u es, desc ibing a ious simila i y me ics o 1-ba d um pa e ns, which ha e
been collec ed and es ed acco ding o human a ings [14]. The ea u es men ioned
in he pape include densi y o onse s o he o al hy hm, as well as looking a he
densi y o onse s in low-, mid-, and high- equency bins. They also de ine a means o
looking a he syncopa ion on bo h a global le el and he indi idual equency bins
10 Chap e 2. Rela ed Wo ks
as well. We can u ilize his idea o looking a he ea u es o indi idual equency
bins and apply i o ou indi idual s eams. F om hei wo k, we will implemen
ea u es based on onse densi y, hy hm simila i y, and syncopa ion. Fu he mo e,
we can use he ea u es desc ibed in hei pape o alida ion o ou gene a ed
hy hmic s eams, allowing us o ensu e ha we ha e a di e se se o hy hms a e
decoding.
2.8 G oo eT ans o me
The G oo eT ans o me , a con ollable d um accompanimen gene a ion sys em,
was de eloped by Behzad Haki, wi h inpu om Ca alan musician Raül Re ee. I
is a eal- ime hy hm gene a ion sys em ha ou pu s d um hy hms based on a
use ’s inpu , in ended o use in li e pe o mance. Ou new implemen a ion will be
hea ily based on his wo k, and we look o build on his wo k o ou implemen a ion
o he T iple S eams sys em.
Now, we will gi e a b ie o e iew o he sys em as a whole, he model a chi ec u e,
da a sou ces, and he ep esen a ion o he s eams.
2.8.1 O e iew
The cu en implemen a ion o G oo eT ans o me was c ea ed by Behzad Haki o
hei doc o al esea ch a Uni e si a Pompeu Fab a. The sys em is based on yea s
o expe ience wo king in he a ea o hy hm gene a ion and ep esen a ion, and in i s
cu en o m, i is a edesign based on he eedback om he musician Raül Re ee
[5]. The ocus o he edesign is o allow o a balance be ween pe o mance and
composi ion, which is a ained by add essing he dualism o au onomy and con ol
o he gene a ed hy hmic s eams. Ini ially, he sys em could only be upda ed by
changing he inpu g oo e o changing he sampling pa ame e s. This was modi ied
o gi e he pe o me mo e con ol o e he sys em, by he inclusion o wo addi-
ional p ede e mined pa e ns ha allowed he use o in e pola e be ween he wo
addi ional pa e ns and he inpu g oo e. The addi ion o a VAE in o he model’s
a chi ec u e allowed o he in e pola ion be ween hese h ee hy hms, enabling he
2.8. G oo eT ans o me 11
pe o me o modi y he ou pu in eal- ime [5]. In addi ion, wo new con ols we e
included in he new sys em implemen a ion. The i s o hese, i e indi idual mu e
con ols, allow o he mu ing o he ins umen g oupings "kick, sna e, ha s, oms,
cymbals. These a e enabled by lea ning an embedding wi h he same dimension as
he la en space and added o he la en ec o be o e decoding. The o he con ol, a
gen e mu e/selec o , is c ea ed by including gen e ea u es in he aining p ocess by
adding he in o ma ion o he encode , and simila ly o he ins umen mu es/gen e
in o ma ion sen o he decode .
2.8.2 A chi ec u e
The model a chi ec u e o he cu en implemen a ion o he G oo eT ans o me
consis s o a T ans o me VAE , whe e a 2-ba hy hmic pa e n is encoded in o a
la en dis ibu ion, and is decoded in o a 9- oice d um pa e n. The model consis s
o h ee pa allel decode s, one o hi s (onse s), eloci y, and o se s (mic o imings).
2.8.3 Da a
G oo eMIDI was chosen as he da ase o he G oo eT ans o me sys em, as i has
a la ge amoun o mul i-d um MIDI iles ha can be used o aining he model.
In addi ion, he MIDI iles ha e an associa ed gen e ag, allowing o he c ea ion o
gen e il e ing in he sys em. Due o he gen e-imbalance o he da ase , addi ional
p i a e da a was used o balance he da ase [5].
2.8.4 S eams/Rep esen a ion
G oo eT ans o me uses an HVO ma ix o ep esen he hi s, eloci ies, and o se s
o bo h he inpu and ou pu hy hms. The ma ices a e o size T x M, whe e T
is he ime s eps in he ep esen a ion in six een h no es, and M is he numbe o
ins umen s. In he inal G oo eT ans o me implemen a ion, hese HVO ma ices
will be o size 32 X 27. We ge 32 ime s eps om 2-ba hy hm sequences composed
o six een h no es in 4/4 iming, and 27 om mul iplying he 9 ins umen s by 3 o
he hi s, eloci ies, and o se s. Fo ou T ipleS eams implemen a ion, we will use
18 Chap e 3. Design
Expanding on he ideas in he G oo eT ans o me , we wan o allow he use o
mo e beyond he s o ed la en a iables in a linea di ec ion. This gi es he use
a simple way o explo e he space, hope ully yielding in e es ing esul s while s ill
main aining some ela ion o he wo hy hms.
Looking a igu e 6, we can see he use ’s inpu hy hm encoded in he op node
labeled ZG, and he wo sa ed da a poin s labeled ZAand ZB. The diag am on he
igh o his igu e shows how we can in e pola e be ween he wo da a poin s, bu
also allow he use o mo e in a linea di ec ion beyond he pa h be ween he wo
poin s.
Figu e 6: Rep esen a ion o he La en Space o Real-Time In e pola ion
3.2.3 Decode
Ou inal subsec ion is he decode . In his egion, hy hms ha ha e been en-
coded in he la en space a e decoded in o a la ened s eam. We hen expand his
la ened s eam in o ou h ee ou pu s based on he densi y alues we ha e selec ed.
Densi y
A e decoding he la en a iable, we ecei e ou la ened ou pu o he h ee
s eams. To keep he design as simple as possible, we decided o include only one se
o con ols on his end. We designa e hese con ols as ’Densi y’, which con ols he
numbe o onse s in each expanded s eam. The numbe o onse s in each s eam is
bounded by he o al numbe o hi s in he la ened ou pu . I he la ened ou pu
3.2. Ins umen O e iew 19
has 12 onse s, he maximum numbe o onse s each un la ened s eam can ha e is
12. This allows he use o quickly modi y he indi idual s eams acco ding o hei
needs du ing a pe o mance.
To calcula e he densi y o each s eam, we employ he Jacca d Simila i y be ween
each indi idual expanded s eam and he o iginal la ened ou pu . We de ine his
o mally as
S eami∩Fla ened
S eami∪Fla ened
Looking a igu e 7 we can see he expansion o he la ened ou pu in o he h ee
s eams based on he densi y ea u e con ol alues. Se ing ou densi y ea u e
con ol o 70% o he op s eam allows us o ha e 7 o he 10 onse s ound in he
la ened ou pu . Bo h eloci y and o se alues will be e ained om he la ened
s eam.
Figu e 7: Expanding he ou pu in o he h ee s eams based on he Densi y Fea u e
Con ol Values
Chap e 4
Da ase s
A cen al ocus o his esea ch is explo ing di e en ways o ex ac hy hmic in-
o ma ion om di e se sou ces, mo ing beyond a d um-cen ic pe spec i e in which
each gene a ed s eam co esponds o a speci ic d um. Ins ead, we in es iga e mo e
abs ac me hods o cons uc ing hy hmic s eams. This includes d awing on
da ase s ha a e no s ic ly pe cussi e as well as de eloping s a egies o pa i-
ioning and ecombining da a o econs uc hy hms.
Du ing he da a cu a ion p ocess, we applied se e al c i e ia. Fi s , he da ase
needed o ca y a sui able license; hose unde he C ea i e Commons A ibu ion
4.0 In e na ional (CC BY 4.0) license we e conside ed accep able. Second, because
ou me hod equi es accu a e empo and ime signa u e in o ma ion o compu e
o se alues om MIDI iles, only da ase s ha e ained his me ada a we e usable.
This es ic ion excluded some o he wise aluable collec ions. Thi d, we p io i ized
da ase s wi h a b oad ange o eloci ies and o se s in o de o p o ide iche ain-
ing da a o he model. Based on hese c i e ia, we selec ed a se o anno a ed
Candombe eco dings, he LAKH MIDI da ase , G oo eMIDI, El Bongose o, and
TapTamD um.
The ollowing sec ions de ail ou a ionale o choosing hese da ase s, he speci ic
p ocessing s eps used o educe hem in o ou s eams, and he challenges encoun-
e ed. We hen desc ibe he con e sion o he cu a ed ma e ial in o he HVO ep-
20
4.1. Anno a ed Candombe Reco dings 21
esen a ion used o aining, along wi h he cons uc ion o a new da ase c ea ed
speci ically o his p ojec .
All da a manipula ion was implemen ed in Py hon and is a ailable in he p ojec ’s
Gi Hub eposi o y. Fo MIDI-based da ase s, we elied ex ensi ely on P e yMIDI
o eading, analyzing, modi ying, and w i ing MIDI iles [15]. All da ase s used all
unde he C ea i e Commons A ibu ion 4.0 In e na ional license, making hem
sui able o ou pu poses.
4.1 Anno a ed Candombe Reco dings
The da ase o Candombe eco dings was c ea ed by Luis Ju e, Ma ín Rocamo a,
Simone Ta si ani, and Ma in Clay on. I consis s o mul iple li e eco dings o
Candombe a is s eco ded in a s udio in U uguay in 2018. Speci ically, one playe
on he piano d um, one on he chico, and wo on epique [16]. The da ase also
con ains anno a ions o onse s and eloci ies. O se alues we e no included, bu
calcula ed by Sa yajee P abhu and Anmol Mish a. The o se alues a e only o
he acks wi h one epique, so we will omi he eco dings wi h wo epique playe s.
In he adi ion o Candombe music, each d um has a speci ic ole in he pe o -
mance. The piano, which is he la ges o he d ums, has he lowes equency
and p o ides a melodic ounda ion o he o he d ums. The epique is he second
la ges o he h ee d ums and is used o add a ia ion and imp o isa ion o he
pe o mance. The ’llamada’, o call and esponse, is pe o med on hese d ums and
is cen al o Candombe. Finally, he chico is he smalles o he h ee and is used o
p o ide he s able hy hm o he pe o mance. The Candombe hy hm comes om
he in e ac ion o hese h ee d ums and hei unc ions, c ea ing a e y complex
hy hmic s uc u e.
Though we ha e s a ed we wish o mo e away om mapping one d um o one
ou pu s eam, we ound i was bes o ha e each d um ep esen one s eam, ou
ou h s eam being a la ened combina ion o he h ee d ums. This allows o he
complex hy hmic s uc u e o be main ained in ou model aining.
22 Chap e 4. Da ase s
The da a p ep ocessing code o ou Candombe da ase can be ound in he ’can-
dombe.py’ ile in ou eposi o y.
4.2 LAKH
The LAKH da ase consis s o 176,581 unique MIDI iles, including 45,129 iles ha
ha e been ma ched o songs in he Million Song Da abase [17]. Fo ou pu poses,
he LAKH da ase o e s an asso men o ins umen s ha we can use o c ea ing
a ious hy hmic s eams, and allowing di e en ins umen s o ill he ole o he
’G oo e’, e.g. using bass gui a as he main g oo e, and using he piano, elec ic
gui a , and d um ins umen s as he ela ed s eams.
Due o he size o he LAKH da ase , we need o be p ecise wi h how we choose ou
combina ions o ins umen s. Fi s , o condense he la ge amoun o ins umen s in o
hei amilies, like gui a , elec ic gui a , lead gui a in o ’Gui a ’, we ma ch on all
he associa ed MIDI p og am numbe s and me ge hese in o one ins umen wi h he
name we will be using. We hen look a he po en ial ins umen s o use and choose
hose ha we assume o be be e sui ed o ou ask. Fo example, a combina ion
like "Piano", "Gui a ", "D ums", and "Bass" seems like a classic example ha would
ha e a good unde lying hy hmic s uc u e ha we can u ilize. Howe e , ce ain
hings like "S ings" o "Syn h Pads", end o ha e long sus ained no es and may
no ha e he mos in e es ing onse and mic o- iming in o ma ion o ou esea ch,
so hese a e excluded. Ou o iginal subse o ins umen s o use consis s o "Piano",
"Pe cussion", "Gui a ", "Bass", "B ass", "E hnic", "Pe cussi e", "Sound E ec s",
"D ums", and "Syn h E ec s". To ind which combina ions o hese ins umen s
a e mos ep esen ed in ou da ase , we c ea e a small subse o 4,200 MIDI iles,
plo a his og am o he op en combina ions o ou o he ins umen s we ha e
selec ed, and choose hese o c ea ing ou hy hmic s eams. In igu e 8 we can see
ha "Bass", "D ums", "Gui a ", "Piano", "B ass", "Pe cussion", and "Pe cussi e"
ha e he highes numbe s, so we will use hese o ou hy hmic s eam c ea ion.
Because he LAKH MIDI da ase consis s o en i e songs, i was decided o only use
4.3. G oo eMIDI 23
Figu e 8: Top 10 ins umen combina ions om LAKH da ase .
a po ion o each MIDI ile. The e a e many 2-ba loops ha a e epea ed and hese
become edundan in ou aining. To accomplish his, we so he HVO sequences
o each MIDI ile by he numbe o o al hi s. We hen selec he second hal o he
so ed lis and ake eigh e enly sp ead HVO sequences. This will also help educe
ins ances whe e he e a e li le o no onse s in he HVO sequences.
The code o ou LAKH MIDI p ep ocessing, as well as he code o selec ing he
eigh HVO sequences can be ound in lakh_midi.py and lmd_bes _eigh .py.
4.3 G oo eMIDI
The G oo eMidi da ase con ains mo e han 13.6 hou s o aligned MIDI o human
pe o med d umming and o e s us a g ea deal o ma e ial o use o aining [18].
Behzad Haki’s G oo eT ans o me sys em was ained on his da ase o c ea e i s
ou pu d um s eams. Fo ou pu poses, we a e looking o o he ways we can me ge
hese a ious d um acks in o h ee o ou s eams. In "D um Rhy hm Spaces"
(Gómez-Ma ín e al. 2020), i is no ed ha humans pe cei e he oles o a ious
d ums in h ee equency bands, i.e., low, mid, and high. We use his as a way o
sepa a e he MIDI eco dings in o h ee sepa a e s eams, using a la ened e sion
o he h ee as he g oo e [19]. Ou hope is ha his will main ain some o he
’ unc ions’ o he d ums, e.g., he kick d um as a pulse in wes e n music, he sna e
d um o ein o ce he me e in ock, e c. The h ee pa i ions we c ea e include
24 Chap e 4. Da ase s
he low pa i ion, including kick d um and low om, he mid pa i ion, including
sna e d ums, mid oms, and high oms, and he high pa i ion, including hi-ha s,
ide cymbals, and c ash cymbals.
Fo ou second g ouping, we pa i ion he eco dings in o eloci y bins. Loudness
and i s MIDI ep esen a ion, eloci y, can be used o emphasize ce ain po ions o
he hy hm. These emphasized momen s could be accen s o ein o ce he me e , o
in e es ing e en s in a musical piece. Ou in en ion was o pa i ion hese emphasized
po ions, whe e one s eam would consis o onse s wi h high eloci y ha would
s and ou agains he o he sounds, one s eam would be e y sub le and could be
used o mino changes, and some hing in be ween. F om a pe o mance pe spec i e,
hese could be e y use ul when mapped o pa ame e s such as syn h imb e and
en elopes o con ol e ec s like delay.
The hi d g ouping ha we use is oo ed in he adi ions o wes e n ock, pop,
and elec onic music. We ocus on ou ins umen amilies, kick d um, sna e, oms,
and hi-ha s, and c ea e new MIDI iles based on hese g oupings. This allows us o
abs ac he oles o he ins umen s sligh ly, no mapping each indi idual d um o
a s eam, bu ocusing on how hey a e used in hese s yles o music.
Ou las wo g oupings we e chosen as po en ially in e es ing g oupings ha would
also yield mo e spa se s eams. One o hese is a collec ion o he cymbal ins u-
men s, whe e one s eam is he open hi-ha , one is he closed hi-ha , one is he c ash
cymbal, and he las is he ide cymbal. Ce ain s yles o music, such as jazz, ha e
a endency o inco po a e cymbals in o he hy hm, and we assume his could be
an in e es ing g ouping. In a simila ashion, we make a g ouping ou o he h ee
oms and he ide cymbal.
G oo eMIDI is a la ge da ase , and hese i e g oupings allow us o di e si y he
HVO sequences ha we c ea e om i . A e p ocessing ou da ase , we ha e 82,687
2-ba hy hms spli in o i e di e en g oups. We can see in igu e 9 he eloci y,
unc ional, and pi ch ( equency) g oupings, each ep esen ing oughly a qua e o
he da ase , which is expec ed. The o he wo ha e less ep esen a ion because hey
4.4. El Bongose o 25
Figu e 9: Pe cen ages o each G oo eMIDI g ouping o o al da ase .
a e d um speci ic, and ce ain eco dings may no ha e all he indi idual d ums o
c ea e he g ouping.
The code o ou p ep ocessing o he G oo eMIDI da ase can be ound in he ile
named g oo e_midi.py.
4.4 El Bongose o
El Bongose o is a la ge-scale symbolic da ase c ea ed by Behzad Haki e al. con-
sis ing o 6,035 c owd-sou ced imp o ised d um pe o mances by 3,184 pa icipan s
o a ying le els o expe ience [20]. The pa icipan s we e asked o selec a gen e
o a backing ack om he G oo eMIDI da ase and imp o ise on a se o digi al
bongos o e a 2-ba loop. Once inished, hey could lis en o hei eco ding and
o e dub addi ional hi s. The pa icipan s we e hen asked o speci y hei le el o
expe ience ou o a maximum o 5 poin s, as well as a e hei pe o mance ou o 5
poin s.
Fo ou da ase , we choose o use he e en s om he le hand, igh hand, a
combina ion o bo h le and igh hands, and he la ened G oo eMIDI ack. The
inclusion o he ’bo h hands’ e en allows us o ex ac in e es ing hy hms ha a e
26 Chap e 4. Da ase s
di ec ly ela ed o he wo o he hy hmic s eams by sha ing o e lapping onse s,
and can aid in he aining o he model in ega ds o Rhy hmic Simila i y, Accen
Simila i y, and he Densi y ea u es. To be e cu a e ou da ase , we include only
sessions wi h an expe anking o 4 o highe and a use a ing o 3 o highe .
Wi h ou es ic ions in place, he El Bongose o da ase is educed o 2243 o al
eco ded sessions. O hese emaining eco ded sessions, 1917 con ain e en s whe e
he le hand plays, 1863 whe e he igh hand plays, and 1221 whe e bo h hands
play simul aneously. This can be seen in igu e 10. Gi en he numbe o hi s o each
o ou selec ions, we eel ha ou choice o e en s om he El Bongose o da ase is
jus i ied and will add in e es ing di e si y o ou inal da ase o model aining.
Figu e 10: Reco ded Session Coun s o Le Hand, Righ Hand, Bo h Hands, and
o al sessions o ou il e ed El Bongose o da ase .
The p ep ocessing code o ou El Bongose o da ase can be ound in he ile
bongose o.py.
4.5 TapTamD um
TapTamD um is ano he symbolic da ase c ea ed by Behzad Haki e al. whe e ou
expe d umme s a e asked wi h imp o ising on wo d um pads o e 2-ba loops
chosen om G oo eMIDI [21]. The sessions las app oxima ely one hou , and each
4.6. Da ase P e-p ocessing 27
expe d umme plays a a ie y o gen es. A e each session, he pe o me s a e
asked o a e hei session ou o i e poin s.
We ollow a simila me hod wi h he El Bongose o da ase , using he g oupings le
hand, igh hand, bo h hands, and he la ened G oo eMIDI ack. To add a ie y
o ou aining da a, we c ea e wo da ase s om TapTamD um, allowing he ’bo h
hands’ ca ego y o be when bo h hands play simul aneously (In e sec ion) o when
a leas one hand is playing (Union). Addi ionally, we es ic ou da ase o en ies
ha ha e a use a ing o 4 o highe . A e es ic ing ou da ase , we ha e 1116
sequences in each da ase , o aling 2232 TapTamD um HVO sequences we can use
o aining.
Ou code o he TapTamD um da ase can be ound in he ile ap_ am_d um.py.
4.6 Da ase P e-p ocessing
The da ase s desc ibed abo e a e ep esen ed in a ious ways. LAKH and G oo eMIDI
a e bo h MIDI da ase s, and equi e he addi ional usage o P e yMIDI o open,
me ge he ins umen s, and ew i e o MIDI iles. The Candombe anno a ions we e
s o ed in CSV iles, and only equi ed he use o Pandas o ex ac ing he da a and
sa ing in o HVO ep esen a ions. Because hey included he onse s, eloci ies, and
o se s, we only needed o ans e his in o ma ion o new HVO objec s. Bo h El
Bongose o and TapTamD um we e al eady in HVO ep esen a ion, only equi ing
he ans e o he hi , eloci y, and o se alues in o new HVO objec s.
Fo he da a p ep ocessing ask, he MIDI da ase s, LAKH and G oo eMIDI, ook
he longes o c ea e. Fi s , we had o g oup he a ious ins umen s in o hei la ge
amily, such as elec ic piano and acous ic piano in o ’Piano’. A e we designa ed
he g oupings, we sa ed each indi idual ins umen amily o each ile, o be loaded
in o HVO sequences ia he HVO MIDI loade . I was obse ed ha addi ional
no es we e w i en o he newly c ea ed MIDI iles, which was caused by e y sho
du a ions in he o iginal MIDI ile ha we e smalle han he ime in e als sup-
po ed by P e yMIDI. To o e come his, we w o e a small me hod o emo e any
34 Chap e 5. Model
he o al eloci ies changed. This implies ha he model s uggles o c ea e s eams
wi h changing accen s o alues be ween he maximum and minimum. Looking
a igu e 16 we see ha ou Accen Simila i y ea u e is no wo king as well as we
had expec ed. We would expec his o ha e a simila shape as he g aph om he
Rhy hmic Simila i y, g owing mono onically. Fo he o iginal model, he e is some
g ow h, bu he a e o change is e y slow un il i eaches he ea u e alues o 8
and 9, whe e i inc eases much mo e apidly. In he lex model, we see no change in
ou Hamming Dis ance and we can assume ha educing he numbe o okens is
no a alid way o imp o e ou model. The a iance also inc eases be ween alues
o [−7,7]. This shows us ou cu en implemen a ion is no a eliable con ol in i ’s
cu en s a e.
(a) Hamming Dis ance o a ious Ac-
cen Simila i y alues using o iginal
model
(b) Hamming Dis ance o a ious Ac-
cen Simila i y alues using lex model
Figu e 16: Hamming Dis ance o a ious Accen Simila i y alues o o iginal and
lex models
Ano he issue can be obse ed in igu e 17. When looking a s eam 1 and s eam 2
o he maximum dis ance alue, we can see he e a e many alues ha a e less han
hose o s eam 3 ( he da ke shading ep esen s highe eloci ies). This implies ha
he e is a ela ionship be ween he densi y alues and he Accen Simila i y, and we
shall explo e hese unin ended ela ionships la e in his sec ion.
5.3. Valida ion 35
(a) Accen Simila i y Min-
imum (b) Accen Simila i y Hal (c) Accen Simila i y Max
Figu e 17: Ou pu S eams o Th ee Accen Simila i y alues
5.3.1 Densi y
In Sec ion 3.2.3, we explain how he densi y ea u e is calcula ed o each o he
ou pu s eams. Ou in en was o ha e a con ol ha could dic a e he numbe o
onse s in each s eam, ela i e o he la ened ou pu . Fo many o he ou pu s,
ou expec a ions a e me by he numbe o onse s in each s eam. We can see in
igu e 18 se ing s eam 1 o wo, s eam 2 o i e, and s eam 3 o nine gi es us
a s eam wi h e y ew onse s, and he o he s add mo e based upon he alues
speci ied. The same holds o he image on he igh . S eam 1 and s eam 3 a e
se o he maximum alue and s eam 2 is se o one. Bo h images a e using a
Rhy hmic Simila i y dis ance se o ze o, so should mi o he g oo e displayed on
he bo om.
(a) Densi y Values se o 2, 5, and 9 ou
o 9.
(b) Densi y Values se o 9, 1, and 9 ou
o 9.
Figu e 18: Va ious Densi y Values o Ou pu S eams
Howe e , when we decide o maximize all he alues, we can see he e a e some
issues in he decoding. The pa e n on he le o igu e 19 has a densi y alue o
se en o s eam 1, nine o s eam 2, and nine o s eam 3, ou o a maximum o
nine. All o he con ols a e se o ze o. The image on he igh has all densi y alues
o he s eams se o nine, he maximum alue. We would assume ha se ing each
36 Chap e 5. Model
s eam’s densi y alue o he maximum in his scena io would con ain all he onse s
o he inpu g oo e, and each s eam would be he same as he o he . Bu , bo h
s eam 1 and s eam 3 ha e less onse s han s eam 2, as well as he inpu . This
may be an issue wi h aining, o wi h how we calcula e he onse s on he decoding
side, and mo e in es iga ion needs o happen.
(a) Densi y Values se o 7, 9, and 9 ou
o 9.
(b) Densi y Values se o 9, 9, and 9 ou
o 9.
Figu e 19: Issues wi h Densi ies se o Maximum Values
5.3.2 Ou pu S eam Quali y
To judge he gene al quali y o he ou pu s eams, we will look a h ee c i e ia.
The i s is ha he onse s c ea e an in e es ing pa e n. By in e es ing, we mean a
pa e n ha has hy hmic quali ies, bu has a ia ions in he posi ions o he onse s.
An example o unin e es ing esul s would be ha ing many pa e ns gene a ed o
a ious g oo es, bu all he gene a ed onse s appea on he down bea , o whe e he
same pa e n is epea ed each measu e.
Ou second c i e ion is o look a he eloci ies o he gene a ed s eams. We expec
o see a ia ion in he eloci ies, accen ing ce ain onse s, and gi ing he hy hm
mul idimensionali y.
Fo ou hi d c i e ia, we a e conce ned wi h he o se alues o he gene a ed onse s.
This shows ou model is capable o ec ea ing hese o se s, and can ou pu hy hms
ha consis o iple s ha a en’ aligned o he g id.
In igu e 20 we ha e ou di e en piano olls displaying he ou pu o ou di e en
inpu g oo es. The alues o he ea u e con ols ha e been andomly selec ed. The
piano olls o igu es a, b, and c all show in e es ing a ia ion o he onse s. We
5.3. Valida ion 37
ha e ins ances whe e li le o no epe i ion happens in he 2-ba loop, such as o
igu es a and b. In igu e c, we see he pa e n epea s i sel o each ba , bu his
may be an issue wi h he inpu , which also includes h ee epea ing no es o each
ba . In igu e d, we see he leas amoun o di e si y. The second ba is a copy o
he i s o each o he h ee s eams. Fu he mo e, S eam 2 con ains onse s on
all bu one posi ion in each ba . Ou assump ion is he inpu g oo e’s epe i i e
pa e n, and a high densi y alue o s eam 2, is wha causes his lack o di e si y.
Looking a he eloci y alues o hese ou examples c ea es some conce n. Fo each
o he s eams in each o he pa e ns, eloci y alues do no exceed a alue o 0.7.
The inpu g oo es all con ain a ious eloci ies be ween minimum and maximum
alues, so we would expec o see mo e a ying alues o ou ou pu s eams. This
has been a common occu ence in all o ou a emp s, and no speci ic o hese ou
examples. Changing he Accen Simila i y alue will cause eloci y alues o be
aised, bu his di e s om ou expec a ions. The Accen Simila i y should shi
he eloci y alues. I we ha e a la ge di e ence, onse s ha ha e a la ge eloci y
in he inpu should ha e smalle eloci ies in he ou pu s eams, and onse s ha
ha e small eloci ies in he inpu s eam should ha e la ge eloci y alues in he
ou pu s eams.
(a) (b)
(c) (d)
Figu e 20: Fou Examples o Gene a ed S eams o Fou Di e en Inpu s
38 Chap e 5. Model
Finally, when we look a he o se alues o ou ou pu s, we see much less a ia ion
han we had hoped o . The only example ha has o se alues less han o g ea e
han 0 is igu e b. The o he h ee examples only con ain onse s snapped o he g id.
Each o he inpu g oo es ha e a ying le els o o se s, so we canno say ha his is
an issue wi h he inpu s. Fu he mo e, a e many a emp s, we s ill ind ha mos
o ou ou pu s con ain li le o no a ia ion in he o se alues. Ou assump ion is
ha his is caused by a lack o o se di e si y in he LAKH MIDI aining se and
some hing we will discuss la e in his pape .
5.3.3 Rela ionships be ween Fea u e Values
One o he unin ended ela ionships we ha e no iced in ou sys em is be ween he
Rhy hmic Simila i y con ol and he Accen Simila i y con ol. In igu e 21 we look
a how changing he Rhy hmic Simila i y dis ance a ec s he Accen Simila i y and
ice e sa.
(a) O iginal Model - Rhy hmic Simila -
i y held cons an while inc easing Accen
Simila i y
(b) Flex Model - Rhy hmic Simila i y
held cons an while inc easing Accen
Simila i y
Figu e 21: Issues wi h Densi ies se o Maximum Values
I we se he Accen Simila i y dis ance o he maximum alue, and ou Rhy hmic
Simila i y dis ance o some hing low, we see we ha e onse s in he same posi ions
as he inpu g oo e, and eloci y alues ha a e highe han he lowe eloci ies
in he inpu . Howe e , as we inc ease he Rhy hmic Simila i y dis ance we no ice
he eloci y alues began o dec ease in ou ou pu . This can be seen in igu e 22.
The image on he le shows ou ou pu wi h a maximum dis ance o he Accen
Simila i y and a dis ance o 1/3 he maximum o Rhy hmic Simila i y. Ou eloci y
alues ha e a maximum o 0.754 and a minimum o 0.180. On he igh we can see
5.3. Valida ion 39
wha happens when we aise ou Rhy hmic Simila i y o he maximum alues. He e
ou eloci ies ha e a maximum alue o 0.528 and a minimum alue o 0.088. We
would expec o main ain he same le el o eloci y o ou ou pu , and o ha e high
eloci y when he same posi ion in he inpu ei he con ains an onse wi h a low
eloci y o no onse a all. This issue may be due o he simila ways we calcula e
hese wo ea u es and we shall discuss his la e in he pape .
(a) Densi y Values se o 2, 5, and 9 ou
o 9.
(b) Densi y Values se o 9, 1, and 9 ou
o 9.
Figu e 22: Veloci y alues dec easing as we aise he Rhy hmic Simila i y Dis ance
A second unexpec ed ela ionship was ound be ween he Rhy hmic Simila i y ea-
u e and he Densi y ea u es o each o he ou pu s eams. In igu e 23 we see
h ee di e en scena ios when we ha e a e y ac i e inpu wi h many onse s.
(a) Rhy hmic Simila i y
dis ance alue o 8 ou o
32
(b) Rhy hmic Simila i y
dis ance alue o 19 ou o
32
(c) Rhy hmic Simila i y
dis ance alue o 32 ou o
32
Figu e 23: Rela ionship be ween Rhy hmic Simila i y and Densi y o ac i e inpu
When he Rhy hmic Simila i y dis ance alue is low, we allow ou ou pu onse s o
happen a he same posi ion as he inpu onse s. I we se ou densi y alues o
la ge amoun s, such as he maximum, we see e y ac i e ou pu s. This is expec ed
beha io . Mo e densi y implies mo e onse s. Howe e , as we inc ease he amoun
o dis ance o ou Rhy hmic Simila i y ea u e, we see he onse s in he ou pu
s eams dec easing. Knowing how he ea u es a e calcula ed, his makes pe ec
sense. The Rhy hmic Simila i y measu es he dis ance be ween he inpu and he
40 Chap e 5. Model
la ened ou pu . A maximum dis ance be ween he wo would only allow onse s
in ou ou pu o occu a loca ions whe e he e a e no onse s in he inpu . Fo
an inpu con aining many onse s, a maximum dis ance o he Rhy hmic Simila i y
would con ain e y ew onse s. Because o his, he densi y o each o he s eams
would be bounded by he numbe o onse s in he la ened ou pu s eam. I he e
a e only ou loca ions o ou onse s o occu , a maximum densi y alue would
allow ou onse s in ha s eam. Howe e , his can be e y con using om a use ’s
pe spec i e. In mos si ua ions, inc easing he densi y will inc ease he numbe o
onse s in he ou pu , and i would be sa e o assume ha many use s will ha e his
assump ion.
Chap e 6
Discussion
6.1 Da a
In ou discussion o he model we ha e seen ha ou sys em does no qui e mee
ou expec a ions. Speci ically, we ha e no iced issues a ound he hy hmic s uc u e
o ou ou pu s being oo epe i i e, unin ended ela ionships be ween ou ea u es
o con olling he hy hm ou pu s, and a lack o a ia ion in ou aining da a. In
his sec ion, we will look a he speci ic p oblems o ou cu en model and how we
can imp o e i in he u u e.
One o he issues we aised is he lack o a ia ion in ou ou pu hy hms. As we
ha e seen, we o en ha e ou pu s ha do no con ain much di e si y o eloci y
alues o o o se alues. We ha e also no iced a endency o c ea e ou pu s ha
ha e epe i i e pa e ns in he sequence and lack a ia ion in he placemen o he
onse s. One o he bigges ac o s c ea ing his issue is he da ase s we ha e used,
speci ically he LAKH MIDI da ase . Al hough he da ase allowed us in e es ing
g oupings om he mul iple ins umen s used, he pa e ns in he MIDI iles a e
lacking in he a ia ion o eloci y and o se s ha we we e hoping o ind in ou
ou pu .
In igu e 24 we ha e h ee examples o he LAKH MIDI aining da a.
41
42 Chap e 6. Discussion
(a) Example 1 o LAKH
MIDI T aining Da a
(b) Example 2 o LAKH
MIDI T aining Da a
(c) Example 3 o LAKH
MIDI T aining Da a
Figu e 24: Th ee Examples o LAKH MIDI T aining Da a
We can see he e is e y li le a ia ion wi h he eloci y alues, o se alues, and
he onse posi ions. Compa ing hese piano olls wi h hose gene a ed om ou
o he da ase s 25, we can ind much mo e di e si y in ou hi , eloci y, and o se
alues.
(a) Example o Candombe
T aining Da a
(b) Example o El Bon-
gose o T aining Da a
(c) Example o TapTam-
D um T aining Da a
Figu e 25: T aining Da a om Candombe, El Bongose o, and TapTamD um
Da ase s
One o he issues wi h he LAKH da ase is ha i con ains ull songs. When we spli
hese ull songs in o 2-ba HVO sequences, we ind ha many o he sequences do no
ha e he quali ies we ha e men ioned, such as di e si y in he eloci ies o o se s.
Ini ially, we il e ed he ull songs om he o iginal LAKH MIDI da ase by he
numbe o eloci y changes, bu ne e did his o he 2-ba spli HVO sequences.
To ec i y his, we can emo e 2-ba HVO sequences ha do no ha e a ce ain
numbe o eloci y o o se changes. In igu e 26 we can see he numbe o eloci y
changes o a small subse o ou LAKH MIDI 2-ba spli s used o aining. Ou o
267713 HVO sequences, 174496 ha e only one eloci y alue. This will ha e a huge
impac on aining ou model. Howe e , i we es ic he HVO sequences o ones
wi h six o mo e eloci y changes, we s ill ha e 17,002 emaining sequences. Ou ull
LAKH MIDI da ase con ains six y pa i ions o he 2-ba HVO sequences, which
6.2. Fea u es 43
would p o ide mo e han enough da a o mee ou c i e ia o ain ou model.
Figu e 26: Veloci y Changes in 2-ba spli s o a Subse o LAKH MIDI
6.2 Fea u es
Ano he a ea o conce n ha we ha e men ioned is he ela ion be ween he con ol
ea u es o ou sys em. Speci ically, he ela ion be ween he Rhy hmic Simila i y
and he Accen Simila i y con ols and he ela ion be ween he Rhy hmic Simila i y
and he Densi y con ols.
Ou in en ion wi h he Accen Simila i y con ol was o allow he use o injec a
eeling o syncopa ion in o he ou pu , in ela ion o he inpu g oo e. Howe e , we
disco e ed ha i is di icul o ind a easonable way o calcula e and implemen
syncopa ion o be used as a con ol in ou sys em. Fu he mo e, he simila i ies
be ween he way we calcula e he Rhy hmic Simila i y and Accen Simila i y con ols
has made i e y di icul o isola e how hey impac he ou pu o he sys em.
Fo u u e wo k, we will emo e he Accen Simila i y, and ocus on ensu ing he
Rhy hmic Simila i y wo ks exac ly as we in end. Once we a e ce ain ha he e
a e no issues wi h ou ea u e on he encoding side, we can begin o add addi ional
ea u es and be mo e awa e o hei impac on he ou pu .
Ano he ealiza ion is he need o he gene a ion o he ou pu o be in a ian o
he ea u e con ols. The con ols should a ec how he onse s a e dis ibu ed and
a anged, bu no he numbe o onse s in he la ened ou pu . Mo ing he Rhy h-
Bibliog aphy
[1] Jiang, H. H. e al. AI a and i s impac on a is s. In Rossi, F., Das, S., Da is,
J., Fi h-Bu e ield, K. & John, A. (eds.) P oceedings o he 2023 AAAI/ACM
Con e ence on AI, E hics, and Socie y, AIES 2023, Mon éal, QC, Canada,
Augus 8-10, 2023, 363–374 (ACM, 2023). URL h ps://doi.o g/10.1145/
3600211.3604681.
[2] Nug oho, Y. Y. T. & Manggala, P. P. M. D. The use o ai in c ea ing music
composi ions: A case s udy on suno applica ion. In P oceedings o he 7 h Cel
In e na ional Con e ence (CIC 2024), 177–189 (A lan is P ess, 2024). URL
h ps://doi.o g/10.2991/978-2-38476-348-1_13.
[3] Vaswani, A. e al. A en ion is all you need (2023). URL h ps://a xi .o g/
abs/1706.03762.1706.03762.
[4] Kingma, D. P. & Welling, M. Au o-encoding a ia ional bayes (2022). URL
h ps://a xi .o g/abs/1312.6114.1312.6114.
[5] Haki, B. Design, de elopmen , and deploymen o eal- ime d um accompani-
men sys ems (2025).
[6] Huang, C. A. e al. An imp o ed ela i e sel -a en ion mechanism o ans-
o me wi h applica ion o music gene a ion. CoRR abs/1809.04281 (2018).
URL h p://a xi .o g/abs/1809.04281.1809.04281.
[7] Huang, Y. & Yang, Y. Pop music ans o me : Gene a ing music wi h hy hm
and ha mony. CoRR abs/2002.00212 (2020). URL h ps://a xi .o g/
abs/2002.00212.2002.00212.
50
BIBLIOGRAPHY 51
[8] Gillick, J., Robe s, A., Engel, J. H., Eck, D. & Bamman, D. Lea ning o g oo e
wi h in e se sequence ans o ma ions. CoRR abs/1905.06118 (2019). URL
h p://a xi .o g/abs/1905.06118.1905.06118.
[9] Robe s, A., Engel, J. H., Ra el, C., Haw ho ne, C. & Eck, D. A hie a chical
la en ec o model o lea ning long- e m s uc u e in music. In Dy, J. G. &
K ause, A. (eds.) P oceedings o he 35 h In e na ional Con e ence on Machine
Lea ning, ICML 2018, S ockholmsmässan, S ockholm, Sweden, July 10-15,
2018, ol. 80 o P oceedings o Machine Lea ning Resea ch, 4361–4370 (PMLR,
2018). URL h p://p oceedings.ml .p ess/ 80/ obe s18a.h ml.
[10] Hadje es, G., Nielsen, F. & Pache , F. GLSR-VAE: geodesic la en space egu-
la iza ion o a ia ional au oencode a chi ec u es. In 2017 IEEE Symposium
Se ies on Compu a ional In elligence, SSCI 2017, Honolulu, HI, USA, No em-
be 27 - Dec. 1, 2017, 1–7 (IEEE, 2017). URL h ps://doi.o g/10.1109/
SSCI.2017.8280895.
[11] Yang, L., Chou, S. & Yang, Y. Midine : A con olu ional gene a i e ad e sa ial
ne wo k o symbolic-domain music gene a ion using 1d and 2d condi ions.
CoRR abs/1703.10847 (2017). URL h p://a xi .o g/abs/1703.10847.
1703.10847.
[12] Tokui, N. Towa ds democ a izing music p oduc ion wi h ai-design o a i-
a ional au oencode -based hy hm gene a o as a DAW plugin. CoRR
abs/2004.01525 (2020). URL h ps://a xi .o g/abs/2004.01525.2004.
01525.
[13] Vigliensoni, M. L. M. E., G. & Fieb ink, R. R- ae: Li e la en space d um
hy hm gene a ion om minimal-size da ase s. Jou nal o C ea i e Music Sys-
ems 1(1) (2022). URL h ps://doi.o g/10.5920/jcms.902.
[14] Gómez-Ma ín, D., Jo dà, S. & He e a, P. D um hy hm spaces: F om global
models o s yle-speci ic maps. In A amaki, M., Da ies, M. E. P., K onland-
Ma ine , R. & Ys ad, S. (eds.) Music Technology wi h Swing - 13 h In e na-
ional Symposium, CMMR 2017, Ma osinhos, Po ugal, Sep embe 25-28, 2017,
52 BIBLIOGRAPHY
Re ised Selec ed Pape s, ol. 11265 o Lec u e No es in Compu e Science, 123–
134 (Sp inge , 2017). URL h ps://doi.o g/10.1007/978-3-030-01692-0_
9.
[15] Ra el, C. & Ellis, D. P. W. In ui i e analysis, c ea ion and manipula ion o
MIDI da a wi h p e y_midi. In P oceedings o he 15 h In e na ional Con e -
ence on Music In o ma ion Re ie al La e B eaking and Demo Pape s (2014).
[16] Ju e, L., Rocamo a, M., Ta si ani, S. & Clay on, M. Iemp u uguayan candombe
(2025). URL os .io/w x7k.
[17] Ra el, C. Lea ning-Based Me hods o Compa ing Sequences, wi h Applica ions
o Audio- o-MIDI Alignmen and Ma ching. Ph.D. hesis, Columbia Uni e si y,
USA (2016). URL h ps://doi.o g/10.7916/D8N58MHV.
[18] Gillick, J., Robe s, A., Engel, J., Eck, D. & Bamman, D. Lea ning o g oo e
wi h in e se sequence ans o ma ions. In In e na ional Con e ence on Machine
Lea ning (ICML) (2019).
[19] Gómez-Ma ín, D., Jo dà, S. & He e a, P. D um hy hm spaces: F om poly-
phonic simila i y o gene a i e maps. Jou nal o New Music Resea ch 49,
438–456 (2020).
[20] E ans, N., Haki, B., Gómez-Ma ín, D. & Jo dà, S. El bongose o: A c owd-
sou ced symbolic da ase o imp o ised hand pe cussion hy hms pai ed wi h
d um pa e ns. In Kaneshi o, B. e al. (eds.) P oceedings o he 25 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e ence, ISMIR 2024, San
F ancisco, Cali o nia, USA and Online, No embe 10-14, 2024, 540–546 (2024).
URL h ps://doi.o g/10.5281/zenodo.14877393.
[21] Haki, B., Ko owski, B., Lee, C. L. I. & Jo dà, S. Tap amd um: A da ase
o dualized d um pa e ns. In Sa i, A. e al. (eds.) P oceedings o he 24 h
In e na ional Socie y o Music In o ma ion Re ie al Con e ence, ISMIR 2023,
Milan, I aly, No embe 5-9, 2023, 114–120 (2023). URL h ps://doi.o g/10.
5281/zenodo.10265237.
Appendix A
Fi s Appendix
Du ing ou discussion a ound he alida ion o he model, we chose no o discuss
he KL Di e gence loss du ing aining and alida ion. In igu e 27 we can see ha
ou o iginal model is p ope ly educing he loss du ing aining and alida ion.
(a) KL Di e gence T aining Loss (b) KL Di e gence Tes Loss
Figu e 27: Kullback-Leible Di e gence o T aining and Valida ion
In he Rela ed Wo ks sec ion we in oduced he ans o me model. Being ha his
hesis is ocused on he design aspec o ou sys em, and ans o me models ha e
been used subs an ially in he pas ew yea s, we chose o no go oo in o de ail
wi h ou desc ip ion. Fo mo e in o ma ion on ans o me models, one can ead
A en ion is All You Need (Vaswani e al. 2017) [3] We can see he a chi ec u e o
he ans o me model in igu e 28.
53
54 Appendix A. Fi s Appendix
Figu e 28: T ans o me model a chi ec u e [3]
We also chose o no go oo in o de ail wi h he Va ia ion Au oEncode model o
he same easons as we ga e o he ans o me model. We include a igu e o
he VAE a chi ec u e he e 29. In his doc o al disse a ion, Behzhad Haki gi es an
in-dep h explana ion o he VAE a chi ec u e and we e e he eade o his pape
[5].
55
Figu e 29: Va ia ional Au oencode A chi ec u e [5]
Appendix B
Second Appendix
56