Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
F eesound Loop Gene a o
Ada Sal ado A alos
Supe iso : Lonce Wyse
Co-Supe iso : Dmi y Bogdano , Pablo Alonso
Augus 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 2
1.2 Resea chObjec i es ............................ 3
1.3 S uc u eo heThesis........................... 4
2 S a e o he A 6
2.1 Gene a i e Models o La en Space Manipula ion . . . . . . . . . . . . 7
2.1.1 Va ia ional Au oencode s . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 T ans o me Models ............................ 8
2.1.3 Al e na i es App oaches: Gene a i e Ad e sa ial Ne wo ks (GANs),
Di usion and Hyb id Models . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Me hods 15
3.1 Da ase ................................... 15
3.1.1 Da aP ep ocessing............................. 16
3.2 Explo a o y Da ase Analysis . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 S yleEncoding ............................... 22
3.2.2 S yle Ac i a ion Dis ibu ion Analysis . . . . . . . . . . . . . . . . . . 23
3.2.3 S yle Classi ica ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Real ime Audio Va ia ional Au oEncode (RAVE) . . . . . . . . . . . . 28
3.3.1 A chi ec u e and T aining S a egy . . . . . . . . . . . . . . . . . . . . 28
3.3.2 La en Space Comp ession Me hod . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Recons uc ion om Comp essed La en s . . . . . . . . . . . . . . . . . 30
3.3.4 Implemen a ion o he T aining P ocedu e . . . . . . . . . . . . . . . . 31
3.4 Real-Time Neu al Audio Syn hesis and Mo phing Sys em . . . . . . . . 32
3.4.1 Playback Manipula ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Encoding .................................. 33
3.4.3 La en In e pola ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.4 Real- imeSyn hesis............................. 34
3.4.5 Decoding .................................. 36
3.5 E alua ionMe ics ............................. 37
3.5.1 Subjec i e E alua ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Expe imen s................................. 38
3.6.1 Condi ioned Mo phe T ans o me Model . . . . . . . . . . . . . . . . . 38
3.6.2 Da aP ep ocessing............................. 38
3.6.3 ModelA chi ec u e............................. 39
3.6.4 T ainingP ocedu e............................. 43
3.6.5 Audio Mo phing In e ence Sys em . . . . . . . . . . . . . . . . . . . . . 47
3.6.6 E alua ionMe ics ............................. 50
3.6.7 Resul s.................................... 51
4 Resul s 53
4.1 5-poin Like Scales ............................ 53
4.1.1 Ag eemen Scale Responses (Q1–Q3) . . . . . . . . . . . . . . . . . . . 54
4.1.2 In ensi y Scale Responses (Q4–Q5, Q8–Q9) . . . . . . . . . . . . . . . 55
4.2 Iden i ica ion o Musical Aspec s o each Dimension . . . . . . . . . . 56
4.3 Pe cei ed Con ol: Uni o m s. Pe -Dimension Blending . . . . . . . . 58
5 Discussion 59
5.1 Conclusion.................................. 60
5.2 Fu u eWo k................................. 61
Lis o Figu es 62
Lis o Tables 63
Bibliog aphy 64
A Ques ionnai e o Subjec i e E alua ion 69
A.1 Linea In e pola ion in La en Space Using a Uni o m Ra io Ac oss All
Dimensions ................................. 69
A.2 Pe -Dimension Ra io Con ol in La en Space . . . . . . . . . . . . . . 70
A.3 Explo ing Dimensionali y in La en Space . . . . . . . . . . . . . . . . 71
B Model A chi ec u e 72
C Ma e ials o Rep oducibili y 74
Acknowledgemen
Fi s , I would like o exp ess my g a i ude o my supe iso and co-supe iso s
o hei guidance and ad ice h oughou his p ojec , o pa ien ly answe ing my
ques ions, and helping me o e come doub s ha opened he way o new pe spec i es.
I am uly g a e ul o hey a ailabili y and willingness o suppo me.
I would also like o hank my iends o hei pa ience and unde s anding, o always
lis ening, and o o e ing suppo e en when I wasn’ a ound due o all he wo k
in ol ed. I am uly g a e ul o hei encou agemen and p esence h oughou his
jou ney.
Finally, I am especially g a e ul o my amily o hei endless suppo , ca e, and
belie in me. Wi hou hem, I could ne e ha e made i his a .
Abs ac
This p ojec in es iga es he syn hesis o new audio loops using neu al ne wo ks,
ocusing on c ea i e sound gene a ion h ough la en space manipula ion. Using he
F eeSound Loop Da ase , audio samples we e p ep ocessed h ough empo no mal-
iza ion as well as bea and downbea alignmen o ensu e hy hmic consis ency and
s uc u al cohe ence, enabling musically ele an syn hesis.
The sys em is buil a ound a neu al au oencode , speci ically he RAVE model,
which comp esses audio loops in o compac la en ep esen a ions and econs uc s
hem wi h high- ideli y. New loops a e gene a ed by in e pola ing be ween wo en-
coded examples, p oducing smoo h ansi ions and hyb id sounds ha blend cha -
ac e is ics om bo h sou ces. Gen e classi ica ion guides la en space a e sal,
suppo ing s ylis ic selec ion and con olled blending o sounds. Fu he mo e, eal-
ime manipula ion o indi idual la en dimensions expands he sys em’s po en ial o
in e ac i e audio applica ions, such as li e pe o mances o dynamic sound design.
Subjec i e e alua ion demons a ed ha in e pola ed loops a e pe cep ually cohe -
en and musically meaning ul. Lis ene s epo ed ha la en space ajec o ies p o-
ide exp essi e and con ollable ools o c ea i e composi ion, and, ha a ying
he numbe o la en dimensions in luences musical ichness wi hou comp omising
pe cep ual con inui y.
By combining deep gene a i e models, musically in o med p ep ocessing, and use -
con ollable syn hesis echniques, his wo k p esen s a lexible amewo k o c ea i e
loop-based audio gene a ion, b idging high-quali y syn hesis wi h p ac ical usabili y.
Keywo ds: Audio Gene a ion; Neu al Audio Syn hesis; La en Space In e pola ion;
In e ac i e Sound Design; Loop-Based Music
Chap e 2
S a e o he A
The ield o deep lea ning has wi nessed apid ad ancemen s, pa icula ly in gen-
e a i e models. T adi ional me hods, such as a ia ional au oencode s (VAEs) [3],
ha e demons a ed signi ican capabili ies in modeling complex da a dis ibu ions
h ough he lea ning o la en spaces. Mo e ecen ly, ans o me -based a chi ec-
u es [7] ha e ede ined how la en spaces can be explo ed and manipula ed, o e ing
good scalabili y and pe o mance ac oss a ious domains.
This chap e examines s a e-o - he-a me hodologies ele an o la en space ma-
nipula ion and in e pola ion, wi h a pa icula ocus on mo phing echniques ac oss
a ious gene a i e models. Sec ion 2.1 places special emphasis on a ia ional au oen-
code s (VAEs) –pa icula ly he RAVE model [12]– and ans o me a chi ec u es,
alongside gene a i e ad e sa ial ne wo ks (GANs) [4], di usion models [6] [5], and
hyb id app oaches. The analysis o how hese models syn hesize audio p o ides in-
sigh s in o how a chi ec u al di e ences in luence he quali y and cohe ence o he
gene a ed ou pu .
6
2.1. Gene a i e Models o La en Space Manipula ion 7
2.1 Gene a i e Models o La en Space Manipula-
ion
Mo phing echniques enable smoo h ansi ions be ween di e en da a poin s in he
la en space, allowing o in e pola ion, a ibu e edi ing, and con olled ans o ma-
ions. Di e en classes o gene a i e models o e unique app oaches o mo phing.
2.1.1 Va ia ional Au oencode s
Va ia ional au oencode s (VAEs) [3] a e gene a i e models designed o lea n a p ob-
abilis ic mapping om obse ed da a o a s uc u ed la en space. Unlike de e min-
is ic au oencode s, VAEs in oduce s ochas ici y h ough a p obabilis ic encode -
decode amewo k, whe e he encode maps inpu da a o a dis ibu ion o e la en
a iables a he han a single poin . This s ochas ic na u e p omo es obus ness
and smoo hness in he lea ned la en space, enabling meaning ul in e pola ions and
a e sals.
The la en space o a VAE is ypically modeled as a Gaussian dis ibu ion, whe e he
encode ou pu s pa ame e s de ining a mean and a iance o each inpu . Du ing
aining, la en a iables a e sampled om hese dis ibu ions, and he decode
econs uc s he o iginal inpu om he sampled poin s. The model is op imized
by maximizing a a ia ional lowe bound, which includes bo h a econs uc ion loss
and a egula iza ion e m en o cing he la en space o app oxima e a p ede ined
p io dis ibu ion, usually a s anda d Gaussian.
Ea ly esea ch demons a ed ha VAEs could mo ph be ween dis inc da a poin s
by linea ly in e pola ing be ween hei co esponding la en ep esen a ions. Due o
he con inuous and s uc u ed na u e o he la en space, ansi ions be ween poin s
p oduce cohe en , g adual ans o ma ions.
Addi ionally, he s anda d VAE amewo k o en p oduces en angled la en spaces,
whe e di e en gene a i e ac o s a e no well-sepa a ed. This lack o disen angle-
men makes mo phing ope a ions less in e p e able and can cause undesi ed blending
8Chap e 2. S a e o he A
o un ela ed ea u es du ing in e pola ion. Se e al app oaches ha e been p oposed
o add ess hese limi a ions by imp o ing he s uc u e and in e p e abili y o he
la en space. Fo example, hie a chical VAEs [13], hey in oduce a mul i-le el la-
en space a chi ec u e, whe e highe -le el la en a iables cap u e global s uc u e
while lowe -le el a iables cap u e ine de ails. This hie a chical o ganiza ion allows
smoo he and mo e cohe en in e pola ions by modeling complex da a dis ibu ions
mo e e ec i ely. Fu he mo e, s uc u ed la en space models, such as Be a-VAEs
[14], modi y he aining objec i e o encou age disen anglemen [15]. By in oduc-
ing a hype pa ame e β o con ol he ade-o be ween econs uc ion ideli y and
la en egula iza ion, Be a-VAEs p omo e he sepa a ion o independen gene a i e
ac o s. Such disen angled ep esen a ions enable mo e in e p e able and con olled
mo phing ope a ions.
While VAEs a e well-sui ed o gene a ing smoo h ansi ions wi hin a s uc u ed
la en space, balancing econs uc ion quali y wi h meaning ul la en ep esen a ion
emains a undamen al challenge. En o cing smoo hness h ough s ong egula -
iza ion can deg ade econs uc ion quali y, leading o o e smoo hed ou pu s ha
lack de ail. Con e sely, elaxing he egula iza ion can enhance ideli y bu a he
expense o losing cohe ence in in e pola ions.
The RAVE model [12] o e comes se e al adi ional limi a ions o VAEs by com-
bining ad e sa ial egula iza ion and mul i- esolu ion spec al econs uc ion loss.
This enables high- ideli y and eal- ime audio gene a ion, making RAVE pa icu-
la ly well-sui ed o in e ac i e applica ions such as li e audio loop c ea ion and
manipula ion.
2.1.2 T ans o me Models
T ans o me models [7] ha e e olu ionized sequence modeling by employing sel -
a en ion mechanisms o p ocess inpu da a in a pa allelized manne , bypassing
he need o ecu en o con olu ional s uc u es. They lea n con ex ual ela ion-
ships o e long dis ances, making hem pa icula ly e ec i e o asks ha equi e
p ese ing cohe ence ac oss ex ended sequences. The sel -a en ion mechanism com-
2.1. Gene a i e Models o La en Space Manipula ion 9
pu es dependencies be ween all elemen s o an inpu sequence simul aneously, al-
lowing ans o me s o cap u e complex pa e ns and long- ange in e ac ions ha
adi ional a chi ec u es s uggle o model. This capaci y o unde s anding global
dependencies is especially aluable o mo phing asks, whe e smoo h ansi ions
be ween di e se inpu s a e essen ial.
Fo example, LoopNe [16] employs a en ion mechanisms o cap u e empo al de-
pendencies wi hin musical loops. I lea ns o ecognize pa e ns and s uc u es o e
mul iple ime scales, acili a ing he gene a ion o cohe en and con inuous audio
loops. Wa e-U-Ne [17], o iginally designed as a con olu ional model o audio
sou ce sepa a ion, has been adap ed wi h ans o me a chi ec u es o enhance i s
capaci y o modeling long- ange dependencies. This in eg a ion p o ides imp o ed
con ol o e gene a ing s uc u ed audio ou pu s ac oss ex ended sequences. E en
MelNe [18], a model designed o gene a ing high- ideli y audio spec og ams using
hie a chical a chi ec u es, suppo s he gene a ion o s uc u ed ou pu s ha e ain
ine-g ained de ails o e ex ended du a ions.
Mos ecen ly, T ans o me -XL [19], a model ha is a a ian o he s anda d ans-
o me , add esses he challenge o gene a ing long sequences by inco po a ing e-
cu ence mechanisms. In ypical ans o me s, he model’s a en ion mechanism is
limi ed o a ixed-leng h con ex window, which es ic s he abili y o cap u e long-
ange dependencies. T ans o me -XL sol es his limi a ion by in oducing a memo y
mechanism ha allows he model o e ain and euse hidden s a es om p e ious
segmen s o inpu da a ac oss di e en aining s eps. This enables he model o
handle much longe sequences e icien ly, making i well-sui ed o asks like lan-
guage modeling, whe e main aining long- e m dependencies is c ucial o cohe en
ou pu . This abili y o pe sis memo y o e long dis ances is pa icula ly use ul
when gene a ing con inuous da a, such as long audio sequences, whe e consis ency
and con ex p ese a ion a e essen ial.
In addi ion, AudioLM [20] is a hie a chical model designed o high-quali y audio
gene a ion. Unlike adi ional models ha di ec ly gene a e audio wa e o ms o
spec og ams, AudioLM le e ages a wo-le el hie a chical app oach o gene a e au-
10 Chap e 2. S a e o he A
dio. I lea ns o ep esen audio da a a bo h low-le el ( ine-g ained) and high-le el
(abs ac ). The low-le el ep esen a ions cap u e mo e g anula ea u es o audio,
such as acous ic p ope ies, while he high-le el ep esen a ions cap u e b oade pa -
e ns like musical s uc u es o speech con ex . The hie a chical na u e o AudioLM
enables i o model bo h local and global dependencies wi hin he da a. This makes
i pa icula ly powe ul o asks ha equi e con ex ual accu acy, such as gene a -
ing speech o music, whe e unde s anding bo h he mic o-le el (like phonemes o
no es) and mac o-le el (like sen ence o melody s uc u e) is c i ical. By lea ning a
mul iple le els o abs ac ion, AudioLM can gene a e ealis ic audio ha main ains
cohe ence o e long du a ions.
T ans o me s o e a powe ul amewo k o modeling complex sequen ial da a, pa -
icula ly due o hei abili y o cap u e long- ange empo al dependencies h ough
sel -a en ion mechanisms. This makes hem well-sui ed o asks equi ing s uc-
u al cohe ence, such as audio loop syn hesis and mo phing. Al hough mos ans o me -
based sys ems in he li e a u e ocus on gene a ion o ansla ion asks, he e is
g owing in e es in explo ing hei po en ial o la en space a e sal and s yle con-
di ioning. Howe e , es ablished a chi ec u es like VAEs cu en ly domina e his
space in e ms o cohe ence and ideli y.
2.1.3 Al e na i es App oaches: Gene a i e Ad e sa ial Ne -
wo ks (GANs), Di usion and Hyb id Models
Gene a i e Ad e sa ial Ne wo ks
Gene a i e ad e sa ial ne wo ks (GANs)[4], ha e become a ounda ional ool o
gene a ing ealis ic samples by aining a gene a o agains a disc imina o in a
compe i i e amewo k. The gene a o aims o p oduce da a indis inguishable om
he eal da ase , while he disc imina o a emp s o disce n be ween eal and gen-
e a ed samples. The lea ned la en space in GANs has been ex ensi ely s udied o
i s capabili y o suppo linea and nonlinea mo phing echniques. Such echniques
include s yle ans e , a ibu e-based edi ing, and smoo h ansi ions be ween di -
2.1. Gene a i e Models o La en Space Manipula ion 11
e en da a samples.
Fo example, condi ional GANs (cGANs) [21] in oduce addi ional condi ioning a i-
ables o guide he la en space explo a ion, gi ing use s mo e con ol o e he gen-
e a ed ou pu s. These condi ioning a iables can include a ibu es such as empo,
key, o gen e, which can be speci ied o in luence he gene a ed audio o isual
con en .
Ea ly GAN a chi ec u es o en su e ed om aining ins abili y and mode collapse.
Howe e , ad ancemen s such as S yleGAN [22] and S yleGAN2 [23] in oduced sig-
ni ican imp o emen s. These models p oposed a s yle-based gene a o a chi ec u e
whe e a mapping ne wo k ans o ms inpu ec o s in o in e media e la en codes.
This app oach enabled hie a chical disen anglemen o ea u es and ine-g ained con-
ol o e gene a ed samples, enhancing he model’s abili y o pe o m smoo h and
cohe en mo phing. Fu he mo e, S yleGAN3 [24] was in oduced, imp o ing upon
i s p edecesso s by add essing aliasing issues and ensu ing be e spa ial consis ency.
This e sion in oduced Fou ie ea u es and an imp o ed gene a o a chi ec u e,
allowing o smoo he ansi ions and enhanced mo phing quali y by p ese ing ge-
ome ic consis ency ac oss in e pola ions.
Recen wo k by Hung e al.[25] explo ed loop mo phing using S yleGAN, S yle-
GAN2, and UNAGAN[26], showing ha smoo h ansi ions can be achie ed ac oss
di e en domains, including isual and audi o y da a. Thei app oach demons a ed
he e ec i eness o GAN-based models in lea ning cohe en , seamless ans o ma-
ions be ween di e en la en ep esen a ions. In pa icula , S yleGAN’s mapping
ne wo k allows o non-linea con ol o e a ibu es, which has been e ec i ely
exploi ed in a ious mo phing asks. The abili y o a e se he la en space in con-
olled di ec ions enables ansi ions be ween a ibu es, s yles, o e en domains.
Fo audio-based gene a ion and mo phing, se e al GAN a chi ec u es ha e been
employed success ully. UNAGAN demons a ed he easibili y o applying GANs o
sequence- o-sequence gene a ion, showcasing obus mo phing capabili ies in speech
syn hesis and music gene a ion. O he no able models include Pa allel Wa eGAN[27]
12 Chap e 2. S a e o he A
and MelGAN [28], which ha e p o en e ec i e o high-quali y audio wa e o m gen-
e a ion. GANsyn h [29], in pa icula , in oduced a GAN-based app oach o audio
syn hesis ha ope a es di ec ly in he equency domain, achie ing high-quali y, as
gene a ion o musical audio.
E en wi h ecen ad ances such as pSp (pixel2s yle2pixel) [30] and e4e (Encode
o Edi ing) [31], which enhance la en in e p e abili y and enable mo e meaning-
ul manipula ions in S yleGAN-gene a ed images, GANs emain less lexible han
VAEs o s uc u ed la en in e pola ions. Thei applica ion o audio is s ill la gely
unexplo ed, making hem less sui able o eal- ime, con ollable loop mo phing.
Di usion models
Denoising di usion p obabilis ic models (DDPMs) ha e ecen ly gained popula -
i y o hei high-quali y gene a i e capabili ies, o e ing a di e en app oach o
gene a i e modeling compa ed o a ia ional au oencode s (VAEs) and gene a i e
ad e sa ial ne wo ks (GANs). Unlike hese models, di usion models use an i e a i e
p ocess o g adually ans o m noise in o meaning ul samples h ough a sequence o
denoising s eps.
The ounda ional wo k by Sohl-Dicks ein e al. [6] in oduced he idea o using a
Ma ko chain o p og essi ely add noise o da a, e ec i ely des oying i s s uc u e
o e se e al ime s eps. By lea ning o e e se his p ocess, a model can gene a e
ealis ic samples om pu e noise. This o mula ion laid he g oundwo k o mo e
ad anced di usion models by de ining a s uc u ed gene a i e p ocess go e ned by
p obabilis ic ansi ions. Fu he e inemen came wi h he wo k o Ho e al. [5],
which in oduced a simple and mo e e icien aining objec i e known as he denois-
ing sco e-ma ching loss. This app oach imp o ed aining s abili y and gene a ion
quali y by op imizing he model o p edic he o iginal da a om a co up ed e sion
a each ime s ep, a he han di ec ly modeling he en i e e e se p ocess.
Recen di usion-based models, such as Imagen [32], which employs a cascaded di -
usion p ocess condi ioned on ex desc ip ions, and S able Di usion [33], which
in oduces a la en di usion app oach whe ein he gene a i e p ocess ope a es in
2.1. Gene a i e Models o La en Space Manipula ion 13
a comp essed la en space a he han di ec ly in pixel space, ha e demons a ed
highly e icien and scalable gene a ion. These app oaches no only achie e s a e-o -
he-a pho o ealis ic image syn hesis bu also suppo con olled in e pola ion and
mo phing h ough la en condi ioning, ex p omp s, o a ibu e guidance.
The i e a i e na u e o he denoising p ocess allows o ine con ol o e gene a ed
samples, making hem pa icula ly e ec i e o mo phing asks. Du ing in e po-
la ion, hey can p oduce smoo h and cohe en ansi ions by ope a ing a a ious
noise le els. This app oach p e en s ab up o un ealis ic ans o ma ions, p o id-
ing a s uc u ed mechanism o gene a ing con inuous mo phs. Fu he mo e, he
lexibili y o di usion models allows hem o be guided by ex e nal inpu s (e.g., ex
p omp s o condi ioning ec o s), enabling con olled and in e p e able in e pola-
ions.
Howe e , di usion models a e compu a ionally demanding, o en equi ing hund eds
o i e a i e s eps o gene a e a single sample. Al hough la en di usion a ian s
educe his bu den o some ex en , eal- ime o in e ac i e applica ions –such as
loop mo phing in pe o mance con ex s– emain imp ac ical due o la ency and
esou ce equi emen s.
Hyb id App oaches
Combining di e en gene a i e model a chi ec u es le e ages he s eng hs o each
o c ea e models ha a e mo e capable o handling complex asks, especially when
hese asks equi e bo h high- ideli y gene a ion and cohe en empo al s uc u e.
Hyb id app oaches ypically in ol e combining wo o mo e model ypes—such as
GANs, RNNs, ans o me s, and di usion models— o imp o e o e all pe o mance.
A hyb id app oach is o combine GANs—such as S yleGAN o S yleGAN2—wi h
RNNs o ans o me s o add ess di e en aspec s o gene a i e asks. GANs a e yp-
ically e ec i e o modeling high-quali y s uc u es such as images o audio ex u es.
RNNs and ans o me s, on he o he hand, excel a cap u ing empo al dependen-
cies and sequen ial s uc u e, making hem ideal o asks like modeling musical
hy hms, p og essions, o cohe ence o e ime. Fo example, in S yle-condi ioned
14 Chap e 2. S a e o he A
Music Gene a ion wi h T ans o me -GANs by Wang e al. [34] he au ho s p esen
a music gene a ion algo i hm ha c ea es composi ions om sc a ch based on spec-
i ied a ge s yles. I inco po a es a s yle-condi ioned linea ans o me o model
MIDI e en sequences and a s yle-condi ioned pa ch disc imina o wi hin a GAN
amewo k o enhance he modeling o music sequences.
Some models like SpecDi -GAN [35], Fas Di 2 [36] sugges using GANs o gene a e
an ini ial s uc u e o coa se ep esen a ion o da a, ollowed by di usion models o
e ine and add in ica e de ails. This me hod le e ages he GAN’s abili y o cap u e
he o e all da a dis ibu ion and he di usion model’s s eng h in modeling complex,
high- equency componen s, esul ing in high- ideli y ou pu s.
These hyb id me hods demons a e ha in eg a ing complemen a y a chi ec u es
can yield good esul s in gene a ing complex audio loops, especially when aiming
o balance s ylis ic di e si y wi h cohe en empo al e olu ion. Howe e , models
ha combine elemen s o VAEs, GANs, and di usion a chi ec u es emain complex
o ain, esou ce-in ensi e, and o en ha de o in e p e . The added a chi ec u al
o e head can also limi adap abili y in in e ac i e music-making scena ios, whe e
e iciency and anspa en con ol a e c ucial.
Chap e 3
Me hods
This chap e ou lines he me hodology employed in his wo k. Sec ion 3.1 be-
gins wi h a discussion o he chosen da ase , ollowed by he p ep ocessing s eps
desc ibed in Subsec ion 3.1.1. Sec ion 3.2 p esen s an explo a o y da a analysis,
emphasizing he s yle dis ibu ion o he subse used o ain he model. Subse-
quen ly, Sec ion 3.3 desc ibes he selec ed model o eal- ime audio syn hesis and
i s aining implemen a ion, while Sec ion 3.4 co e s i s in e ence p ocess, associa ed
applica ions, and modula ion capabili ies.
Finally, Sec ion 3.5 ou lines he e alua ion me ics employed, and Sec ion 3.6 con-
cludes wi h an expe imen al s udy o a cus om-buil ans o me model, including
i s dedica ed p ep ocessing pipeline, model a chi ec u e, aining p ocedu e, in e -
ence p ocess, e alua ion me ics, and compa a i e esul s agains he co e sys em
model.
3.1 Da ase
The F eeSound Loop Da ase [37] was selec ed o i s ich combina ion o high-
quali y audio loops and de ailed me ada a anno a ions, making i well-sui ed o
music analysis and gene a ion asks. The da ase comp ises 9,455 loops collec ed
om he F eesound pla o m [38], each anno a ed wi h empo (BPM), musical key,
15
22 Chap e 3. Me hods
3.2 Explo a o y Da ase Analysis
The da ase used o aining consis ed o 5 hou s o audio, co esponding o 1125
audio iles d awn om he p ep ocessed co pus desc ibed in Subsec ion 3.1.1. To
a oid o de ing bias, he iles we e andomly shu led p io o selec ion. This subse
was chosen o p o ide a ep esen a i e sample o he la ge da ase while main aining
compu a ional easibili y du ing model aining.
To be e unde s and he selec ed subse , an analysis was conduc ed ocusing on he
dis ibu ion o music s yles and he classi ica ion o audio iles acco ding o hese
s yles. This analysis made i possible o assess how balanced he subse was ac oss
di e en sub-gen es and o iden i y po en ial biases ha could ha e in luenced model
pe o mance.
3.2.1 S yle Encoding
The MAEST (Music Audio E icien Spec og am T ans o me ) model [40] was em-
ployed o au oma ic gen e dis ibu ion and classi ica ion. I is a ans o me -based
model designed o music agging. Unlike con en ional CNN-based app oaches,
MAEST is op imized o sho audio segmen s and suppo s mul i-label classi i-
ca ion o e hund eds o gen es.
In he pipeline, he audio signal was i s con e ed in o a log-mel spec og am,
using an FFT size o 1024, a hop leng h o 320, and 80 mel equency bins. The
esul ing spec og am was hen esized o ma ch he inpu dimensions equi ed by
he MAEST a chi ec u e.
The p e- ained checkpoin discogs-maes -5s-pw-129e was employed, which pe -
o ms gen e p edic ions on 5-second audio segmen s and ou pu s p obabili ies ac oss
400 musical s yles. To ob ain meaning ul ac i a ion p obabili ies, he auxilia y unc-
ion p edic _labels() was used; i applies a sigmoid ac i a ion and a e ages he
p edic ions ac oss ime dimension.
The ull 400-dimensional ac i a ion ec o was used as he s yle ep esen a ion o
3.2. Explo a o y Da ase Analysis 23
analyze he dis ibu ion and classi ica ion o gen es wi hin he subse .
3.2.2 S yle Ac i a ion Dis ibu ion Analysis
Median and IQR Analysis
To see how musical s yles a e ep esen ed in he da ase , he ac i a ion p obabili ies
o he 400 s yle ca ego ies we e analyzed ac oss all audio iles. Each audio ile
was associa ed wi h a 400-dimensional s yle p obabili y enso , ep esen ing he
likelihood o each s yle being p esen . The median ac i a ion p obabili y and
he in e qua ile ange (IQR) we e compu ed o each s yle ac oss he da ase .
Figu e 1 displays he op 20 s yles anked by median ac i a ion, wi h e o ba s
indica ing he IQR.
Figu e 1: Median Ac i a ion o s yle p obabili y ec o s.
Elec onic subgen es domina e op anks, wi h Elec onic—Abs ac ,Elec onic—
Expe imen al and Elec onic—Gli ch showing he highes median ac i a ions. This
sugges s a s ong p esence o elec onic ex u es and cha ac e is ics in he da ase .
The IQR alues u he e eal he a iabili y in s yle ac i a ion. Some s yles, such as
Elec onic—Gli ch show high median alues bu also la ge IQRs, indica ing ha hey
a e o en p ominen bu wi h subs an ial a ia ion in s eng h ac oss he da ase .
In con as , s yles like Rock—Go eg ind exhibi a low median combined a wi h wide
IQRs, implying a mo e spo adic and inconsis en ac i a ion pa e n. Meanwhile,
24 Chap e 3. Me hods
s yles such as Elec onic–Indus ial p esen bo h low median and low IQR alues,
sugges ing a small bu consis en p esence ac oss he da ase .
PCA P ojec ion
To u he explo e he s uc u e o he s yle ep esen a ions in he da ase , P in-
cipal Componen Analysis (PCA) was applied o he 400-dimensional s yle
ac i a ion ec o s. Each s yle ep esen a ion was p ojec ed in o a 2D space de ined
by he i s wo p incipal componen s, which cap u e he mos signi ican a iance
in he s yle p obabili y dis ibu ions.
Figu e 2 isualizes his p ojec ion, wi h poin s colo -coded by hei p edic ed pa en
gen e. Pa en gen es we e de e mined by a e aging ac i a ion sco es wi hin gen e-
speci ic subse s o he 400 s yle ca ego ies and assigning each ile he gen e wi h he
highes mean ac i a ion.
Figu e 2: PCA p ojec ion o 400-dimensional s yle p obabili ies ac oss audio iles.
The i s p incipal componen s accoun o 36.14% and 19.76% o he o al a iance,
espec i ely. Colo s indica e p edic ed pa en gen e.
The PCA p ojec ion e eals o e lapping clus e bu also some deg ee o gen e-
speci ic sepa a ion. No ably:
3.2. Explo a o y Da ase Analysis 25
•Elec onic and Non-Music iles a e mo e widely sp ead, o ming di use
clus e s, likely due o he di e si y o s yles wi hin hese ca ego ies.
•S age & Sc een and Reggae show igh e g oupings, sugges ing mo e con-
sis en s yle ac i a ions ac oss he iles.
•Rock and Hip Hop appea mo e dispe sed, hough hey exhibi localized
endencies, po en ially e lec ing hyb id o o e lapping s yle cha ac e is ics.
This p ojec ion suppo s he idea ha while many gen es sha e s ylis ic simila i ies
(as e idenced by clus e o e lap), he model cap u es enough s ylis ic a ia ion o
o ganize iles by gen e in low-dimensional space.
3.2.3 S yle Classi ica ion
Each audio sample was ep esen ed wi h a s yle p obabili y ec o , wi h each elemen
co esponding o a sub-gen e. These sub-gen es we e hen g ouped in o b oade
pa en gen es, as lis ed in Table 1. Classi ica ion was pe o med by calcula ing
he mean p obabili y ac oss sub-gen es wi hin each pa en gen e, ensu ing ai ness
ac oss gen es wi h di e ing numbe o s yles.
26 Chap e 3. Me hods
Pa en Gen e Sub-Gen e
Elec onic 106
Rock 91
La in 35
Folk, Wo ld, & Coun y 27
Hip Hop 26
Jazz 25
Pop 16
Funk / Soul 15
Classical 13
Non-Music 13
Blues 12
Reggae 11
S age & Sc een 4
B ass & Mili a y 3
Child en’s Music 3
Table 1: Numbe o sub-gen es (s yles) associa ed wi h each pa en gen e.
Based on he classi ica ion p ocedu e desc ibed abo e, Table 2 shows he dis ibu ion
o he 1,125 classi ied iles ac oss pa en gen es. Gen es no lis ed had no assigned
samples and we e emo ed.
3.2. Explo a o y Da ase Analysis 27
Pa en Gen e Numbe o iles (%)
Elec onic 195 (17.3%)
Rock 107 (9.5%)
Hip Hop 127 (11.3%)
Funk / Soul 2 (0.2%)
Non-Music 590 (52.4%)
Reggae 47(4.2%)
S age & Sc een 29 (2.6%)
B ass & Mili a y 28 (2.5%)
Table 2: Numbe o iles classi ied in o each pa en gen e.
28 Chap e 3. Me hods
3.3 Real ime Audio Va ia ional Au oEncode (RAVE)
Fo he pu pose o eal- ime audio manipula ion h ough la en space con ol, he
Real ime Audio Va ia ional Au oEncode (RAVE) [12] was selec ed o i s high
syn hesis quali y and low la ency.
The subsec ions ha ollow p o ide a s uc u ed desc ip ion o he RAVE model.
Fi s , he o e all a chi ec u e and he wo-s age aining p ocedu e a e p esen ed.
Followed by a discussion o he la en space comp ession s a egies o e ec i e
manipula ion and he econs uc ion app oach om educed la en ep esen a ions.
Finally, he speci ic aining implemen a ion used in his wo k is desc ibed.
3.3.1 A chi ec u e and T aining S a egy
The Real ime Audio Va ia ional Au oEncode (RAVE) is a deep gene a i e model
ha consis s o an encode ha maps inpu audio spec og ams in o a compac
la en ec o , cap u ing pe cep ually ele an ea u es, and a decode ha econ-
s uc s audio om his la en ep esen a ion. To achie e e icien eal- ime pe o -
mance, RAVE employs a mul i-band syn hesis s a egy, whe e he wa e o m is de-
composed in o se e al sub-bands ha a e p edic ed in pa allel and hen ecombined.
This a chi ec u e no only educes compu a ional complexi y bu also enables high-
quali y, low-la ency audio gene a ion. The la en space is designed o be s uc u ed
and comp essible, allowing o lexible audio manipula ions such as in e pola ion
be ween samples and eal- ime modula ion o indi idual la en dimensions.
RAVE is ained wi h a wo-s age aining p ocedu e:
•S age 1: Rep esen a ion Lea ning
In his phase, he model is ained as a a ia ional au oencode (VAE). Th ough
i di e s om s anda d VAE implemen a ions by using he mul iscale spec-
al dis ance [41]. This spec al dis ance is c ucial o audio applica ions
as i a oids penalizing i ele an phase a ia ions ha would occu wi h aw
wa e o m L2 loss. The goal is o lea n a la en ep esen a ion ha cap u es he
3.3. Real ime Audio Va ia ional Au oEncode (RAVE) 29
pe cep ually ele an ea u es o audio while being obus o phase di e ences.
•S age 2: Ad e sa ial Fine-Tuning
Once he encode has lea ned a meaning ul ep esen a ion, i is ozen o
p ese e he la en space s uc u e. The decode con inues aining wi h a
combina ion o h ee loss componen s:
– Ad e sial loss: To ool he disc imina o and imp o e he syn hesis
ealism.
– Con inued spec al dis ance loss: To main ain econs uc ion ideli y.
– Fea u e ma ching loss [28]: To ma ch disc imina o ea u e maps be-
ween eal and gene a ed audio.
This mul i-obje i e app oach ensu es ha syn hesis quali y imp o es wi hou com-
p omising he s abili y and meaning ulness o he la en ep esen a ion.
3.3.2 La en Space Comp ession Me hod
RAVE add esses he challenge o iden i ying he mos in o ma i e dimensions in he
lea ned la en space o enable mo e e ec i e manipula ion and analysis. The goal is
o ind he minimal subse o dimensions in he la en ec o z ha e ains enough
in o ma ion o high- ideli y econs uc ion.
The me hod uses pos - aining Singula Value Decomposi ion (SVD) o dis-
inguish be ween in o ma i e and unin o ma i e la en dimensions. Howe e , ap-
plying SVD di ec ly o samples Z∈Rb×d, he e bis he numbe o audio samples
in he ba ch and dis he la en dimensionali y, would be p oblema ic due o high
a iance in collapsed dimensions ha ha e con e ged o p io .
To add ess his, a modi ied ma ix Z′∈Rb×dis cons uc ed whe e each ow ep e-
sen s he mode (mos likely alue) o he pos e io dis ibu ion o sample i:
Z′
i= a g max
zqϕ(z|x)(3.9)
30 Chap e 3. Me hods
Whe e, z∼qϕ(z|x)deno es a la en ec o zsampled om he pos e io dis ibu ion
qϕ(z|x)o he encode gi en an inpu audio x, pa ame e ized by ϕ. Fo Gaussian
pos e io s, his co esponds o he mean dis ibu ion.
The ma ix Z′is cen e ed by sub ac ing he mean ac oss samples, ensu ing ha
collapsed dimensions (which ha e cons an alues) become ze o a e cen e ing. SVD
is applied o he cen e ed ma ix Z′:
Z′=UΣVT(3.10)
Whe e U∈Rb×bcon ains he le singula ec o s ep esen ing di ec ions in sample
space, Σ∈Rb×dis a diagonal ma ix o singula alues, and V∈Rd×dcon ains he
igh singula ec o s, ep esen ing he di ec ions in he la en space.
A ideli y pa ame e ∈[0,1] de e mines he minimal numbe o dimensions o
e ain:
P
i=1 Σii
Pd
i=1 Σii
≥ (3.11)
Whe e Σii e e s o he i- h singula alue in Σ, and indica es he con ibu ion o
each la en dimension o he econs uc ion a iance.
This allows any la en ec o z o be p ojec ed o a compac ep esen a ion z ∈R
con aining only he mos in o ma i e dimensions.
3.3.3 Recons uc ion om Comp essed La en s
Fo econs uc ion, he comp essed la en z is conca ena ed wi h andom noise
ϵ∼ N(0, I) o he unin o ma i e dimensions, o ming a ull la en ec o ˜z=
[z ;ϵ]VT, which is hen passed h ough he decode o gene a e audio. He e, ϵis a
mul i a ia e no mal dis ibu ion wi h ze o mean and iden i y co a iance, and [z ;ϵ]
indica es conca ena ion along he la en dimension.
The pape demons a es ha wi h = 0.99, he la en dimensionali y can be e-
duced om 128 o 24 dimensions o s ing music and 16 o speech while main aining
econs uc ion quali y.
3.3. Real ime Audio Va ia ional Au oEncode (RAVE) 31
3.3.4 Implemen a ion o he T aining P ocedu e
RAVE was ained o a o al o 2 million s eps ollowing he wo-s age p ocedu e de-
sc ibed in Subsec ion 3.3.1. The i s 1 million s eps co esponded o ep esen a ion
lea ning, while he second 1 million s eps co esponded o ad e sa ial ine- uning.
The 2 a chi ec u e was employed, which is an imp o ed con inuous model op i-
mized o as e and highe quali y gene a ion.
Se e al addi ional con igu a ions we e applied du ing aining. The causal se ing
en o ced he model o ely on exclusi ely on pas wa e o m samples, which is es-
sen ial in eal- ime scena ios as i educes he pe cei ed la ency, al hough a he
expense o econs uc ion quali y. The noise se ing in oduced a noise syn hesize
in he decode , which imp o es modeling o sounds con aining signi ican noisy com-
ponen s. Single-channel audio inpu was also used o ensu e consis en p ocessing
ac oss he da ase .
A e aining, he model was expo ed as a To chSc ip ile o deploymen . The
–s eaming op ion was enabled, which ac i a es cached con olu ions and ensu es
compa ibili y wi h eal- ime audio p ocessing.
38 Chap e 3. Me hods
3.6 Expe imen s
3.6.1 Condi ioned Mo phe T ans o me Model
In addi ion o u ilizing he RAVE model, a cus om ans o me model was de eloped
o pe o m audio mo phing while inco po a ing condi ioning on s yle and BPM
(bea s pe minu e) as he quan i a i e ep esen a ion o musical empo.
3.6.2 Da a P ep ocessing
The audio p ep ocessing pipeline consis ed o ou main s ages: audio no maliza ion,
empo adjus men , gen e classi ica ion, and neu al audio coding. Raw audio iles
om he F eeSound Loop Da ase we e p ocessed o c ea e sui able s anda dized
ep esen a ions o he model.
Audio No maliza ion and S anda diza ion
All inpu audio iles we e s anda dized o a consis en o ma wi h he ollowing
speci ica ions:
•Sample a e: s= 44.1kHz
•Du a ion: T= 5 seconds
•Channels: Mono (s e eo iles we e con e ed by a e aging channels)
•Ta ge samples: N=T× s= 220,500 samples
Tempo No maliza ion
To ensu e hy hmic consis ency ac oss he da ase , an op ional empo no maliza ion
p ocedu e a ge ing BPM a ge = 120 was implemen ed, as desc ibed in Subsec ion
3.1.1. Howe e , his no maliza ion was no applied du ing model aining.
3.6. Expe imen s 39
Gen e Classi ica ion and S yle Encoding
The s yle embedding was ex ac ed using he MAEST model [40], as desc ibed in
Subsec ion 3.2.1 o classi ica ion pu poses. Speci ically, he p e- ained checkpoin
discogs-maes -5s-pw-129e was used o ob ain he ull 400-dimensional ac i a ion
ec o s o he whole da ase .
Neu al Audio Coding
To comp ess and ep esen audio in a disc e e la en space, he Disc e e Audio Codec
(DAC) [42] was employed –a s a e-o - he-a neu al audio codec designed o high-
ideli y audio econs uc ion. DAC encodes audio signals in o compac sequences o
disc e e okens h ough a ully con olu ional encode -decode a chi ec u e, ollowed
by Residual Vec o Quan iza ion o e icien disc e iza ion.
The 44 kHz DAC model wi h GPU-based encoding was used o ensu e as and
uni o m oken gene a ion ac oss he da ase .
3.6.3 Model A chi ec u e
The model employs a ans o me -based encode -decode a chi ec u e wi h sepa a e
FiLM condi ioning laye s o s yle and BPM [43]. Wi hin he encode , audio okens
a e p ocessed h ough mul iple codebook embeddings, wi h FiLM laye s modula ing
ea u es acco ding o s yle and BPM. Rep esen a ions om bo h sou ce and a ge
inpu s a e hen linea ly in e pola ed. Finally, he decode gene a es ou pu logi s
using c oss-a en ion o he in e pola ed ep esen a ions, inco po a ing he same
dual condi ioning mechanisms.
40 Chap e 3. Me hods
Figu e 3: O e iew o he model. Sou ce(s) and a ge ( ) inpu s (x) a e en-
coded (E) in o la en space ep esen a ions and decoded (D) in o ou pu (ˆy).
Sou ce (zs) and a ge (z ) la en ep esen a ions a e in e pola ed using α∈[0,1]
o p oduce zmo ph. Condi ioning ec o s include cs= [bs, σs],c = [b , σ ], and
cdecode = [bdecode , σdecode ], wi h cdecode ∈ {ccus om, c }. Whe e σand b ep esen he
s yle and BPM ec o s, espec i ely.
A de ailed diag am is p o ided in Figu e 8 in Appendix B.
Inpu S age
The model akes wo audio sequences o mo phing, co esponding o he sou ce
and a ge domains. Each domain consis s o audio codes x∈ZB×K×L
+ om K= 9
codebooks, a s yle ec o σ∈RB×400
+, and a BPM scala b∈RB×1
+, whe e Bis he
ba ch size and Lis he maximum sequence leng h.
Embedding S age
Audio okens om each codebook a e p ocessed h ough sepa a e embedding laye s
(mapping om V= 1024, he ocabula y size, o dmodel = 64 dimensions), hen
mean-pooled ac oss codebooks and enhanced wi h posi ional encoding. The con-
di ioning in o ma ion (s yle and BPM) is p ocessed h ough sepa a e embedding
ne wo ks:
•S yle condi ioning: Linea p ojec ion om 400 o 64 dimensions
3.6. Expe imen s 41
•BPM condi ioning: MLP wi h Linea →Laye No m →ReLU →Linea →
Laye No m, which ans o ms he scala BPM o a 64-dimensional embedding.
Fea u e-wise Linea Modula ion (FiLM)
The model employs sepa a e FiLM laye s o s yle and BPM condi ioning allow-
ing independen modula ion o ea u es. Gi en inpu embedded ea u es xemb ∈
RB×L×dmodel , s yle condi ion cσ∈RB×dmodel and BPM condi ion cb∈RB×dmodel , each
FiLM laye compu es:
FiLM(xemb, c) = γ(c)⊙xemb +β(c)(3.14)
Whe e γ(c)and β(c)a e lea ned MLP p ojec ions (Linea →ReLu →Linea ) o
he condi ioning ec o , and ⊙deno es elemen -wise mul iplica ion.
The combined FiLM ope a ion applies bo h s yle and BPM modula ions addi i ely:
xou =xemb + (FiLMσ(xemb, cσ)−xemb)+(FiLMb(xemb, cb)−xemb)(3.15)
P ese ing bo h condi ioning signals while main aining he o iginal ea u e s uc u e.
Encode S age
Six FiLM-condi ioned ans o me blocks encode each sequence. Each block con-
ains masked mul i-head sel -a en ion (8 heads, dk= 16; masked o igno e padding
posi ions), ollowed by Add & No m, Dual FiLM modula ion (as desc ibed in Sub-
sec ion 3.6.3), a eed- o wa d ne wo k (Linea →ReLU →Linea ), ano he Add &
No m, and a second dual FiLM laye . The ou pu hen passes h ough a la en p o-
jec o (Linea →Laye No m →ReLU →Linea ) p oducing la en ep esen a ion
o shape [B, L, dmodel].
42 Chap e 3. Me hods
La en In e pola ion
Sou ce and a ge la en a e in e pola ed in he la en space:
zmo ph = (1 −α)·zs+α·z (3.16)
Whe e α∈[0,1] con ols he mo phing a io, and zsand z deno e he sou ce and
a ge la en s, espec i ely.
Decode S age
Six FiLM-condi ioned ans o me blocks decode he mo phed la en using a ge
condi ioning. Each decode block ollows a s uc u ed pipeline:
1. Masked mul i-head sel -a en ion is applied i s , ollowed by Add & No m and
he i s dual FiLM condi ioning laye , which modula es ea u es using bo h
s yle and BPM condi ions.
2. The block pe o ms mul i-head c oss-a en ion o he encoded memo y, ap-
plies ano he Add & No m ope a ion, and in oduces a second dual FiLM
condi ioning s age.
3. Finally, he ea u es pass h ough a eed- o wa d ne wo k, apply a hi d Add &
No m s ep, and a e p ocessed by a hi d dual FiLM condi ioning, ensu ing
comp ehensi e s yle and BPM modula ion h oughou he decoding.
The decode inpu is ze o-ini ialized wi h a shape [B, L , dmodel], whe e L is he a -
ge sequence leng h, and enhanced wi h posi ional encoding be o e passing h ough
he FiLM-condi ioned laye s, which a end o he mo phed la en ep esen a ion.
Nine sepa a e heads, one pe codebook, compu e logi s independen ly, each o shape
[B, L , V ].
3.6. Expe imen s 43
3.6.4 T aining P ocedu e
Da ase Cons uc ion and Da a Loading
A pai ed da ase was cons uc ed om DAC-encoded audio loops. Each aining
sample consis ed o a sou ce- a ge pai (xs, x )whe e xsand x we e di e en en-
coded audio loops, which enables he model o lea n mo phing ans o ma ions be-
ween di e en musical sequences.
The da ase loade was implemen ed wi h obus handling o a iable-leng h se-
quences h ough dynamic padding and sequence leng h acking. Fo each ba ch,
sequences we e padded o he maximum leng h wi hin he ba ch, and a en ion
masks we e compu ed o ensu e p ope handling o padded posi ions du ing ain-
ing.
Each sample con ained:
•Sou ce and a ge DAC codes: xs,x ∈RK×L
•S yle p obabili y ec o s: σs, σ ∈R400
•BPM alues: bs, b ∈R
•Ac ual sequence leng hs: ls, l ∈N
Cu iculum Lea ning S a egy
To a chi e mo e s abili y du ing aining, a cu iculum lea ning s a egy was imple-
men ed o p og essi ely inc ease he complexi y o mo phing a ios o e ime:
αcu iculum(e) =
{0.0,1.0}i e
E<0.3
{0.0,0.25,0.75,1.0}i 0.3≤e
E<0.6
U(0,1) i e
E≥0.6
(3.17)
whe e edeno es he cu en epoch, E he o al numbe o epochs, and U(0,1) is he
uni o m dis ibu ion o e [0,1].
44 Chap e 3. Me hods
This cu iculum begins wi h ex eme alues o α(i.e., pu e sou ce o a ge econ-
s uc ion), hen in oduces in e media e a ios, and inally explo es he ull in e po-
la ion space.
Loss Func ion Design
T aining was conduc ed using a simpli ied mo phing loss unc ion ha di ec ly su-
pe ises he model o econs uc he a ge sequence, ega dless o he mo phing
a io:
Lmo ph =1
K
K
X
k=1
LCE(ˆ
yk,x ,k)(3.18)
Whe e K= 9 deno es he numbe o codebooks, ˆ
yk ep esen s he p edic ed logi s o
he codebook k, and x ,k a e he a ge sequence okens o codebook k. The c oss-
en opy loss LCE is compu ed wi h he masking o handle a iable-leng h sequences:
LCE(ˆ
yk,x ,k) = −1
|M| X
τ∈M
log p(xτ
,k |ˆ
yτ
k)(3.19)
whe e M ep esen s he se o alid (no-padded) ime s eps τbased on he a ge
sequence leng h.
The model lea ns o decode he in e pola ed ep esen a ion, as desc ibed in Subsec-
ion 3.6.3, in o cohe en audio sequences h ough consis en supe ision agains he
a ge sequence. The mo phing beha io eme ges implici ly om he la en space
in e pola ion and he FiLM-based condi ioning mechanism.
T aining Con igu a ion
The model was ained on 2010 audio iles using he AdamW op imize and he
ollowing hype pa ame e s:
3.6. Expe imen s 45
Pa ame e Value
Lea ning a e 1×10−4
Weigh decay 1×10−5
Ba ch size 4
G adien accumula ion s eps 8
E ec i e ba ch size 32
Maximum epochs 300
G adien clipping 0.5
Table 3: T aining hype pa ame e s
The OneCycleLR schedule was employed wi h a peak lea ning a e eached a 10%
o o al aining s eps. G adien accumula ion was applied o simula e longe ba ch
sizes while main aining memo y e iciency.
T aining P ocedu e
The model was ained using mini-ba ches o pai ed sou ce- a ge examples, ollow-
ing cu iculum lea ning s a egy o in e pola ion a ios (see Subsec ion 3.6.4). Fo
each ba ch, sou ce and a ge sequences we e independen ly encoded, in e pola ed
in he la en space acco ding o he sampled a io α, and decoded unde a ge
condi ioning. A mo phing loss Lmo ph (see Subsec ion 3.6.4) was hen compu ed
agains a ge okens, a e aged ac oss codebooks wi h ull masking a iable-leng h
handling. Op imiza ion used g adien accumula ion o e Naccum s eps and clipping
o s abili y. The ull aining loop is summa ized in Algo i hm 1.
46 Chap e 3. Me hods
Algo i hm 1 T aining Loop
1: o each ba ch (xs, x )in aining se do
2: Sample in e pola ion ac o α∼cu iculum(e)
3: Encode sou ce: zs←Encode(xs, σs, bs)
4: Encode a ge : z ←Encode(x , σ , b )
5: In e pola e ep esen a ions: zmo ph ←(1 −α)zs+αz
6: Use a ge condi ioning di ec ly: cdecode ←(σ , b )
7: Decode ou pu : ˆy ←Decode(zmo ph,cdecode )
8: Compu e loss: L←Lmo ph(ˆy, x )
9: No malize loss: L←L/Naccum and backp opaga e
10: i S ep % Naccum = 0 hen
11: Clip g adien by no m (max=0.5), upda e pa ame e s, ese g adien s
12: end i
13: end o
Valida ion and Ea ly S opping
The model was e alua ed ac oss mul iple ixed mo phing a ios α∈ {0.0,0.25,0.5,0.75,1.0}
o assess econs uc ion quali y a endpoin s α= 0.0and α= 1.0, in e pola ion
smoo hness ac oss in e media e alues, and mo phing e ec i eness a he midpoin
α= 0.5.
Ea ly s opping was igge ed i alida ion loss did no imp o e o 15 consecu i e
epochs, p e en ing o e i ing while ensu ing con e gence.
Model Checkpoin ing
Du ing aining, he bes model based on alida ion loss was sa ed, along wi h
pe iodic checkpoin s e e y 10 epochs, a inal model s a e a he end o aining, and
cu iculum epoch me ada a o esuming aining.
This s a egy suppo ed aining esump ion and helped model selec ion based on
mo phing pe o mance ac oss di e en in e pola ion a ios.
3.6. Expe imen s 47
3.6.5 Audio Mo phing In e ence Sys em
The in e ence sys em implemen s a neu al audio mo phing amewo k ha ope a es
on Disc e e Audio Codec (DAC) ep esen a ions. I enables con olled in e pola ion
be ween he audio loops using he de eloped Condi ioned Mo phe T ans o me
model, allowing con ol o e musical s yle, BPM, and s uc u al cha ac e is ics.
Inpu P ocessing Pipeline
The in e ence sys em applies he same p ep ocessing pipeline used o aining da a
p epa a ion o ensu e consis ency be ween aining and in e ence phases.
Audio P ep ocessing Gi en sou ce and a ge audio iles, he sys em applies
p ep ocessing o ensu e consis en o ma and empo al alignmen , including esam-
pling, unca ing o padding.
DAC Encoding The p ep ocessed audio is encoded using he Disc e e Audio
Codec o ob ain quan ized ep esen a ions sui able o he model.
S yle and Tempo Fea u e Ex ac ion The in e ence sys em applies he same
s yle and empo ea u e ex ac ion me hods desc ibed in Subsec ion 3.6.2. Musical
s yle ea u es a e ex ac ed using he MAEST model o gene a e 400-dimensional
s yle ac i a ion enso s, while empo in o ma ion is ob ained using Essen ia’s Rhy h-
mEx ac o 2013 algo i hm. These ea u es a e compu ed o bo h sou ce and a ge
audio iles o p o ide he condi ioning in o ma ion equi ed o he mo phing p o-
cess.
Neu al Mo phing A chi ec u e
Condi ioned Mo phe T ans o me model (CMT) A in e ence, he model
akes sou ce and a ge DAC codes, s yle and BPM enso s, and a mo phing a io.
Op ional cus om s yle and BPM can also be p o ided, which a e applied di ec ly
wi hou in e pola ion. The CMT model gene a es logi s o each codebook indepen-
den ly.
54 Chap e 4. Resul s
4.1.1 Ag eemen Scale Responses (Q1–Q3)
Responses o he Ag eemen Scale, which assessed pe cep ual balance, cohe ence,
and usabili y o he mo phs, a e p esen ed in Table 5.
Ques ion Main esponse dis ibu ion Median Mode
Q1 5.8% Neu al/Unsu e, 59.3 %
Ag ee, 34.9% S ongly Ag ee
4.0 4
Q2 7.0% Neu al/Unsu e, 47.7%
Ag ee, 45.3% S ongly Ag ee
4.0 4
Q3 62.8% Ag ee, 37.2% S ongly
Ag ee
4.0 4
Table 5: Ag eemen Scale Responses (Q1–Q3).
Pa icipan s exp essed high le els o ag eemen ac oss all h ee ques ions, The mid-
poin o he mo ph (50% blend) was gene ally pe cei ed as a balanced combina ion
o bo h o iginal loops (Q1; 59.3% ag eed, 34.9% s ongly ag eed, median = 4.0).
In e pola ed ansi ions we e a ed as musically cohe en , wi h changes in hy hm,
imb e, and s uc u e pe cei ed as making musical sense (Q2; 47.7% ag eed, 45.3%
s ongly ag eed, median = 4.0). Finally, he mo phs we e conside ed musically
usable o applica ions such as seamless ansi ions, gen e blending o imb e ans-
o ma ions (Q3; 62.8% ag eed, 37.2 % s ongly ag eed, median = 4.0).
The dis ibu ion o pa icipan esponses o he Ag eemen Scale ques ions is isu-
alized in Figu e 4.
4.1. 5-poin Like Scales 55
Figu e 4: Pe cen age dis ibu ions o Like -scale esponses o Ag eemen Scale
ques ions (Q1–Q3)
4.1.2 In ensi y Scale Responses (Q4–Q5, Q8–Q9)
The In ensi y Scale, which measu ed how s ongly pa icipan s pe cei ed speci ic
quali ies o changes ela ed o he mo phing p ocess, is p esen ed in Table 6.
Ques ion Main esponse dis ibu ion Median Mode
Q4 64.0% S ongly, 36.0% Ve y
s ongly
4.0 4
Q5 68.6% S ongly, 31.4% Ve y
s ongly
4.0 4
Q8 52.3% No a all, 47.7% Sligh ly 1.0 1
Q9 58.1% S ongly, 41.9% Ve y
s ongly
4.0 4
Table 6: In ensi y Scale esponses (Q4–Q5, Q8–Q9).
Pa icipan s consis en ly pe cei ed he mo phing as g adual (Q4; 64.0% s ongly,
36.0% e y s ongly, median = 4.0). They also epo ed ha he mo ph appea ed
o a ge speci ic musical ea u es such as hy hm, imb e and ex u e (Q5; 68.6%
56 Chap e 4. Resul s
s ongly, 31.4% e y s ongly, median = 4.0). In con as , changes in he numbe
o la en dimensions we e judged o ha e li le impac on he smoo hness o con i-
nui y o he mo ph (Q8; 52.3% no a all, 47.7% sligh ly, median = 1.0). Howe e ,
a ia ions in la en dimensions we e pe cei ed o s ongly a ec he exp essi eness
o he mo ph (Q9; 58.1% s ongly, 41.9% e y s ongly, median = 4.0).
The dis ibu ion o pa icipan esponses o he In ensi y Scale ques ions is isualized
in Figu e 5.
Figu e 5: Pe cen age dis ibu ions o Like -scale esponses o In ensi y Scale ques-
ions (Q4–Q5, Q8–Q9)
4.2 Iden i ica ion o Musical Aspec s o each Di-
mension
Ques ion 6 assessed which musical aspec s –such as hy hm, imb e, and s uc u e–
pa icipan s pe cei ed as mos a ec ed du ing he mo phing ansi ions. The dis-
ibu ion o esponses ac oss he i e dimensions is epo ed in Table 7.
4.2. Iden i ica ion o Musical Aspec s o each Dimension 57
Musical Fea-
u e
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
Rhy hm 20.5 % (84) 40.7 % (81) 48.8 % (78) 0.8% (1) –
Timb e 20.8 % (85) 12.6 % (25) 50.6 % (81) 57.9 % (77) –
Pi ch/ Ha -
mony
20.5 % (84) 0.5 % (1) – 39.8 % (53) –
S uc u e/
A angemen
19.6 % (80) 17.1 % (34) – – –
Tex u e/ Lay-
e ing
18.6 % (76) 29.1 % (58) – 1.5 % (2) –
No no iceable
change
– – 0.6 % (1) – 100 % (85)
Table 7: Dis ibu ion o pa icipan esponses ac oss di e en dimensions and mu-
sical ea u es, showing bo h pe cen ages and coun s.
O e all, esponses e ealed dis inc endencies depending on he la en dimension.
Fo Dimension 1, pa icipan s epo ed changes dis ibu ed ai ly e enly ac oss all
ea u es (≈20% each). Dimension 2 emphasized hy hm (40.7%) and ex u e/laye -
ing (29.1%). Dimension 3 was domina ed by hy hm (48.8%) and imb e (50.6%).
Dimension 4 showed s ong emphasis on imb e (57.9%) and pi ch/ha mony (39.8%).
Finally, dimension 5 was cons an ly pe cei ed as ha ing no no iceable change (100.0%).
The dis ibu ion o esponses pe dimension is isualized in Figu e 6.
58 Chap e 4. Resul s
Figu e 6: Pa icipan -iden i ied musical ea u es mos a ec ed ac oss dimensions
(Q6).
4.3 Pe cei ed Con ol: Uni o m s. Pe -Dimension
Blending
Ques ion 7 e alua ed whe he pa icipan s pe cei ed pe -dimension blending as p o-
iding g ea e p ecision and con ol compa ed o uni o m blending.
The dis ibu ion o esponses is shown in Figu e 7.
Figu e 7: Pa icipan esponses on pe cei ed p ecision and con ol o pe -dimension
blending (Q7).
All pa icipan s unanimously acknowledge ha using a pe -dimension a io el mo e
p ecise and con olled han using a uni o m a io.
Chap e 5
Discussion
The indings om he subjec i e e alua ion p o ide insigh in o he pe cep ual and
musical alidi y o la en space in e pola ion o loop-based audio syn hesis.
Wi h espec o pe cep ual blend and musical cohe ence RQ1, pa icipan s consis-
en ly a ed he in e pola ed loop as pe cep ually meaning ul and musically cohe en .
The midpoin mo phs we e pe cei ed as balanced combina ions o bo h sou ce loops,
and ansi ions we e judged musically cohe en , wi h changes in di e en musical
aspec s pe cei ed as in en ional and musically logical. Fu he mo e, analysis ac oss
la en dimensions e ealed ha speci ic dimensions co esponded o iden i iable mu-
sical ea u es, ein o cing he in e p e abili y and musical alidi y o la en space
mo phing.
In e ms o usabili y and con ol RQ2, lis ene s epo ed ha he in e pola ions
we e musically usable wi hin composi ional o pe o mance con ex s. The g adual
ans o ma ions we e eliable pe cei ed and appea ed o con ey s ylis ic cha ac-
e is ics associa ed wi h di e en gen es. Pe -dimension blending was consis en ly
judged as mo e p ecise and con ollable han uni o m blending. These obse a ions
sugges ha la en space ajec o ies can unc ion as exp essi e ools o s yle usion
and c ea i e manipula ion.
Rega ding la en space ideli y RQ3, a ying he ideli y o he la en ep esen a-
59
60 Chap e 5. Discussion
ions was ound o a ec exp essi eness mo e han pe cep ual con inui y. Al hough
ansi ions emained smoo h e en a lowe ideli ies, highe ideli y se ings we e con-
sis en ly associa ed wi h g ea e ichness and musical de ail, highligh ing a ade-o
be ween compac ness and pe cep ual ichness in gene a i e loop syn hesis.
Taken oge he , hese esul s demons a e ha la en in e pola ion no only yields
pe cep ually cohe en blends bu also a o ds p ac ical usabili y in c ea i e audio
con ex s. A he same ime, hey e eal ha model ideli y can signi ican ly shape
he quali a i e cha ac e o he ou pu s, being an impo an pa ame e o bo h
design and e alua ion o gene a i e sys ems.
5.1 Conclusion
This hesis de eloped a gene a i e audio sys em o c ea i e loop manipula ion,
demons a ing how la en space in e pola ion can be used o syn hesize, ans o m,
and blend audio loops. The sys em enables pe cep ual cohe en and musically mean-
ing ul ansi ions, suppo ing applica ions such as mashups, s ylis ic blending and
audio mo phing.
The e alua ion con i med ha la en space in e pola ion o e s pe cep ually cohe en
and musically meaning ul ans o ma ions, showing i s alue as a ool o c ea i e
loop manipula ion. Ra he han ocusing solely on syn hesis quali y, he sys em
emphasizes con ollabili y and exp essi eness, aligning wi h he needs o musicians
and p oduce s in loop-based wo k lows.
An addi ional explo a o y expe imen wi h a T ans o me model condi ioned on
s yle and empo (CMT) u he highligh ed he challenges o balancing con ol-
labili y wi h syn hesis quali y. While s yle condi ioning showed po en ial, empo
condi ioning was less e ec i e, and objec i e e alua ion e ealed ha i s acous ic,
pe cep ual, and abili y o ep esen musical s uc u e did no ma ch he RAVE-
based sys em.
O e all, his wo k con i ms ha neu al gene a i e models, when combined wi h
la en space manipula ion s a egies, o e a p ac ical and exp essi e app oach o
5.2. Fu u e Wo k 61
c ea i e loop syn hesis and ans o ma ion.
5.2 Fu u e Wo k
Fu u e di ec ions include he de elopmen o a plugin implemen a ion sys em, which
would make he ool accessible in common digi al wo ks a ions (DAW) en i on-
men s. This would allow musicians and p oduce s o explo e la en space in e pola-
ion di ec ly wi hin hei wo k lows.
In pa allel, imp o emen s o he Condi ioned Mo phe T ans o me model will be
pu sued. In pa icula , e ining he condi ioning mechanisms o empo and in-
es iga ing al e na i e s a egies o enhance bo h con ollabili y and compe i i e
pe o mance. A mo e comp ehensi e e alua ion, including subjec i e lis ening s ud-
ies, will also be necessa y o assess he pe cep ual impac o condi ioning and o
be e unde s and i s implica ions o c ea i e applica ions.
Lis o Figu es
1 Median Ac i a ion o s yle p obabili y ec o s. . . . . . . . . . . . . . 23
2 PCA p ojec ion o 400-dimensional s yle p obabili ies ac oss audio
iles. The i s p incipal componen s accoun o 36.14% and 19.76%
o he o al a iance, espec i ely. Colo s indica e p edic ed pa en
gen e. ................................... 24
3 O e iew o he model. Sou ce(s) and a ge ( ) inpu s (x) a e en-
coded (E) in o la en space ep esen a ions and decoded (D) in o
ou pu (ˆy). Sou ce (zs) and a ge (z ) la en ep esen a ions a e in-
e pola ed using α∈[0,1] o p oduce zmo ph. Condi ioning ec o s
include cs= [bs, σs],c = [b , σ ], and cdecode = [bdecode , σdecode ], wi h
cdecode ∈ {ccus om, c }. Whe e σand b ep esen he s yle and BPM
ec o s, espec i ely. ........................... 40
4 Pe cen age dis ibu ions o Like -scale esponses o Ag eemen Scale
ques ions(Q1–Q3) ............................ 55
5 Pe cen age dis ibu ions o Like -scale esponses o In ensi y Scale
ques ions (Q4–Q5, Q8–Q9) . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Pa icipan -iden i ied musical ea u es mos a ec ed ac oss dimen-
sions(Q6).................................. 58
7 Pa icipan esponses on pe cei ed p ecision and con ol o pe -dimension
blending(Q7)................................ 58
8 Modela chi ec u e ............................ 73
62
Lis o Tables
1 Numbe o sub-gen es (s yles) associa ed wi h each pa en gen e. . . . 26
2 Numbe o iles classi ied in o each pa en gen e. . . . . . . . . . . . . 27
3 T aining hype pa ame e s . . . . . . . . . . . . . . . . . . . . . . . . 45
4 FAD E alua ion Resul s: Compa ison be ween RAVE model and he
CMTmodel................................ 52
5 Ag eemen Scale Responses (Q1–Q3). . . . . . . . . . . . . . . . . . . 54
6 In ensi y Scale esponses (Q4–Q5, Q8–Q9). . . . . . . . . . . . . . . . 55
7 Dis ibu ion o pa icipan esponses ac oss di e en dimensions and
musical ea u es, showing bo h pe cen ages and coun s. . . . . . . . . 57
63
70 Appendix A. Ques ionnai e o Subjec i e E alua ion
1. Does he midpoin o he mo ph (50% blend) sound like a pe cep ually bal-
anced combina ion o bo h o iginal loops?
Scale: S ongly Disag ee Disag ee Neu al / Unsu e Ag ee S ongly
Ag ee
2. Do he in e pola ed ansi ions eel musically cohe en — e.g., do changes in
hy hm, imb e, and s uc u e make musical sense and eel in en ional?
Scale: S ongly Disag ee Disag ee Neu al / Unsu e Ag ee S ongly
Ag ee
3. Does he in e pola ion be ween he wo loops sound musically usable — o
example, could i be used as a seamless ansi ion, gen e blend, o imb e
ans o ma ion wi hin a musical piece?
Scale: S ongly Disag ee Disag ee Neu al / Unsu e Ag ee S ongly
Ag ee
4. Can you clea ly pe cei e a g adual ans o ma ion in sound as he mo ph
p og esses om one loop (in a speci ic gen e) o ano he ?
Scale: No a all Sligh ly Mode a ely S ongly Ve y S ongly
A.2 Pe -Dimension Ra io Con ol in La en Space
In he second pa o he demons a ion ideo, i led "Pe -Dimension Ra io Con-
ol in La en Space", pa icipan s obse ed how each dimension was indi idually
mo phed wi hin he Max/MSP en i onmen . These dimensions we e iden i iable as
he i e sepa a e wi es coming ou o he encode objec s a e linea in e pola ion,
leading in o he decode .
Pa icipan we e hen asked o answe Ques ions 5-7 ega ding he use o pe -
dimension a io con ol:
5. Do he changes you hea du ing he mo ph seem o a ge speci ic musical
ea u es (e.g., hy hm, imb e, ex u e)?
Scale: No a all Sligh ly Mode a ely S ongly Ve y S ongly
A.3. Explo ing Dimensionali y in La en Space 71
6. Can you iden i y which musical aspec s (e.g., hy hm, imb e, s uc u e) we e
mos a ec ed in he ansi ion?
Mul iple choice pe each dimension:
•Rhy hm — e.g., changes in bea , empo, o g oo e
•Timb e — e.g., he "colo " o one quali y (b igh , da k, buzzy, e c.)
•Pi ch / Ha mony — e.g., melody shape, ha monic eel
•S uc u e / A angemen — e.g., buildup, b eakdown, o change in loop
o m
•Tex u e / Laye ing — e.g., hickness, numbe o laye s o ins umen s
•No no iceable change
7. Compa ed o uni o m blending, did pe -dimension blending eel mo e p ecise
and con olled?
Scale: Less con olled Abou he same Mo e con olled
A.3 Explo ing Dimensionali y in La en Space
In he inal pa o he demons a ion ideo, i led "Explo ing Dimensionali y in
La en Space", pa icipan s obse ed how changing he numbe o dimensions a ec s
he decoded audio.
Pa icipan s we e hen asked o answe Ques ions 8 and 9 based on hei obse a-
ions:
8. When ewe o mo e la en dimensions a e used (i.e., di e en ideli y se ings),
how s ongly do you no ice an e ec on he smoo hness o con inui y o he
audio mo phing be ween sounds?
Scale: No a all Sligh ly Mode a ely S ongly Ve y S ongly
9. When ewe o mo e la en dimensions a e used, how s ongly do you no ice
an e ec on he musical exp essi eness o ichness o he mo phing?
Scale: No a all Sligh ly Mode a ely S ongly Ve y S ongly
Appendix B
Model A chi ec u e
This appendix p esen s a de ailed diag am o he model a chi ec u e. The ans o me -
based encode -decode s uc u e, he FiLM-based condi ioning mechanism o s yle
and BPM, and he ea u e in e pola ion p ocess–including he laye s in each block–
a e illus a ed. Figu e 8 p o ides a schema ic ep esen a ion o he a chi ec u e,
showing he low o da a om inpu o ou pu and he applica ion o dual condi-
ioning.
72
73
Figu e 8: Model a chi ec u e
Appendix C
Ma e ials o Rep oducibili y
All code, demons a ion ideos, and ma e ials necessa y o ep oduce he co e
RAVE-based mo phing sys em a e a ailable in he p ojec eposi o y: h ps://
gi hub.com/AdaSal ado A alos/ eesound-loop-gene a o .
This eposi o y includes:
•All sc ip s used o he ML pipeline
•A p e- ained model checkpoin
•The in e ac i e Max/MSP pa ches
•The demons a ion ideo used in he subjec i e e alua ion
•A no ebook o analyzing he e alua ion esul s
Addi ionally, a sepa a e eposi o y o he expe imen al app oach is a ailable a
h ps://gi hub.com/AdaSal ado A alos/condi ioned-mo phe - ans o me -model,
which con ains:
•All sc ip s used o he ML pipeline
•A p e- ained model checkpoin
74
75
•Objec i e e alua ion compa ing he RAVE-based model wi h he de eloped
model
•The web in e ace