Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Seman ic Con ol O e Neu ally
Syn hesized Audio ia La en
Disen anglemen
Jed Padoa
Supe iso : Lonce Wyse
Co-Supe iso : F ede ic Fon
Augus 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 Objec i es.................................. 2
1.3 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Backg ound 4
2.1 AudioRep esen a ions ........................... 5
2.2 La en ep esen a ions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 In e p e abili y ............................... 7
2.3 Audioembeddings ............................. 10
2.3.1 Con as i e Lea ning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 VAEs, GANs, and RAVE 12
3.1 Va ia ional Au oencode s . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Ma ginalLikelihood ............................ 13
3.1.2 KL-Di e gence ............................... 14
3.1.3 Repa ame e iza ion T ick . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Condi ioningVAEs............................. 15
3.2 Gene a i e Ad e sa ial Ne wo ks . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 A chi ec u e................................. 17
3.2.2 T aining................................... 17
3.3 RAVE .................................... 18
3.3.1 A chi ec u e................................. 19
3.3.2 T aining................................... 20
4 Expe imen 23
4.1 Expe imen design ............................. 23
4.1.1 Da ase ................................... 24
4.1.2 Embedding and A ibu e Compu a ion . . . . . . . . . . . . . . . . . 24
4.1.3 Audio Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.4 ModelA chi ec u e............................. 27
5 Resul s 31
5.1 A ibu e Compu a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 Speed .................................... 31
5.1.2 Su ace Ma e ial A ibu es . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.3 Plo s..................................... 32
5.2 Seman icFade Model ........................... 33
5.2.1 Recons uc ion Fideli y . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Quan i a i e E alua ion . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Pe cep ual E alua ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Discussion 37
6.1 ChallengesFaced .............................. 37
6.1.1 Pos e io Collapse.............................. 37
6.1.2 Inconsis en Foo s ep Tempo . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Conclusions ................................. 39
6.3 Fu u eWo k................................. 40
Lis o Figu es 41
Lis o Tables 42
Bibliog aphy 43
Abs ac
Ad ances in deep gene a i e models ha e made i possible o syn hesize high- ideli y
audio, ye gi ing use s p ecise, con inuous con ol o e he seman ic quali ies o he
gene a ed sound emains challenging. This hesis ackles he p oblem by combining
a ia ional au o-encode s (VAEs) wi h a la en disen anglemen s a egy inspi ed by
Fade Ne wo ks [1][2]. In his model he encode lea ns a comp essed la en space in-
a ian o desi ed con ol a ibu es, allowing o p ecise con ol o e said a ibu es
be o e he la en ep esen a ion is passed o he decode ia a " ade " like mecha-
nism. A ibu es a e compu ed ia a lea ned linea eg ession coe icien ained in
a supe ised manne on con inuous a ibu e labels de i ed om a syn he ic oo -
s ep sound e ec s da ase . Du ing aining, ad e sa ial and econs uc ion losses
encou age o hogonali y be ween he la en codes and con ol a ibu es, ensu ing
ha adjus ing one a ibu e adjus s only he desi ed con en while lea ing he es
unchanged. This hesis de ails he s a e o he a and co e concep s behind he
me hods used be o e di ing in o he de ails o he implemen a ion. Finally, he
esul s will be p esen ed and discussed.
Acknowledgemen
I would like o exp ess my since e g a i ude o my supe iso s Lonce Wyse and
F ede ic Fon as well as he many eache s and colleagues a he MTG who p o ided
me wi h wisdom and guidance h oughou my ime he e.
Chap e 1
In oduc ion
1.1 Mo i a ion
Recen ad ancemen s in gene a i e AI ha e led o a ema kable su ge in c ea i e ca-
pabili ies, enabling he gene a ion o highly ealis ic images, ex , and audio. These
models, ained on as da ase s, can p oduce no el con en ha is o en indis in-
guishable om human-c ea ed wo ks. In he audio domain, his has esul ed in
models ha can syn hesize music in a ious s yles, gene a e ealis ic human speech,
and c ea e a wide a ay o sound e ec s. This p og ess has opened up exci ing possi-
bili ies o con en c ea ion, music p oduc ion, and accessibili y ools, demons a ing
a powe ul abili y o lea n and eplica e complex pa e ns om da a.
Despi e hese imp essi e capabili ies, a signi ican challenge emains: he lack o
ine-g ained, in ui i e con ol o e he gene a ed ou pu . While use s can o en
guide gene a ion wi h ex p omp s o high-le el desc ip ions, con olling speci ic,
con inuous a ibu es, pa icula ly seman ic ones like he emo ional in ensi y o a
oice o he cons i uen ma e ials o a sound e ec , is o en di icul o impossible.
The gene a ion p ocess can eel unin ui i e, and mino changes o he inpu can lead
o unp edic able o undesi able changes in he ou pu . This gap be ween he models’
gene a i e powe and ou abili y o di ec i wi h p ecision highligh s a c ucial a ea
o imp o emen in gene a i e audio sys ems.
1
8Chap e 2. Backg ound
i di icul o objec i ely assess whe he a model has lea ned a use ul s uc u e.
E en i a la en space is in e p e able, na iga ing i o c ea e meaning ul ans o -
ma ions is no s aigh o wa d. The mos ob ious app oach, linea in e pola ion
be ween wo poin s in he la en space, o en ails o p oduce a pe cep ually smoo h
ansi ion, due o unce ain y as o whe he o no he pa h mo es h ough ealis-
ic ou pu s. Fo audio, his can esul in a se ies o mu led, noisy, o pe cep ually
implausible sounds du ing a ans o ma ion. This p oblem highligh s ha he geom-
e y o he la en space is a ely simple o unde s andable. To gene a e a sequence
o alid in e media e samples in he la en space o a VAE o example, one mus
ind a pa h ha emains wi hin high-p obabili y egions o he lea ned dis ibu ion.
Two common app oaches o la en space disen anglemen o gene a i e models
include β-VAE and In oGAN. The β-VAE in oduces a hype pa ame e β o he
s anda d VAE objec i e, encou aging he la en dis ibu ion o closely ma ch a ac-
o ized p io and hus disen angle independen ac o s o a ia ion. By penalizing
he Kullback-Leible (KL) di e gence mo e hea ily, β-VAEs can lea n la en spaces
whe e single dimensions co espond o in e p e able a ibu es. Howe e , i he β
e m is oo la ge he inc eased in e p e abili y may come a he cos o educed e-
cons uc ion ideli y [4]. In oGAN, on he o he hand, augmen s he GAN objec i e
wi h an in o ma ion- heo e ic egula ize o maximize mu ual in o ma ion be ween
a subse o he la en space and he da a, enabling he eco e y o in e p e able,
disen angled ac o s in an unsupe ised manne [5].
Building on hese echniques, Fac o VAE explici ly encou ages independence among
la en dimensions by penalizing he " o al co ela ion," a s a is ical measu e o e-
dundancy among andom a iables. Fac o VAE achie es a supe io ade-o be-
ween disen anglemen and econs uc ion quali y compa ed o β-VAE, bu equi es
an auxilia y disc imina o o es ima e o al co ela ion, in oducing addi ional com-
plexi y o aining [6]. Fu he inno a ions such as β-TCVAE and DIP-VAE p opose
e ined egula iza ions o pos e io s, add essing some o he s abili y and complexi y
issues p esen in ea lie models. Recen wo k also ques ions he eliabili y o disen-
anglemen me ics, highligh ing ha e en high-sco ing models may no gua an ee
2.2. La en ep esen a ions 9
meaning ul o consis en con ol o e gene a i e ac o s, pa icula ly in complex do-
mains like audio [5]. Thus, while cu en app oaches like β-VAE, In oGAN, and
Fac o VAE ha e signi ican ly ad anced he disen anglemen o la en spaces, open
challenges emain in balancing in e p e abili y, gene a i e quali y, and eliable e al-
ua ion.
Fade Ne wo ks
Fade Ne wo ks aim o con ollable gene a ion by lea ning a la en space whe e
dimensions co espond o seman ically meaning ul a ibu es. The concep in ol es
aining a gene a i e model such ha a e sing a speci ic la en space di ec ion o
dimension esul s in a con inuous, p edic able change in a pa icula ou pu a ibu e
(e.g., b igh ness, s yle, imb e) while o he s emain una ec ed [1]. This is achie ed
h ough an ad e sa ial app oach whe eby a disc imina o ne wo k a emp s o p e-
dic ce ain ea u es by inspec ing he la en space i sel . The goal o he model’s
encode is o lea n a ep esen a ion in a ian o hese ea u es, meaning he encode
igno es hem, in o de o ool he disc imina o . The ue ea u e alues a e hen
appended o he la en space, and decoded o p oduce ou pu s. This ad e sa ial as-
pec allows he use o pinpoin exac ly which dimension co esponds o he desi ed
con ol a ibu es, allowing o con ollable gene a ion wi h ega d o he chosen ea-
u es [1]. F-RAVE, an ex ension o he gene a i e audio model RAVE, builds upon
his concep by lea ning a mapping be ween con inuous, human-unde s andable au-
dio desc ip o s (e.g., spec al cen oid o b igh ness, RMS o loudness) and he
la en space o a gene a i e model [2]. This enables ine , mo e in ui i e con ol
o e gene a ed sound by di ec ly manipula ing hese desc ip o s. The idea behind
Fade Ne wo ks builds on he b oade goal o c ea ing a well-o ganized la en space
whe e di e en da a aspec s a e encoded in a disen angled way. Fade Ne wo ks o -
e a p omising app oach o making gene a i e audio models mo e con ollable and
use - iendly by p o iding in ui i e ways o manipula e speci ic sound cha ac e is-
ics. Thei de elopmen highligh s he inc easing demand o mo e in e p e able and
con ollable gene a i e models in audio, mo ing beyond ealis ic sound gene a ion
o p ecise a is ic exp ession and manipula ion.
10 Chap e 2. Backg ound
2.3 Audio embeddings
Audio embedding is a echnique in machine lea ning ha ans o ms complex audio
signals in o compac , low-dimensional nume ical ec o ep esen a ions [7]. Raw
audio, whe he as a ime-domain wa e o m o a high- esolu ion spec og am, is in-
he en ly high-dimensional and di icul o many algo i hms o p ocess di ec ly. The
p ima y goal o an embedding is o dis ill he pe cep ually and s uc u ally impo -
an cha ac e is ics o a sound—such as i s imb al quali ies, pi ch con ou s, and
hy hmic pa e ns—in o a dense and meaning ul ep esen a ion [8]. This no only
makes he da a mo e compu a ionally ac able bu also c ea es a s uc u ed ea u e
space whe e sounds wi h simila acous ic p ope ies a e posi ioned closely oge he ,
enabling asks like simila i y-based e ie al, classi ica ion, and manipula ion o c e-
a i e applica ions. These embeddings a e mos commonly lea ned using deep neu al
ne wo ks, pa icula ly models wi h an au oencode a chi ec u e. In his amewo k,
an encode ne wo k lea ns o comp ess an inpu audio segmen in o a compac la-
en ec o — he embedding i sel —while a co esponding decode ne wo k lea ns o
econs uc he o iginal audio om only his ec o .
O he popula app oaches lea n embeddings h ough di e en objec i es. Sel -
supe ised models like Wa 2Vec 2.0 ope a e di ec ly on aw wa e o ms, lea ning
ich ea u es by p edic ing masked o missing po ions o he audio signal. Ano he
common s a egy in ol es ans e lea ning, whe e models like VGGish i s con e
audio in o a spec og am—a isual ep esen a ion—and hen p ocess his image
wi h a con olu ional neu al ne wo k o iginally designed o compu e ision asks
[9]. Finally, mul imodal models such as CLAP use con as i e lea ning o c ea e a
sha ed embedding space be ween audio and ex , lea ning o align sounds wi h hei
co esponding desc ip ions [10].
2.3.1 Con as i e Lea ning
Con as i e lea ning is a machine lea ning pa adigm in which a model o dis in-
guish be ween simila and dissimila da a poin s. The co e idea is o lea n an
2.3. Audio embeddings 11
embedding space whe e "posi i e pai s" (simila i ems) a e b ough close oge he ,
while "nega i e pai s" (dissimila i ems) a e pushed u he apa . Fo any gi en
da a poin , e e ed o as he "ancho ," a posi i e sample is a ela ed i em (e.g.,
an augmen ed e sion o he same image), and nega i e samples a e all o he i ems
in a aining ba ch [8]. By op imizing a con as i e loss unc ion, he model lea ns
o p oduce ep esen a ions ha clus e seman ically simila i ems wi hou needing
explici , human-p o ided labels o e e y single class.
OpenAI’s CLIP (Con as i e Language-Image P e- aining) model applies his p in-
ciple o b idge he gap be ween ision and ex . CLIP u ilizes a dual-encode a -
chi ec u e: one encode p ocesses images, and ano he p ocesses ex desc ip ions.
Du ing i s p e- aining phase, he model is ed a massi e da ase o (image, ex )
pai s collec ed om he in e ne . Fo each pai , he image and i s co esponding
ex o m a posi i e pai , while he image and he ex om all o he pai s in he
ba ch a e ea ed as nega i e pai s. The model’s objec i e is o maximize he cosine
simila i y o he embeddings o he co ec image- ex pai s while minimizing i o
all inco ec pai s. This p ocess e ec i ely aligns he wo modali ies in o a single,
sha ed embedding space [11].
This same con as i e me hodology has been success ully ex ended o he audio do-
main wi h models like CLAP (Con as i e Language-Audio P e- aining). CLAP
lea ns a join embedding space ha aligns sounds wi h hei co esponding ex de-
sc ip ions. By aining on pai s o audio clips and hei ex ual me ada a, CLAP
can pe o m ze o-sho audio classi ica ion, iden i ying sounds based on na u al lan-
guage que ies (e.g., "a dog ba king" o "a ca ho n"). This c oss-modal capabili y
is c ucial o applica ions like audio e ie al, whe e use s can sea ch as sound
lib a ies using desc ip i e ex a he han o he audio examples [10].
Chap e 3
VAEs, GANs, and RAVE
This chap e will in oduce and explain Va ia ional Au oencode s (VAEs), Gene a-
i e Ad e sa ial Ne wo ks (GANs), and he RAVE model.
3.1 Va ia ional Au oencode s
Va ia ional Au oencode s (VAEs) cons i u e a class o gene a i e neu al ne wo ks
ha anscend con en ional au oencode limi a ions by inco po a ing p obabilis ic
modeling p inciples. Unlike adi ional au oencode s ha p oduce de e minis ic la-
en ep esen a ions, VAEs encode inpu da a as p obabili y dis ibu ions wi hin a
con inuous la en space, enabling he gene a ion o no el da a samples ha main ain
s uc u al ideli y o he aining dis ibu ion. This p obabilis ic amewo k, o igi-
nally o mula ed by Kingma and Welling in 2013, e olu ionized gene a i e modeling
by p o iding a ma hema ically p incipled app oach o gene a ion while somewha
p ese ing he in e p e abili y and con ollabili y o he lea ned ep esen a ions [12].
VAEs a e comp ised o an encode ne wo k ha maps inpu obse a ions o dis i-
bu ional pa ame e s (mean and a iance) in a la en space, a s ochas ic sampling
mechanism acili a ed by he epa ame e iza ion ick, and a decode ne wo k ha
econs uc s da a om sampled la en ec o s. The model op imizes a composi e
objec i e unc ion ha balances econs uc ion ideli y agains egula iza ion con-
12
3.1. Va ia ional Au oencode s 13
s ain s, speci ically minimizing he Kullback-Leible di e gence be ween he lea ned
pos e io dis ibu ion and a speci ied p io dis ibu ion. This dual op imiza ion c e-
a es a s uc u ed la en mani old whe e seman ically simila inpu s clus e oge he ,
acili a ing meaning ul in e pola ion and con olled gene a ion h ough la en space
manipula ion [12].
VAEs demons a e excep ional u ili y in esea ch applica ions equi ing bo h da a
comp ession and gene a ion capabili ies, o e ing supe io aining s abili y com-
pa ed o ad e sa ial app oaches while main aining heo e ical g ounding in a ia-
ional in e ence p inciples. The s uc u ed la en space acili a es igo ous analysis
o model beha io while he gene a i e capabili ies enable comp ehensi e e alua ion
ac oss di e se scena ios.
3.1.1 Ma ginal Likelihood
The ma hema ical ounda ion o VAEs es s on he in ac able ma ginal likelihood
p oblem. Gi en obse ed da a xand la en a iables z, we seek o maximize he
ma ginal likelihood pθ(x) = Rpθ(x|z)pθ(z)dz, whe e θ ep esen s he pa ame e s o
ou gene a i e model. Since his in eg al is compu a ionally in ac able o machine
lea ning models, VAEs in oduce an app oxima e pos e io qϕ(z|x)pa ame e ized
by ϕ, ypically implemen ed as a neu al ne wo k encode . The key insigh lies in
de i ing a ac able lowe bound on he log ma ginal likelihood h ough Jensen’s
inequali y, yielding he E idence Lowe Bound (ELBO):
L(θ, ϕ;x) = Eqϕ(z|x)[log pθ(x|z)] −DKL(qϕ(z|x)∥pθ(z))
This o mula decomposes he objec i e in o wo componen s: a econs uc ion e m
ha ensu es he decode can accu a ely econs uc inpu s om la en ep esen a-
ions, and a egula iza ion e m ha cons ains he app oxima e pos e io o align
wi h he speci ied p io dis ibu ion [12].
14 Chap e 3. VAEs, GANs, and RAVE
3.1.2 KL-Di e gence
The Kullback-Leible di e gence se es as a undamen al measu e ha quan i ies
he dissimila i y be ween wo p obabili y dis ibu ions, unc ioning as a c ucial
egula iza ion mechanism in VAEs ha p e en s pos e io collapse and ensu es
meaning ul la en space s uc u e [12]. Ma hema ically de ined as DKL(P∥Q) =
Ex∼P[log P(x)−log Q(x)] o con inuous dis ibu ions, he KL di e gence is asym-
me ic and always non-nega i e, eaching ze o only when he dis ibu ions a e iden-
ical. In he VAE amewo k, DKL(qϕ(z|x)∥p(z)) speci ically measu es how he
lea ned app oxima e pos e io de ia es om he speci ied p io dis ibu ion, ypi-
cally a s anda d mul i a ia e Gaussian N(0,I). When bo h dis ibu ions a e Gaus-
sian, his di e gence admi s a closed- o m solu ion: DKL =1
2PJ
j=1(1 + log(σ2
j)−
µ2
j−σ2
j), whe e J ep esen s he la en dimensionali y [12]. This egula iza ion e m
p e en s he encode om lea ning a bi a ily complex pos e io dis ibu ions ha
would o e i o he aining da a, ins ead encou aging he la en space o main ain
con inui y and s uc u e essen ial o in e pola ion and gene a ion. The balance be-
ween econs uc ion accu acy and KL egula iza ion c ea es a na u al ade-o ha
shapes he lea ned ep esen a ions, wi h highe KL weigh s p omo ing mo e s uc-
u ed bu po en ially less exp essi e la en spaces, while lowe weigh s isk pos e io
collapse whe e la en a iables become unin o ma i e.
3.1.3 Repa ame e iza ion T ick
The c i ical inno a ion enabling end- o-end op imiza ion lies in he epa ame e iza-
ion ick, which ans o ms he s ochas ic sampling ope a ion in o a di e en iable
unc ion. Ra he han di ec ly sampling om qϕ(z|x), VAEs epa ame e ize he
la en a iable as
z=µϕ(x) + σϕ(x)⊙ϵ
, whe e ϵ∼ N (0,I) ep esen s auxilia y noise independen o he model pa ame e s,
and ⊙deno es elemen -wise mul iplica ion. This ans o ma ion allows g adien s o
low h ough he sampling p ocess, enabling s anda d backp opaga ion while main-
3.1. Va ia ional Au oencode s 15
aining he s ochas ic na u e essen ial o egula iza ion and gene a ion [12]. The
ma hema ical elegance o his app oach p o ides VAEs wi h supe io aining s a-
bili y compa ed o ad e sa ial me hods while p ese ing he heo e ical gua an ees
o a ia ional in e ence, making hem pa icula ly sui able o esea ch applica ions
equi ing bo h in e p e able la en ep esen a ions and eliable gene a i e capabili-
ies.
Figu e 1: A isual depic ion o a ypical VAE a chi ec u e. The encode akes he inpu
da a and comp esses i in o a la en ep esen a ion, ypically ia a se ies o con olu ional
laye s. The decode hen decomp esses his ep esen a ion and seeks o p oduce a pe cep-
ually iden ical ou pu .
3.1.4 Condi ioning VAEs
Condi ional Va ia ional Au oencode s (cVAEs) ex end s anda d VAEs by in oduc-
ing side in o ma ion o labels (o en deno ed as y) in o bo h he encode and decode
ne wo ks, enabling con ollable and s uc u ed gene a ion. As desc ibed in a de ailed
in an academic o e iew o condi ioning in VAEs [13], he condi ioning a iable can
ep esen a wide ange o seman ic concep s such as class, s yle, o any s uc u ed
language-based a ibu e you wish he model o lea n and gene a e.
The e a e wo p incipal app oaches o inco po a ing condi ioning: assuming he
la en a iable zand he condi ioning a iable ya e independen , o making hem
dependen .
In he independen case, he p io o zand y ac o izes as p(z, y) = p(z)p(y), which
16 Chap e 3. VAEs, GANs, and RAVE
p omo es disen angled ep esen a ions. This is desi able o asks whe e you wan z
o cap u e "e e y hing excep " he in o ma ion in y, making he gene a i e model
mo e con ollable (e.g. Fade Ne wo ks) [13].
Al e na i ely, he dependen case se s a condi ional p io p(z|y)p(y), o en esul ing
in a mo e exp essi e model bu wi h less independen con ol o e each sou ce o
a ia ion.
The cVAE’s objec i e unc ion encou ages he lea ned pos e io q(z|x, y) o ap-
p oxima e he speci ied (po en ially condi ional) p io , while he decode p(x|z, y)
econs uc s he da a om bo h zand y.
Tuning he egula iza ion e m (i.e., he weigh on he KL di e gence) is c i ical:
i i ’s oo low, he la en a iable may "leak" in o ma ion abou y, making he
condi ioning ine ec i e; i i ’s oo high, he model’s gene a ion quali y deg ades due
o he o ced independence o ma ching o a oo-simple p io .
The discussion also highligh s ha he independen cVAE o mula ion is pa icu-
la ly sui able o con ollable gene a ion (like swapping seman ic a ibu es), while
he dependen a ian is o en mo e powe ul o condi ional da a modeling. Ul i-
ma ely, he chosen condi ioning s a egy depends on he desi ed ade-o be ween
in e p e abili y/con ollabili y, and he quali y o he gene a ion [13].
3.2 Gene a i e Ad e sa ial Ne wo ks
Gene a i e Ad e sa ial Ne wo ks (GANs) in oduce a me hod o deep lea ning ha
pi s wo neu al ne wo ks agains each o he . The app oach mimics an ad e sa ial
game be ween a coun e ei e and a de ec i e, wi h one ne wo k gene a ing syn he ic
da a while he o he ies o de ec i [14].
The amewo k ope a es on a simple p inciple. The gene a o ne wo k c ea es syn-
he ic samples ha esemble eal aining da a, and he disc imina o ne wo k e al-
ua es hese samples and de e mines hei au hen ici y. Th ough his compe i ion,
bo h ne wo ks imp o e hei pe o mance [14]. The gene a o becomes be e a
3.2. Gene a i e Ad e sa ial Ne wo ks 17
c ea ing ealis ic da a, while he disc imina o becomes mo e skilled a de ec ion.
This ad e sa ial aining p ocess p oduces ema kable esul s. GANs can gene a e
images, ex , audio, and o he da a ypes ha humans o en canno dis inguish
om eal examples. The applica ions span nume ous ields, om compu e ision o
na u al language p ocessing. The ma hema ical ounda ion builds on game heo y.
The gene a o and disc imina o engage in a minmax game, whe e each ne wo k
ies o op imize i s objec i e unc ion. The gene a o minimizes he disc imina o ’s
abili y o classi y i s ou pu s as ake. The disc imina o maximizes i s classi ica ion
accu acy be ween eal and gene a ed samples.
3.2.1 A chi ec u e
The GAN a chi ec u e o audio gene a ion employs he same undamen al ad e -
sa ial p inciple as image gene a ion bu adap s o he unique challenges o empo al
da a. The gene a o ne wo k ypically ecei es andom noise as inpu and ans-
o ms i in o cohe en audio sequences, hough his di e s in he RAVE app oach
as will be discussed la e .
The disc imina o a chi ec u e o en mi o s he gene a o design bu in e e se.
Raw wa e o m disc imina o s use one-dimensional con olu ions wi h s ide o e-
duce empo al dimensions while ex ac ing inc easingly abs ac ea u es as da a
p opaga es h ough neu al ne wo k laye s. The ne wo k p ocesses audio sequences
h ough mul iple laye s, each de ec ing pa e ns a di e en ime scales. A e p o-
ducing a comp essed, high-le el ea u e ep esen a ion h ough his p ocess, he
inal laye comple es he classi ica ion ask by p oducing a single p obabili y ha
indica es whe he he inpu ep esen s eal o gene a ed audio.
3.2.2 T aining
Du ing each aining s ep, he wo ne wo ks op imise sepa a e bu coupled loss
unc ions ha o m a mini–max game 1. The disc imina o loss is de ined as:
24 Chap e 4. Expe imen
4.1.1 Da ase
To acili a e he aining o a con ollable gene a i e sys em, a specialized da ase
was c ea ed consis ing o 4 hou s o syn hesized oo s ep audio clips. This da ase
was pu pose-buil o p o ide clean, con inuous, a ibu e labels essen ial o concep -
awa e ec o s o gene alizable a ibu e de ec ion in oo s ep audio samples [15].
Each audio clip is associa ed wi h ou speci ic loa ing-poin alues ha de ine i s
pe cep ual cha ac e is ics: ‘speed‘, ‘g ass‘, ‘wood‘, and ‘conc e e‘. ’G ass’, ’wood’,
and ’conc e e’ e e o he deg ee o which each espec i e ma e ial is p esen on
he g ound upon which he oo s eps land. Fo he pu poses o his expe imen ,
we passed wo a ibu es o he model in he hopes o achie ing disen anglemen :
"g assiness" and "speed". By using syn hesized audio, we ensu e ha hese a -
ibu e alues ep esen a g ound u h, allowing us o con iden ly ain he la en
disc imina o wi hou wo ying abou o e ly noisy a ibu e da a.
The da ase ’s s uc u e was sys ema ically designed o co e a wide ange o a -
ibu e combina ions, allowing he model o obse e how changes in each pa ame e
independen ly a ec he esul ing sound. The con inuous na u e o he labels is
pa icula ly c ucial, as i enables he use o eg ession-based echniques o iden i y
a coe icien ec o capable o p oducing an a ibu e sco e when combined wi h an
audio embedding ec o . Wi h 2401 unique samples, he da ase p o ides su icien
densi y and a ie y o ain a model ha can obus ly map hese seman ic concep s
o speci ic, con ollable dimensions in he la en space, o ming he ounda ion o
he "seman ic ade " con ol sys em in es iga ed in his expe imen .
4.1.2 Embedding and A ibu e Compu a ion
The co e o his p ojec ’s app oach o a ibu e con ol lies in ansla ing abs ac ,
human-unde s andable concep s like "speed" o "g assiness" in o conc e e ma h-
ema ical di ec ions wi hin a model’s la en space. To achie e his, we compu e
concep ec o s o each desi ed seman ic a ibu e. This me hod is inspi ed by he
ounda ional wo k on Concep Ac i a ion Vec o s (CAV) by Kim e al [16]. While
4.1. Expe imen design 25
hei wo k o iginally used linea classi ie s o dis inguish he p esence o absence
o a concep o in e p e abili y, his p ojec adap s he co e idea o a gene a i e
pu pose. Ins ead o a bina y classi ica ion, we ea he ask as a con inuous e-
g ession p oblem, which is be e sui ed o he smoo hly a ying a ibu es in ou
syn hesized audio da ase .
The speci ic echnique employed he e is wha we e m a Reg ession Concep Vec o
(RCV). Fo each a ibu e (e.g., speed, g assiness), a sepa a e linea eg ession model
is ained. This model lea ns o p edic he con inuous g ound- u h alue o he
a ibu e (e.g., a loa om 0.0 o 1.0) di ec ly om he high-dimensional la en
space embeddings p oduced by he CLAP model. The key insigh is ha he lea ned
coe icien ec o o his linea eg esso ep esen s he di ec ion in he la en space
o he CLAP model ha co esponds mos s ongly o an inc ease in ha speci ic
a ibu e’s alue. This ec o and i s abili y o p edic seman ic a ibu es p o ides
he basis o he disen anglemen o he downs eam VAE’s la en ep esen a ion.
This eg ession-based app oach o s ee ing gene a i e models aligns wi h a b oade
esea ch e o in o c ea ing mo e disen angled and con ollable ep esen a ions. By
aming concep - o- ec o mapping as a eg ession ask, we can di ec ly le e age
labeled da a o c ea e con ols ha a e no jus ep esen a i e o whe he o no an
a ibu e is p esen in da a bu also exhibi o wha deg ee he a ibu e is p esen .
This allows o a mo e in ui i e and powe ul me hod o "s ee ing" he gene a i e
p ocess.
This RCV is no used o manipula e he CLAP space di ec ly. Ins ead i se es
as a c ucial ool o s uc u ing he la en space o he co e gene a i e model. The
RCV’s abili y o p oduce a scala a ibu e sco e om an audio embedding p o ides
he quan i a i e basis o he ad e sa ial disen anglemen o he downs eam VAE’s
la en ep esen a ion. In he main aining loop, he la en disc imina o is asked
wi h p edic ing hese a ibu e sco es om he VAE encode ’s ou pu . The encode ,
in u n, is ained o p oduce a la en code ha " ools" he disc imina o , he eby
lea ning a ep esen a ion ha is explici ly in a ian o—o disen angled om— he
a ibu e in ques ion. This allows he g ound- u h a ibu e alue o be appended
26 Chap e 4. Expe imen
la e as a condi ional inpu o he decode , acili a ing p ecise and independen
con ol.
4.1.3 Audio Rep esen a ion
The choice o audio ep esen a ion as inpu o he VAE model is c ucial and has a
g ea impac on he model’s ou comes. We decided o con e he aw WAV iles
o a 16-band Pseudo-Quad a u e Mi o Fil e (PQMF) bank, a specialized digi al
signal p ocessing ool used o e icien ly decompose a single, high-bandwid h signal
in o mul iple, lowe -bandwid h sub-band signals [17].
The co e o he PQMF sys em is an analysis il e bank, which akes he inpu
audio and passes i h ough a se ies o pa allel band-pass il e s. Each il e is
designed o isola e a speci ic equency ange. A e il e ing, he signal in each
sub-band is downsampled. Because each sub-band now con ains a much na owe
ange o equencies, i s sampling a e can be signi ican ly educed wi hou losing
in o ma ion, a p ocess known as c i ical sampling. This decomposi ion is highly
e icien as i educes he empo al esolu ion and compu a ional load equi ed o
p ocess audio da a in subsequen s eps [17].
Figu e 3: O e iew o he PQMF a chi ec u e
The main VAE model is asked wi h aking in his PQMF ep esen a ion and p o-
iding a ai h ul econs uc ion. To ge back o he o iginal audio, he sub-band
4.1. Expe imen design 27
signals a e ed in o a co esponding syn hesis il e bank. Fi s , each sub-band sig-
nal is upsampled by inse ing ze os be ween he exis ing samples o e u n i o he
o iginal sampling a e. Then, each upsampled signal is passed h ough a syn hesis
il e ha mi o s he p ope ies o i s co esponding analysis il e . The ou pu s
om all he syn hesis il e s a e hen summed oge he o p oduce a eplica o he
o iginal ull-bandwid h signal.
We chose his ep esen a ion due o i s e ec i eness in making he compu a ionally
expensi e ask o p ocessing aw audio mo e ac able by ep esen ing he audio as
a collec ion o mul iple equency bands, and i s abili y o achie e a nea -pe ec
econs uc ion o he o iginal signal [17].
Ins ead o o cing he VAE o lea n he in ica e s uc u e o he ull-bandwid h
audio a once, he PQMF spli s he signal in o mul iple lowe - equency sub-bands.
This decomposi ion signi ican ly educes he empo al esolu ion ha he encode
and decode mus p ocess wi hin each band, making he lea ning ask compu a-
ionally mo e e icien and s able. Fu he mo e, since he econs uc ion loss is
calcula ed di ec ly on hese pe cep ually ele an sub-bands, he model is op imized
o minimize e o s in dis inc equency anges, which o en leads o highe - ideli y
audio syn hesis. The nea -pe ec econs uc ion p ope y o he PQMF ensu es ha
he audio quali y is p ese ed du ing he encoding and decoding s ages, allowing he
model o ocus en i ely on he gene a i e ask wi hou in oducing a i ac s.
4.1.4 Model A chi ec u e
The model is a condi ional Va ia ional Au oencode (VAE) enhanced wi h an ad e -
sa ial componen . The co e o he model consis s o an encode , which comp esses
he mul i-band audio in o a lowe -dimensional la en space, and a condi ional de-
code , which econs uc s he audio om a sample o his la en space combined
wi h explici a ibu e in o ma ion.
The VAE’s a chi ec u e is de ined by a symme ic encode -decode s uc u e op-
e a ing on a 16-band PQMF ep esen a ion o he audio. The Encode is a deep
28 Chap e 4. Expe imen
con olu ional ne wo k ha p og essi ely downsamples he inpu h ough ou main
blocks wi h empo al downsampling a ios o [4, 4, 4, 2]. This comp esses he em-
po al dimensionali y o he inpu audio by a ac o o 128. The decode mi o s his
a chi ec u e, using ou co esponding upsampling blocks o econs uc he 16-band
audio om a la en ec o . Impo an ly, he decode is condi ional: i s inpu is a
130-dimensional ec o c ea ed by conca ena ing he 128-dimensional la en code
wi h a 2-dimensional a ibu e ec o (speed and g assiness), enabling con olled
syn hesis.
Se e al key hype pa ame e s go e n he model’s aining and beha io . The la en
space has a dimensionali y o 128, p o iding a compac ep esen a ion o he audio
ea u es. The VAE loss unc ion is balanced by a e m weigh ing he KL di e gence,
which is se o 0.2 du ing he aining s ep. The model is ained using he Adam
op imize wi h a lea ning a e o 1e-3 and a ba ch size o 16. Audio inpu s a e
p ocessed as 6-second clips sampled a 44.1 kHz.
The aining p ocess uses a hyb id objec i e unc ion ha combines classic VAE
losses wi h an ad e sa ial loss. The VAE is ained o minimize a econs uc ion
loss, which is calcula ed as a mul i-scale spec al dis ance be ween he o iginal
and gene a ed audio sub-bands o ensu e pe cep ual simila i y. Alongside his, a
Kullback-Leible (KL) di e gence loss egula izes he encode , pushing he dis ibu-
ion o he la en space o app oxima e a s anda d gaussian p io dis ibu ion. This
egula iza ion is c ucial o ensu ing ha he la en space is smoo h and con inuous,
allowing o meaning ul in e pola ion and gene a ion o no el audio samples.
La en disc imina o
The ad e sa ial componen in oduces a unique dynamic o he aining. A ‘la en
disc imina o ‘ is ained concu en ly o p edic he audio’s desc ip i e a ibu es
di ec ly om he la en code p oduced by he encode . The disc imina o is e-
wa ded o making co ec p edic ions and he encode is ewa ded o ooling he
disc imina o . This ad e sa ial objec i e o ces he encode o lea n a disen angled
la en ep esen a ion whe e speci ic dimensions co ela e wi h he audio a ibu es.
4.1. Expe imen design 29
The decode hen le e ages his s uc u ed la en code along wi h he g ound- u h
a ibu es o pe o m condi ional syn hesis, enabling con ol o e he cha ac e is ics
o he gene a ed audio.
The a chi ec u e o he disc imina o consis s o wo main pa s: a sha ed ea u e
ex ac ion backbone ollowed by a ibu e-speci ic p edic ion heads. The inpu o
he model is he la en code ’z’. The sha ed ea u e ex ac ion s age begins wi h
a se ies o blocks, each con aining a 1d con olu ional laye wi h a ke nel size o 7,
ollowed by ba ch no maliza ion and a LeakyReLU ac i a ion. These ini ial laye s
p ocess he la en sequence while main aining i s channel dimension. A e hese
epea ing blocks, an addi ional con olu ional block educes he channel dimension
by hal . This sha ed ep esen a ion is hen ed in o a sepa a e p edic ion head
o each a ibu e. Each head u he e ines he ea u es, i s wi h a con olu ional
block ha educes channels o 32, and hen wi h a inal con olu ional laye ha ac s
as he classi ie , mapping he 32-channel ep esen a ion o an ou pu wi h ou pu
channels equalling he numbe o quan iza ion bins. This inal ou pu con ains
he aw logi s o each quan ized bin o e e y ime s ep in he sequence. This
disc imina o app oach mi o s he one used by De is e al. in hei 2023 pape [2].
A ibu e compu a ion and p edic ion
The compu ed seman ic a ibu es a e s o ed as a ec o o leng h equal o he em-
po al dimensionali y o he inpu audio’s la en ep esen a ion o allow o appending
be o e being passed o he decode .
Be o e model aining, quan iza ion bins a e compu ed on he en i e da ase by
mapping he se o con inuous a ibu e alues o a disc e e se , enabling he dis-
c imina o o handle a ibu e compu a ion as a classi ica ion p oblem. All a ibu e
ec o s a e quan ized in o 16 bins o ma ch he ou pu o he la en disc imina o .
The accu acy o he disc imina o is hen assessed by c oss-en opy loss.
Quan iza ion is he p ocess o mapping a la ge, con inuous se o alues in o a
smalle , disc e e se , o en e e ed o as bins. In models like F-RAVE, which deal
wi h con inuous a ibu e p edic ion, quan iza ion enables he model o handle a -
30 Chap e 4. Expe imen
ibu e p edic ion as a classi ica ion p oblem by assigning con inuous p edic ions o
a ini e numbe o bins.
In his implemen a ion, quan iza ion was done by con inuous ea u e alues (scaling
hem o he ange [−1, 1]), so s hese alues, and hen de ining bin h esholds by
selec ing equidis an indices ac oss he so ed da a, e ec i ely di iding he a ibu e
ange in o 16 bins.
The la en disc imina o p edic ed a ibu es ia logi compu a ion ac oss all 16
a ibu e bins. C oss-en opy loss was hen compu ed be ween he eal and p edic ed
a ibu e alues. Fo mally, c oss-en opy loss is de ined as
L=−
K
X
k=1
yklog(pk)
whe e Kis he numbe o classes, ykis he ue label, and pkis is he p edic ed
p obabili y o class k.
Chap e 5
Resul s
This chap e will exhibi he esul s o he expe imen de ailed in he p e ious
chap e , beginning wi h he pe o mance o he a ibu e compu a ion me hod and
mo ing o he pe cep ual esul s o modi ying he la en axes co esponding o he
disen angled seman ic a ibu es.
5.1 A ibu e Compu a ion
5.1.1 Speed
The RCV dedica ed o speed (Fig. 4-1) achie es an R²o 0.990 on he aining da a
and 0.984 on he unseen es clips, indica ing ha he lea ned eg ession coe icien
is highly p edic i e o oo s ep empo.
5.1.2 Su ace Ma e ial A ibu es
While pe o mance s ays solid o he "g assiness" a ibu e (R²= 0.988), p edic ion
quali y d ops ma kedly o he wo o he “ma e ial” dimensions: Upon pe cep ual
e alua ion o he da ase , his is likely due o a high le el o en anglemen be ween
"g assiness" and he o he ma e ial a ibu es. Fo example, adding "woodiness"
o a clip wi h high "g assiness" esul ed in much less o a pe cep ual di e ence
31
32 Chap e 5. Resul s
han pe o ming he in e se ope a ion. Due o his en anglemen , only "speed" and
"g assiness" we e chosen as pe cep ual a ibu es o disen angle du ing aining.
5.1.3 Plo s
Figu es 4-1 o 4-4 summa ize he quali y wi h which he lea ned eg ession-concep
ec o s (RCVs) eco e he ou con inuous a ibu es encoded in he oo -s ep
da ase . Each panel shows he p edic ed alue agains he g ound- u h alue o
he es se ; he dashed ed line indica es pe ec p edic ion. The coe icien o
de e mina ion (R2) ob ained om he ull linea eg esso a e anno a ed in each
plo
(a) Speed (b) Woodiness
(c) G assiness (d) Conc e e
Figu e 4: These ou plo s show es pe o mance o he lea ned RCV o each o he
conside ed seman ic a ibu es in his expe imen . The "speed" and "g assiness" RCVs
pe o m well, while he o he a ibu es a e ha de o p edic .
5.2. Seman ic Fade Model 33
5.2 Seman ic Fade Model
This sec ion will de ail he quan i a i e and pe cep ual esul s ela ed o aining
and in e ence o he p oposed VAE/GAN gene a i e audio model on he syn hesized
oo s eps da ase .
La en Size Ra ios N PQMF Bands Num Bins
128 [4, 4, 4, 2] 16 16
Table 1: Model con igu a ion o he mos success ul aining un. "La en Size" e e s o
he numbe o dimensions p esen in he model’s la en ep esen a ion. "Ra ios" e e s o
he downsampling/upsampling a io a each con olu ional laye o he encode /decode .
The p oduc o hese a ios gi es he o al sampling ac o = 128. "N PQMF Bands"
e e s o he amoun o equency sub-bands he audio was decomposed in o as inpu o
he model. "Num Bins" e e s o he amoun o bins used o quan ize he con inuous
a ibu es.
5.2.1 Recons uc ion Fideli y
Recons uc ion ideli y was measu ed ia a mul i-scale spec al dis ance compu ed
on PQMF sub-bands. Gi en he mul iband signals xmb, ymb ∈RB×C×T,we apply
magni ude STFTs a scales S={2048,1024,512,256,128}wi h hop s/4 o each
band and agg ega e ac oss bands and scales. A each scale s, he dis ance couples a
ela i e L2 e m on linea magni udes wi h an L1 e m on log-magni udes (s abilized
by ε= 10−7), encou aging bo h en elope ma ching and con as a di e en loudness
le els. The o mal de ini ion is gi en below:
L econ(x, y) = X
s∈S ∥|Xs|−|Ys|∥2
∥|Xs|∥2
+
log(|Xs|+ε)−log(|Ys|+ε)
1!,
whe e Xs= STFTs(xmb)and Ys= STFTs(ymb). The loss is compu ed pe PQMF
band (by eshaping bands in o he ba ch) and summed o e scales, yielding he
scala spec al_dis ance used o aining loss.
The econs uc ion loss cu e in igu e 5 exhibi s a apid ini ial dec ease ollowed
by a slowe , s eady imp o emen , con e ging nea a s able egime a e oughly he
mid- aining s age. Lis ening es s indica ed s ong p ese a ion o o e all "shape"
and ampli ude en elope, bu compa a i ely less high- equency and ine-g ained
40 Chap e 6. Discussion
6.3 Fu u e Wo k
Challenges aced in achie ing empo ally consis en and cohe en ou pu s indica e
ha a di e en s a egy may need o be applied o ime-dependen seman ic ea u es.
In pa icula , deep-lea ning a chi ec u es ha be e accoun o empo al sequences
such as RNNs, LSTMs, and T ans o me s may be be e sui ed o his aspec . A
comp ehensi e solu ion may come in he o m o a hyb id VAE-RNN a chi ec u e
ha combines he ep esen a ional powe o s a is ical dis ibu ion modeling wi h
he sequence modeling capabili ies o ecu en neu al ne wo ks.
Lis o Figu es
1 A isual depic ion o a ypical VAE a chi ec u e. The encode akes
he inpu da a and comp esses i in o a la en ep esen a ion, ypically
ia a se ies o con olu ional laye s. The decode hen decomp esses
his ep esen a ion and seeks o p oduce a pe cep ually iden ical ou pu . 15
2 A isual depic ion o a ypical GAN a chi ec u e. The gene a o
c ea es ake da a om a comp essed la en ep esen a ion, and he
disc imina o pe o ms he e e se p ocess (downsampling) in o de
o unde and and disce n be ween eal and ake da a. . . . . . . . . . 19
3 O e iew o he PQMF a chi ec u e . . . . . . . . . . . . . . . . . . . 26
4 These ou plo s show es pe o mance o he lea ned RCV o each
o he conside ed seman ic a ibu es in his expe imen . The "speed"
and "g assiness" RCVs pe o m well, while he o he a ibu es a e
ha de op edic .............................. 32
5 Recons uc ion loss o e oughly 750,000 aining s eps . . . . . . . . 34
6 Compu ed speed o o iginal and au o-encoded audio clips . . . . . . 35
7 Compu ed g assiness o o iginal and au o-encoded audio clips . . . . 35
8 Loss cu e du ing ini ial aining wi h GAN eading la en codes. . . 38
41
Lis o Tables
1 Model con igu a ion o he mos success ul aining un. "La en Size"
e e s o he numbe o dimensions p esen in he model’s la en ep-
esen a ion. "Ra ios" e e s o he downsampling/upsampling a io
a each con olu ional laye o he encode /decode . The p oduc o
hese a ios gi es he o al sampling ac o = 128. "N PQMF Bands"
e e s o he amoun o equency sub-bands he audio was decom-
posed in o as inpu o he model. "Num Bins" e e s o he amoun
o bins used o quan ize he con inuous a ibu es. . . . . . . . . . . . 33
2 Pa ame e s ela ed o disc imina o loss weigh con ol. "Ini ial Lambda"
e e s o he ini ial alue o he loss e m weigh . "Lambda De-
lay" e e s o he amoun o s eps be o e he amp up began. "Max
Lambda" e e s o he maximum weigh e m ha was applied a e
helinea ampup. ........................... 38
42
Bibliog aphy
[1] Lample, G. e al. Fade ne wo ks: Manipula ing images by sliding a ibu es
(2018). URL h p://a xi .o g/abs/1706.00409.
[2] De is, N., Deme lé, N., Nabi, S., Geno a, D. & Esling, P. Con inuous
desc ip o -based con ol o deep audio syn hesis (2023). URL h p://a xi .
o g/abs/2302.13542.
[3] Caillon, A. & Esling, P. Ra e: A a ia ional au oencode o as and high-
quali y neu al audio syn hesis (2021). URL h p://a xi .o g/abs/2111.
05011.
[4] Bu gess, C. P. e al. Unde s anding disen angling in - ae (2018). URL h p:
//a xi .o g/abs/1804.03599.
[5] Yang, X., Bi, W., Sun, Y., Cheng, Y. & Yan, J. Towa ds be e unde s anding
o disen angled ep esen a ions ia mu ual in o ma ion (2020). URL h p:
//a xi .o g/abs/1911.10922.
[6] Kim, H. & Mnih, A. Disen angling by ac o ising (2019). URL h p://a xi .
o g/abs/1802.05983.
[7] an den Oo d, A., Vinyals, O. & Ka ukcuoglu, K. Neu al disc e e ep esen a ion
lea ning (2018). URL h p://a xi .o g/abs/1711.00937.
[8] an den Oo d, A. e al. Wa ene : A gene a i e model o aw audio (2016).
URL h p://a xi .o g/abs/1609.03499.
43
44 BIBLIOGRAPHY
[9] He shey, S. e al. Cnn a chi ec u es o la ge-scale audio classi ica ion (2017).
URL h p://a xi .o g/abs/1609.09430.
[10] Wu, Y. e al. La ge-scale con as i e language-audio p e aining wi h ea u e
usion and keywo d- o-cap ion augmen a ion (2024). URL h p://a xi .o g/
abs/2211.06687.
[11] Rad o d, A. e al. Lea ning ans e able isual models om na u al language
supe ision (2021). URL h p://a xi .o g/abs/2103.00020.
[12] Kingma, D. P. & Welling, M. An in oduc ion o a ia ional au oencode s
(2019). URL h p://a xi .o g/abs/1906.02691h p://dx.doi.o g/10.
1561/2200000056.
[13] Beckham, C. A deep di e in o condi ional a ia ional au oencode s. h ps:
//beckham.nz/2023/04/27/condi ional- aes.h ml.
[14] Good ellow, I. J. e al. Gene a i e ad e sa ial ne wo ks (2014). URL h p:
//a xi .o g/abs/1406.2661.
[15] Wyse, L. & Kellock, P. Embedding in e ac i e sounds in mul imedia applica-
ions. Tech. Rep. (1999).
[16] Kim, B. e al. In e p e abili y beyond ea u e a ibu ion: Quan i a i e es ing
wi h concep ac i a ion ec o s ( ca ) (2018). URL h p://a xi .o g/abs/
1711.11279.
[17] Mimilakis, S. I. & Schulle , G. In es iga ing he po en ial o pseudo quad a-
u e mi o il e -banks in music sou ce sepa a ion asks (2017). URL h p:
//a xi .o g/abs/1706.04924.
The ull eposi o y con aining he code w i en o his p ojec can be ound a
h ps://gi hub.com/JedPadoa/Seman icFade