scieee Science in your language
[en] (orig)

Extracting Sonic Trajectories

Author: Scutari, Tito
Publisher: Zenodo
DOI: 10.5281/zenodo.17305034
Source: https://zenodo.org/records/17305034/files/Tito-Scutari_SMC_2025_Master_Thesis.pdf
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Ex ac ing Sonic T ajec o ies
Ti o Scu a i
Supe iso : Se gi Jo da
Co-Supe iso : Behzad Haki
2025
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Ex ac ing Sonic T ajec o ies
Ti o Scu a i
Supe iso : Se gi Jo da
Co-Supe iso : Behzad Haki
2025
Con en s
1 In oduc ion 1
1.1 O e iew .................................. 1
1.2 Mo i a ion.................................. 2
1.2.1 Ex ac ingg oo e.............................. 2
1.2.2 Uni e sal neu al encode s a e e e ywhe e . . . . . . . . . . . . . . . . . 2
1.3 Objec i es.................................. 2
1.3.1 T acking sonic ajec o ies . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 In e p e abili y o ep esen a ions . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Tes bench o ep esen a ions . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 S a eo hea ............................... 4
1.4.2 Explo a o y expe imen s . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.3 Fi s S udy ................................. 4
1.4.4 SecondS udy ................................ 5
1.4.5 Conclusion.................................. 5
2 S a e o he a 6
2.1 G oo e.................................... 6
2.2 Onse de ec ion............................... 7
2.2.1 T adi ional app oaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Deep Lea ning app oaches . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 T ajec o ies ................................. 9

2.4 Rep esen a ions............................... 9
2.4.1 T adi ional ep esen a ions . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 La en spaces ................................ 10
3 Explo a o y expe imen s 13
3.1 In oduc ion................................. 13
3.2 Me hodology ................................ 13
3.2.1 Syn h .................................... 14
3.2.2 Tes se ................................... 14
3.2.3 Hea maps .................................. 15
3.2.4 Magni ude.................................. 16
3.2.5 Dis ance................................... 16
3.2.6 Cosinesimila i y .............................. 16
3.3 Resul s.................................... 16
3.3.1 Ampli ude.................................. 16
3.3.2 Pi ch..................................... 16
3.3.3 Noise..................................... 17
3.3.4 F equencycon en ............................. 18
3.3.5 Delayand e e b .............................. 20
4 S udy 1: Single Modula ions 22
4.1 In oduc ion................................. 22
4.2 Se up .................................... 22
4.2.1 Da ase ................................... 23
4.2.2 Rep esen a ions............................... 25
4.2.3 Me ics and measu emen s . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 P ocedu e.................................. 27
4.3 Resul s.................................... 28
4.3.1 Exampleplo ................................ 28
4.3.2 Summa y able ............................... 31
4.4 Discussion.................................. 31
5 S udy 2: Double Modula ions 33
5.1 In oduc ion................................. 33
5.2 Se up .................................... 33
5.2.1 Da ase ................................... 34
5.2.2 Rep esen a ions............................... 36
5.2.3 Me ics and measu emen s . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.4 P ocedu e.................................. 38
5.3 Resul s.................................... 38
5.3.1 Exampleboxplo s.............................. 39
5.3.2 Tables .................................... 40
5.4 Discussion.................................. 42
6 Conclusions and discussion 45
6.1 O e iew .................................. 45
6.2 Rep esen a ions............................... 45
6.3 Me ics ................................... 46
6.4 Fi s S udy: Single Modula ions . . . . . . . . . . . . . . . . . . . . . . 46
6.5 Second S udy: Double Modula ions . . . . . . . . . . . . . . . . . . . . 47
6.6 Fu u eDi ec ions.............................. 47
Lis o Figu es 48
Lis o Tables 50
Bibliog aphy 51
A Linea Analysis o Modula ion Rep esen a ion 53
A.1 Se up .................................... 54
A.1.1 Da ase ................................... 54
A.1.2 Me hodology ................................ 54
A.1.3 S age 1: Pe -Sample Model Fi ing . . . . . . . . . . . . . . . . . . . . 55
A.1.4 S age 2: Gene aliza ion wi h an A e age Model . . . . . . . . . . . . . 55
A.2 Resul s.................................... 55
A.2.1 Pe -Sample Model Pe o mance . . . . . . . . . . . . . . . . . . . . . . 55
A.2.2 Gene aliza ion Model Pe o mance . . . . . . . . . . . . . . . . . . . . 56
A.3 Discussion.................................. 57
Acknowledgemen
I would like o exp ess my since e g a i ude o:
Sheila, Na id and Jus in o an un o ge able yea
Behzad, Błażej, Lonce and Ma in and he UPF acul y o all he inspi a ions and
insigh s
Milo, Doma, Fe nando, Sa ya, Vi ek and all he classma es o he bes ne dy ibes
he memo y o Gio gio, o all he suppo in he pas 10 yea s
he memo y o MoRena o, o eaching me he lo e o music
and all he iends and amily back in I aly, o making me wha I am
4Chap e 1. In oduc ion
•Fi s s udy: sys ema ic e alua ion o single modula ions.
•Second s udy: sys ema ic e alua ion o double modula ions.
•Conclusion: summa y o indings, limi a ions, and possible u u e di ec ions.
1.4.1 S a e o he a
This chap e e iews he main concep s and p e ious wo k ele an o his p ojec .
I s a s wi h he idea o g oo e in music, bo h om a heo e ical and pe cep ual
pe spec i e. I hen co e s adi ional and deep lea ning app oaches o onse de-
ec ion, discusses he no ion o sonic ajec o ies o sound analysis, and su eys
a ious audio ep esen a ions, bo h classic signal-p ocessing me hods and mode n
neu al la en spaces, ha can be used o ajec o y ex ac ion.
1.4.2 Explo a o y expe imen s
This sec ion desc ibes ini ial expe imen s using syn hesized audio signals wi h known
modula ions. The goal is o see i di e en ep esen a ions can cap u e hese mod-
ula ions in a meaning ul way. These expe imen s help mo i a e he choice o ep e-
sen a ions and me ics, and o e an ea ly look a he s eng hs and weaknesses o
each app oach be o e de eloping he main algo i hm.
1.4.3 Fi s S udy
This chap e p esen s a sys ema ic e alua ion o single modula ions. A da ase o
syn he ic sounds was gene a ed wi h one pa ame e modula ed a a ime; ampli-
ude, equency, il e cu o , o oscilla o shape. Fo each case, ajec o ies we e
ex ac ed and compa ed wi h g ound- u h signals h ough co ela ion measu es.
The goal is o es ablish a baseline, showing he sensi i i y and s abili y o di e en
ep esen a ions unde simple and isola ed condi ions.

1.4. S uc u e o he Repo 5
1.4.4 Second S udy
This chap e ex ends he e alua ion o double modula ions. Two pa ame e s a y
simul aneously, c ea ing in e ac ions be ween ajec o ies. The analysis ocuses on
how me ics espond when modula ions o e lap o in e e e, and whe he one can
s ill be eliably acked in he p esence o ano he . The aim is o es obus ness
and gene aliza ion o ep esen a ions unde mo e complex and ealis ic scena ios.
1.4.5 Conclusion
The inal chap e summa izes he main indings o he p ojec , discussing wha
wo ked, wha didn’ , and why. I e lec s on he usabili y o di e en ep esen a-
ions, he use ulness o he de eloped me ics and p ocedu es, and he o e all po-
en ial o his app oach o audio analysis and c ea i e applica ions. I also sugges s
di ec ions o u he esea ch, including imp o emen s o he es ing amewo k and
applica ion o eal-wo ld audio.
Chap e 2
S a e o he a
This chap e illus a es he cu en s a e o he a o he a ious opics pe inen o
he wo k, i is di ided in o 2 g oups o sec ions, he i s g oup e iews he concep
o g oo e om a mo e heo e ical poin o iew, co e ing he psychoacous ical and
cogni i e aspec s (Sec ion 2.1) and p oceeds o go o e pas and cu en echnical
app oaches o onse de ec ion (Sec ion 2.2), which is he mos ele an MIR opic
ela ed o g oo e. The second g oup is an o e iew o he idea o sonic ajec o ies
(Sec ion 2.3), which is one o he scopes used o music analysis, especially o
he mo e expe imen al gen es, whe e mos o he s uc u es a e buil on imb al
aspec s o sound and p oceeds o p esen a ious mo e echical ep esen a ions and
algo i hms (Sec ion 2.4) used o analyze sound and music especially om a imb al
pe spec i e ha can be use ul o a ajec o y-based in e p e a ion o sound.
2.1 G oo e
In he musicology discou se, g oo e is an impo an concep , ye elusi e and mul-
i ace ed, depending on gen es, e as, and con ex s. I is closely ela ed o he idea
o hy hm, musical ges u e, dance, imme sion, and is deeply co ela ed wi h mo o
a eas o he b ain, as explained by E ani [1]. De ining g oo e is no a simple ask,
se e al de ini ions ha e been p oposed, an in e es ing gene al one ha should be
aken in o accoun was p oposed by Duman e al. in 2024 [2]: “G oo e is a pa -
6
2.2. Onse de ec ion 7
icipa o y expe ience ( ela ed o imme sion, mo emen , posi i e a ec , and social
connec ion) esul ing om sub le in e ac ion o speci ic music- (such as ime- and
pi ch- ela ed ea u es), pe o mance-, and/o indi idual- ela ed ac o s.”. I is hus
clea ha g oo e is an inc edibly wide e m, o ou pu poses i had been na owed
down o compu able hy hmic audio ea u es, in he li e a u e [3] mainly ela ed o
pe cussi e elemen s and mo e gene ally e en s ha p o ide a pe cep ual quan iza-
ion o ime. In he ield o MIR hose e en s a e mainly ea ed wi h a ious onse
de ec ion echniques.
2.2 Onse de ec ion
Onse de ec ion e e s o he p ocess o loca ing he beginning o a musical no e
o sound, o en associa ed wi h he ansien phase whe e he signal exhibi s apid
changes. I is essen ial o applica ions such as au oma ic music ansc ip ion, bea
acking, and synch oniza ion in music p oduc ion. The ask is pa icula ly challeng-
ing in polyphonic music, whe e mul iple ins umen s play simul aneously, leading o
o e lapping signals.
2.2.1 T adi ional app oaches
T adi ional digi al signal p ocessing me hods o onse de ec ion ely on analyzing
a ious signal p ope ies o de ec ab up changes indica i e o onse s [4]. These can
be ca ego ized in o se e al sub-app oaches depending on he audio ep esen a ion
used:
Time domain
A common echnique is ene gy-based de ec ion, whe e he signal’s ene gy is calcu-
la ed o e sho windows, and onse s a e iden i ied when ene gy exceeds a p ede ined
h eshold [4]. Fo example, de ec ing sudden inc eases in ampli ude is s aigh o -
wa d bu can lead o alse posi i es in noisy o ampli ude-modula ed signals.
8Chap e 2. S a e o he a
F equency domain
These in ol e he ans o ma ion o he signal in o he equency domain using
echniques such as he as Fou ie ans o m (FFT). Me ics such as spec al lux
[4], which measu es he a e o change in spec al ene gy be ween consecu i e ames,
a e widely used. O he me ics include spec al cen oid, acking he cen e o mass
o he spec um, which can highligh equency shi s a onse s. These me hods a e
mo e obus han ime-domain app oaches bu equi e mo e compu a ional esou ces
due o FFT p ocessing. Supe lux is in his ca ego y o onse de ec ion algo hi hms
and is he mos widely used, as i is a s anda d go o o Lib osa and Essen ia,
he mos widesp ead MIR oolki s. O he ad anced me hods o his kind ake in o
accoun also phase changes o he componen s, de ec ing onse s when simul aneous
changes in phase occu .
2.2.2 Deep Lea ning app oaches
Deep lea ning me hods, pa icula ly Con olu ional Neu al Ne wo ks (CNNs), ha e
gained p ominence in onse de ec ion by le e aging neu al ne wo ks o lea n pa e ns
di ec ly om da a, o en ep esen ed as spec og ams o o he ans o ms (wa ele ,
CQT). They lea n spa ial pa e ns ha co espond o onse s, such as sudden changes
in equency con en . A no able ea ly example is he wo k by Schlü e and Böck [5],
which showed CNNs ou pe o ming adi ional me hods on a da ase wi h 26,000
anno a ed onse s. RNNs, in pa icula LSTMs a e also able o ou pe o m adi-
ional me hods, as shown by Ma chi e al. [6]. Nowadays mos ad anced models
use a combina ion o he wo, as shown by one o he mos ad anced models o
polyphonic s ings onse de ec ion de eloped in 2023 [7]. I is impo an o no e
ha al hough hose models ou pe o m adi ional me hods he cos s o de elop
hem a e e y high, bo h om a compu a ional poin o iew and da a sou cing
wise. Running hem is also expensi e, GPU accele a ion migh be needed and no
all o hem can un in eal ime on a common lap op. In he as pano ama o onse
de ec ion models Dance Dance Con olu ion is he one used as o oday by he sys-
em, i is based on a con olu ion a chi ec u e ained on Dance Dance Re olu ion
2.3. T ajec o ies 9
anno a ions and shows ou s anding eal- ime pe o mance o ansien -like onse
de ec ion. This is ideal o de ec ing pe cussi e sounds bu alls sho when onse s
a e so e o happen in he pi ch o imb e domain.
2.3 T ajec o ies
Sound ges u es and ajec o ies in sound a and elec oacous ic music e e o how
sounds mo e and e ol e, cap u ing bo h pe o me ac ions and spa ial dynamics.
Smalley’s spec omo phology [8], in oduced in 1986, desc ibes he empo al shaping
o sound spec a, p o iding a ool o analyze hese aspec s. While smalley desc ibes
hem om an analy ical poin o iew, compose s ha e been using simila concep ’s
ep esen ing hem wi h g aphic no a ion. This no a ion uses isual symbols o ep e-
sen music, o e ing lexibili y beyond adi ional no a ion. Xenakis’s UPIC sys em
ansla es d awings in o sound, while Ca dew’s "T ea ise" (1967) lea es in e p e a-
ion open, bo h in luencing sound ges u es (Xenakis UPIC, Ca dew T ea ise). MIR
echniques, model and gene a e music by compu a ional ep esen a ions ha col-
lide wi h he ones used by musicologis s and compose s. Though di ec esea ch is
spa se, he in eg a ion o hose ep esen a ions could lead o in e es ing esul s.
2.4 Rep esen a ions
Sound ep esen a ions a e me hods used o ans o m aw audio signals in o s uc-
u ed o ma s ha emphasize speci ic cha ac e is ics, making i easie o analyze,
p ocess, o in e p e he audio da a. They a e essen ial in ields like audio p o-
cessing, speech ecogni ion, music analysis, and machine lea ning o audio- ela ed
asks. Each ep esen a ion highligh s di e en aspec s o sound—such as ampli ude,
equency, o abs ac ea u es—depending on he in ended applica ion.
2.4.1 T adi ional ep esen a ions
T adi ional ep esen a ions ocus on decomposing audio signals in o in e p e able
componen s, o en based on equency o pe cep ual scales. These me hods a e

10 Chap e 2. S a e o he a
well-es ablished and widely used in signal p ocessing.
Fou ie ans o m
The Fou ie T ans o m decomposes a signal in o i s equency componen s, and o
audio, he Sho -Time Fou ie T ans o m (STFT) is commonly used o cap u e how
hese equencies e ol e o e ime [4]. I di ides he audio in o sho segmen s and
compu es he Fou ie T ans o m o each, esul ing in a ime- equency ep esen a-
ion.
Cons an Q T ans o m
The Cons an Q T ans o m is a ime- equency ep esen a ion whe e each equency
bin has a cons an Q ac o , meaning he bandwid h is p opo ional o he equency.
This esul s in a loga i hmic equency scale, aligning wi h human pe cep ion o
pi ch. Unlike STFT, which uses linea equency spacing, CQT uses loga i hmic
spacing, wi h lowe equencies ha ing na owe bandwid hs and highe equencies
ha ing wide bandwid hs [9]. This is pa icula ly e ec i e o musical signals.
Nonnega i e Ma ix Fac o iza ion
NMF is a ma ix decomposi ion echnique ha ac o s a non-nega i e ma ix (e.g.,
a spec og am) in o wo non-nega i e ma ices: a dic iona y ma ix (basis spec-
a) and an ac i a ion ma ix (how hese bases a e ac i a ed o e ime). This can
be in e p e ed as ep esen ing he audio as a combina ion o basis spec a. This
ep esen a ion is widely used in he MIR ield especially o sou ce sepa a ion.
2.4.2 La en spaces
La en space ep esen a ions le e age machine lea ning, pa icula ly neu al ne -
wo ks, o c ea e abs ac , high-le el ep esen a ions o da a [10]. These a e o en
used o asks equi ing simila i y measu emen s, seman ic unde s anding, gene a-
ion, o comp ession, and hey ep esen a shi owa d da a-d i en app oaches.
2.4. Rep esen a ions 11
CLAP and seman ics
Using con as i e lea ning, i is possible o lea n mappings be ween audio clips and
hei ex ual desc ip ions in o a sha ed embedding space whe e ela ed concep s, like
he sound o ain and he ph ase " ain alling", a e close oge he , and un ela ed
ones a e a apa . This c oss-modal app oach b idges he gap be ween sound and
meaning, enabling asks like sea ching o audio wi h ex o unde s anding audio
con en seman ically, hus building a e y seman ically ich and meaning ul la en
space, he downside is ha long samples a e needed, and hese embeddings canno
be calcula ed ame in eal- ime, con en is he e o e e y meaning ul and ich,
bu empo al esolu ion in oday’s models is e y poo . CLAP [11] is he mos
widesp ead model ha has been ained in his way and i s a chi ec u e enables
c oss-modal asks like yping "dog ba king" o ind a sound clip o eeding an audio
ile o ge a ex desc ip ion. I s con as i e lea ning app oach also allows ze o-sho
classi ica ion: you can classi y audio in o ca ego ies ne e seen du ing aining by
using ex labels (e.g., "happy music" s. "sad music"). By ocusing on high-le el
seman ics, CLAP makes audio unde s andable in human e ms, which is a c ucial
o applica ions like con en e ie al o audio anno a ion.
RAVE and li e use
Some models ocus on eal- ime audio syn hesis and manipula ion ia la en spaces.
hey usually le e age a a ia ional au oencode (VAE) ype o neu al ne wo k ha
lea ns a comp essed, p obabilis ic ep esen a ion o da a o cap u e he essence o
audio in a way ha ’s bo h e icien and lexible. The idea he e is o encode audio
in o a compac "la en space" and hen decode i back in o sound, enabling no
jus econs uc ion bu also he c ea ion o new audio. This makes hem use ul
o gene a i e applica ions, such as syn hesizing new sounds o mo phing exis ing
ones, all while keeping la ency low enough o li e use. A well known example is
RAVE [12] ha uses an a chi ec u e ha shines o eal- ime pe o mance and is
op imized o p ocess audio as enough o li e applica ions like music p oduc ion
o in e ac i e sound design. I s gene a i e na u e means ha he la en space can
12 Chap e 2. S a e o he a
be sampled o c ea e en i ely new sounds o weak he la en a iables o ans e
s yles (e.g., making a d um sound like a syn h).
SoundS eam and comp ession
Ano he use o la en spaces is o building neu al audio codecs buil o e icien com-
p ession and high-quali y econs uc ion. The goal is o sh ink audio in o a compac
o m o s o age o s eaming while keeping i sounding g ea when played back. The
esul is a e y as and comple e la en space bu i is op imized o comp ession,
no o unde s andabili y and seman ics. A success ul example is SoundS eam [13],
which is ained end- o-end wi h neu al ne wo ks, combining comp ession, quan i-
za ion, and pe cep ual quali y op imiza ion in o one seamless sys em. I ’s designed
o wo k in eal ime and adap o di e en comp ession needs. SoundS eam’s a -
chi ec u e suppo s a iable bi a es and i s eal- ime e iciency means i can encode
and decode on he ly, pe ec o s eaming.
Chap e 3
Explo a o y expe imen s
3.1 In oduc ion
The idea o his se o expe imen s is o examine di e en ep esen a ions wi h
syn hesized signals in which some modula ion occu s and isualize i he e is some
co ela ion be ween he ep esen a ion’s ec o s and he modula ion. The ep esen-
a ions conside ed a e STFT spec um, MFCCs, CQT o he adi ional side, DAC
and Music2La en la en spaces o he machine lea ning side. Those ep esen a ions
ha e been chosen i s ly based on widesp ead adop ion, no el y and pe o mance. In
pa icula STFT is he mos used ans o m o MIR ela ed p ocessing, MFCC is
also widely used and he da a comp ession i pe o ms packs he in o ma ion in e y
small ec o s. CQT is chosen because o i s psychoacous ical and musical ele ance.
The wo neu al encode s a e e y di e en DAC is made o da a comp ession, wo ks
on aw audio wi h e y high ime esolu ion, music2la en is ins ead ained on com-
plex spec og ams. Those wo we e chosen as examples o good pe o ming, ecen
bu also e y di e en neu al ep esen a ions.
3.2 Me hodology
The me hodology in ol es isually inspec ing he ep esen a ions o de e mine i
he e is an appa en co ela ion be ween he modula ion signal and mo emen s in
13
20 Chap e 3. Explo a o y expe imen s
sen ed wi h espec o dis o ion and il e ing bu since i is s ill a equency- ela ed
phenomenon i seems qui e ackable by equency domain ep esen a ions.
Figu e 8: Impac o m amoun on
CQT magni ude
Figu e 9: Impac o m amoun on
music2la en simila i y
Figu e 10: cq ec o magni udes
3.3.5 Delay and e e b
Delay and e e b a e maybe he leas in e es ing aspec , since he e a e al eady
many ways o emo e e e b, also in eal- ime and he di e ence be ween he en-
e gy be ween e e b and di ec signal clea ly co ela es wi h he amoun o e e b.
Ne e heless, i is in e es ing o see how hose kinds o p ocessing appea s in he
la en space and o he ep esen a ions.

3.3. Resul s 21
Figu e 11: music2la en magni udes Figu e 12: DAC magni udes
Chap e 4
S udy 1: Single Modula ions
4.1 In oduc ion
This i s s udy ocuses on e y simple cases, in o de o gi e an idea o wha
he a ious ep esen a ions can achie e, i a me ic sco es low on hose es s, i is
unlikely ha i will be use ul on mo e complex asks. The idea is s aigh o wa d:
using basic syn hesized sound in which one single pa ame e is modula ed wi h a
smoo h cu e, di e en ep esen a ions(se ies o ec o s), a e hen calcula ed and
hose me ics( ime se ies) ex ac ed and co ela ion be ween he me ic and he
modula ion applied. I he co ela ion is high (close o 1.0 he me ic is able o
ack he modula ion).
Jupy e no ebook a ailable a : sonic- ajec o ies/blob/main/s udy-1.ipynb
4.2 Se up
This sec ion desc ibes he se up o he expe imen , ocusing on he da ase o syn he-
sized sounds and hen lis ing ep esen a ions and me ics used. All he p ocessing
applied is explained in de ail in o de o ensu e eplicabili y.
22
4.2. Se up 23
4.2.1 Da ase
The da ase is ully syn hesized wi h minisyn h, a iny lib a y de eloped o he
occasion.
The lib a y is a ailable a : pypi.o g/p ojec /minisyn h/
MiniSyn hSub ac i e
The syn hesize used is a single oscilla o sub ac i e (non- esonan lowpass 12db/oc )
syn hesize wi h 4 pa ame e s:
•Base equency (exponen ial mapping [50hz −1000hz]): his is he base e-
quency o he oscilla o
•Ampli ude (exponen ial mapping [0.0−1.0]): he ampli ude o he signal
•Fil e cu o (exponen ial mapping [200hz −5000hz]): cu o equency o he
lowpass il e
•Wa e mix (linea mapping [0.0−1.0]): he mix be ween squa e and saw oo h
wa es
This a chi ec u e has been chosen because e en in i ’s e y minimal design i is
able o p oduce a good a ie y o sonic ajec o ies, di ec ly associa ed wi h he
pa ame e s:
•Absolu e pi ch: he base equency knob can con ol he absolu e e ical
placemen in he spec um, in a pe cep ual ele an ange and mapping.
•Ampli ude: his di ec ly con ols he ampli ude o he singla in a pe cep ually
ele an mapping.
•Absolu e spec um shape: he il e cu o con ols he absolu e shape o he
spec um, mainly in he pa ha con ains ha monics.
24 Chap e 4. S udy 1: Single Modula ions
•Rela i e spec um shape: he wa e mix is able o change he balance be ween
odd and e en ha monics changing he ela i e shape o he spec um wi h
espec o he undamen al.
Addi ionally he syn hesize is implemen ed so ha any pa ame e can be modula ed
a sample a e speed.
The sounds
The da ase consis s o 1620 2slong sounds, a li le less han 1hin o al. Fo
e e y sound one single pa ame e is posi i ely modula ed wi h a single cycle o a
cosine wa e a 5 di e en le el o modula ion amoun : (0.1,0.2,0.3,0.4,0.5) o hei
espec i e anges. Fo example he ampli ude o a sound wi h an ampli ude base
le el o 0.25 wi h a 0.2modula ion amoun will s a a 0.45 dec ese o 0.25 and ise
back o 0.45 in pa ame e ange e ms, in ac ual ampli ude hose alues would be
exponen ially mapped (0.25 = 0.1,0.45 = 0.23).
The base alues o he non-modula ed pa ame e s can be one o he h ee be ween
0.25,0.5,0.75 o he pa ame e ange and he modula ed pa ame e base can be one
o he h ee be ween 0.0,0.25,0.50 o accoun o he addi i e modula ion.
The comple e da ase consis s o all he possible combina ions o he a o men ioned
se ings: 3 le els o each one o he ou pa ame e s, 4 possible modula ion a ge s,
5 possible modula ion amoun s:
3×3×3×3×4×5 = 1620
In his way, e en i jus on a minimal se o poin s pe pa ame e , i is easy o
unde s and he in e ac ions and he dependencies be ween a ious pa ame e s. I
is o be men ioned, hough, ha some ep esen a ions (DAC in ou case) wo k
on aw audio and ep esen a ions ha e a pe iodici y based on he a io be ween
he undamen al equency in he sound and he window size and so hey beha e
di e en ly a di e en equencies and he esul s o he expe imen could a y due
4.2. Se up 25
o his phenomenon. Since hose ype o ep esen a ions a e qui e widesp ead o
comp ession asks, u he in es iga ion on hese issues could ha e in e es ing and
po en ially use ul esul s.
The di ec consequence o he da ase ha con ains all he possible combina ions o
a small se o possibili ies is ha he da ase is ac ually no s o ed anywhe e and
can be gene a ed di ec ly, possibly wi h jus an i e a o and i is hus jus a iny
algo i hm wi h anges and mappings. Tha said in he ac ual s udy in e media e
s eps ha e been sa ed o alida ion and debugging pu poses.
4.2.2 Rep esen a ions
The ep esen a ions used a e:
•FFT spec og am: w= 4096, h = 512 hann window
•MFCCs: 20 coe icien s w= 2048, h = 512 hann window
•CQT: in db h= 512
•DAC: p o ided comp essed la en s (72 dimensions) a e used
•music2la en : 4 encode s shi ed by 1024 o achie e 4x o e sampling
4.2.3 Me ics and measu emen s
On e e y ep esen a ion he ollowing me ics a e calcula ed:
Magni ude
F obenius no m o ec o s as a ime se ies:
M =∥x ∥2=
u
u
n
X
i=1
(x ,i)2

26 Chap e 4. S udy 1: Single Modula ions
Cosine simila i y
Cosine simila i y be ween subsequen ec o s:
S =a ·b
∥a ∥2∥b ∥2+ε=Pn
i=1 a ,i b ,i
qPn
i=1 a2
,i qPn
i=1 b2
,i +ε
Dis ances
Dis ances be ween subsequen ec o s:
D =∥x +1 −x ∥2=
u
u
n
X
i=1
(x +1,i −x ,i)2
P ocessing
E e y me ic is smoo hed wi h a la window 0.25s, which ensu es he pe manence
o he sub-audio a e changes and emo es mos o he noise, no malized in he ange
[0,1] and s e ched o i he modula o leng h (l= 1000)
1. Smoo hing wi h a la window o leng h w(co esponding o 0.25 s):
˜m[ ] = 1
w
(w−1)/2
X
k=−(w−1)/2
m[ +k]
2. No maliza ion o he ange [0,1]:
ˆm[ ] = ˜m[ ]−min( ˜m)
max( ˜m)−min( ˜m)
3. S e ching o i modula o leng h l= 1000:
m esampled[n] = ˆm"n
l·N#, n = 0,1, . . . , l −1
4.2. Se up 27
Co ela ion
Co ela ion, in he o m o pea son co ela ion coe icien , is he bes indica o o
he accu acy o he ex ac ion. Pe ec co ela ion (ρ= 1) means ha he ou pu
o he algo i hm is linea ly ela ed o he modula o . I is de ined as:
ρx,y =co (x, y)
σxσy
Pola i y can be lipped, depending on he ep esen a ion and on he speci ic ea u e,
since we a e in e es ing in ma ching mainly he mo emen i makes sense o conside :
|co (a(x), m)|
.
4.2.4 P ocedu e
The p ocedu e o he expe imen is he ollowing:
1. Gene a e all he possible combina ions o pa ame e s (s o ed in a pandas
da a ame).
2. Syn hesize he sounds om he pa ame e s and calcula e each ep esen a ion,
o each ep esen a io a h5 ile is c ea ed and he indexed ep esen a ions a e
s o ed. This is done because he encoding s ep, especially o neu al ep esen-
a ions, is mo e compu a ionally in ensi e.
3. Fo each en y in each ep esen a ion (5×1620) he h ee me ics a e calcu-
la ed, smoo hed and co ela ion wi h he cosine is compu ed. The esul o
his p ocess is a da a ame wi h 15 co ela ion columns, one o e e y me ic-
ep esen a ion pai .
4. boxplo s a e gene a ed, il e ing he da a ame based on wha pa ame e is
modula ed and g ouped by he o he a iables.
28 Chap e 4. S udy 1: Single Modula ions
4.3 Resul s
In o de o unde s and he esul s se e al boxplo s ha e been p oduced, he ull se
can be examined in he no ebook(sonic- ajec o ies/blob/main/s udy-1.ipynb). In
e e y boxplo only he da a wi h a single ep esen a ion, me ic and modula ion
a ge is conside ed, o example a boxplo could coun ain he co ela ions o he
magni ude o m cc o ampli ude modula ion. In each boxplo co ela ion si s on
he y axis, on he x axis he a iable o which we a e es ing he in luence is placed.
In his way we can ha e a clea idea o how sensi i e he ep esen a ion is, o
example we can unde s and how much modula ion is needed o be sensed, and i
he sensi i i y changes ac oss equencies, ampli udes, e c.
4.3.1 Example plo
The ollowing a e a se o plo s ha a e wha is used in his s udy o unde s and
which me ic is he bes o ack il e cu o modula ions and how o he a iables
in luence he acking. E e y plo is now desc ibed in de ail o unde s and how o
ead hose and he ones con ained in he appendix.
In alle he i e igu es co ela ion o il e cu o is plo ed on he y axis, his mean
ha we a e conside ing only samples in which il e cu o modula ion is happening,
no o he modula ion is applied on his subse o samples. Those samples, as hose
wi h o he a ge s o modula ions, ep esen a ou h o he 405, and all he combi-
na ions o a iables as p e iously explained while ha ing il e cu o modula ion.
Figu e 13: music2la en il e cu o modula ion agains modula ion amoun
4.3. Resul s 29
The plo in igu e 13 has he modula ion amoun on he x axis and we can see he
impac on he acking, i is clea ha he amoun o modula ion in luences he
acking done ia magni ude, bu he one done wi h cosine simila i y is no eally
a ec ed. O all he me ics, and his will be con i med by o he plo s, he dis ances
seems he bes o ack il e cu o modula ion wi h music2la en ep esen a ions.
Figu e 14: music2la en il e cu o modula ion agains ampli ude
In he second plo (Figu e 14) on he x axis he ampli ude is plo ed, he le els
co espond o 0.25,0.50,0.75 o he ange, exponen ially mapped. Magni ude is
highly sensi i e o ampli ude, he a e age co ela ion ises om 0.5 o 0.8, he o he
me ics a e no pa icula ly in luenced by he a ia ion in ampli ude, and a e hus
mo e s able. Simila ly o he p e ious plo , dis ances is he bes pe oming me ic.
Figu e 15: music2la en il e cu o modula ion agains equency
Fo he hi d plo (Figu e 15) on he x axis he base equency is plo ed, he le els
co espond o 0.25,0.50,0.75 o he ange, exponen ially mapped. This is he mo e
chao ic plo o he i e, o magni ude co ela ion is a i s highes a low equences,
36 Chap e 5. S udy 2: Double Modula ions
5.2.2 Rep esen a ions
The ep esen a ions used a e:
•FFT spec og am: w= 4096, h = 512 hann window
•MFCCs: 20 coe icien s w= 2048, h = 512 hann window
•CQT: in db h= 512
•DAC: p o ided comp essed la en s (72 dimensions) a e used
•music2la en : 4 encode s shi ed by 1024 o achie e 4x o e sampling
5.2.3 Me ics and measu emen s
On e e y ep esen a ion he ollowing me ics a e calcula ed:
Magni ude
F obenius no m o ec o s as a ime se ies:
M =∥x ∥2=
u
u
n
X
i=1
(x ,i)2
Cosine simila i y
Cosine simila i y be ween subsequen ec o s:
S =a ·b
∥a ∥2∥b ∥2+ε=Pn
i=1 a ,i b ,i
qPn
i=1 a2
,i qPn
i=1 b2
,i +ε

5.2. Se up 37
Dis ances
Dis ances be ween subsequen ec o s:
D =∥x +1 −x ∥2=
u
u
n
X
i=1
(x +1,i −x ,i)2
P ocessing
E e y me ic is smoo hed wi h a la 0.25swindow, his ensu es ha we’ e keeping
he sub-audio a e changes and emo e mos o he noise, no malized in he ange
[0,1] and s e ched o i he modula o leng h (l= 1000)
1. Smoo hing wi h a la window o leng h w(co esponding o 0.25 s):
˜m[ ] = 1
w
(w−1)/2
X
k=−(w−1)/2
m[ +k]
2. No maliza ion o he ange [0,1]:
ˆm[ ] = ˜m[ ]−min( ˜m)
max( ˜m)−min( ˜m)
3. S e ching o i modula o leng h l= 1000:
m esampled[n] = ˆm"n
l·N#, n = 0,1, . . . , l −1
Co ela ion
Co ela ion, in he o m o a Pea son co ela ion coe icien , is he bes indica o o
he accu acy o he ex ac ion. Pe ec co ela ion (ρ= 1) means ha he ou pu
o he algo i hm is linea ly ela ed o he modula o . I is de ined as:
ρx,y =co (x, y)
σxσy
38 Chap e 5. S udy 2: Double Modula ions
Pola i y can be lipped, depending on he ep esen a ion and on he speci ic ea u e,
since we a e in e es ed in ma ching mainly he mo emen , i makes sense o conside :
|co (a(x), m)|
.
5.2.4 P ocedu e
The p ocedu e o he expe imen is he ollowing:
1. Gene a e all he possible combina ions o pa ame e s (s o ed in a pandas
da a ame)
2. Syn hesize he sounds om he pa ame e s and calcula e each ep esen a ion,
o each ep esen a io a h5 ile is c ea ed and he indexed ep esen a ions a e
s o ed. This is done because he encoding s ep, especially o neu al ep esen-
a ions, is mo e compu a ionally in ensi e.
3. Fo each en y in each ep esen a ion (5×4374) he h ee me ics a e calcu-
la ed, smoo hed and co ela ion wi h he cosine is compu ed. The esul o his
p ocess is a da a ame wi h 60 co ela ion columns, one o e e y pa ame e -
me ic- ep esen a ion g oup.
4. boxplo s a e gene a ed, il e ing he da a ame based on wha pa ame e is
modula ed and g ouped by modula ion balance be ween i s and second mod-
ula ion and modula ion o se , i.e., he phase o modula ion 2. Fo his ables
a e also necessa y o e alua e he in luence o o he modula ions on he one
ha i is being acked, o his asks boxplo s would be oo many and di icul
o na iga e so only he a e age is used.
5.3 Resul s
In o de o unde s and he esul s boxplo s ha e been p oduced, each boxplo is
buil wi h he da a poin s o a single ep esen a ion, a single me ic and a single
5.3. Resul s 39
modula ion a ge , i is impo an o emembe ha wo pa ame e s a e being mod-
ula ed in each sample, and he boxplo s conside a da a poin alid i one o he
wo is he one selec ed. Thus he boxplo s a e no use ul o unde s and how mod-
ula ions in e ac based on he combian ion o pa ame e s, o his pu pose ables
a e p esen ed in he nex sec ion. Boxplo s a e used o unde s and how he balance
be ween he wo modula ions and he phase o he second modula ion in luence
he acking o he i s . The boxplo s ha e co ela ion on he y axis and ei he
balance o o se on he x axis. All he plo s a e examinable in he no ebook(sonic-
ajec o ies/blob/main/s udy-2.ipynb).
5.3.1 Example boxplo s
The ollowing examples a e wo he boxplo s o il e cu o acking in music2la en ,
he en ies o used a e he ones ha con ain il e cu o modula ion, which a e hal
o he da ase (2187), which is he case o any speci ic pa ame e modula ion. These
plo s don’ ake in o accoun wha is he o he modula ed pa ame e , hey a e all
agg ega ed. Each plo is explained in de ail in he ollowing pa ag aphs.
Figu e 20: music2la en il e cu o modula ion agains second modula ion o se
In Figu e 20 he x axis is he o se o he second modula ion, ha means ha we
always conside il e cu o modula ion ha ing phase 0 and he modula ion o he
o he pa ame e ha ing phase 0,π
2, π. I can be seen ha he co ela ion does’n
a y a lo be ween he di e en me ics, o chose he bes one he esul s o s udy
1 should be conside ed. I is clea ha when he second modula ion is in phase o
shi ed by π(in e ed pola i y) he acking has pe o ms be e , ge ing close o
40 Chap e 5. S udy 2: Double Modula ions
he esul s o s udy 1. When second modula ion is shi ed by π
2 he pe o mance
deg ades d as ically meaning ha i he modula ions a e no synch onous he dis-
u bance is e y high and il e cu o becomes e y ha d o ack wi h hose me ic
in music2la en .
Figu e 21: music2la en il e cu o modula ion agains modula ion amoun
In Figu e 21 he x axis is he balance be ween he il e cu o modula ion and he
modula ion on he o he pa ame e , agg ega ed. As o he i s plo in o de o
choose he bes me ic he esul s o s udy 1 should be conside ed. In his case
he plo shows ha he di e ence a he a ious deg ees o balance is no eally
ele an and he cu o acking is no eally in luenced by he belance be ween he
modula ion amoun s.
5.3.2 Tables
In his sec ion, we p esen ables o each pa ame e and hei combina ions. These
ables highligh he mos impo an esul s o he expe imen , as hey show how
di e en pa ame e s in e ac and in e e e wi h each o he ’s acking. Each able
includes only hal o he samples, hose con aining one chosen pa ame e modula ion.
In he ables, each column shows he a e age co ela ion o he p ima y modula ion
when a seconda y modula ion (lis ed in ha column) is also p esen .
Fo he equency modula ion acking esul s (Table 2), he able shows ha CQT
is he bes pe o ming ep esen a ion, which is aligned wi h he esul s o s udy
1. The bes me ics a e cosine and dis ances, and he mos dis u bing modula ion
5.3. Resul s 41
eq amp eq cu o eq shape all eq
spec um magni ude 0.63 0.57 0.59 0.59
spec um dis ances 0.64 0.60 0.61 0.61
spec um simila i y 0.63 0.63 0.55 0.60
dac magni ude 0.51 0.52 0.49 0.51
dac dis ances 0.51 0.45 0.42 0.46
dac simila i y 0.54 0.49 0.43 0.49
music2la en magni ude 0.55 0.56 0.55 0.56
music2la en dis ances 0.63 0.57 0.57 0.59
music2la en simila i y 0.61 0.58 0.59 0.59
cq magni ude 0.64 0.57 0.53 0.58
cq dis ances 0.69 0.65 0.63 0.65
cq simila i y 0.69 0.64 0.61 0.65
m cc magni ude 0.63 0.61 0.52 0.59
m cc dis ances 0.60 0.60 0.60 0.60
m cc simila i y 0.62 0.60 0.59 0.61
Table 2: F equency a e ages
is oscilla o wa eshape. Modula ing he wa eshape means change he ha monic
con en , hus he con en in he highe dimensions o he ep esen a ion a ies a lo ,
and his explains he dis u b.
The ampli ude modula ion able (Table 3) shows ha CQT magni ude is he bes
me ic o ack ampli ude, his is cohe en wi h s udy 1, he mos in luen ial sec-
onda y modula ion is equency, and his happens because o how he ene gy is
dis ibu ed is sp ead ac oss he bins and how he no m is calcula ed. Ampli ude is
o e all he easies modula ion o ack, and i is qui e unde s andable ha calcu-
la ing o al ene gy pe ame, which is wha he no m is doing wo ks well o his
no ma e wha he seconda y modula ion happens o be.
As o s udy 1, MFCC magni ude is he bes a acking equency cu o , his is
no su p ising since MFCC desc ibe he o e all shape o he spec um which is wha
we a e changing when modula ing he il e ’s cu o equency. The mos impac ul
seconda y modula ion is ampli ude, which is no su p ising since i g ea ly impac s
he i s coe icien s, which a e o en he ones ha con ain he in o ma ion eal ed
o a so il e like he one we a e using.

42 Chap e 5. S udy 2: Double Modula ions
eq amp amp cu o amp shape all amp
spec um magni ude 0.76 0.67 0.72 0.72
spec um dis ances 0.68 0.66 0.70 0.68
spec um simila i y 0.72 0.78 0.80 0.77
dac magni ude 0.49 0.46 0.50 0.48
dac dis ances 0.58 0.73 0.73 0.68
dac simila i y 0.59 0.76 0.71 0.69
music2la en magni ude 0.56 0.62 0.66 0.61
music2la en dis ances 0.75 0.83 0.83 0.80
music2la en simila i y 0.75 0.85 0.83 0.81
cq magni ude 0.83 0.89 0.89 0.87
cq dis ances 0.59 0.79 0.75 0.71
cq simila i y 0.65 0.76 0.74 0.71
m cc magni ude 0.85 0.80 0.86 0.84
m cc dis ances 0.45 0.54 0.53 0.51
m cc simila i y 0.45 0.62 0.59 0.55
Table 3: Ampli ude a e ages
Shape modula ion (Table 5) is he one ha shows he bigges di e ence wi h s udy
one, CQT dis ances, which was he bes me ic in s udy one, pe o ms signi ican ly
wo se when o he o he modula ions a e in place and he bes pe o ming me ic is
now music2la en dis ances, signi ican ly mo e esis an . The mos dis u bing mod-
ula ion is equency, which makes sense since wha is o be acked in music2la en
is he ela i e s uc u e o spec al peaks and equency mo es hem a ound.
Summa y able
Table 6 summa izes he esul s o s udy 2, he absolu e bes me ic n a e age is
MFCC magni ude bu ha ’s mos ly due o i s pe o mance a acking il e cu o .
Be e all ounde s a e CQT magni ude and music2la en dis ances.
5.4 Discussion
This s udy shows a d as ic dec ease in pe o mance a acking speci ic modula ions
wi h hose me ics compa ed o s udy one. The in e ac ion be ween he a ious
modula ions in e e es a lo wi h he acking o mos single pa ame e s, wi h he
main excep ions being ampli ude modula ion ac oss mos o ep esen a ions and
5.4. Discussion 43
eq cu o amp cu o cu o shape all cu o
spec um magni ude 0.67 0.58 0.69 0.64
spec um dis ances 0.61 0.54 0.67 0.61
spec um simila i y 0.71 0.65 0.78 0.71
dac magni ude 0.45 0.38 0.55 0.46
dac dis ances 0.47 0.58 0.63 0.56
dac simila i y 0.49 0.65 0.59 0.57
music2la en magni ude 0.53 0.65 0.59 0.59
music2la en dis ances 0.56 0.73 0.72 0.67
music2la en simila i y 0.59 0.71 0.69 0.66
cq magni ude 0.64 0.68 0.81 0.71
cq dis ances 0.51 0.58 0.69 0.59
cq simila i y 0.53 0.54 0.70 0.59
m cc magni ude 0.82 0.74 0.85 0.80
m cc dis ances 0.43 0.39 0.59 0.47
m cc simila i y 0.43 0.44 0.59 0.49
Table 4: Fil e cu o a e ages
MFCCs o acking il e cu o .
T adi ional ep esen a ions a e s ill a e y alid way o ack modula ions, espe-
cially CQT, bu his s udy e eals ha me ics a e qui e ha de o pick in a mo e
ealis ic scena io. Whe e dis ances we e pe o ming as a good all ounde now bo h
magni ude and simila i y a e needed, which has he upside ha di e en me ics
co ela e o di e en hings bu also ha he e’s no single me ic ha con eys he
o e all g oo e.
Neu al ep esen a ions, music2la en in pa icula , a e much mo e esis an and he
pe o mance is no impac ed as much as o adi ional ep esen a ion. This could
be a e y in e es ing ea u e o hose ype o ep esen a ions, bu es ing on a mo e
a ied da ase is needed o con i m i s use ulness.
44 Chap e 5. S udy 2: Double Modula ions
eq shape amp shape cu o shape all shape
spec um magni ude 0.52 0.52 0.49 0.51
spec um dis ances 0.49 0.53 0.50 0.51
spec um simila i y 0.52 0.62 0.61 0.59
dac magni ude 0.46 0.43 0.48 0.45
dac dis ances 0.37 0.57 0.55 0.50
dac simila i y 0.39 0.60 0.57 0.52
music2la en magni ude 0.56 0.59 0.60 0.58
music2la en dis ances 0.59 0.69 0.68 0.66
music2la en simila i y 0.54 0.68 0.64 0.62
cq magni ude 0.56 0.65 0.63 0.61
cq dis ances 0.51 0.59 0.56 0.55
cq simila i y 0.50 0.53 0.51 0.51
m cc magni ude 0.57 0.65 0.64 0.62
m cc dis ances 0.42 0.41 0.45 0.43
m cc simila i y 0.41 0.44 0.44 0.43
Table 5: Wa e shape a e ages
equency amp cu o shape o e all
cq dis ances 0.65 0.71 0.59 0.55 0.63
cq magni ude 0.58 0.87 0.71 0.61 0.69
cq simila i y 0.65 0.71 0.59 0.51 0.62
dac dis ances 0.46 0.68 0.56 0.50 0.55
dac magni ude 0.51 0.48 0.46 0.45 0.48
dac simila i y 0.49 0.69 0.57 0.52 0.57
m cc dis ances 0.60 0.51 0.47 0.43 0.50
m cc magni ude 0.59 0.84 0.80 0.62 0.71
m cc simila i y 0.61 0.55 0.49 0.43 0.52
music2la en dis ances 0.59 0.80 0.67 0.66 0.68
music2la en magni ude 0.56 0.61 0.59 0.58 0.58
music2la en simila i y 0.59 0.81 0.66 0.62 0.67
spec um dis ances 0.61 0.68 0.61 0.51 0.60
spec um magni ude 0.59 0.72 0.64 0.51 0.62
spec um simila i y 0.60 0.77 0.71 0.59 0.67
Table 6: S udy 2 o e all summa y, bes a e in bold
Chap e 6
Conclusions and discussion
6.1 O e iew
This wo k explo ed sonic ajec o ies as a way o ep esen con inuous changes in
sound, mo ing beyond onse -based analysis o cap u e a ia ions in ampli ude, pi ch,
imb e, and spec al shape. A ajec o y ex ac ion amewo k was p oposed, com-
bining adi ional signal-p ocessing ep esen a ions and neu al la en spaces wi h
simple bu obus me ics. Explo a o y expe imen s es ablished he easibili y o
he app oach, and wo sys ema ic s udies e alua ed he capaci y o di e en ep e-
sen a ions and me ics o ollow bo h isola ed and in e ac ing modula ions. Taken
oge he , he esul s highligh bo h he p omise o ajec o y-based me hods o mu-
sic in o ma ion e ie al and sound analysis, and he challenges ha emain when
dealing wi h o e lapping o mul idimensional changes.
6.2 Rep esen a ions
The esul s ac oss bo h s udies show ha ep esen a ion choice has a decisi e in-
luence on ajec o y ex ac ion. T adi ional ep esen a ions emain highly com-
pe i i e: Cons an -Q T ans o m p o ed eliable in acking bo h equency- and
ampli ude- ela ed changes, while MFCC magni udes excelled a ollowing il e cu -
o modula ions. These me hods, despi e hei age, con inue o p o ide obus base-
45
52 BIBLIOGRAPHY
o he 18 h In e na ional Audio Mos ly Con e ence, 136–142 (ACM, Edinbu gh
Uni ed Kingdom, 2023). URL h ps://dl.acm.o g/doi/10.1145/3616195.
3616206.
[8] SMALLEY, D. Spec omo phology: explaining sound-shapes. O ganised Sound
2, 107–126 (1997).
[9] Schö khube , C. Cons an -q ans o m oolbox o music p ocessing (2010).
URL h ps://api.seman icschola .o g/Co pusID:12358579.
[10] Mikolo , T., Chen, K., Co ado, G. & Dean, J. E icien es ima ion o wo d
ep esen a ions in ec o space (2013). URL h ps://a xi .o g/abs/1301.
3781.1301.3781.
[11] Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. Clap: Lea ning audio
concep s om na u al language supe ision (2022). URL h ps://a xi .o g/
abs/2206.04769.2206.04769.
[12] Caillon, A. & Esling, P. Ra e: A a ia ional au oencode o as and high-
quali y neu al audio syn hesis (2021). URL h ps://a xi .o g/abs/2111.
05011.2111.05011.
[13] Zeghidou , N., Luebs, A., Om an, A., Skoglund, J. & Tagliasacchi, M. Sound-
s eam: An end- o-end neu al audio codec. CoRR abs/2107.03312 (2021).
URL h ps://a xi .o g/abs/2107.03312.2107.03312.

Appendix A
Linea Analysis o Modula ion
Rep esen a ion
Jupy e no ebook a ailable a : sonic- ajec o ies/blob/main/s udy-3.ipynb
This appendix de ails a hi d s udy, an ex ension o he me ic-based me hods in
S udies 1 and 2. The main goal is o ind i con ol pa ame e modula ions can be
linea ly econs uc ed om he ec o s o an audio ep esen a ion.
This analysis asks wo ques ions:
1. Fo a single audio sample, can a linea combina ion o a ep esen a ion’s di-
mensions econs uc he g ound- u h modula ion signal? How good is his
econs uc ion?
2. I hese linea combina ions exis , do hey wo k o o he samples? Is a il e
cu o sweep, o example, always encoded in he same dimensions o di e en
sounds?
53
54 Appendix A. Linea Analysis o Modula ion Rep esen a ion
A.1 Se up
A.1.1 Da ase
The da se o his s udy is signi ican ly di e en om he p e ious, i consis s o
1000 10s (mo e han 3h) minisyn h syn hesized samples wi h andom sub audio
modula ions. Samples ha e 1,2,3 o 4 concu en modula ions, all di e en wi h
each o he . This ensu e he maximum a ie y possible o his minisyn h class,
which means ha i he me hod wo ks on hose minisyn h sounds i wo ks on all o
hem. The da ase is a ailable a hugging ace.co/da ase s/inspek al/minisyn h1k-
sub-poin s- 1 and ende able ia minisyn h (see no ebook).
A.1.2 Me hodology
The me hod is a wo-s age analysis using linea eg ession. The assump ion is ha
he modula ion signal M(a ime se ies) is a linea combina ion o he dimensions o
he audio ep esen a ion R(a ma ix o ime ames x dimensions). The ela ionship
is:
M≈R·w
Whe e:
•Mis he g ound- u h modula ion ec o (leng h T).
•Ris he ep esen a ion ma ix (T×D).
•wis a weigh ec o (leng h D).
The op imal weigh ec o wis ound using o dina y leas squa es (OLS), which
minimizes he squa ed di e ence be ween he eal modula ion Mand he p edic ed
one R·w.
A.2. Resul s 55
A.1.3 S age 1: Pe -Sample Model Fi ing
In he i s s age, he p ocess is applied o each audio sample. Fo each sample i,
wi h ep esen a ion Riand modula ion Mi, a weigh ec o wiis compu ed.
The model pe o mance is measu ed wi h he Pea son co ela ion be ween he g ound-
u h modula ion Miand he econs uc ed modula ion Ri·wi. A high co ela ion
(close o 1.0) means he modula ion is linea o ha sample’s ep esen a ion.
A.1.4 S age 2: Gene aliza ion wi h an A e age Model
The second s age es s i he model gene alizes. The weigh ec o s (wi) o all
samples wi h he same modula ed pa ame e (e.g., il e cu o ) a e a e aged o
c ea e a single ec o , wa g:
wa g =1
N
N
X
i=1
wi
This a e age ec o is a gene al model o ha pa ame e and ep esen a ion. I s
pe o mance is es ed on each sample j. Pe o mance is he Pea son co ela ion
be ween he g ound- u h Mjand he econs uc ed modula ion Rj·wa g.
Good pe o mance implies he modula ion is encoded in he same way o all sam-
ples. Poo pe o mance sugges s he encoding is con ex -dependen , using di e en
dimensions o weigh s o each sample.
A.2 Resul s
The esul s om he wo-s age analysis answe he ini ial ques ions.
A.2.1 Pe -Sample Model Pe o mance
The analysis shows ha o any sample, a linea combina ion o dimensions can be
ound o econs uc he modula ion. As shown in Figu e 22, he co ela ion sco es
56 Appendix A. Linea Analysis o Modula ion Rep esen a ion
o he pe -sample models we e close o 1.0 o mos pa ame e s and ep esen a-
ions. This shows ha he modula ion in o ma ion is p esen and linea wi hin he
ep esen a ion’s dimensions.
Figu e 22: Co ela ion his og ams o pe -sample model i ing. Fo all ou modu-
la ion ypes in he ‘music2la en ‘ ep esen a ion, he co ela ions a e hea ily skewed
owa ds 1.0, indica ing nea -pe ec linea econs uc ion.
A.2.2 Gene aliza ion Model Pe o mance
The pe o mance o he a e age model depended on he modula ed pa ame e and
he ep esen a ion.
•Good Gene aliza ion: Some imes, he a e age model pe o med well, which
means he encoding was consis en . Fo example, ampli ude modula ion wo ked
well, as mos ep esen a ions cap u e ene gy.
•Poo Gene aliza ion: In o he cases, he model’s pe o mance was much
wo se. The esul s a ied, wi h poo esul s o many samples. Fo example,
he ‘music2la en ‘ model o il e cu o ga e a ied esul s, while i s model
o oscilla o shape pe o med poo ly o e all.
A.3. Discussion 57
Figu e 23: Co ela ion dis ibu ion o he gene alized a e age model on ‘mu-
sic2la en ‘. The il e cu o model (le ) shows a wide, inconsis en dis ibu ion
o co ela ions, while he oscilla o shape model ( igh ) pe o ms poo ly o e all,
wi h mos co ela ions below 0.4.
This shows ha while he modula ion is linea in each sample, he dimensions and
weigh s change o each sample. A change in base equency, o example, can make
he ep esen a ion use di e en dimensions o encode a il e sweep.
A.3 Discussion
These indings add o he esul s o S udies 1 and 2. The simple me ics o magni-
ude, dis ance, and simila i y show gene al change, bu his s udy shows he in o -
ma ion is mo e s uc u ed.
The nea -pe ec pe -sample esul is impo an . I sugges s ha e en in complex
la en spaces, he ep esen a ion o ea u es has s uc u e; i is linea on a pe -sample
basis.
The mixed esul s o he a e age model a e also impo an . I answe s ques ions
abou in e p e abili y and consis ency. To be in e p e able, a ea u e like il e cu o
should be ep esen ed in he same way. The esul s show his is no always ue; he
encoding depends on con ex . This is a key p oblem in using audio ep esen a ions:
a ea u e can be ep esen ed di e en ly o each sound.