Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
Imp o ing he Seman ic S uc u e o
Neu al Audio Codecs
Ángel Monsal e Fe nández
Supe iso : D . Lonce Wyse
July 2025
Con en s
1 In oduc ion 1
1.1 AudioRep esen a ions ........................... 1
1.1.1 La en Spaces in Audio Rep esen a ion Lea ning . . . . . . . . . . . . . 1
1.1.2 Seman ic S uc u e in Audio La en Spaces . . . . . . . . . . . . . . . . 3
1.2 Neu alAudioCodecs............................ 4
1.2.1 F om T adi ional o Neu al Audio Codecs . . . . . . . . . . . . . . . . 4
1.2.2 Neu al Audio Codecs as Audio Rep esen a ion Lea ne s . . . . . . . . . 5
1.2.3 TheSeman icGap ............................. 7
1.2.4 Eme ging Seman ic Codecs . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 ALMTokenize : Enabling Que y-Based Comp ession . . . . . . . . . . 9
1.3 Mo i a ion.................................. 10
1.4 Objec i es.................................. 11
2 Me hods 12
2.1 A chi ec u e................................. 12
2.1.1 Gene a o .................................. 12
2.1.2 Disc imina o ................................ 15
2.2 Losses .................................... 16
2.2.1 Gene a o Loss ............................... 16
2.2.2 Disc imina o Loss ............................. 20
2.3 T aining................................... 20
2.4 Da ase s and Da a P ep ocessing . . . . . . . . . . . . . . . . . . . . . 21
2.5 E alua ion.................................. 21
2.5.1 Signal Recons uc ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Seman icS uc u e............................. 22
3 Resul s 26
3.1 T aining................................... 26
3.2 Signal Recons uc ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Seman icS uc u e............................. 28
3.3.1 P ojec ions ................................. 28
3.3.2 Clus e ing.................................. 30
3.3.3 Linea Sepa abili y Tes s . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Ins umen Classi ica ion . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.5 In e pola ion Smoo hness Tes s . . . . . . . . . . . . . . . . . . . . . . 32
3.3.6 Ze o-sho Timb e T ans e . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Discussion 34
4.1 Discussion.................................. 34
4.2 Fu he Wo k................................ 35
4.3 Conclusions ................................. 37
Lis o Figu es 39
Lis o Tables 40
Bibliog aphy 41
Acknowledgemen
I would i s like o exp ess my deepes g a i ude o my supe iso , D . Lonce
Wyse, o his i eless guidance, pa ience, and suppo h oughou his p ojec . His
dedica ion made i possible o me o b ing his wo k o comple ion, and i has been
a eal pleasu e o wo k unde his supe ision.
I am also g a e ul o my o me la ma es, Da ide, Diego, and Ma e, o c ea ing
a space ou side o academia whe e I could es and echa ge. Thei iendship
o e ed me he dis ance I needed o see he challenges o his p ojec om a di e en
pe spec i e, and o ha I owe hem much.
My hea el hanks go o Amalia, my pa ne and new la ma e, whose p esence
has gi en me di ec ion and cla i y. He suppo has been a cons an eminde o he
pa h I wan o ollow in li e.
Finally, I would like o hank he en i e Music Technology G oup (MTG), and
especially my classma es in he MSc in Sound and Music Compu ing, o illing his
yea wi h joy, disco e y, knowledge, and cul u e. I has been uly com o ing o
sha e he jou ney wi h people who unde s and he s uggle o lis ening o he same
audio o e and o e again, wai ing o some hing o change.
Abs ac
Neu al audio codecs ha e achie ed ema kable comp ession e iciency by lea ning
la en ep esen a ions op imized o wa e o m ideli y. Howe e , hese codecs o en
lack explici seman ic s uc u e, limi ing hei e ec i eness o downs eam asks
ha equi e meaning ul audio abs ac ions. Que y-based comp ession, as in o-
duced by ALMTokenize , o e s a pa h o in use global con ex in o disc e e audio
okens by in e lea ing lea nable [CLS] embeddings among ame-le el ea u es and
le e aging T ans o me a en ion o agg ega e seman ic in o ma ion. This hesis
implemen s a ep oducible pipeline ha adap s he ALMTokenize pa adigm using
a ozen EnCodec on -end. By inse ing one [CLS] que y oken e e y w ames,
he model enables bi a e-on-demand h ough a unable window leng h, while a
T ans o me encode –decode a chi ec u e cap u es long- ange dependencies and e-
cons uc s wa e o ms ia a pai ed decode . Quan iza ion laye s a e omi ed in his
implemen a ion o ocus analysis on he aw con ex ual embeddings. To assess he
seman ic o ganiza ion o he esul ing la en space, we ex ac [CLS] embeddings
om he Good-sounds da ase and pe o m an e alua ion o he esul ing la en s.
Ou analyses show ha al hough ALMTokenize econs uc ions lag behind En-
Codec in pe cep ual quali y, i s embeddings exhibi s onge seman ic o ganiza ion.
Clus e ing, p ojec ion, and classi ica ion expe imen s e eal clea e g oupings by in-
s umen , no e, and oc a e, while in e pola ion sugges s smoo he la en ansi ions.
This highligh s a ade-o : EnCodec excels a ideli y, whe eas ALMTokenize p o-
ides embeddings be e sui ed o seman ic asks. By eleasing he implemen a ion
and me hodology, his hesis o e s a ounda ion o u u e esea ch on seman ically
s uc u ed audio codecs.
Keywo ds: neu al audio codecs; seman ic s uc u e; que y-based comp ession; au-
dio ep esen a ion lea ning; ans o me
Chap e 1
In oduc ion
1.1 Audio Rep esen a ions
1.1.1 La en Spaces in Audio Rep esen a ion Lea ning
When we hea a sound, ou b ain does no p ocess he aw wa e o m sample by
sample. Ins ead, i cons uc s an in e nal ep esen a ion ha cap u es he essen ial
ea u es o he audi o y e en and links hem o ou p io expe iences and memo ies.
This p ocess, o en e e ed o as audi o y image y o men al ep esen a ion, allows
us o ecognize a oice in a c owd, iden i y a une a e a single no e, o ecall he
emo ion con eyed by a sound, all wi hou a ending o he physical wa e o m i sel
[1]. In machine lea ning, la en spaces aim o mi o his abs ac ion: hey map high-
dimensional audio signals in o a lowe -dimensional embedding whe e seman ically
simila inpu s lie close oge he .
A la en space is he con inuous, mul idimensional ea u e space lea ned by he en-
code o an au oencode , a ia ional au oencode (VAE), o simila a chi ec u e. I
comp esses he da a in o a compac ep esen a ion while p ese ing he in o ma ion
needed o econs uc ion o downs eam asks [2]. In he con ex o audio, hese
embeddings may cap u e imb al quali ies, phone ic con en , o hy hmic pa e ns,
depending on he aining objec i e.
1
8Chap e 1. In oduc ion
encode he linguis ic con en [28]. In essence, he e is a g owing ealiza ion ha
codec design ma e s o high-le el audio asks: he la en should ideally p ese e
seman ic in eg i y, no jus acous ic ideli y.
1.2.4 Eme ging Seman ic Codecs
Se e al esea ch e o s a e now explici ly ocusing on seman ic audio codecs, aim-
ing a al e ing he codec design o aining o yield okens ha a e bo h highly
comp essi e and ca y seman ic meaning.
A popula app oach consis s in using a wo-s age encoding: i s ex ac a high-
le el ep esen a ion and hen an acous ic esidual. Fo example, Seman iCodec
uses a dual-encode a chi ec u e: a seman ic encode (buil on AudioMAE ea u es)
p oduces okens cap u ing he seman ic con en , and an acous ic encode cap u es
he emaining low-le el de ails [30]. The seman ic encode ou pu is quan ized using
k-means codebooks de i ed om a la ge audio da ase , e ec i ely clus e ing audio
ames in o disc e e seman ic uni s.
Ano he example is MimiCodec, which a ge s ul a-low bi a e (≈1kbps) by
dis illing seman ic knowledge om a p e ained model (Wa LM) in o he i s quan-
iza ion laye o a codec [22]. MimiCodec hus has a seman ic codebook o he i s
laye (ensu ing okens co ela e wi h speech con en ), ollowed by addi ional laye s
o e inemen . Howe e , MimiCodec was ailo ed o speech and equi ed ex e nal
supe ision (knowledge dis illa ion) o injec seman ics.
Ano he no able wo k, X-Codec, explici ly combines a HuBERT-based seman ic
encode wi h an acous ic encode , and in oduces a seman ic econs uc ion loss in
aining [31]. By quan izing a e me ging hose ea u es, X-Codec p oduces okens
ha signi ican ly imp o e phone ic disc iminabili y (measu ed ia ABX es s) and
downs eam speech gene a ion wo d e o a es.
In a e y ecen wo k, ALMTokenize pushes his idea u he o suppo bo h
speech and gene al audio [23]. I in oduces a T ans o me -based encode ha com-
p esses a sequence o audio ames in o a smalle se o okens by a ending o e
1.2. Neu al Audio Codecs 9
con ex , a he han ea ing each ame independen ly. This que y-based comp es-
sion allows he model o pick up longe pa e ns and encode hem wi h ewe okens,
he eby encoding mo e seman ic con ex pe oken. In he ollowing sec ion, we will
di e deep in o his no el seman ic codec a chi ec u e.
1.2.5 ALMTokenize : Enabling Que y-Based Comp ession
ALMTokenize is a neu al audio codec okenize designed o p oduce disc e e, se-
man ically ich audio okens a ex emely low bi a es. Ra he han quan izing
e e y ame, i in e lea es a small numbe o lea nable [CLS] que y embeddings
in o a T ans o me -based encode –decode a chi ec u e. These que y okens a end
o e windows o aw audio ame embeddings, agg ega ing con ex ual in o ma ion
be o e being quan ized ia esidual ec o quan iza ion (RVQ) and decoded back
in o wa e o m h ough a pai ed decode
Que y-based comp ession, he co e inno a ion o ALMTokenize , eplaces pe - ame
quan iza ion wi h a mechanism in which only hese spa se, con ex -agg ega ing
que ies a e e ained. By inse ing one [CLS] oken e e y w ames and disca ding
he in e media e ame slo s a e encoding, he model comp esses a long sequence o
ames in o a much sho e sequence o que ies. This app oach ha nesses he T ans-
o me ’s sel -a en ion o cap u e long- ange dependencies, enabling each oken o
summa ize dozens o milliseconds o audio con en in a single ec o .
This pa adigm deli e s se e al key ad an ages. Fi s , i achie es bi a e on-
demand by simply uning he window leng h w: ewe que ies mean ewe okens
pe second and hence lowe bi a e, while mo e que ies eco e ine empo al de-
ail. Second, because each que y oken a ends o i s neighbo s, he esul ing okens
ca y global seman ic con ex ha pe - ame quan iza ion o en misses. Thi d,
by comp essing con ex in o ewe okens, ALMTokenize p oduces much sho e
sequences, alle ia ing he compu a ional bu den on downs eam sequence models
and imp o ing long- e m cohe ence o gene a i e asks.
Empi ical e alua ions con i m ha ALMTokenize ’s que y-based comp ession de-
10 Chap e 1. In oduc ion
li e s pe cep ual quali y compa able o much highe -bi a e neu al codecs while
ope a ing a only a ac ion o hei oken a es. Lis ene s consis en ly judge i s
econs uc ions as nea ly indis inguishable om iche e e ences, and objec i e in-
elligibili y and ideli y me ics place i a he o e on o ul a-low-bi a e mod-
els. C ucially, he seman ic densi y o i s [CLS] okens ansla es in o eal gains
on downs eam asks: speech ecogni ion e o a es d op, and emo ion o speake -
iden i ica ion sys ems pe o m mo e accu a ely when ed ALMTokenize embeddings
[23].
1.3 Mo i a ion
The mo i a ion o his wo k s ems om bo h p ac ical and heo e ical gaps in
cu en neu al audio codec esea ch. Fi s , al hough he ALMTokenize pape in-
oduces a powe ul que y-based comp ession pa adigm ha p omises bi a e on-
demand and iche seman ic con en , he au ho s ha e no eleased hei imple-
men a ion alongside he p ep in , limi ing ep oducibili y and b oade communi y
adop ion. Rep oducing and ex ending such a model is he e o e essen ial o alida e
i s claims, explo e i s design choices, and in eg a e i s inno a ions in o downs eam
audio-language applica ions.
Beyond codec design, he e is a p essing need o econcile comp ession e iciency
wi h he sequence-modeling demands o audio-language T ans o me s. High oken
a es (e.g., 150 okens/s in EnCodec) p o ide ine-g ained acous ic de ail bu lead o
p ohibi i ely long sequences o au o eg essi e o sel -a en ion models, es ic ing
hei abili y o cap u e long- e m s uc u e. Con e sely, ex eme comp ession can
deg ade pe cep ual quali y. A seman ically s uc u ed la en space a mode a e
oken a es could enable T ans o me s o model con en o e ex ended con ex s
mo e e ec i ely, imp o ing asks like speech ecogni ion, emo ion classi ica ion, and
audio gene a ion.
This wo k explo es seman ic comp ession by ep oducing and simpli ying he que y-
based amewo k o ALMTokenize and e alua ing i s seman ic p ope ies. The goal
1.4. Objec i es 11
is o con ibu e open, ep oducible me hods and insigh s ha could suppo he
de elopmen o u u e audio codecs wi h imp o ed bi a e e iciency and seman ic
cla i y.
1.4 Objec i es
The p ima y aim o his hesis is o in es iga e and enhance he seman ic o ganiza-
ion o neu al audio codec la en s, ocusing on he no el que y-based comp ession
app oach. To his end, we pu sue he ollowing wo speci ic objec i es:
1. Rep oduce and adap he ALMTokenize pa adigm: we will implemen
a que y-based comp ession pipeline by in e lea ing lea nable [CLS] que y o-
kens in o he la en s eam o a ozen EnCodec on -end, enabling bi a e
on-demand h ough a unable window leng h.
2. Cha ac e ize he seman ic s uc u e o he esul ing la en space: we
will ex ac [CLS] embeddings om he Good-sounds da ase and compa e
hem agains baseline EnCodec la en s ia (a) p ojec ions, (b) unsupe ised
clus e ing and p ojec ions, (c) linea sepa abili y es s, (d) supe ised ins u-
men classi ica ion, (e) in e pola ion smoo hness assessmen s and ( ) imb e
ans e capabili ies.
Chap e 2
Me hods
2.1 A chi ec u e
This sec ion de ails he dual-module a chi ec u e o ou ad e sa ial audio codec,
which consis s o a gene a o ha comp esses and econs uc s wa e o ms and an
ensemble o disc imina o s ha p o ide mul iscale pe cep ual eedback. A schema
o he whole a chi ec u e, including bo h he gene a o and he disc imina o is
p esen ed in Figu e 2.
2.1.1 Gene a o
The gene a o pa o ou cus om ALMTokenize implemen a ion is composed o
he ollowing modules: Pa chi y and Unpa chi y (encode /decode on ends), [CLS]
okens in e lea ing and e ie al unc ions, T ans o me modules and mask okens
in e lea ing and e ie al unc ions.
Pa chi y and Unpa chi y Modules
The Pa chi y and Unpa chi y modules se e as he encode and decode on -ends,
e ec i ely ans o ming aw audio signals o ec o ep esen a ions (z). Ou model
uses he p e ained 24 kHz EnCodec encode as he Pa chi y module and i s co e-
sponding decode as Unpa chi y, bo h ozen du ing aining o ensu e high- ideli y
12
2.1. A chi ec u e 13
Figu e 2: Schema ic o he ALMTokenize gene a o ( op) and disc imina o (bo -
om). The disc imina o schema is aken om he EnCodec pape [17].
wa e o m p ocessing wi hou addi ional on -end op imiza ion. The encode ans-
o ms he aw audio wa e o m in o a sequence o ame embeddings ia successi e
one-dimensional s ided con olu ions, p oducing a enso o shape T×dwhe e T
is he numbe o ames and d he embedding dimension (in ou case, d= 128).
Unpa chi y mi o s his p ocess wi h ansposed con olu ions ha econs uc he
wa e o m om decoded embeddings, gua an eeing con e ibili y be ween ime and
la en domains.
14 Chap e 2. Me hods
[CLS] Token In e lea ing and Re ie al
A cus om ou ine in e lea es a lea nable [CLS] oken immedia ely a e e e y w
ames, whe e wis he chosen window size. By a ying w, i is possible o con ol how
many [CLS] okens a e inse ed, and hus he e ec i e bi a e, enabling ue bi a e
on-demand. Once hese okens pass h ough he ans o me encode (desc ibed
below), each [CLS] will ga he con ex om he su ounding ame embeddings
h ough sel -a en ion. A e passing h ough he ans o me encode , he con ex -
ich [CLS] okens a e e ie ed om he sequence, o o m he new seman ic- ich
la en ames.
Du ing aining, wis andomly chosen o each ba ch, aking alues in he ange
om 2 o 10. This ensu es be e gene aliza ion o di e en alues o w. Du ing
e alua ion, we use w= 3, so ha a e e e y h ee ame ec o s p oduced by he
Pa chi y module, one lea nable [CLS] oken is inse ed.
T ans o me Modules
Once he [CLS] okens ha e been in e lea ed among he ame embeddings, he
combined sequence is ed in o ou ans o me encode , which is esponsible o
in using each [CLS] ec o wi h con ex ual in o ma ion om i s neighbo s. We
employ o a y posi ional embeddings (RoPE) [32] o encode empo al o de , and
es ic he a en ion mechanism o a causal sliding window wi h a de aul size o 16
ames (app oxima ely 213 ms, gi en 75 ames pe second). This design en o ces
local empo al dependencies while keeping compu a ion e icien .
Du ing decoding, he e ie ed [CLS] que ies a e me ged wi h lea nable mask okens
(desc ibed below) and ed in o a symme ic ans o me decode , which e-expands
he sequence back o i s o iginal leng h be o e Unpa chi y es o es he wa e o m.
Bo h he ans o me encode and decode consis o 12 iden ical laye s, each ea-
u ing 32 a en ion heads and a eed- o wa d ne wo k o 256 and 512 dimensions o
he encode and decode espec i ely.
2.1. A chi ec u e 15
Mask Tokens In e lea ing and Re ie al
Du ing decoding, we eins a e he o iginal oken posi ions by inse ing a single lea n-
able mask embedding in o each ame slo ha was d opped du ing comp ession and
placing he e ie ed [CLS] okens back in hei que y loca ions. This in e lea ed
sequence is hen ed in o he ans o me decode , whe e he mask okens signal gaps
o be illed using he con ex ual in o ma ion ca ied by he [CLS] ec o s. A e de-
coding, only he ame-posi ion ou pu s p oceed o Unpa chi y, while he mask and
[CLS] slo s a e disca ded.
Quan iza ion
The o iginal ALMTokenize implemen a ion employs h ee laye s o esidual ec o
quan iza ion (RVQ) o quan ize each que y embedding in sequence. Each laye has
i s own codebook, hal o which is ini ialized using k-means cen oids de i ed om
wa 2 ec 2.0 ea u es o speech and he o he hal om BEATs ea u es o gene al
sounds.
In ou implemen a ion, we elimina e quan iza ion en i ely. Ou goal is no o mini-
mize he bi a e o he comp essed signal bu o explo e he la en space gene a ed
by he model. Acco dingly, we eed he aw [CLS] okens di ec ly, wi hou any quan-
iza ion s ep. This choice simpli ies aining by emo ing bo h he RVQ module and
he au o eg essi e (AR) loss.
2.1.2 Disc imina o
Disc imina ion is pe o med using he same mul i-scale STFT disc imina o a chi-
ec u e in oduced in EnCodec, which we adop di ec ly om hei o icial implemen-
a ion [17]. We use ou disc imina o s, each ope a ing on spec og ams a di e en
esolu ions. Speci ically, he STFTs a e compu ed wi h FFT sizes o 256, 512, 1024,
and 2048, wi h ma ching window leng hs and hop sizes o 64, 128, 256, and 512
samples, espec i ely. This mul i-scale con igu a ion allows he disc imina o s o
cap u e bo h ine and coa se empo al s uc u es in he audio.
16 Chap e 2. Me hods
Each disc imina o ollows he EnCodec design: he complex STFT ( eal and imag-
ina y pa s conca ena ed) is p ocessed h ough a s ack o 2D con olu ional laye s,
s a ing wi h a 3x9 con olu ion wi h 32 il e s, and con inuing wi h p og essi ely
deepe con olu ions ha inco po a e s ided downsampling along he equency axis
and dila ions o 1, 2, and 4 along he ime axis. All laye s use LeakyReLU ac i a-
ions and weigh no maliza ion, and he ne wo k concludes wi h a 3x3 con olu ion
o p oduce he disc imina o logi s.
2.2 Losses
In his subsec ion we in oduce he loss e ms ha guide he aining o he model.
We de ine he ime-domain econs uc ion loss, he equency-domain spec al loss,
he ad e sa ial loss and he ea u e-ma ching loss. Figu e 3 illus a es he se o loss
unc ions ha is compu ed o he ALMTokenize model.
2.2.1 Gene a o Loss
The gene a o ea u es a composi e loss ha combines ou e ms ha guide he
model owa d pe cep ually ealis ic and seman ically ich econs uc ions.
Recons uc ion Loss
Recons uc ion e o quan i ies he disc epancy be ween he o iginal and econ-
s uc ed audio by combining wo complemen a y measu es. Fi s , we compu e he
ime-domain loss as he poin wise L1dis ance be ween he o iginal wa e o m (x)
and i s econs uc ion (ˆx):
L ime =
x−ˆx
1
Second, we o m he equency-domain loss using a mul iscale mel-spec og am c i-
e ion. Bo h he o iginal signal and he econs uc ion a e ans o med in o mel-
spec og ams a mul iple esolu ions. A e e y scale, we compu e (1) he mean
absolu e e o be ween he mel-spec og am magni udes, and (2) he RMSE e o
2.2. Losses 17
Figu e 3: Schema ic o he losses compu ed du ing aining. All he losses a e shown
wi hin a ed box.
24 Chap e 2. Me hods
We delibe a ely use a e y high C pa ame e in he SVC (C= 107), ensu ing
ha he classi ie s ongly penalizes misclassi ica ions and i s he da a as closely
as possible, so ha he epo ed accu acy e lec s he in insic linea sepa abili y o
he embeddings a he han he e ec o s ong egula iza ion.
Ins umen Classi ica ion
To u he es he ep esen a ional powe o he la en s, we ain a small andom
o es classi ie (RF) wi h 100 es ima o s on he embeddings. By compa ing
downs eam pe o mance when using ou que y-based [CLS] okens e sus aw En-
Codec ame ec o s (wi hou addi ional p ocessing), we can measu e how seman ic
s uc u ing in he la en space ansla es in o conc e e gains on a classi ica ion ask.
To assess such pe o mance, we compu e and compa e he accu acy and p ecision
o bo h ins ances o he classi ie .
In e pola ion Smoo hness Tes s
We in es iga e he in e pola ion beha io be ween la en codes om di e en ins u-
men s. Speci ically, we linea ly in e pola e be ween he wo [CLS] ec o s ep esen -
ing he cen oids o all samples belonging o a gi en ins umen , no e, o oc a e.
Each in e media e poin is hen decoded back o audio, allowing us o lis en o he
esul ing sequence. In a seman ically s uc u ed la en space, hese a e sals should
p oduce pe cep ually smoo h ansi ions, wi h imb e and cha ac e mo phing g ad-
ually. By con as , an uns uc u ed space would yield ab up o incohe en changes.
While his es is inhe en ly subjec i e, we p o ide exempla y audio snippe s in ou
p ojec eposi o y so ha eade s can e alua e he con inui y and plausibili y o
hese in e pola ions hemsel es.
Al hough Good-Sounds con ains ca e ully eco ded samples, an ins umen ’s imb e
can a y signi ican ly o e he cou se o a no e. To educe a i ac s in he gene a ed
audio, we compu e cen oids only om he mos s able po ions o he signal. Fo
sounds wi h a ailable anno a ions, we used he segmen spanning om he anno a ed
a ack o decay o he anno a ed elease o o se . Because no anno a ions we e
2.5. E alua ion 25
p o ided o he onse o he sus ain phase, we elied on hese a ailable bounda ies
ins ead. Unde hese c i e ia, only he lu e, cla ine , and umpe eco dings we e
sui able o cen oid cons uc ion.
Ze o-sho Timb e T ans e
Finally, we explo e whe he mo ing he la en ep esen a ion o an audio owa ds
ha o a a ge ins umen can e ec i ely change i s imb e while p ese ing pi ch
and hy hm.
The p ocedu e s a s by encoding an inpu sound in o la en ames, bo h wi h
EnCodec and wi h ou ALMTokenize . F om hese ames, we compu e he cen oid
o he inpu audio in la en space. We also compu e he cen oid o he a ge
ins umen by a e aging all la en ames co esponding o he s able pa o ha
ins umen in he Good-Sounds da ase . The di e ence be ween hese wo cen oids
de ines a imb e di ec ion. Once we ha e his di ec ion, we shi all la en ames
o he inpu audio along i un il hei a e age aligns wi h he cen oid o he a ge
ins umen . A e he shi , he modi ied la en s a e decoded back in o audio.
We will e alua e he esul s by lis ening and assessing whe he he ans o med
audio con ains cha ac e is ic elemen s o he a ge ins umen while s ill p ese ing
he pi ch and iming o he o iginal audio.
Chap e 3
Resul s
The code used o implemen ing he a chi ec u e and cha ac e izing i s ou pu has
been made a ailable a h ps://gi hub.com/angelm 97/alm okenize
3.1 T aining
In Figu e 4, we plo he e olu ion o each loss e m o e he i s 300 aining s eps.
0.150
0.155
0.160
0.165
Value
L_ ime
0.32
0.33
0.34
0.35
0.36
0.37
0.38
Value
L_ eq
1.0
1.2
1.4
1.6
Value
L_ad
0.175
0.200
0.225
0.250
0.275
0.300
Value
L_ ea
0 10 20 30 40 50 60 70
Epoch
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Value
L_mae
0 10 20 30 40 50 60 70
Epoch
2.50
2.75
3.00
3.25
3.50
3.75
Value
L_ o al
0 10 20 30 40 50 60 70
Epoch
1.93
1.94
1.95
1.96
1.97
1.98
Value
L_disc
T ain
Tes
T aining Me ics
Figu e 4: Loss cu es o he ain (blue) and es da ase (o ange).
O e all, he o al loss o he gene a o (L_ o al s eadily declines, demons a ing ha
he model is con e ging. Looking mo e closely a he econs uc ion losses, we see a
p onounced d op in he equency-domain spec al loss (L_ eq). This indica es ha
he spec og ams o he econs uc ed audio a e becoming e e mo e simila o hose
26
3.2. Signal Recons uc ion 27
o he o iginal signal. By con as , he ime-domain L1 loss (L_ ime) inc eases du -
ing he i s s ages o aining. This ise is no inhe en ly p oblema ic: ime-domain
e o measu es ma hema ical disc epancies be ween wa e o ms, and wo signals can
be pe cep ually iden ical despi e subs an ial sample-by-sample di e ences, and he
disc imina o is agnos ic o hese di e ences, since i only e alua es spec og ams.
The MAE loss decays smoo hly and consis en ly, showing ha he ans o me en-
code in he gene a o is e ec i ely embedding con ex ual in o ma ion in o each
ame. E en when po ions o he inpu a e masked, he model lea ns o econs uc
hem accu a ely.
Du ing aining, he ad e sa ial loss, disc imina o loss, and ea u e-ma ching loss
each a s eady equilib ium. All h ee emain s able, oscilla ing only wi hin a na ow
ange, which sugges s ha he gene a o and disc imina o a e e enly ma ched and
ha nei he domina es he o he . This implies ha , e en as he disc imina o s
inc ease hei s eng h, he gene a o lea ns o p oduce ou pu s ha be e align
wi h hei in e nal ea u e ep esen a ions, poin ing o an imp o emen in pe cep ual
quali y.
3.2 Signal Recons uc ion
We p o ide audio examples compa ing econs uc ed signals (a e encoding and de-
coding) wi h hei o iginal coun e pa s a h ps://angelm 97.gi hub.io/alm okenize /
The quali y o he econs uc ions is a om op imal. The gene a ed audios exhibi
a me allic cha ac e and audible a i ac s, which shows ha he model is no ye
capable o ep oducing he ull ichness o he o iginal signals. Compa ed o s a e-
o - he-a codecs such as EnCodec, he gap in pe cep ual quali y is clea . Howe e ,
i is impo an o emphasize ha he econs uc ed sounds emain pe ec ly ecog-
nizable, and in he case o speech samples he con en is in elligible. Achie ing his
le el o econs uc ion wi h a ela i ely small da ase and unde limi ed compu a-
ional esou ces ep esen s a signi ican accomplishmen and p o ides a solid basis
o u he imp o emen .
28 Chap e 3. Resul s
3.3 Seman ic S uc u e
3.3.1 P ojec ions
Figu e 5 p esen s he -SNE p ojec ions o he embeddings p oduced by EnCodec
and ALMTokenize o he Good-Sounds da ase . We omi he legend, ocusing in-
s ead on he o e all s uc u e and g ouping o he embeddings a he han indi idual
labels. Visually, he clus e s o med by ALMTokenize embeddings a e no iceably
igh e han hose o EnCodec ac oss all h ee seman ic a ibu es examined (in-
s umen , no e, and oc a e). This igh e g ouping sugges s ha he la en space
lea ned by ALMTokenize encodes seman ic in o ma ion mo e e ec i ely han adi-
ional neu al codecs, which may lead o imp o ed pe o mance in downs eam asks
ha depend on seman ic dis inc ions.
In pa icula , he space shows a clea e o ganiza ion by pi ch, as he na u ally
20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
ins umen
20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
no e
20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
oc a e
EnCodec
60 40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
ins umen
60 40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
no e
60 40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
oc a e
ALMTokenize
Figu e 5: -SNEs o he g ound u h labels o he embeddings gene a ed by EnCodec
( op) and ALMTokenize (bo om).
3.3. Seman ic S uc u e 29
eme ging clus e s align be e wi h no e and oc a e labels, sugges ing ha pi ch is
a key ac o s uc u ing he la en space.
While -SNE can be a use ul ool, i is also p one o p oducing misleading imp es-
sions o he da a. To ob ain a mo e eliable iew o class sepa abili y, we ins ead
employ Linea Disc iminan Analysis (LDA), which p ojec s he samples on o he
di ec ions ha maximize a iabili y be ween g oups (i.e., hose ha bes sepa a e
ins umen s, no es, o oc a es). Figu e 6 shows he LDA p ojec ions o hese h ee
label ypes, compa ing ALMTokenize wi h EnCodec. In ALMTokenize , he g oups
appea sligh ly mo e dis inc ly sepa a ed. This sugges s ha he class bounda ies
in ALMTokenize a e mo e linea and be e de ined.
7.5 5.0 2.5 0.0 2.5 5.0
LDA 1
10.0
7.5
5.0
2.5
0.0
2.5
5.0
LDA 2
ins umen
4202468
LDA 1
6
4
2
0
2
4
6
LDA 2
no e
420246810
LDA 1
6
4
2
0
2
4
6
LDA 2
oc a e
EnCodec
642024
LDA 1
6
4
2
0
2
4
6
LDA 2
ins umen
420246
LDA 1
4
2
0
2
4
6
LDA 2
no e
420246
LDA 1
4
2
0
2
4
6
LDA 2
oc a e
ALMTokenize
Figu e 6: LDAs o he g ound u h labels o he embeddings gene a ed by EnCodec
( op) and ALMTokenize (bo om).
30 Chap e 3. Resul s
3.3.2 Clus e ing
We explo e a ange o alues o k o iden i y he numbe o clus e s ha bes
aligns wi h he ins umen labels. Fo each k, we compu e he Akaike In o ma ion
C i e ion (AIC). Figu e 7 epo s he alue o his me ic ac oss di e en choices o
k o bo h EnCodec and ALMTokenize . Because AIC penalizes model complexi y,
he cu es ypically descend and hen ise as kinc eases, helping a oid o e i ing.
The op imal kis 34 and 44 o EnCodec and ALMTokenize , espec i ely.
Figu e 7 also displays he same -SNE p ojec ions as be o e, now colo ed by he
clus e assignmen s ob ained wi h he op imal numbe o clus e s. The p ojec ion
con i ms be e de ined clus e s in ou model.
Then, we compu e he ex e nal alida ion me ics. The Adjus ed Rand Index (ARI)
measu es how well he p edic ed clus e s ma ch he ue labels, co ec ing o chance.
The No malized Mu ual In o ma ion quan i ies he amoun o sha ed in o ma ion
be ween he clus e ing and he g ound u h. Finally, he Homogenei y Sco e e al-
ua es whe he each clus e con ains only membe s o a single class. Table 2 epo s
he alues ob ained o each o he h ee chosen me ics.
ALMTokenize consis en ly ou pe o ms EnCodec ac oss all o hem. These esul s
indica e ha he la en space o ALMTokenize is be e o ganized wi h espec o
he seman ic s uc u e o ins umen s, no es and oc a es han ha o EnCodec.
Fea u e EnCodec ALMTokenize
Adjus ed Rand Index (↑) 0.011 0.029
No malized Mu ual In o ma ion Sco e (↑) 0.268 0.411
Homogenei y Sco e (↑) 0.206 0.326
Table 2: Ex e nal clus e ing e alua ion me ics o EnCodec embeddings s. ALM-
Tokenize embeddings. Bold o he bes esul in each ea u e.
3.3. Seman ic S uc u e 31
EnCodec
20 25 30 35 40 45 50 55 60
Numbe o clus e s
800000
750000
700000
650000
600000
550000
500000
AIC
Model Selec ion: EnCodec
ALMTokenize
20 25 30 35 40 45 50 55 60
Numbe o clus e s
950000
900000
850000
800000
750000
700000
650000
600000
550000
AIC
Model Selec ion: ALMTokenize
40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
clus e
60 40 20 0 20 40 60
-SNE 1
40
20
0
20
40
-SNE 2
clus e
Figu e 7: Plo o he AICs compu ed a each alue o k( op) and -SNEs o he
clus e ing o he embeddings gene a ed by EnCodec and ALMTokenize (bo om).
3.3.3 Linea Sepa abili y Tes s
Table 3 epo s he aw accu acies achie ed by an SVM classi ie wi h a linea
ke nel on EnCodec and ALMTokenize embeddings o h ee seman ic a ibu es:
ins umen , no e and oc a e. ALMTokenize ou pe o ms EnCodec in e e y case,
eaching 0.32 e sus 0.44 o ins umen , 0.15 e sus 0.30 o no e and 0.30 e sus
0.34 o oc a e. These esul s indica e ha ALMTokenize embeddings cap u e
seman ic dis inc ions mo e e ec i ely han adi ional neu al codecs, yielding mo e
disc imina i e ep esen a ions o downs eam classi ica ion asks.
32 Chap e 3. Resul s
Fea u e EnCodec ALMTokenize
Accu acy on Ins umen (↑) 0.32 0.44
Accu acy on No e (↑) 0.15 0.30
Accu acy on Oc a e (↑) 0.30 0.34
Table 3: Raw accu acies ob ained by he SVM classi ie o EnCodec embeddings
s. ALMTokenize embeddings. Bold o he bes esul in each ea u e.
3.3.4 Ins umen Classi ica ion
Table 4 summa izes he accu acy and p ecision ob ained om aining a andom
o es classi ie o ca ego ize samples by ins umen , no e, and oc a e. In all h ee
cases, ALMTokenize ou pe o ms he baseline, achie ing consis en ly highe alues
o bo h me ics. This sugges s ha he ep esen a ions lea ned by ALMTokenize
cap u e musically ele an ea u es and can be mo e use ul o downs eam MIR
asks such as ins umen classi ica ion.
Fea u e Model Accu acy (↑) P ecision (↑)
Ins umen EnCodec 0.51 0.53
ALMTokenize 0.67 0.70
No e EnCodec 0.46 0.47
ALMTokenize 0.60 0.61
Oc a e EnCodec 0.64 0.68
ALMTokenize 0.74 0.77
Table 4: Raw accu acies and p ecisions ob ained by he Random Fo es classi ie
o EnCodec embeddings s. ALMTokenize embeddings. Bold o he bes esul
in each ea u e.
3.3.5 In e pola ion Smoo hness Tes s
We included a se ies o examples showcasing sound in e pola ion (o mo phing) a
h ps://angelm 97.gi hub.io/alm okenize /.
The p ocedu e consis s o selec ing wo g oups o sounds, compu ing hei cen oids,
and hen gene a ing in e pola ed ec o s be ween hose cen oids. These in e po-
la ed ec o s a e subsequen ly passed h ough EnCodec and ALMTokenize o audio
syn hesis.
3.3. Seman ic S uc u e 33
Upon lis ening, we obse ed ha ALMTokenize p oduces audio o lowe pe cep-
ual quali y compa ed o EnCodec. Howe e , i appea s o cap u e he seman ic
s uc u e o he sounds mo e e ec i ely. In pa icula , in in e pola ion cases whe e
pi ch is in ol ed, he audio gene a ed by ALMTokenize clea ly ansi ions h ough
in e media e pi ches along he pa h, while EnCodec does no exhibi his beha io .
3.3.6 Ze o-sho Timb e T ans e
The expe imen s on ze o-sho imb e ans e can be ound a he ollowing page:
h ps://angelm 97.gi hub.io/alm okenize /
In hese es s, an inpu sound is shi ed in la en space owa d he cen oid o a a ge
ins umen , and hen decoded back o audio. The esul ing examples make clea
ha , al hough he o e all econs uc ion quali y is s ill limi ed and a i ac s emain
audible, he la en space lea ned by ALMTokenize encodes seman ic s uc u e mo e
e ec i ely han EnCodec. This s onge o ganiza ion allows he ans e ed sounds
o con ey a clea e sense o he in ended a ge imb e, making he ans o ma ion
eel mo e pu pose ul and consis en , e en i i is s ill a om musically con incing.
Lis o Tables
1 Compa ison o ames pe second (FPS), okens pe second (TPS),
codebook size (CS) and bi a e (BR) ac oss models. Adap ed om [23] 5
2 Ex e nal clus e ing e alua ion me ics o EnCodec embeddings s.
ALMTokenize embeddings. Bold o he bes esul in each ea u e. 30
3 Raw accu acies ob ained by he SVM classi ie o EnCodec embed-
dings s. ALMTokenize embeddings. Bold o he bes esul in
each ea u e................................. 32
4 Raw accu acies and p ecisions ob ained by he Random Fo es classi-
ie o EnCodec embeddings s. ALMTokenize embeddings. Bold
o he bes esul in each ea u e. . . . . . . . . . . . . . . . . . . . . 32
40
Bibliog aphy
[1] Hubba d, T. L. Audi o y image y: Empi ical indings. Psychological Bulle in
136 (2010).
[2] Be gmann, D. Wha is la en space? (2025). URL h ps://www.ibm.com/
hink/ opics/la en -space.
[3] Bae ski, A., Zhou, H., Mohamed, A. & Auli, M. wa 2 ec 2.0: A amewo k o
sel -supe ised lea ning o speech ep esen a ions (2020). URL h p://a xi .
o g/abs/2006.11477.
[4] Hsu, W.-N. e al. Hube : Sel -supe ised speech ep esen a ion lea ning by
masked p edic ion o hidden uni s (2021). URL h p://a xi .o g/abs/2106.
07447.
[5] Huang, P.-Y. e al. Masked au oencode s ha lis en (2023). URL h p://
a xi .o g/abs/2207.06405.
[6] Liu, H. e al. Audioldm: Tex - o-audio gene a ion wi h la en di usion models
(2023). URL h p://a xi .o g/abs/2301.12503.
[7] Nakashima, R., Ozaki, R. & Taniguchi, T. Unsupe ised phoneme and wo d
disco e y om mul iple speake s using double a icula ion analyze and neu al
ne wo k wi h pa ame ic bias. F on ie s in Robo ics and AI 6(2019).
[8] Hawley, S. H. & Tacke , A. R. Ope a ional la en spaces. Jou nal o Audio
Enginee ing Socie y (2024).
41
42 BIBLIOGRAPHY
[9] Lu, H. e al. Disen angled speech ep esen a ion lea ning o one-sho c oss-
lingual oice con e sion using β- ae (2022). URL h p://a xi .o g/abs/
2210.13771.
[10] Wyse, L., Kama h, P. & Gup a, C. Sound model ac o y: An in eg a ed sys em
a chi ec u e o gene a i e audio modelling (2022). URL h p://a xi .o g/
abs/2206.13085.
[11] Ga cía, H. F., Nie o, O., Salamon, J., Pa do, B. & See ha aman, P.
Ske ch2sound: Con ollable audio gene a ion ia ime- a ying signals and sonic
imi a ions (2025). URL h p://a xi .o g/abs/2412.08550.
[12] S awn, J. & Pohlmann, K. C. P inciples o digi al audio. Compu e Music
Jou nal 10 (1986).
[13] Smi h, J. O. & Abel, J. S. Iso11172-3: In o ma ion echnology - coding o
mo ing pic u es and associa ed audio o digi al s o age media a up o abou
1.5 mbi /s - pa 3: Audio. ISEJTC 129 WG 11 (1993).
[14] Valin, J., Vos, K. & Te ibe y, T. De ini ion o he opus audio codec. In e ne
Enginee ing Task Fo ce (IETF) (2012).
[15] Wu, H. e al. Towa ds audio language modeling – an o e iew (2024). URL
h p://a xi .o g/abs/2402.13236.
[16] Zeghidou , N., Luebs, A., Om an, A., Skoglund, J. & Tagliasacchi, M. Sound-
s eam: An end- o-end neu al audio codec (2021). URL h p://a xi .o g/
abs/2107.03312.
[17] Dé ossez, A., Cope , J., Synnae e, G. & Adi, Y. High ideli y neu al audio
comp ession (2022). URL h p://a xi .o g/abs/2210.13438.
[18] Kuma , R., See ha aman, P., Luebs, A., Kuma , I. & Kuma , K. High- ideli y
audio comp ession wi h imp o ed qgan (2023). URL h p://a xi .o g/
abs/2306.06546.
BIBLIOGRAPHY 43
[19] Ji, S. e al. Wa okenize : an e icien acous ic disc e e codec okenize o
audio language modeling (2025). URL h p://a xi .o g/abs/2408.16532.
[20] Pa ke , J. D. e al. Scaling ans o me s o low-bi a e high-quali y speech
coding (2024). URL h p://a xi .o g/abs/2411.19842.
[21] Zhang, X., Zhang, D., Li, S., Zhou, Y. & Qiu, X. Speech okenize : Uni ied
speech okenize o speech la ge language models (2024). URL h p://a xi .
o g/abs/2308.16692.
[22] Dé ossez, A. e al. Moshi: a speech- ex ounda ion model o eal- ime dialogue
(2024). URL h p://a xi .o g/abs/2410.00037.
[23] Yang, D. e al. Alm okenize : A low-bi a e and seman ic- ich audio codec
okenize o audio language modeling (2025). URL h p://a xi .o g/abs/
2504.10344.
[24] Juang, B. H. & G ay, A. H. Mul iple s age ec o quan iza ion o speech
coding. In ICASSP, IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing - P oceedings, ol. 1982-May (1982).
[25] Bo sos, Z. e al. Audiolm: a language modeling app oach o audio gene a ion
(2023). URL h p://a xi .o g/abs/2209.03143.
[26] Agos inelli, A. e al. Musiclm: Gene a ing music om ex (2023). URL h p:
//a xi .o g/abs/2301.11325.
[27] Bo sos, Z. e al. Sounds o m: E icien pa allel audio gene a ion (2023). URL
h p://a xi .o g/abs/2305.09636.
[28] Wang, C. e al. Neu al codec language models a e ze o-sho ex o speech
syn hesize s (2023). URL h p://a xi .o g/abs/2301.02111.
[29] Sun, S., K ishna, K., Ma a ella-Micke, A. & Iyye , M. Do long- ange language
models ac ually use long- ange con ex ? (2021). URL h p://a xi .o g/
abs/2109.09115.
44 BIBLIOGRAPHY
[30] Liu, H. e al. Seman icodec: An ul a low bi a e seman ic audio codec o
gene al sound (2024). URL h p://a xi .o g/abs/2405.00233h p://dx.
doi.o g/10.1109/JSTSP.2024.3506286.
[31] Ye, Z. e al. Codec does ma e : Explo ing he seman ic sho coming o codec
o audio language model (2024). URL h p://a xi .o g/abs/2408.17175.
[32] Su, J. e al. Ro o me : Enhanced ans o me wi h o a y posi ion embedding.
Neu ocompu ing 568 (2024).
[33] Fonseca, E., Fa o y, X., Pons, J., Fon , F. & Se a, X. Fsd50k: An open da ase
o human-labeled sound e en s. IEEE/ACM T ansac ions on Audio Speech and
Language P ocessing 30 (2022).
[34] Picas, O. R., Rod iguez, H. P., Dabi i, D. & Se a, X. Good-Sounds Da ase
(2017). URL h ps://zenodo.o g/ eco d/820937.
[35] Maa en, L. V. D. & Hin on, G. Visualizing da a using -sne. Tech. Rep. (2008).
[36] FISHER, R. A. The use o mul iple measu emen s in axonomic p oblems.
Annals o Eugenics 7(1936).
[37] Rao, C. R. The u iliza ion o mul iple measu emen s in p oblems o biologi-
cal classi ica ion. Jou nal o he Royal S a is ical Socie y Se ies B: S a is ical
Me hodology 10 (1948).
[38] Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. Clap: Lea ning audio
concep s om na u al language supe ision (2022). URL h p://a xi .o g/
abs/2206.04769.
[39] Alonso-Jiménez, P., Se a, X. & Bogdano , D. E icien supe ised aining
o audio ans o me s o music ep esen a ion lea ning (2023). URL h ps:
//a xi .o g/abs/2309.16418.2309.16418.