Improving the Semantic Structure of Neural Audio Codecs

Author: Monsalve Fernández, Ángel

Publisher: Zenodo

DOI: 10.5281/zenodo.17303114

Source: https://zenodo.org/records/17303114/files/Angel_Monsalve_SMC_2025_Master_Thesis.pdf

Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
Imp o ing he Seman ic S uc u e o
Neu al Audio Codecs
Ángel Monsal e Fe nández
Supe iso : D . Lonce Wyse
July 2025
Con en s
1 In oduc ion 1
1.1 AudioRep esen a ions ........................... 1
1.1.1 La en Spaces in Audio Rep esen a ion Lea ning . . . . . . . . . . . . . 1
1.1.2 Seman ic S uc u e in Audio La en Spaces . . . . . . . . . . . . . . . . 3
1.2 Neu alAudioCodecs............................ 4
1.2.1 F om T adi ional o Neu al Audio Codecs . . . . . . . . . . . . . . . . 4
1.2.2 Neu al Audio Codecs as Audio Rep esen a ion Lea ne s . . . . . . . . . 5
1.2.3 TheSeman icGap ............................. 7
1.2.4 Eme ging Seman ic Codecs . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 ALMTokenize : Enabling Que y-Based Comp ession . . . . . . . . . . 9
1.3 Mo i a ion.................................. 10
1.4 Objec i es.................................. 11
2 Me hods 12
2.1 A chi ec u e................................. 12
2.1.1 Gene a o .................................. 12
2.1.2 Disc imina o ................................ 15
2.2 Losses .................................... 16
2.2.1 Gene a o Loss ............................... 16
2.2.2 Disc imina o Loss ............................. 20
2.3 T aining................................... 20
2.4 Da ase s and Da a P ep ocessing . . . . . . . . . . . . . . . . . . . . . 21
2.5 E alua ion.................................. 21
2.5.1 Signal Recons uc ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Seman icS uc u e............................. 22
3 Resul s 26
3.1 T aining................................... 26
3.2 Signal Recons uc ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Seman icS uc u e............................. 28
3.3.1 P ojec ions ................................. 28
3.3.2 Clus e ing.................................. 30
3.3.3 Linea Sepa abili y Tes s . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Ins umen Classi ica ion . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.5 In e pola ion Smoo hness Tes s . . . . . . . . . . . . . . . . . . . . . . 32
3.3.6 Ze o-sho Timb e T ans e . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Discussion 34
4.1 Discussion.................................. 34
4.2 Fu he Wo k................................ 35
4.3 Conclusions ................................. 37
Lis o Figu es 39
Lis o Tables 40
Bibliog aphy 41
Acknowledgemen
I would i s like o exp ess my deepes g a i ude o my supe iso , D . Lonce
Wyse, o his i eless guidance, pa ience, and suppo h oughou his p ojec . His
dedica ion made i possible o me o b ing his wo k o comple ion, and i has been
a eal pleasu e o wo k unde his supe ision.
I am also g a e ul o my o me la ma es, Da ide, Diego, and Ma e, o c ea ing
a space ou side o academia whe e I could es and echa ge. Thei iendship
o e ed me he dis ance I needed o see he challenges o his p ojec om a di e en
pe spec i e, and o ha I owe hem much.
My hea el hanks go o Amalia, my pa ne and new la ma e, whose p esence
has gi en me di ec ion and cla i y. He suppo has been a cons an eminde o he
pa h I wan o ollow in li e.
Finally, I would like o hank he en i e Music Technology G oup (MTG), and
especially my classma es in he MSc in Sound and Music Compu ing, o illing his
yea wi h joy, disco e y, knowledge, and cul u e. I has been uly com o ing o
sha e he jou ney wi h people who unde s and he s uggle o lis ening o he same
audio o e and o e again, wai ing o some hing o change.

Abs ac
Neu al audio codecs ha e achie ed ema kable comp ession e iciency by lea ning
la en ep esen a ions op imized o wa e o m ideli y. Howe e , hese codecs o en
lack explici seman ic s uc u e, limi ing hei e ec i eness o downs eam asks
ha equi e meaning ul audio abs ac ions. Que y-based comp ession, as in o-
duced by ALMTokenize , o e s a pa h o in use global con ex in o disc e e audio
okens by in e lea ing lea nable [CLS] embeddings among ame-le el ea u es and
le e aging T ans o me a en ion o agg ega e seman ic in o ma ion. This hesis
implemen s a ep oducible pipeline ha adap s he ALMTokenize pa adigm using
a ozen EnCodec on -end. By inse ing one [CLS] que y oken e e y w ames,
he model enables bi a e-on-demand h ough a unable window leng h, while a
T ans o me encode –decode a chi ec u e cap u es long- ange dependencies and e-
cons uc s wa e o ms ia a pai ed decode . Quan iza ion laye s a e omi ed in his
implemen a ion o ocus analysis on he aw con ex ual embeddings. To assess he
seman ic o ganiza ion o he esul ing la en space, we ex ac [CLS] embeddings
om he Good-sounds da ase and pe o m an e alua ion o he esul ing la en s.
Ou analyses show ha al hough ALMTokenize econs uc ions lag behind En-
Codec in pe cep ual quali y, i s embeddings exhibi s onge seman ic o ganiza ion.
Clus e ing, p ojec ion, and classi ica ion expe imen s e eal clea e g oupings by in-
s umen , no e, and oc a e, while in e pola ion sugges s smoo he la en ansi ions.
This highligh s a ade-o : EnCodec excels a ideli y, whe eas ALMTokenize p o-
ides embeddings be e sui ed o seman ic asks. By eleasing he implemen a ion
and me hodology, his hesis o e s a ounda ion o u u e esea ch on seman ically
s uc u ed audio codecs.
Keywo ds: neu al audio codecs; seman ic s uc u e; que y-based comp ession; au-
dio ep esen a ion lea ning; ans o me
Chap e 1
In oduc ion
1.1 Audio Rep esen a ions
1.1.1 La en Spaces in Audio Rep esen a ion Lea ning
When we hea a sound, ou b ain does no p ocess he aw wa e o m sample by
sample. Ins ead, i cons uc s an in e nal ep esen a ion ha cap u es he essen ial
ea u es o he audi o y e en and links hem o ou p io expe iences and memo ies.
This p ocess, o en e e ed o as audi o y image y o men al ep esen a ion, allows
us o ecognize a oice in a c owd, iden i y a une a e a single no e, o ecall he
emo ion con eyed by a sound, all wi hou a ending o he physical wa e o m i sel
[1]. In machine lea ning, la en spaces aim o mi o his abs ac ion: hey map high-
dimensional audio signals in o a lowe -dimensional embedding whe e seman ically
simila inpu s lie close oge he .
A la en space is he con inuous, mul idimensional ea u e space lea ned by he en-
code o an au oencode , a ia ional au oencode (VAE), o simila a chi ec u e. I
comp esses he da a in o a compac ep esen a ion while p ese ing he in o ma ion
needed o econs uc ion o downs eam asks [2]. In he con ex o audio, hese
embeddings may cap u e imb al quali ies, phone ic con en , o hy hmic pa e ns,
depending on he aining objec i e.
1
8Chap e 1. In oduc ion
encode he linguis ic con en [28]. In essence, he e is a g owing ealiza ion ha
codec design ma e s o high-le el audio asks: he la en should ideally p ese e
seman ic in eg i y, no jus acous ic ideli y.
1.2.4 Eme ging Seman ic Codecs
Se e al esea ch e o s a e now explici ly ocusing on seman ic audio codecs, aim-
ing a al e ing he codec design o aining o yield okens ha a e bo h highly
comp essi e and ca y seman ic meaning.
A popula app oach consis s in using a wo-s age encoding: i s ex ac a high-
le el ep esen a ion and hen an acous ic esidual. Fo example, Seman iCodec
uses a dual-encode a chi ec u e: a seman ic encode (buil on AudioMAE ea u es)
p oduces okens cap u ing he seman ic con en , and an acous ic encode cap u es
he emaining low-le el de ails [30]. The seman ic encode ou pu is quan ized using
k-means codebooks de i ed om a la ge audio da ase , e ec i ely clus e ing audio
ames in o disc e e seman ic uni s.
Ano he example is MimiCodec, which a ge s ul a-low bi a e (≈1kbps) by
dis illing seman ic knowledge om a p e ained model (Wa LM) in o he i s quan-
iza ion laye o a codec [22]. MimiCodec hus has a seman ic codebook o he i s
laye (ensu ing okens co ela e wi h speech con en ), ollowed by addi ional laye s
o e inemen . Howe e , MimiCodec was ailo ed o speech and equi ed ex e nal
supe ision (knowledge dis illa ion) o injec seman ics.
Ano he no able wo k, X-Codec, explici ly combines a HuBERT-based seman ic
encode wi h an acous ic encode , and in oduces a seman ic econs uc ion loss in
aining [31]. By quan izing a e me ging hose ea u es, X-Codec p oduces okens
ha signi ican ly imp o e phone ic disc iminabili y (measu ed ia ABX es s) and
downs eam speech gene a ion wo d e o a es.
In a e y ecen wo k, ALMTokenize pushes his idea u he o suppo bo h
speech and gene al audio [23]. I in oduces a T ans o me -based encode ha com-
p esses a sequence o audio ames in o a smalle se o okens by a ending o e

1.2. Neu al Audio Codecs 9
con ex , a he han ea ing each ame independen ly. This que y-based comp es-
sion allows he model o pick up longe pa e ns and encode hem wi h ewe okens,
he eby encoding mo e seman ic con ex pe oken. In he ollowing sec ion, we will
di e deep in o his no el seman ic codec a chi ec u e.
1.2.5 ALMTokenize : Enabling Que y-Based Comp ession
ALMTokenize is a neu al audio codec okenize designed o p oduce disc e e, se-
man ically ich audio okens a ex emely low bi a es. Ra he han quan izing
e e y ame, i in e lea es a small numbe o lea nable [CLS] que y embeddings
in o a T ans o me -based encode –decode a chi ec u e. These que y okens a end
o e windows o aw audio ame embeddings, agg ega ing con ex ual in o ma ion
be o e being quan ized ia esidual ec o quan iza ion (RVQ) and decoded back
in o wa e o m h ough a pai ed decode
Que y-based comp ession, he co e inno a ion o ALMTokenize , eplaces pe - ame
quan iza ion wi h a mechanism in which only hese spa se, con ex -agg ega ing
que ies a e e ained. By inse ing one [CLS] oken e e y w ames and disca ding
he in e media e ame slo s a e encoding, he model comp esses a long sequence o
ames in o a much sho e sequence o que ies. This app oach ha nesses he T ans-
o me ’s sel -a en ion o cap u e long- ange dependencies, enabling each oken o
summa ize dozens o milliseconds o audio con en in a single ec o .
This pa adigm deli e s se e al key ad an ages. Fi s , i achie es bi a e on-
demand by simply uning he window leng h w: ewe que ies mean ewe okens
pe second and hence lowe bi a e, while mo e que ies eco e ine empo al de-
ail. Second, because each que y oken a ends o i s neighbo s, he esul ing okens
ca y global seman ic con ex ha pe - ame quan iza ion o en misses. Thi d,
by comp essing con ex in o ewe okens, ALMTokenize p oduces much sho e
sequences, alle ia ing he compu a ional bu den on downs eam sequence models
and imp o ing long- e m cohe ence o gene a i e asks.
Empi ical e alua ions con i m ha ALMTokenize ’s que y-based comp ession de-
10 Chap e 1. In oduc ion
li e s pe cep ual quali y compa able o much highe -bi a e neu al codecs while
ope a ing a only a ac ion o hei oken a es. Lis ene s consis en ly judge i s
econs uc ions as nea ly indis inguishable om iche e e ences, and objec i e in-
elligibili y and ideli y me ics place i a he o e on o ul a-low-bi a e mod-
els. C ucially, he seman ic densi y o i s [CLS] okens ansla es in o eal gains
on downs eam asks: speech ecogni ion e o a es d op, and emo ion o speake -
iden i ica ion sys ems pe o m mo e accu a ely when ed ALMTokenize embeddings
[23].
1.3 Mo i a ion
The mo i a ion o his wo k s ems om bo h p ac ical and heo e ical gaps in
cu en neu al audio codec esea ch. Fi s , al hough he ALMTokenize pape in-
oduces a powe ul que y-based comp ession pa adigm ha p omises bi a e on-
demand and iche seman ic con en , he au ho s ha e no eleased hei imple-
men a ion alongside he p ep in , limi ing ep oducibili y and b oade communi y
adop ion. Rep oducing and ex ending such a model is he e o e essen ial o alida e
i s claims, explo e i s design choices, and in eg a e i s inno a ions in o downs eam
audio-language applica ions.
Beyond codec design, he e is a p essing need o econcile comp ession e iciency
wi h he sequence-modeling demands o audio-language T ans o me s. High oken
a es (e.g., 150 okens/s in EnCodec) p o ide ine-g ained acous ic de ail bu lead o
p ohibi i ely long sequences o au o eg essi e o sel -a en ion models, es ic ing
hei abili y o cap u e long- e m s uc u e. Con e sely, ex eme comp ession can
deg ade pe cep ual quali y. A seman ically s uc u ed la en space a mode a e
oken a es could enable T ans o me s o model con en o e ex ended con ex s
mo e e ec i ely, imp o ing asks like speech ecogni ion, emo ion classi ica ion, and
audio gene a ion.
This wo k explo es seman ic comp ession by ep oducing and simpli ying he que y-
based amewo k o ALMTokenize and e alua ing i s seman ic p ope ies. The goal
1.4. Objec i es 11
is o con ibu e open, ep oducible me hods and insigh s ha could suppo he
de elopmen o u u e audio codecs wi h imp o ed bi a e e iciency and seman ic
cla i y.
1.4 Objec i es
The p ima y aim o his hesis is o in es iga e and enhance he seman ic o ganiza-
ion o neu al audio codec la en s, ocusing on he no el que y-based comp ession
app oach. To his end, we pu sue he ollowing wo speci ic objec i es:
1. Rep oduce and adap he ALMTokenize pa adigm: we will implemen
a que y-based comp ession pipeline by in e lea ing lea nable [CLS] que y o-
kens in o he la en s eam o a ozen EnCodec on -end, enabling bi a e
on-demand h ough a unable window leng h.
2. Cha ac e ize he seman ic s uc u e o he esul ing la en space: we
will ex ac [CLS] embeddings om he Good-sounds da ase and compa e
hem agains baseline EnCodec la en s ia (a) p ojec ions, (b) unsupe ised
clus e ing and p ojec ions, (c) linea sepa abili y es s, (d) supe ised ins u-
men classi ica ion, (e) in e pola ion smoo hness assessmen s and ( ) imb e
ans e capabili ies.
Chap e 2
Me hods
2.1 A chi ec u e
This sec ion de ails he dual-module a chi ec u e o ou ad e sa ial audio codec,
which consis s o a gene a o ha comp esses and econs uc s wa e o ms and an
ensemble o disc imina o s ha p o ide mul iscale pe cep ual eedback. A schema
o he whole a chi ec u e, including bo h he gene a o and he disc imina o is
p esen ed in Figu e 2.
2.1.1 Gene a o
The gene a o pa o ou cus om ALMTokenize implemen a ion is composed o
he ollowing modules: Pa chi y and Unpa chi y (encode /decode on ends), [CLS]
okens in e lea ing and e ie al unc ions, T ans o me modules and mask okens
in e lea ing and e ie al unc ions.
Pa chi y and Unpa chi y Modules
The Pa chi y and Unpa chi y modules se e as he encode and decode on -ends,
e ec i ely ans o ming aw audio signals o ec o ep esen a ions (z). Ou model
uses he p e ained 24 kHz EnCodec encode as he Pa chi y module and i s co e-
sponding decode as Unpa chi y, bo h ozen du ing aining o ensu e high- ideli y
12
2.1. A chi ec u e 13
Figu e 2: Schema ic o he ALMTokenize gene a o ( op) and disc imina o (bo -
om). The disc imina o schema is aken om he EnCodec pape [17].
wa e o m p ocessing wi hou addi ional on -end op imiza ion. The encode ans-
o ms he aw audio wa e o m in o a sequence o ame embeddings ia successi e
one-dimensional s ided con olu ions, p oducing a enso o shape T×dwhe e T
is he numbe o ames and d he embedding dimension (in ou case, d= 128).
Unpa chi y mi o s his p ocess wi h ansposed con olu ions ha econs uc he
wa e o m om decoded embeddings, gua an eeing con e ibili y be ween ime and
la en domains.

14 Chap e 2. Me hods
[CLS] Token In e lea ing and Re ie al
A cus om ou ine in e lea es a lea nable [CLS] oken immedia ely a e e e y w
ames, whe e wis he chosen window size. By a ying w, i is possible o con ol how
many [CLS] okens a e inse ed, and hus he e ec i e bi a e, enabling ue bi a e
on-demand. Once hese okens pass h ough he ans o me encode (desc ibed
below), each [CLS] will ga he con ex om he su ounding ame embeddings
h ough sel -a en ion. A e passing h ough he ans o me encode , he con ex -
ich [CLS] okens a e e ie ed om he sequence, o o m he new seman ic- ich
la en ames.
Du ing aining, wis andomly chosen o each ba ch, aking alues in he ange
om 2 o 10. This ensu es be e gene aliza ion o di e en alues o w. Du ing
e alua ion, we use w= 3, so ha a e e e y h ee ame ec o s p oduced by he
Pa chi y module, one lea nable [CLS] oken is inse ed.
T ans o me Modules
Once he [CLS] okens ha e been in e lea ed among he ame embeddings, he
combined sequence is ed in o ou ans o me encode , which is esponsible o
in using each [CLS] ec o wi h con ex ual in o ma ion om i s neighbo s. We
employ o a y posi ional embeddings (RoPE) [32] o encode empo al o de , and
es ic he a en ion mechanism o a causal sliding window wi h a de aul size o 16
ames (app oxima ely 213 ms, gi en 75 ames pe second). This design en o ces
local empo al dependencies while keeping compu a ion e icien .
Du ing decoding, he e ie ed [CLS] que ies a e me ged wi h lea nable mask okens
(desc ibed below) and ed in o a symme ic ans o me decode , which e-expands
he sequence back o i s o iginal leng h be o e Unpa chi y es o es he wa e o m.
Bo h he ans o me encode and decode consis o 12 iden ical laye s, each ea-
u ing 32 a en ion heads and a eed- o wa d ne wo k o 256 and 512 dimensions o
he encode and decode espec i ely.
2.1. A chi ec u e 15
Mask Tokens In e lea ing and Re ie al
Du ing decoding, we eins a e he o iginal oken posi ions by inse ing a single lea n-
able mask embedding in o each ame slo ha was d opped du ing comp ession and
placing he e ie ed [CLS] okens back in hei que y loca ions. This in e lea ed
sequence is hen ed in o he ans o me decode , whe e he mask okens signal gaps
o be illed using he con ex ual in o ma ion ca ied by he [CLS] ec o s. A e de-
coding, only he ame-posi ion ou pu s p oceed o Unpa chi y, while he mask and
[CLS] slo s a e disca ded.
Quan iza ion
The o iginal ALMTokenize implemen a ion employs h ee laye s o esidual ec o
quan iza ion (RVQ) o quan ize each que y embedding in sequence. Each laye has
i s own codebook, hal o which is ini ialized using k-means cen oids de i ed om
wa 2 ec 2.0 ea u es o speech and he o he hal om BEATs ea u es o gene al
sounds.
In ou implemen a ion, we elimina e quan iza ion en i ely. Ou goal is no o mini-
mize he bi a e o he comp essed signal bu o explo e he la en space gene a ed
by he model. Acco dingly, we eed he aw [CLS] okens di ec ly, wi hou any quan-
iza ion s ep. This choice simpli ies aining by emo ing bo h he RVQ module and
he au o eg essi e (AR) loss.
2.1.2 Disc imina o
Disc imina ion is pe o med using he same mul i-scale STFT disc imina o a chi-
ec u e in oduced in EnCodec, which we adop di ec ly om hei o icial implemen-
a ion [17]. We use ou disc imina o s, each ope a ing on spec og ams a di e en
esolu ions. Speci ically, he STFTs a e compu ed wi h FFT sizes o 256, 512, 1024,
and 2048, wi h ma ching window leng hs and hop sizes o 64, 128, 256, and 512
samples, espec i ely. This mul i-scale con igu a ion allows he disc imina o s o
cap u e bo h ine and coa se empo al s uc u es in he audio.
16 Chap e 2. Me hods
Each disc imina o ollows he EnCodec design: he complex STFT ( eal and imag-
ina y pa s conca ena ed) is p ocessed h ough a s ack o 2D con olu ional laye s,
s a ing wi h a 3x9 con olu ion wi h 32 il e s, and con inuing wi h p og essi ely
deepe con olu ions ha inco po a e s ided downsampling along he equency axis
and dila ions o 1, 2, and 4 along he ime axis. All laye s use LeakyReLU ac i a-
ions and weigh no maliza ion, and he ne wo k concludes wi h a 3x3 con olu ion
o p oduce he disc imina o logi s.
2.2 Losses
In his subsec ion we in oduce he loss e ms ha guide he aining o he model.
We de ine he ime-domain econs uc ion loss, he equency-domain spec al loss,
he ad e sa ial loss and he ea u e-ma ching loss. Figu e 3 illus a es he se o loss
unc ions ha is compu ed o he ALMTokenize model.
2.2.1 Gene a o Loss
The gene a o ea u es a composi e loss ha combines ou e ms ha guide he
model owa d pe cep ually ealis ic and seman ically ich econs uc ions.
Recons uc ion Loss
Recons uc ion e o quan i ies he disc epancy be ween he o iginal and econ-
s uc ed audio by combining wo complemen a y measu es. Fi s , we compu e he
ime-domain loss as he poin wise L1dis ance be ween he o iginal wa e o m (x)
and i s econs uc ion (ˆx):
L ime =
x−ˆx
1
Second, we o m he equency-domain loss using a mul iscale mel-spec og am c i-
e ion. Bo h he o iginal signal and he econs uc ion a e ans o med in o mel-
spec og ams a mul iple esolu ions. A e e y scale, we compu e (1) he mean
absolu e e o be ween he mel-spec og am magni udes, and (2) he RMSE e o
2.2. Losses 17
Figu e 3: Schema ic o he losses compu ed du ing aining. All he losses a e shown
wi hin a ed box.
24 Chap e 2. Me hods
We delibe a ely use a e y high C pa ame e in he SVC (C= 107), ensu ing
ha he classi ie s ongly penalizes misclassi ica ions and i s he da a as closely
as possible, so ha he epo ed accu acy e lec s he in insic linea sepa abili y o
he embeddings a he han he e ec o s ong egula iza ion.
Ins umen Classi ica ion
To u he es he ep esen a ional powe o he la en s, we ain a small andom
o es classi ie (RF) wi h 100 es ima o s on he embeddings. By compa ing
downs eam pe o mance when using ou que y-based [CLS] okens e sus aw En-
Codec ame ec o s (wi hou addi ional p ocessing), we can measu e how seman ic
s uc u ing in he la en space ansla es in o conc e e gains on a classi ica ion ask.
To assess such pe o mance, we compu e and compa e he accu acy and p ecision
o bo h ins ances o he classi ie .
In e pola ion Smoo hness Tes s
We in es iga e he in e pola ion beha io be ween la en codes om di e en ins u-
men s. Speci ically, we linea ly in e pola e be ween he wo [CLS] ec o s ep esen -
ing he cen oids o all samples belonging o a gi en ins umen , no e, o oc a e.
Each in e media e poin is hen decoded back o audio, allowing us o lis en o he
esul ing sequence. In a seman ically s uc u ed la en space, hese a e sals should
p oduce pe cep ually smoo h ansi ions, wi h imb e and cha ac e mo phing g ad-
ually. By con as , an uns uc u ed space would yield ab up o incohe en changes.
While his es is inhe en ly subjec i e, we p o ide exempla y audio snippe s in ou
p ojec eposi o y so ha eade s can e alua e he con inui y and plausibili y o
hese in e pola ions hemsel es.
Al hough Good-Sounds con ains ca e ully eco ded samples, an ins umen ’s imb e
can a y signi ican ly o e he cou se o a no e. To educe a i ac s in he gene a ed
audio, we compu e cen oids only om he mos s able po ions o he signal. Fo
sounds wi h a ailable anno a ions, we used he segmen spanning om he anno a ed
a ack o decay o he anno a ed elease o o se . Because no anno a ions we e

2.5. E alua ion 25
p o ided o he onse o he sus ain phase, we elied on hese a ailable bounda ies
ins ead. Unde hese c i e ia, only he lu e, cla ine , and umpe eco dings we e
sui able o cen oid cons uc ion.
Ze o-sho Timb e T ans e
Finally, we explo e whe he mo ing he la en ep esen a ion o an audio owa ds
ha o a a ge ins umen can e ec i ely change i s imb e while p ese ing pi ch
and hy hm.
The p ocedu e s a s by encoding an inpu sound in o la en ames, bo h wi h
EnCodec and wi h ou ALMTokenize . F om hese ames, we compu e he cen oid
o he inpu audio in la en space. We also compu e he cen oid o he a ge
ins umen by a e aging all la en ames co esponding o he s able pa o ha
ins umen in he Good-Sounds da ase . The di e ence be ween hese wo cen oids
de ines a imb e di ec ion. Once we ha e his di ec ion, we shi all la en ames
o he inpu audio along i un il hei a e age aligns wi h he cen oid o he a ge
ins umen . A e he shi , he modi ied la en s a e decoded back in o audio.
We will e alua e he esul s by lis ening and assessing whe he he ans o med
audio con ains cha ac e is ic elemen s o he a ge ins umen while s ill p ese ing
he pi ch and iming o he o iginal audio.
Chap e 3
Resul s
The code used o implemen ing he a chi ec u e and cha ac e izing i s ou pu has
been made a ailable a h ps://gi hub.com/angelm 97/alm okenize
3.1 T aining
In Figu e 4, we plo he e olu ion o each loss e m o e he i s 300 aining s eps.
0.150
0.155
0.160
0.165
Value
L_ ime
0.32
0.33
0.34
0.35
0.36
0.37
0.38
Value
L_ eq
1.0
1.2
1.4
1.6
Value
L_ad
0.175
0.200
0.225
0.250
0.275
0.300
Value
L_ ea
0 10 20 30 40 50 60 70
Epoch
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Value
L_mae
0 10 20 30 40 50 60 70
Epoch
2.50
2.75
3.00
3.25
3.50
3.75
Value
L_ o al
0 10 20 30 40 50 60 70
Epoch
1.93
1.94
1.95
1.96
1.97
1.98
Value
L_disc
T ain
Tes
T aining Me ics
Figu e 4: Loss cu es o he ain (blue) and es da ase (o ange).
O e all, he o al loss o he gene a o (L_ o al s eadily declines, demons a ing ha
he model is con e ging. Looking mo e closely a he econs uc ion losses, we see a
p onounced d op in he equency-domain spec al loss (L_ eq). This indica es ha
he spec og ams o he econs uc ed audio a e becoming e e mo e simila o hose
26
3.2. Signal Recons uc ion 27
o he o iginal signal. By con as , he ime-domain L1 loss (L_ ime) inc eases du -
ing he i s s ages o aining. This ise is no inhe en ly p oblema ic: ime-domain
e o measu es ma hema ical disc epancies be ween wa e o ms, and wo signals can
be pe cep ually iden ical despi e subs an ial sample-by-sample di e ences, and he
disc imina o is agnos ic o hese di e ences, since i only e alua es spec og ams.
The MAE loss decays smoo hly and consis en ly, showing ha he ans o me en-
code in he gene a o is e ec i ely embedding con ex ual in o ma ion in o each
ame. E en when po ions o he inpu a e masked, he model lea ns o econs uc
hem accu a ely.
Du ing aining, he ad e sa ial loss, disc imina o loss, and ea u e-ma ching loss
each a s eady equilib ium. All h ee emain s able, oscilla ing only wi hin a na ow
ange, which sugges s ha he gene a o and disc imina o a e e enly ma ched and
ha nei he domina es he o he . This implies ha , e en as he disc imina o s
inc ease hei s eng h, he gene a o lea ns o p oduce ou pu s ha be e align
wi h hei in e nal ea u e ep esen a ions, poin ing o an imp o emen in pe cep ual
quali y.
3.2 Signal Recons uc ion
We p o ide audio examples compa ing econs uc ed signals (a e encoding and de-
coding) wi h hei o iginal coun e pa s a h ps://angelm 97.gi hub.io/alm okenize /
The quali y o he econs uc ions is a om op imal. The gene a ed audios exhibi
a me allic cha ac e and audible a i ac s, which shows ha he model is no ye
capable o ep oducing he ull ichness o he o iginal signals. Compa ed o s a e-
o - he-a codecs such as EnCodec, he gap in pe cep ual quali y is clea . Howe e ,
i is impo an o emphasize ha he econs uc ed sounds emain pe ec ly ecog-
nizable, and in he case o speech samples he con en is in elligible. Achie ing his
le el o econs uc ion wi h a ela i ely small da ase and unde limi ed compu a-
ional esou ces ep esen s a signi ican accomplishmen and p o ides a solid basis
o u he imp o emen .
28 Chap e 3. Resul s
3.3 Seman ic S uc u e
3.3.1 P ojec ions
Figu e 5 p esen s he -SNE p ojec ions o he embeddings p oduced by EnCodec
and ALMTokenize o he Good-Sounds da ase . We omi he legend, ocusing in-
s ead on he o e all s uc u e and g ouping o he embeddings a he han indi idual
labels. Visually, he clus e s o med by ALMTokenize embeddings a e no iceably
igh e han hose o EnCodec ac oss all h ee seman ic a ibu es examined (in-
s umen , no e, and oc a e). This igh e g ouping sugges s ha he la en space
lea ned by ALMTokenize encodes seman ic in o ma ion mo e e ec i ely han adi-
ional neu al codecs, which may lead o imp o ed pe o mance in downs eam asks
ha depend on seman ic dis inc ions.
In pa icula , he space shows a clea e o ganiza ion by pi ch, as he na u ally
20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
ins umen
20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
no e
20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
oc a e
EnCodec
60 40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
ins umen
60 40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
no e
60 40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
oc a e
ALMTokenize
Figu e 5: -SNEs o he g ound u h labels o he embeddings gene a ed by EnCodec
( op) and ALMTokenize (bo om).
3.3. Seman ic S uc u e 29
eme ging clus e s align be e wi h no e and oc a e labels, sugges ing ha pi ch is
a key ac o s uc u ing he la en space.
While -SNE can be a use ul ool, i is also p one o p oducing misleading imp es-
sions o he da a. To ob ain a mo e eliable iew o class sepa abili y, we ins ead
employ Linea Disc iminan Analysis (LDA), which p ojec s he samples on o he
di ec ions ha maximize a iabili y be ween g oups (i.e., hose ha bes sepa a e
ins umen s, no es, o oc a es). Figu e 6 shows he LDA p ojec ions o hese h ee
label ypes, compa ing ALMTokenize wi h EnCodec. In ALMTokenize , he g oups
appea sligh ly mo e dis inc ly sepa a ed. This sugges s ha he class bounda ies
in ALMTokenize a e mo e linea and be e de ined.
7.5 5.0 2.5 0.0 2.5 5.0
LDA 1
10.0
7.5
5.0
2.5
0.0
2.5
5.0
LDA 2
ins umen
4202468
LDA 1
6
4
2
0
2
4
6
LDA 2
no e
420246810
LDA 1
6
4
2
0
2
4
6
LDA 2
oc a e
EnCodec
642024
LDA 1
6
4
2
0
2
4
6
LDA 2
ins umen
420246
LDA 1
4
2
0
2
4
6
LDA 2
no e
420246
LDA 1
4
2
0
2
4
6
LDA 2
oc a e
ALMTokenize
Figu e 6: LDAs o he g ound u h labels o he embeddings gene a ed by EnCodec
( op) and ALMTokenize (bo om).

30 Chap e 3. Resul s
3.3.2 Clus e ing
We explo e a ange o alues o k o iden i y he numbe o clus e s ha bes
aligns wi h he ins umen labels. Fo each k, we compu e he Akaike In o ma ion
C i e ion (AIC). Figu e 7 epo s he alue o his me ic ac oss di e en choices o
k o bo h EnCodec and ALMTokenize . Because AIC penalizes model complexi y,
he cu es ypically descend and hen ise as kinc eases, helping a oid o e i ing.
The op imal kis 34 and 44 o EnCodec and ALMTokenize , espec i ely.
Figu e 7 also displays he same -SNE p ojec ions as be o e, now colo ed by he
clus e assignmen s ob ained wi h he op imal numbe o clus e s. The p ojec ion
con i ms be e de ined clus e s in ou model.
Then, we compu e he ex e nal alida ion me ics. The Adjus ed Rand Index (ARI)
measu es how well he p edic ed clus e s ma ch he ue labels, co ec ing o chance.
The No malized Mu ual In o ma ion quan i ies he amoun o sha ed in o ma ion
be ween he clus e ing and he g ound u h. Finally, he Homogenei y Sco e e al-
ua es whe he each clus e con ains only membe s o a single class. Table 2 epo s
he alues ob ained o each o he h ee chosen me ics.
ALMTokenize consis en ly ou pe o ms EnCodec ac oss all o hem. These esul s
indica e ha he la en space o ALMTokenize is be e o ganized wi h espec o
he seman ic s uc u e o ins umen s, no es and oc a es han ha o EnCodec.
Fea u e EnCodec ALMTokenize
Adjus ed Rand Index (↑) 0.011 0.029
No malized Mu ual In o ma ion Sco e (↑) 0.268 0.411
Homogenei y Sco e (↑) 0.206 0.326
Table 2: Ex e nal clus e ing e alua ion me ics o EnCodec embeddings s. ALM-
Tokenize embeddings. Bold o he bes esul in each ea u e.
3.3. Seman ic S uc u e 31
EnCodec
20 25 30 35 40 45 50 55 60
Numbe o clus e s
800000
750000
700000
650000
600000
550000
500000
AIC
Model Selec ion: EnCodec
ALMTokenize
20 25 30 35 40 45 50 55 60
Numbe o clus e s
950000
900000
850000
800000
750000
700000
650000
600000
550000
AIC
Model Selec ion: ALMTokenize
40 20 0 20 40
-SNE 1
40
20
0
20
40
-SNE 2
clus e
60 40 20 0 20 40 60
-SNE 1
40
20
0
20
40
-SNE 2
clus e
Figu e 7: Plo o he AICs compu ed a each alue o k( op) and -SNEs o he
clus e ing o he embeddings gene a ed by EnCodec and ALMTokenize (bo om).
3.3.3 Linea Sepa abili y Tes s
Table 3 epo s he aw accu acies achie ed by an SVM classi ie wi h a linea
ke nel on EnCodec and ALMTokenize embeddings o h ee seman ic a ibu es:
ins umen , no e and oc a e. ALMTokenize ou pe o ms EnCodec in e e y case,
eaching 0.32 e sus 0.44 o ins umen , 0.15 e sus 0.30 o no e and 0.30 e sus
0.34 o oc a e. These esul s indica e ha ALMTokenize embeddings cap u e
seman ic dis inc ions mo e e ec i ely han adi ional neu al codecs, yielding mo e
disc imina i e ep esen a ions o downs eam classi ica ion asks.
32 Chap e 3. Resul s
Fea u e EnCodec ALMTokenize
Accu acy on Ins umen (↑) 0.32 0.44
Accu acy on No e (↑) 0.15 0.30
Accu acy on Oc a e (↑) 0.30 0.34
Table 3: Raw accu acies ob ained by he SVM classi ie o EnCodec embeddings
s. ALMTokenize embeddings. Bold o he bes esul in each ea u e.
3.3.4 Ins umen Classi ica ion
Table 4 summa izes he accu acy and p ecision ob ained om aining a andom
o es classi ie o ca ego ize samples by ins umen , no e, and oc a e. In all h ee
cases, ALMTokenize ou pe o ms he baseline, achie ing consis en ly highe alues
o bo h me ics. This sugges s ha he ep esen a ions lea ned by ALMTokenize
cap u e musically ele an ea u es and can be mo e use ul o downs eam MIR
asks such as ins umen classi ica ion.
Fea u e Model Accu acy (↑) P ecision (↑)
Ins umen EnCodec 0.51 0.53
ALMTokenize 0.67 0.70
No e EnCodec 0.46 0.47
ALMTokenize 0.60 0.61
Oc a e EnCodec 0.64 0.68
ALMTokenize 0.74 0.77
Table 4: Raw accu acies and p ecisions ob ained by he Random Fo es classi ie
o EnCodec embeddings s. ALMTokenize embeddings. Bold o he bes esul
in each ea u e.
3.3.5 In e pola ion Smoo hness Tes s
We included a se ies o examples showcasing sound in e pola ion (o mo phing) a
h ps://angelm 97.gi hub.io/alm okenize /.
The p ocedu e consis s o selec ing wo g oups o sounds, compu ing hei cen oids,
and hen gene a ing in e pola ed ec o s be ween hose cen oids. These in e po-
la ed ec o s a e subsequen ly passed h ough EnCodec and ALMTokenize o audio
syn hesis.
3.3. Seman ic S uc u e 33
Upon lis ening, we obse ed ha ALMTokenize p oduces audio o lowe pe cep-
ual quali y compa ed o EnCodec. Howe e , i appea s o cap u e he seman ic
s uc u e o he sounds mo e e ec i ely. In pa icula , in in e pola ion cases whe e
pi ch is in ol ed, he audio gene a ed by ALMTokenize clea ly ansi ions h ough
in e media e pi ches along he pa h, while EnCodec does no exhibi his beha io .
3.3.6 Ze o-sho Timb e T ans e
The expe imen s on ze o-sho imb e ans e can be ound a he ollowing page:
h ps://angelm 97.gi hub.io/alm okenize /
In hese es s, an inpu sound is shi ed in la en space owa d he cen oid o a a ge
ins umen , and hen decoded back o audio. The esul ing examples make clea
ha , al hough he o e all econs uc ion quali y is s ill limi ed and a i ac s emain
audible, he la en space lea ned by ALMTokenize encodes seman ic s uc u e mo e
e ec i ely han EnCodec. This s onge o ganiza ion allows he ans e ed sounds
o con ey a clea e sense o he in ended a ge imb e, making he ans o ma ion
eel mo e pu pose ul and consis en , e en i i is s ill a om musically con incing.
Lis o Tables
1 Compa ison o ames pe second (FPS), okens pe second (TPS),
codebook size (CS) and bi a e (BR) ac oss models. Adap ed om [23] 5
2 Ex e nal clus e ing e alua ion me ics o EnCodec embeddings s.
ALMTokenize embeddings. Bold o he bes esul in each ea u e. 30
3 Raw accu acies ob ained by he SVM classi ie o EnCodec embed-
dings s. ALMTokenize embeddings. Bold o he bes esul in
each ea u e................................. 32
4 Raw accu acies and p ecisions ob ained by he Random Fo es classi-
ie o EnCodec embeddings s. ALMTokenize embeddings. Bold
o he bes esul in each ea u e. . . . . . . . . . . . . . . . . . . . . 32
40

Bibliog aphy
[1] Hubba d, T. L. Audi o y image y: Empi ical indings. Psychological Bulle in
136 (2010).
[2] Be gmann, D. Wha is la en space? (2025). URL h ps://www.ibm.com/
hink/ opics/la en -space.
[3] Bae ski, A., Zhou, H., Mohamed, A. & Auli, M. wa 2 ec 2.0: A amewo k o
sel -supe ised lea ning o speech ep esen a ions (2020). URL h p://a xi .
o g/abs/2006.11477.
[4] Hsu, W.-N. e al. Hube : Sel -supe ised speech ep esen a ion lea ning by
masked p edic ion o hidden uni s (2021). URL h p://a xi .o g/abs/2106.
07447.
[5] Huang, P.-Y. e al. Masked au oencode s ha lis en (2023). URL h p://
a xi .o g/abs/2207.06405.
[6] Liu, H. e al. Audioldm: Tex - o-audio gene a ion wi h la en di usion models
(2023). URL h p://a xi .o g/abs/2301.12503.
[7] Nakashima, R., Ozaki, R. & Taniguchi, T. Unsupe ised phoneme and wo d
disco e y om mul iple speake s using double a icula ion analyze and neu al
ne wo k wi h pa ame ic bias. F on ie s in Robo ics and AI 6(2019).
[8] Hawley, S. H. & Tacke , A. R. Ope a ional la en spaces. Jou nal o Audio
Enginee ing Socie y (2024).
41
42 BIBLIOGRAPHY
[9] Lu, H. e al. Disen angled speech ep esen a ion lea ning o one-sho c oss-
lingual oice con e sion using β- ae (2022). URL h p://a xi .o g/abs/
2210.13771.
[10] Wyse, L., Kama h, P. & Gup a, C. Sound model ac o y: An in eg a ed sys em
a chi ec u e o gene a i e audio modelling (2022). URL h p://a xi .o g/
abs/2206.13085.
[11] Ga cía, H. F., Nie o, O., Salamon, J., Pa do, B. & See ha aman, P.
Ske ch2sound: Con ollable audio gene a ion ia ime- a ying signals and sonic
imi a ions (2025). URL h p://a xi .o g/abs/2412.08550.
[12] S awn, J. & Pohlmann, K. C. P inciples o digi al audio. Compu e Music
Jou nal 10 (1986).
[13] Smi h, J. O. & Abel, J. S. Iso11172-3: In o ma ion echnology - coding o
mo ing pic u es and associa ed audio o digi al s o age media a up o abou
1.5 mbi /s - pa 3: Audio. ISEJTC 129 WG 11 (1993).
[14] Valin, J., Vos, K. & Te ibe y, T. De ini ion o he opus audio codec. In e ne
Enginee ing Task Fo ce (IETF) (2012).
[15] Wu, H. e al. Towa ds audio language modeling – an o e iew (2024). URL
h p://a xi .o g/abs/2402.13236.
[16] Zeghidou , N., Luebs, A., Om an, A., Skoglund, J. & Tagliasacchi, M. Sound-
s eam: An end- o-end neu al audio codec (2021). URL h p://a xi .o g/
abs/2107.03312.
[17] Dé ossez, A., Cope , J., Synnae e, G. & Adi, Y. High ideli y neu al audio
comp ession (2022). URL h p://a xi .o g/abs/2210.13438.
[18] Kuma , R., See ha aman, P., Luebs, A., Kuma , I. & Kuma , K. High- ideli y
audio comp ession wi h imp o ed qgan (2023). URL h p://a xi .o g/
abs/2306.06546.
BIBLIOGRAPHY 43
[19] Ji, S. e al. Wa okenize : an e icien acous ic disc e e codec okenize o
audio language modeling (2025). URL h p://a xi .o g/abs/2408.16532.
[20] Pa ke , J. D. e al. Scaling ans o me s o low-bi a e high-quali y speech
coding (2024). URL h p://a xi .o g/abs/2411.19842.
[21] Zhang, X., Zhang, D., Li, S., Zhou, Y. & Qiu, X. Speech okenize : Uni ied
speech okenize o speech la ge language models (2024). URL h p://a xi .
o g/abs/2308.16692.
[22] Dé ossez, A. e al. Moshi: a speech- ex ounda ion model o eal- ime dialogue
(2024). URL h p://a xi .o g/abs/2410.00037.
[23] Yang, D. e al. Alm okenize : A low-bi a e and seman ic- ich audio codec
okenize o audio language modeling (2025). URL h p://a xi .o g/abs/
2504.10344.
[24] Juang, B. H. & G ay, A. H. Mul iple s age ec o quan iza ion o speech
coding. In ICASSP, IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing - P oceedings, ol. 1982-May (1982).
[25] Bo sos, Z. e al. Audiolm: a language modeling app oach o audio gene a ion
(2023). URL h p://a xi .o g/abs/2209.03143.
[26] Agos inelli, A. e al. Musiclm: Gene a ing music om ex (2023). URL h p:
//a xi .o g/abs/2301.11325.
[27] Bo sos, Z. e al. Sounds o m: E icien pa allel audio gene a ion (2023). URL
h p://a xi .o g/abs/2305.09636.
[28] Wang, C. e al. Neu al codec language models a e ze o-sho ex o speech
syn hesize s (2023). URL h p://a xi .o g/abs/2301.02111.
[29] Sun, S., K ishna, K., Ma a ella-Micke, A. & Iyye , M. Do long- ange language
models ac ually use long- ange con ex ? (2021). URL h p://a xi .o g/
abs/2109.09115.
44 BIBLIOGRAPHY
[30] Liu, H. e al. Seman icodec: An ul a low bi a e seman ic audio codec o
gene al sound (2024). URL h p://a xi .o g/abs/2405.00233h p://dx.
doi.o g/10.1109/JSTSP.2024.3506286.
[31] Ye, Z. e al. Codec does ma e : Explo ing he seman ic sho coming o codec
o audio language model (2024). URL h p://a xi .o g/abs/2408.17175.
[32] Su, J. e al. Ro o me : Enhanced ans o me wi h o a y posi ion embedding.
Neu ocompu ing 568 (2024).
[33] Fonseca, E., Fa o y, X., Pons, J., Fon , F. & Se a, X. Fsd50k: An open da ase
o human-labeled sound e en s. IEEE/ACM T ansac ions on Audio Speech and
Language P ocessing 30 (2022).
[34] Picas, O. R., Rod iguez, H. P., Dabi i, D. & Se a, X. Good-Sounds Da ase
(2017). URL h ps://zenodo.o g/ eco d/820937.
[35] Maa en, L. V. D. & Hin on, G. Visualizing da a using -sne. Tech. Rep. (2008).
[36] FISHER, R. A. The use o mul iple measu emen s in axonomic p oblems.
Annals o Eugenics 7(1936).
[37] Rao, C. R. The u iliza ion o mul iple measu emen s in p oblems o biologi-
cal classi ica ion. Jou nal o he Royal S a is ical Socie y Se ies B: S a is ical
Me hodology 10 (1948).
[38] Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. Clap: Lea ning audio
concep s om na u al language supe ision (2022). URL h p://a xi .o g/
abs/2206.04769.
[39] Alonso-Jiménez, P., Se a, X. & Bogdano , D. E icien supe ised aining
o audio ans o me s o music ep esen a ion lea ning (2023). URL h ps:
//a xi .o g/abs/2309.16418.2309.16418.

Related note

Why organizations use Identific for document trust, entry 84
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in North America, Europe, Latin America, and international online education, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports more transparent source review, better handling of multilingual submissions, and more consistent review procedures. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For doctoral theses, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com