CODICODEC: UNIFYING CONTINUOUS AND DISCRETE COMPRESSED
REPRESENTATIONS OF AUDIO
Ma co Pasini1S e an La ne 2Gyö gy Fazekas1
1Queen Ma y Uni e si y o London, UK 2Sony Compu e Science Labo a o ies, Pa is, F ance
[email p o ec ed]
ABSTRACT
E icien ly ep esen ing audio signals in a comp essed la-
en space is c i ical o la en gene a i e modelling. How-
e e , exis ing au oencode s o en o ce a choice be ween
con inuous embeddings and disc e e okens. Fu he mo e,
achie ing high comp ession a ios while main aining audio
ideli y emains a challenge. We in oduce CoDiCodec, a
no el audio au oencode ha o e comes hese limi a ions
by bo h e icien ly encoding global ea u es ia summa y
embeddings, and by p oducing bo h comp essed con inuous
embeddings a ~11 Hz and disc e e okens a a a e o 2.38
kbps om he same ained model, o e ing unp eceden ed
lexibili y o di e en downs eam gene a i e asks. This
is achie ed h ough Fini e Scala Quan iza ion (FSQ) and
a no el FSQ-d opou echnique, and does no equi e addi-
ional loss e ms beyond he single consis ency loss used
o end- o-end aining. CoDiCodec suppo s bo h au o e-
g essi e decoding and a no el pa allel decoding s a egy,
wi h he la e achie ing supe io audio quali y and as e
decoding. CoDiCodec ou pe o ms exis ing con inuous
and disc e e au oencode s a simila bi a es in e ms o
econs uc ion audio quali y. Ou wo k enables a uni ied
app oach o audio comp ession, b idging he gap be ween
con inuous and disc e e gene a i e modelling pa adigms.
1. INTRODUCTION
E icien , compac audio ep esen a ions a e c ucial o ap-
plica ions in Music In o ma ion Re ie al (MIR), gene a i e
modelling, and comp ession. While ecen ad ances in deep
lea ning ha e demons a ed imp essi e esul s in he lea n-
ing o comp essed ep esen a ions, se e al key challenges
emain. These include balancing high comp ession a ios
wi h econs uc ion ideli y, enabling bo h disc e e and con-
inuous la en ep esen a ions o di e se downs eam appli-
ca ions, and achie ing e icien aining and in e ence wi h-
ou eso ing o a complex and uns able aining p ocess.
Exis ing audio au oencode s o en all sho in one o
mo e o hese a eas. Vec o Quan iza ion (VQ)-based ap-
p oaches, such as SoundS eam [1], EnCodec [2], and De-
© M. Pasini, S. La ne , and G. Fazekas. Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A -
ibu ion: M. Pasini, S. La ne , and G. Fazekas, “CoDiCodec: Uni ying
Con inuous and Disc e e Comp essed Rep esen a ions o Audio”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
sc ip Audio Codec (DAC, [3]), can excel a high- ideli y
econs uc ion and a e well-sui ed o aining au o eg es-
si e language models on he esul ing disc e e la en okens
[4
–
6]. Howe e , hei disc e e na u e makes hem less com-
pa ible wi h con inuous gene a i e amewo ks (e.g., GANs
[7], di usion models [8
–
10]), as hei p e-quan iza ion
con inuous ea u es a e ypically high-dimensional and
unsui able o e icien la en modelling. Con inuous au-
oencode s, such as hose used in Moûsai [11], in Musika
[12,13], and in he S able Audio amily o gene a i e mod-
els [14
–
16], add ess he compa ibili y issue wi h con inuous
la en gene a i e models. Howe e , hey o en equi e mul i-
s age aining p ocedu es, uns able ad e sa ial aining ob-
jec i es, o slow i e a i e decoding p ocesses. While Mu-
sic2La en [17] in oduces a consis ency-based au oencode
ha achie es single-s ep decoding and single-loss end- o-
end aining, i is limi ed o con inuous ep esen a ions.
Fu he mo e, mos con inuous au oencode s encode audio
in o empo ally o de ed sequences, leading o edundancy
by epea edly encoding global ea u es ac oss embeddings.
This pape in oduces CoDiCodec (Con inuous-Disc e e
Codec), a no el audio au oencode ha add esses hese lim-
i a ions. CoDiCodec achie es he ollowing key objec i es:
•
Encoding o bo h comp essed con inuous embeddings
(~11 Hz) and disc e e okens (2.38 kbps) o
44.1
kHz
s e eo audio om a single model, o e ing lexibili y o
downs eam asks wi hou he need o sepa a e models.
•
Use o summa y embeddings [18] o cap u e global
ea u es, educing edundancy compa ed o o de ed se-
quences o be e ideli y a simila comp ession
•
Le e aging consis ency models [19, 20], CoDiCodec is
ained end- o-end using a single loss, simpli ying he
aining p ocess and a oiding he complexi ies o ad e -
sa ial aining o mul i-s age p ocedu es.
•
Suppo o bo h au o eg essi e and a no el, as e pa al-
lel decoding s a egy o long sequences.
•
In oduc ion o FSQ-d opou , enabling highe -quali y
con inuous decoding by bypassing quan iza ion, while
p omo ing in o ma i e embeddings sui able o down-
s eam modeling.
• An imp o ed a chi ec u e designed o inc ease he p o-
po ion o pa ame e s used by he ans o me laye s com-
pa ed o con olu ional ones, which simpli ies he p ocess
o scaling, while achie ing as e in e ence speed com-
pa ed o Music2La en 2.
To ou knowledge, his is he i s wo k uni ying sum-
ma y embeddings, consis ency-based aining, and he gen-
433
e a ion o bo h con inuous and disc e e ep esen a ions om
a single audio au oencode . Ou expe imen s show ha
CoDiCodec ou pe o ms exis ing con inuous and disc e e
au oencode s in e ms o econs uc ion quali y measu ed
by FAD [21] wi h di e en backbones. We p esen comp e-
hensi e abla ion s udies alida ing he design choices.
2. RELATED WORK
2.1 Audio Au oencode s
Audio au oencode s aim o lea n comp essed la en ep-
esen a ions o audio signals, ypically o dimensionali y
educ ion, gene a i e modeling, o MIR asks. These can
be b oadly di ided in o hose p oducing disc e e and con-
inuous comp essed la en ep esen a ions.
Disc e e La en Rep esen a ions: Vec o Quan iza ion
(VQ [22, 23]) has been a dominan echnique o lea n-
ing disc e e audio ep esen a ions. SoundS eam [1], En-
Codec [2], and Desc ip Audio Codec (DAC) [3] use Resid-
ual Vec o Quan iza ion (RVQ) o achie e high- ideli y au-
dio econs uc ion. These models a e pa icula ly well-
sui ed o aining au o eg essi e language models on he
esul ing disc e e okens [4
–
6]. Howe e , hei disc e e na-
u e limi s compa ibili y wi h con inuous gene a i e ame-
wo ks, and hey o en yield lowe empo al comp ession,
esul ing in longe sequences o downs eam asks com-
pa ed o con inuous me hods.
Con inuous La en Rep esen a ions: Se e al app oaches
lea n con inuous la en ep esen a ions o audio. The au-
oencode used in Musika [12] econs uc s bo h magni-
ude and phase componen s o a spec og am, enabling
as in e ence. Howe e , i elies on a wo-s age aining
p ocess and an ad e sa ial objec i e. Moûsai [11] uses a
di usion au oencode , achie ing end- o-end aining bu
equi ing expensi e i e a i e sampling o decoding. S a-
ble Audio and S able Audio 2 [14
–
16] le e age con inu-
ous ep esen a ions o ain di usion-based audio gene -
a ion models, bu he p oposed au oencode s s ill equi e
an objec i e wi h mul iple ad e sa ial and econs uc ion
losses. Music2La en [17] in oduces a consis ency-based
au oencode , achie ing single-s ep decoding and end- o-end
aining wi h a single loss unc ion. Howe e , i is limi ed
o p oducing o de ed sequences o con inuous ep esen-
a ions. Music2La en 2 [18] in oduces summa y embed-
dings [24] ha a e able o mo e e icien ly encode global
ea u es om he inpu samples, while s ill encoding o
con inuous- alued la en s. CoDiCodec, in con as , can
encode bo h con inuous and disc e e ep esen a ions, while
s ill using summa y embeddings.
2.2 Consis ency Models
Consis ency models [19,20] ep esen a class o gene a i e
models ha enables as one-s ep gene a ion. While show-
ing imp essi e esul s in image gene a ion [25], hei appli-
ca ion o audio emains unde -explo ed. CoMoSpeech [26]
explo es consis ency dis illa ion o speech syn hesis, e-
qui ing a p e- ained eache . Music2La en [17] and Mu-
sic2La en 2 [18] we e he i s o use consis ency models
in an end- o-end audio au oencode amewo k.
3. BACKGROUND
3.1 Consis ency Models
Consis ency models [19,20] a e a class o gene a i e models
ha lea n o map any poin on a di usion p ocess ajec o y
back o he o igin o ha ajec o y. They a e based on he
p obabili y low (PF) o dina y di e en ial equa ion (ODE)
[27], which desc ibes he e olu ion o a da a sample
x
pe u bed by Gaussian noise wi h s anda d de ia ion
σ
:
dx
dσ =−σ∇xlog pσ(x), σ ∈[σmin, σmax].(1)
whe e
pσ(x)
is he pe u bed da a dis ibu ion, and
∇xlog pσ(x)
is he sco e unc ion. The PF ODE de ines
ajec o ies mapping noisy samples
xσ
o he clean sam-
ple
xσmin
(whe e
σmin ≈0
). Consis ency models lea n a
consis ency unc ion
(xσ, σ)
ha di ec ly maps any poin
on his ajec o y o i s o igin:
(xσ, σ)7→ xσmin
, while
sa is ying he bounda y condi ion
(xσmin , σmin) = xσmin
. A
consis ency model
θ(xσ, σ)
is a neu al ne wo k pa ame e -
ized by
θ
ha app oxima es he ue consis ency unc ion.
To en o ce he bounda y condi ion, consis ency models
a e ypically pa ame e ized as
θ(xσ, σ) = cskip(σ)xσ+
cou (σ)Fθ(xσ, σ)
, whe e
Fθ(xσ, σ)
is a neu al ne wo k, and
cskip(σ)
and
cou (σ)
a e chosen such ha
cskip(σmin) = 1
and
cou (σmin) = 0
o sa is y he bounda y condi ion.
3.2 Consis ency T aining
Consis ency models can be ained ia Consis ency Dis-
illa ion (CD), equi ing a p e- ained di usion model, o
Consis ency T aining (CT). CoDiCodec uses CT, allow-
ing o aining in isola ion wi hou a p e ained eache
model. In CT, he con inuous PF ODE (Eq. 1) is dis-
c e ized using a sequence o noise le els
σmin =σ1<
σ2<· · · < σN=σmax
. The consis ency model is ained
by minimizing he ollowing loss:
LCT =Eλ(σi, σi+1)d θ(xσi+1 , σi+1), θ−(xσi, σi),
(2)
whe e
x∼pda a
is a aining sample,
σi
and
σi+1
a e ad-
jacen noise le els,
xσi
and
xσi+1
a e co esponding noisy
e sions o
x
,
d(x, y)
is a dis ance me ic, and
λ(σi, σi+1)
is a weigh ing unc ion.
θ
is he s uden model, and
θ−
is he eache , wi h pa ame e s
θ−←s opg ad(θ)
.
The loss minimizes he dis ance be ween model ou pu s
a adjacen noise s eps
σi, σi+1
, using he eache
θ−
o
p o ide he a ge s. Pos - aining, gene a ion om noise
xσmax
can occu in one s ep (
x= θ(xσmax , σmax)
) whe e
xσmax ∼ N(0, σ2
maxI)
, o mul iple s eps.
3.3 Fini e Scala Quan iza ion (FSQ)
Fini e Scala Quan iza ion (FSQ) [28] is a simple quan-
iza ion echnique and, unlike Vec o Quan iza ion (VQ
[22, 23]), i does no equi e addi ional loss e ms. I is
also shown o achie e almos pe ec codebook u iliza ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
434
A
L
L
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
M
L
L
M
M
M
L
L
A
L
L
A
A
A
A
A
A
A
M
L
L
M
M
M
L
L
Encode
Summa y Embeddings
Clean Spec og am Noisy Spec og am Recons uc ed Spec og am
C oss-connec ions
Pa chifie
Pa chifie
De-Pa chifie
De-Pa chifie
De-Pa chifie
De-Pa chifie
Pa chifie
Pa chifie
Upsample
Au o eg essi e
Consis ency Decode
T
T
T
T
T
Figu e 1. T aining p ocess. T ans o me modules a e ep esen ed wi h T, audio embeddings wi h A, lea ned/summa y
embeddings wi h L, and mask embeddings wi h M. We ep esen chunked causal masking wi h a cu ed a ow.
e en wi h la ge codebook sizes. FSQ bounds a alue
x
in
[−N, N]
whe e
N
is in ege , ounds i , and escales:
ˆx= ound(N· anh(x))
N,(3)
whe e
ˆx
is he quan ized alue, and
ound(·)
deno es
he ounding ope a ion. Applied elemen -wise o a
D
-
dimensional ec o
x
, each elemen
ˆxi
akes one o
2N+ 1
disc e e alues in
[−1,1]
, yielding an implici codebook o
size
(2N+ 1)D
. The g adien o he non-di e en iable
ounding ope a ion is app oxima ed using he s aigh -
h ough es ima o [29].
4. CODICODEC
Following p e ious wo k [17, 18, 30, 31], CoDiCodec op-
e a es on complex Sho -Time Fou ie T ans o m (STFT)
spec og ams. To add ess he skewed dis ibu ion o di -
e en equency bins, we apply an ampli ude ans o ma-
ion [32]:
˜c=β|c|αei∠(c)
, whe e
c
and
˜c
a e he o iginal
and ans o med STFT coe icien s,
α∈(0,1]
is a comp es-
sion exponen ha emphasizes lowe -ene gy componen s,
∠(c)
is he phase angle o
c
, and
β∈R+
is a scaling ac-
o . We ea he complex spec og am as a wo-channel
( eal/imagina y) ep esen a ion.
4.1 A chi ec u e
The p oposed a chi ec u e (Fig. 1) consis s o an en-
code , an upsample , and a consis ency model decode .
The model ope a es on pai s o consecu i e audio chunks.
Encode : I akes a spec og am chunk
x∈RC×F×T
(
C= 2 ×channels
,
F
and
T
a e he numbe o equency
bins and ime ames) and downsamples i ia a con olu-
ional pa chi ie . The la ened ea u es (audio embeddings)
a e conca ena ed wi h
K
lea nable summa y embeddings
and ed in o ans o me blocks [33] (T in Fig. 1) o sum-
ma y embeddings o ga he global con ex . Only he
K
summa y embeddings a e e ained, p ojec ed o
dla
, and
p ocessed ia
anh
( o con inuous ou pu ) o FSQ ( o
disc e e okens con e ed o indices a in e ence).
Upsample : I mi o s he encode s uc u e bu upsamples
ins ead o downsampling. I akes
K
summa y embeddings
(disc e e okens a e mapped back o ec o s), conca ena es
lea nable mask embeddings, and p ocesses hem h ough
ans o me blocks o “de-comp ess” in o ma ion om he
summa y embeddings. The esul ing audio embeddings a e
eshaped and upsampled by a con olu ional de-pa chi ie .
I s sole pu pose is p o iding in e media e ea u e maps as
c oss-connec ions o he decode : since he consis ency
model decode gene a es samples in one s ep, i is c ucial
o p o ide in o ma ion abou which sample o econs uc
o he i s laye s o he decode [17].
Consis ency Decode : I is ained o map a noisy spec o-
g am
xσ
o a clean one, condi ioned on upsample c oss-
connec ions. A pa chi ie downsamples he inpu noisy
spec og am
xσ
. C oss-connec ions om he upsample
a e added o ea u e maps a each esolu ion le el: his is
possible because o he exac symme y o he pa chi ie
wi h espec o he de-pa chi ie o he upsample . The ou -
pu is la ened and ed in o a s ack o ans o me blocks.
C ucially, ans o me s ope a e on consecu i e chunk pai s
(
xσ,le , xσ, igh
) wi h chunked causal masking ( igh chunk
a ends o le , no ice- e sa) o enable au o eg essi e de-
coding. A de-pa chi ie upsamples he ou pu o he o ig-
inal spec og am dimension. Skip connec ions addi i ely
combine he ea u e maps om he pa chi ie o he co e-
sponding ones in he de-pa chi ie . The o wa d pass is:
ˆxle ,ˆx igh =Decσle ,σ igh (Up(Enc(xle )), xle +σle εle ,
Up(Enc(x igh )), x igh +σ igh ε igh )
whe e Enc, Up, and Dec a e he Encode , Upsample ,
and Decode .
ε∼ N(0, I)
and noise le els
σ
a e sam-
pled independen ly. End- o-end aining uses he consis-
ency loss [20]:
L=E1
∆σdDecσle +∆σ,σ igh +∆σ,sg Decσle ,σ igh
wi h Pseudo-Hube dis ance
d(·)
, s ep
∆σ
, and s op-
g adien sg. We use he EDM pa ame e iza ion [19, 34],
con inuous log-no mal noise sampling [34], and an expo-
nen ial
∆σ
schedule [17, 35].
FSQ-d opou : To enable decoding om bo h disc e e FSQ
okens and mo e exp essi e con inuous embeddings us-
ing he same model, we in oduce FSQ-d opou . S anda d
FSQ aining causes con inuous p e-quan iza ion alues
(
anh(z)
) o clus e nea quan iza ion le els (Fig. 2(a)),
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
435
(a) S anda d FSQ (b) FSQ-d opou p=0.75
Figu e 2. Dis ibu ion o con inuous la en embeddings
o an e alua ion audio sample be o e he ounding ope a-
ion (a) wi h s anda d FSQ, and (b) wi h FSQ-d opou wi h
p=0.75. FSQ-d opou encou ages a mo e uni o m dis ibu-
ion, u ilizing he ull ange be ween -1 and 1.
limi ing exp essi eness. E en i he encode did p oduce a
mo e uni o m dis ibu ion o con inuous alues, we would
be o ced o apply he FSQ ounding ope a ion be o e decod-
ing, hus ounding away he addi ional in o ma ion, since
du ing aining FSQ is always enabled. FSQ-d opou ad-
d esses his: du ing aining, wi h p obabili y
p
, we by-
pass FSQ’s ounding s ep, eeding he con inuous
anh(z)
di ec ly o he upsample ; o he wise, we apply s anda d
FSQ ounding:
˜
z=( anh(z),wi h p obabili y p
ound(N· anh(z))
N,wi h p obabili y 1−p(4)
whe e choosing
N
esul s in
2N+ 1
FSQ quan iza ion
le els. This encou ages he encode o p oduce mo e in-
o ma i e con inuous embeddings ac oss he ull
[−1,1]
ange (Fig. 2(b)) and ains he decode o accep bo h
disc e e and con inuous inpu s, enabling highe - ideli y con-
inuous econs uc ion a in e ence. We no e ha a simila
echnique is p oposed in [36], using a combina ion o FSQ
and uni o m noise di he ing.
Random Mixing: We also in oduce andom mixing as
a da a augmen a ion echnique. Wi h a p obabili y o 0.5,
wo andomly selec ed aining samples a e mixed (added
oge he ) o c ea e a new aining sample. This encou -
ages he model o be obus o complex audio scenes wi h
mul iple sou ces. We abla e he e ec i eness o his ech-
nique in Sec ion 5.
4.2 Decoding P ocess
CoDiCodec suppo s wo decoding s a egies: au o eg es-
si e decoding, and a no el pa allel decoding s a egy.
Au o eg essi e Decoding: Au o eg essi e decoding is well-
sui ed o in e ac i e applica ions equi ing low la ency. In
his mode, CoDiCodec gene a es audio sequen ially, chunk
by chunk, condi ioning he gene a ion o each new chunk
on he p e iously decoded one. Fo a de ailed o maliza ion,
we e e he eade o he Music2La en 2 pape [18].
Pa allel Decoding: While au o eg essi e decoding is sui -
able o in e ac i e applica ions, i can be ine icien o
decoding long sequences, as each chunk mus be p ocessed
sequen ially. We in oduce a no el pa allel decoding s a -
egy ha add esses his limi a ion.
A a high le el, we decode adjacen pai s o comp essed
la en s in pa allel, and shi he pai s by one a each de-
noising s ep o a oid bounda y a i ac s. Mo e speci ically,
gi en a sequence o
T
se s o summa y embeddings, each
se encoding in o ma ion abou an audio chunk, we spli
hem in o
⌈T/2⌉
pai s. I
T
is odd, he las se is pai ed
wi h a se o ze oed-ou summa y embeddings. Each pai
o summa y embeddings is hen p ocessed independen ly.
The decoding p ocess in ol es mul iple denoising s eps (
S
).
S ep 1: Each pai o summa y embeddings is decoded by
he consis ency model in pa allel, s a ing om pu e noise
ep esen a ions o bo h he le and igh chunks. S ep
s
(
1< s ≤S
): The p e iously decoded chunks a e con-
ca ena ed, and he pai s a e shi ed by one posi ion. Fo
example, i chunks 0 and 1 we e pai ed in he p e ious s ep,
chunks 1 and 2 a e pai ed in he cu en s ep. Gaussian
noise wi h a dec easing s anda d de ia ion
σcond,s
is added
o all chunks. The consis ency model hen denoises each
pai o chunks, condi ioned on he co esponding summa y
embeddings. A linea ly dec easing noise schedule ensu es
ha he model g adually e ines he decoded audio samples.
This i e a i e p ocess, wi h shi ing pai s, e ec i ely al-
lows in o ma ion o p opaga e ac oss he sequence, mi iga -
ing bounda y a i ac s ha would a ise om independen ly
decoding ixed pai s. The numbe o s eps,
S
, con ols
he ade-o be ween compu a ional cos and econs uc-
ion quali y. While he memo y usage o au o eg essi e
decoding is cons an ega dless o he leng h o he se-
quence, o pa allel decoding i scales linea ly wi h leng h
(numbe o chunks), since he model pe o ms mul iple
decoding s eps a he same ime.
4.3 Implemen a ion De ails
A chi ec u e: CoDiCodec ea u es a scaled-up a chi ec-
u e compa ed o Music2La en 2 [18], p io i izing ans-
o me blocks o e con olu ional laye s o ease o scalabil-
i y [37,38]. The STFT ep esen a ion uses hop=1024, com-
pa ed o 512 in Music2La en , and window=2048. The con-
olu ional pa chi ie s and de-pa chi ie s ha e 5 esolu ion
le els, compa ed o 7 in Music2La en 2. We use [3, 3, 3, 1]
con olu ional laye s pe le el, and [64, 128, 256, 512] chan-
nels pe le el. Downsampling/upsampling a e pe o med
3 imes, wi h a ac o o 2 along bo h ime and equency,
excep o he middle le el, whe e only he equency axis
is downsampled/upsampled by a ac o o 4. The encode ,
upsample , and consis ency model each ha e 12 ans o me
blocks. These blocks ha e a hidden_dim=512 (compa ed o
256 in Music2La en 2), head_dim=128, and mlp_mul =4.
Fo each inpu chunk, he encode p oduces
K= 128
sum-
ma y embeddings, each wi h a dimensionali y o
dla = 4
.
We can hen eshape hem o 8 embeddings wi h 64 chan-
nels ( esul ing in ~11 Hz ep esen a ions o s e eo 44.1 kHz
audio). Since hey a e no a empo ally o de ed sequence,
hey can be eely eshaped o di e en ime-dimension
s. channels ade-o s. Noise le els (
σ
) a e encoded using
sinusoidal embeddings [33] wi h 512 channels. T aining
uses audio samples o 67,072 samples (app oxima ely 1.5
seconds a 44.1 kHz), wi h STFT spec og ams spli in o
wo consecu i e 32- ame chunks. We use a ba ch size o
20 and ain o 2 million i e a ions. We use RAdam [39]
wi h a lea ning a e o
1×10−4
,
β1= 0.9
, and
β2= 0.999
.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
436
A cosine lea ning a e decay is applied, eaching a inal
lea ning a e o
0
. An Exponen ial Mo ing A e age (EMA)
o he model pa ame e s is main ained wi h a momen um
o 0.9999. Fo FSQ, we use
N= 5
, esul ing in 11 quan i-
za ion le els pe dimension and an implici codebook size
o
114= 14′641
, which is much lowe han wha mod-
e n LLMs [40
–
42] use. Gi en he 128 okens pe chunk,
his esul s in a 2.38 kbps a e o s e eo 44.1 kHz audio.
FSQ-d opou is used wi h
p= 0.75
, ollowing ou abla-
ion esul s. We use he consis ency aining amewo k
o [17], wi h an ini ial consis ency s ep o
∆ 0= 0.1
and
a inal exponen o
eK= 2
. Random mixing da a aug-
men a ion is applied wi h a p obabili y o 0.5. T aining is
pe o med on a single A100 GPU and akes ~ wo weeks.
The model has ~150 million pa ame e s.
5. EXPERIMENTS AND RESULTS
Da a: We ain CoDiCodec on a combina ion o h ee
da ase s: MTG-Jamendo [43] o music ( 3k hou s), he
speech ( 800 hou s) and gene al audio ( 200 hou s) sam-
ples om DNS Challenge 4 [44], and M4singe [45] o
singing oice ( 30 hou s). We sample he aining da ase s
wi h weigh s
[4,1.5,4,1]
, espec i ely, du ing aining. We
choose hese weigh s in o de o ain CoDiCodec o be
obus o speech and gene al audio, while s ill ocusing on
music. Since we a e mainly in e es ed in he pe o mance
o ou model on musical samples, we use MusicCaps [6] as
he e alua ion da ase . We manually e i y ha none o he
samples in MusicCaps a e p esen in he aining se s.
Baselines: Fo con inuous ep esen a ion baselines, we
include: Musika [12], an au oencode econs uc ing mag-
ni ude and phase spec og ams; La Music [13], an au oen-
code designed o la en di usion models in music accom-
panimen gene a ion; Moûsai [11], which p o ides wo di -
usion au oencode models ( 2 and 3) wi h di e ing com-
p ession a ios; Music2La en [17] and Music2La en 2 [18],
wo consis ency-based au oencode s; and he au oencode
used in S able Audio Open [15, 16]. All hese models
ha e comp ession a ios om 32x o 128x, calcula ed as
wa e o m alues in di ided by la en alues ou . We also
include Desc ip Audio Codec (DAC) [3], a high- ideli y
RVQ-based au oencode p oducing disc e e ep esen a ions,
using bo h i s 2.67 kbi /s and 8 kbi /s con igu a ions.
Me ics: We use: SI-SDR (Scale-In a ian Signal- o-
Dis o ion Ra io) [46], which measu es he dis ance be-
ween he econs uc ed and o iginal wa e o ms; ViSQOL
(Vi ual Speech Quali y Objec i e Lis ene ) [47
–
49], which
es ima es pai -wise pe cep ual audio quali y, p o iding a
MOS-like sco e; FAD (F éche Audio Dis ance) [21], which
measu es he dis ance be ween he dis ibu ions o eal and
gene a ed audio ea u es om a p e ained VGGish [50],
assessing o e all audio quali y; FAD_clap, a a ian o FAD
ha uses CLAP [51] ea u es, shown o be e co ela e wi h
human pe cep ion o audio quali y [52].
5.1 Abla ion S udy
We conduc an abla ion s udy o alida e he key design
choices o CoDiCodec. We ain all abla ed models o
400k i e a ions wi h a ba ch size o 20, keeping o he ain-
ing pa ame e s and da ase consis en wi h he ull model.
Con inuous Disc e e
Model FADclap ↓FAD ↓FADclap ↓FAD ↓
M2L2 0.0218 0.784 - -
+ mix aug. 0.0208 0.745 - -
+ new a ch. 0.0178 0.635 - -
+ 128 la . 0.0154 0.568 - -
+ FSQ - - 0.0182 0.704
d.o. p=0.25 0.0173 0.628 0.0184 0.725
d.o. p=0.5 0.0169 0.618 0.0191 0.718
d.o. p=0.75 0.0161 0.599 0.0187 0.716
Table 1. Inc emen al abla ion s udy.
We s a by e alua ing he same a chi ec u e p esen ed in
Music2La en 2. We hen inc emen ally add changes o in-
di idually e alua e hei e ec . We i s use he andom
mixing augmen a ion, hen change o ou p oposed a chi-
ec u e, hen use 128 4-dimensional summa y embeddings
ins ead o 8 64-dimensional summa y embeddings (bo h
ha ing he same o al dimensionali y and esul ing comp es-
sion a io), and inally use FSQ-d opou wi h a ying alues
o he d opou p obabili y
p
. Fo each con igu a ion, we
epo FAD and
FADclap
o bo h con inuous embeddings
and disc e e okens, when applicable. Table 1 shows how
using he andom mixing augmen a ion, changing o ou
p oposed a chi ec u e, and e-dis ibu ing he same la en
space dimensionali y om 8 la en s wi h 64 channels o
128 la en s wi h 4 channels, all independen ly con ibu e o
lowe
FADclap
and FAD. In oducing FSQ pe o ms sligh ly
wo se (as expec ed, due o quan iza ion), bu enables dis-
c e e okens. FSQ-d opou wi h
p= 0.75
allows us o
bo h eco e a simila disc e e okens pe o mance as he
s anda d FSQ a ian , and simila con inuous embeddings
pe o mance as he ully con inuous a ian . We hus use
his con igu a ion o he emaining expe imen s.
Figu e 3. Downs eam gene a i e modeling
FADclap
wi h
espec o numbe o denoising s eps.
5.2 Downs eam Gene a i e Modeling
To assess he impac o he in oduced comp essed em-
beddings FSQ-based cons ain on downs eam gene a i e
modeling, we ain uncondi ional gene a i e models using
Rec i ied Flow [53] on con inuous la en ep esen a ions
om wo con igu a ions:
1. Con inuous: Embeddings om he “+ 128 la .” model
om he abla ion s udy (Sec ion 5.1), which does no use
any FSQ, bu a simple anh bo leneck. The dis ibu ion o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
437
Model S e eo Rep esen a ion Comp ession Ra io Bi a e SI-SDR ↑ViSQOL ↑FADclap ↓FAD ↓
Musika ✗Con inuous 64x - -25.81 3.80 0.103 2.308
La Music ✗Con inuous 64x - -27.32 3.95 0.050 1.630
Moûsai_ 2 ✓Con inuous 64x - -21.44 2.36 0.731 4.687
Moûsai_ 3 ✓Con inuous 32x - -17.47 2.28 0.647 4.473
Music2La en ✗Con inuous 64x - -3.85 3.84 0.036 1.176
Music2La en 2 ✓Con inuous 128x - -2.29 3.91 0.023 0.717
S able Audio ✓Con inuous 64x - 6.04 4.08 0.107 1.017
CoDiCodec (AR) ✓Con inuous 128x - -0.28 3.95 0.0120 0.390
CoDiCodec (Pa ., s=3) ✓Con inuous 128x - -0.08 3.94 0.0114 0.355
CoDiCodec (Pa ., s=4) ✓Con inuous 128x - -0.01 3.95 0.0112 0.344
DAC ✗Disc e e - 2.67 kbps 2.80 3.87 0.174 3.791
DAC ✗Disc e e - 8 kbps 9.48 4.21 0.041 0.966
CoDiCodec (AR) ✓Disc e e - 2.38 kbps -0.95 3.89 0.0136 0.485
CoDiCodec (Pa ., s=3) ✓Disc e e - 2.38 kbps -0.74 3.88 0.0130 0.431
CoDiCodec (Pa ., s=4) ✓Disc e e - 2.38 kbps -0.66 3.90 0.0127 0.427
Table 2. Audio quali y and econs uc ion me ics.
he la en alues ollows a gaussian-like dis ibu ion, which
we scale o ha e uni s anda d de ia ion o he aining da a.
2. FSQ-d opou : Embeddings om he “d.o. p=0.75”
model, aken wi hou he FSQ ounding ope a ion. In his
case, we i s apply an
a anh
ope a ion o p ojec he FSQ-
d opou con inuous alues om a uni o m (Fig. 2(b)) in o
a compa able gaussian- esembling dis ibu ion, and hen
escale hem o ha e uni s anda d de ia ion.
Fo each se ing, we ain a ~100M pa ame e Rec i ied
Flow DiT [54] o 200k i e a ions wi h a ba ch size o 128,
using la en s o 10-second samples. We use an in e nal
da ase o 100k single ins umen sou ces as aining da a.
We hen gene a e 1000 samples and e alua e hem using
FADclap
, a ying he numbe o DiT denoising s eps du ing
gene a ion. In Fig. 3 we show ha while bo h con igu a-
ions con e ge o a compa able
FADclap
wi h a la ge numbe
o denoising s eps, he model ained on FSQ-d opou em-
beddings (Se ing 2) achie es sligh ly lowe FADclap when
using less han 32 denoising s eps. We hypo hesize ha
he implici egula iza ion p o ided by FSQ-d opou can
be bene icial o la en gene a i e modelling: he decode
appea s o be sligh ly mo e “ obus ” o noisy gene a ions
o he downs eam model. We will u he in es iga e his
hypo hesis in u u e wo k.
Model Encoding (s) Decoding (s)
Music2La en 2 (AR) 0.44 4.53
Ou s (AR) 0.34 3.22
Ou s (Pa . s=3) 0.34 2.23
Ou s (Pa . s=4) 0.34 2.89
Ou s (Pa . s=5) 0.34 3.51
Table 3. In e ence speed compa ison (60-second audio).
5.3 Audio Quali y and Recons uc ion
We e alua e CoDiCodec ained as desc ibed in Sec. 4.3.
Table 2 p esen s he audio quali y and econs uc ion ac-
cu acy esul s. We e alua e bo h au o eg essi e (AR) and
pa allel (Pa . using 3 and 4 denoising s eps) decoding. We
also e alua e bo h con inuous (Con .) and disc e e (Disc.)
ep esen a ions. CoDiCodec signi ican ly ou pe o ms all
con inuous au oencode baselines in e ms o FAD and
FAD_clap. While some baslines achie e highe SI-SDR
and ViSQOL, hey a e explici ly ained wi h econs uc-
ion losses, while CoDiCodec only uses a gene a i e loss:
gene al audio quali y is hus p io i ised o e econs uc-
ion o he exac same signal, which hu s hese pai wise
me ics. C ucially, he p oposed pa allel decoding s a egy
achie es he bes audio quali y esul s, o bo h con inuous
and disc e e ep esen a ions. We p o ide samples he e
1
.
5.4 In e ence Speed
We measu e he encoding and decoding speed by p ocess-
ing a 60-second audio sample on a single RTX 3090 GPU.
Table 3 shows ha CoDiCodec achie es as e encoding
han Music2La en 2, and also subs an ially as e decod-
ing using he exac same au o eg essi e decoding s a egy.
Pa allel decoding can u he p o ide lowe imes i using
less han 5 s eps. Assuming unlimi ed memo y a ou dis-
posal o pa allel p ocessing, decoding e en longe samples
would ine i ably widen he gap.
6. CONCLUSION
This pape in oduced a no el audio au oencode p oducing
bo h con inuous embeddings and disc e e okens om a
single model, ained end- o-end wi h a single consis ency
loss. This is achie ed ia ini e scala quan iza ion and ou
p oposed FSQ-d opou echnique, which allows o exp es-
si e con inuous la en s ha pe o m well o downs eam
gene a i e modelling. CoDiCodec le e ages summa y em-
beddings o high comp ession and suppo s bo h au o e-
g essi e and a no el, as e pa allel decoding s a egy, ou -
pe o ming exis ing au oencode s in audio quali y me ics.
A no el a chi ec u e is designed o scalabili y, ocusing on
ans o me laye s. Fu u e wo k will explo e scaling up he
model, applying i o di e se audio domains, and in es i-
ga ing i s ep esen a ions o a b oade ange o MIR asks,
ully explo ing he po en ial o uni ying comp essed con in-
uous and disc e e ep esen a ions unde a single model.
1sonycslpa is.gi hub.io/codicodec
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
438
7. ACKNOWLEDGEMENTS
This wo k is suppo ed by he EPSRC UKRI Cen e o
Doc o al T aining in A i icial In elligence and Music
(EP/S022694/1) and Sony Compu e Science Labo a o-
ies Pa is.
8. REFERENCES
[1]
N. Zeghidou , A. Luebs e al., “SoundS eam: An End-
o-End Neu al Audio Codec,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 30, 2022.
[2]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” T ansac ions on Ma-
chine Lea ning Resea ch, 2023.
[3]
R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed RVQGAN,” in Thi y-se en h Con e ence on
Neu al In o ma ion P ocessing Sys ems, 2023.
[4]
J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” in Thi y-se en h Con e ence
on Neu al In o ma ion P ocessing Sys ems, 2023.
[5]
P. Dha iwal, H. Jun e al., “Jukebox: A gene a i e model
o music,” a Xi p ep in a Xi :2005.00341, 2020.
[6]
A. Agos inelli, T. I. Denk e al., “MusicLM: Gene a ing
Music F om Tex ,” Jan. 2023, a Xi :2301.11325 [cs,
eess].
[7]
I. J. Good ellow, J. Pouge -Abadie e al., “Gene a i e
ad e sa ial ne s,” in Ad ances in Neu al In o ma ion
P ocessing Sys ems 27, Dec. 2014.
[8]
J. Sohl-Dicks ein, E. A. Weiss e al., “Deep unsupe -
ised lea ning using nonequilib ium he modynamics,”
in P oceedings o he 32nd In e na ional Con e ence on
Machine Lea ning, ICML 2015, Lille, F ance, 6-11 July
2015, se . JMLR Wo kshop and Con e ence P oceed-
ings, ol. 37, 2015.
[9]
Y. Song, J. Sohl-Dicks ein, D. P. Kingma, A. Kuma ,
S. E mon, and B. Poole, “Sco e-based gene a i e mod-
eling h ough s ochas ic di e en ial equa ions,” in In-
e na ional Con e ence on Lea ning Rep esen a ions,
2021.
[10]
J. Ho, A. Jain e al., “Denoising Di usion P obabilis ic
Models,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 33: Annual Con e ence on Neu al In o ma ion
P ocessing Sys ems 2020, Neu IPS 2020, Decembe
6-12, 2020, i ual, 2020.
[11]
F. Schneide , Z. Jin e al., “Mo ^usai: Tex - o-Music
Gene a ion wi h Long-Con ex La en Di usion,” Jan.
2023, a Xi :2301.11757 [cs, eess].
[12]
M. Pasini and J. Schlü e , “Musika! Fas In ini e Wa e-
o m Music Gene a ion,” in P oceedings o he 23 d
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2022, Bengalu u, India, Decembe
4-8, 2022, 2022.
[13]
M. Pasini, M. G ach en e al., “Bass accompanimen
gene a ion ia la en di usion,” in ICASSP 2024 - 2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2024.
[14]
Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and J. Pons,
“Fas iming-condi ioned la en audio di usion,” in
Fo y- i s In e na ional Con e ence on Machine Lea n-
ing, 2024.
[15]
Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Taylo ,
and J. Pons, “Long- o m music gene a ion wi h la en
di usion,” in P oceedings o he 25 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2024, San F ancisco, Cali o nia, USA and Online,
No embe 10-14, 2024, 2024.
[16]
——, “S able audio open,” in ICASSP 2025 - 2025
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2025.
[17]
M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp ession,”
in P oceedings o he 25 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2024,
San F ancisco, Cali o nia, USA and Online, No embe
10-14, 2024, 2024.
[18]
——, “Music2la en 2: Audio comp ession wi h sum-
ma y embeddings and au o eg essi e decoding,” in
ICASSP 2025-2025 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2025, pp. 1–5.
[19]
Y. Song, P. Dha iwal e al., “Consis ency Models,” May
2023, a Xi :2303.01469 [cs, s a ].
[20]
Y. Song and P. Dha iwal, “Imp o ed echniques
o aining consis ency models,” a Xi p ep in
a Xi :2310.14189, 2023.
[21]
K. Kilgou , M. Zuluaga e al., “F éche audio dis ance:
A e e ence- ee me ic o e alua ing music enhance-
men algo i hms,” in 20 h Annual Con e ence o he
In e na ional Speech Communica ion Associa ion (IN-
TERSPEECH), Sep. 2019.
[22]
A. an den Oo d, O. Vinyals e al., “Neu al disc e e
ep esen a ion lea ning,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems 30, Dec. 2017.
[23]
A. Raza i, A. an den Oo d e al., “Gene a ing di e se
high- ideli y images wi h VQ-VAE-2,” in Ad ances in
Neu al In o ma ion P ocessing Sys ems 32, Dec. 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
439
[24]
Q. Yu, M. Webe , X. Deng, X. Shen, D. C eme s, and L.-
C. Chen, “An image is wo h 32 okens o econs uc-
ion and gene a ion,” a Xi p ep in a Xi :2406.07550,
2024.
[25]
S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, “La-
en consis ency models: Syn hesizing high- esolu ion
images wi h ew-s ep in e ence,” a Xi p ep in
a Xi :2310.04378, 2023.
[26]
Z. Ye, W. Xue e al., “Comospeech: One-s ep speech
and singing oice syn hesis ia consis ency model,” in
P oceedings o he 31s ACM In e na ional Con e ence
on Mul imedia, MM 2023, O awa, ON, Canada, 29
Oc obe 2023- 3 No embe 2023, 2023.
[27]
J. Song, C. Meng e al., “Denoising Di usion Implici
Models,” in 9 h In e na ional Con e ence on Lea ning
Rep esen a ions, ICLR 2021, Vi ual E en , Aus ia,
May 3-7, 2021, 2021.
[28]
F. Men ze , D. Minnen, E. Agus sson, and M. Tschan-
nen, “Fini e scala quan iza ion: VQ-VAE made simple,”
in The Twel h In e na ional Con e ence on Lea ning
Rep esen a ions, ICLR 2024, Vienna, Aus ia, May 7-11,
2024, 2024.
[29]
Y. Bengio, N. Léona d, and A. Cou ille, “Es ima -
ing o p opaga ing g adien s h ough s ochas ic neu-
ons o condi ional compu a ion,” a Xi p ep in
a Xi :1308.3432, 2013.
[30]
J. Nis al, S. La ne e al., “DRUMGAN: syn hesis o
d um sounds wi h imb al ea u e condi ioning using
gene a i e ad e sa ial ne wo ks,” in P oceedings o he
21 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), Oc . 2020.
[31]
J. Nis al, S. La ne , and G. Richa d, “Compa ing ep-
esen a ions o audio syn hesis using gene a i e ad e -
sa ial ne wo ks,” in 28 h Eu opean Signal P ocessing
Con e ence (EUSIPCO), Jan. 2020.
[32]
J. Rich e , S. Welke e al., “Speech enhancemen and
de e e be a ion wi h di usion-based gene a i e mod-
els,” IEEE ACM T ans. Audio Speech Lang. P ocess.,
ol. 31, 2023.
[33]
A. Vaswani, N. Shazee e al., “A en ion is all you
need,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 30, Dec. 2017.
[34]
T. Ka as, M. Ai ala e al., “Elucida ing he Design
Space o Di usion-Based Gene a i e Models,” Oc .
2022, a Xi :2206.00364 [cs, s a ].
[35]
Z. Geng, A. Pokle, W. Luo, J. Lin, and J. Z. Kol e ,
“Consis ency models made easy,” in The Thi een h In-
e na ional Con e ence on Lea ning Rep esen a ions,
2025.
[36]
J. D. Pa ke , A. Smi no , J. Pons, C. Ca , Z. Zukowski,
Z. E ans, and X. Liu, “Scaling ans o me s o low-
bi a e high-quali y speech coding,” a Xi p ep in
a Xi :2411.19842, 2024.
[37]
J. Kaplan, S. McCandlish, T. Henighan, T. B. B own,
B. Chess, R. Child, S. G ay, A. Rad o d, J. Wu, and
D. Amodei, “Scaling laws o neu al language models,”
a Xi p ep in a Xi :2001.08361, 2020.
[38]
J. Ho mann, S. Bo geaud, A. Mensch, E. Bucha skaya,
T. Cai, E. Ru he o d, D. d. L. Casas, L. A. Hen-
d icks, J. Welbl, A. Cla k e al., “T aining compu e-
op imal la ge language models,” a Xi p ep in
a Xi :2203.15556, 2022.
[39]
L. Liu, H. Jiang e al., “On he a iance o he adap i e
lea ning a e and beyond,” in 8 h In e na ional Con e -
ence on Lea ning Rep esen a ions, ICLR 2020, Addis
Ababa, E hiopia, Ap il 26-30, 2020, 2020.
[40]
H. Tou on, T. La il, G. Izaca d, X. Ma ine , M.-A.
Lachaux, T. Lac oix, B. Roziè e, N. Goyal, E. Hamb o,
F. Azha e al., “Llama: Open and e icien ounda ion
language models,” a Xi p ep in a Xi :2302.13971,
2023.
[41]
H. Tou on, L. Ma in, K. S one, P. Albe , A. Alma-
hai i, Y. Babaei, N. Bashlyko , S. Ba a, P. Bha ga a,
S. Bhosale e al., “Llama 2: Open ounda ion and ine-
uned cha models,” a Xi p ep in a Xi :2307.09288,
2023.
[42]
A. G a a io i, A. Dubey, A. Jauh i, A. Pandey, A. Ka-
dian, A. Al-Dahle, A. Le man, A. Ma hu , A. Schel en,
A. Vaughan e al., “The llama 3 he d o models,” a Xi
p ep in a Xi :2407.21783, 2024.
[43]
D. Bogdano , M. Won e al., “The m g-jamendo da ase
o au oma ic music agging,” in Machine Lea ning o
Music Disco e y Wo kshop, In e na ional Con e ence
on Machine Lea ning (ICML 2019), Long Beach, CA,
Uni ed S a es, 2019.
[44]
H. Dubey, V. Gopal e al., “Icassp 2022 deep noise sup-
p ession challenge,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing, ICASSP
2022, Vi ual and Singapo e, 23-27 May 2022, 2022.
[45]
L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He,
R. Huang, J. Zhu, X. Chen, and Z. Zhao, “M4singe :
A mul i-s yle, mul i-singe and musical sco e p o ided
manda in singing co pus,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, ol. 35, 2022, pp. 6914–
6926.
[46]
J. L. Roux, S. Wisdom e al., “SDR - hal -baked o well
done?” in IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing, ICASSP 2019, B igh on,
Uni ed Kingdom, May 12-17, 2019, 2019.
[47]
A. Hines, J. Skoglund e al., “Visqol: an objec i e
speech quali y model,” EURASIP J. Audio Speech Mu-
sic. P ocess., ol. 2015, 2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
440
[48]
C. Sloan, N. Ha e e al., “Objec i e assessmen o pe -
cep ual audio quali y using isqolaudio,” IEEE T ans.
B oadcas ., ol. 63, no. 4, 2017.
[49]
M. Chinen, F. S. C. Lim e al., “Visqol 3: An open
sou ce p oduc ion eady objec i e speech and audio me -
ic,” in Twel h In e na ional Con e ence on Quali y o
Mul imedia Expe ience, QoMEX 2020, A hlone, I eland,
May 26-28, 2020, 2020.
[50]
S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla , R. A.
Sau ous, B. Seybold, M. Slaney, R. J. Weiss, and K. W.
Wilson, “CNN a chi ec u es o la ge-scale audio clas-
si ica ion,” in 2017 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing, ICASSP 2017,
New O leans, LA, USA, Ma ch 5-9, 2017. IEEE, 2017,
pp. 131–135.
[51]
Y. Wu, K. Chen e al., “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal P o-
cessing ICASSP 2023, Rhodes Island, G eece, June
4-10, 2023, 2023.
[52]
M. Tailleu , J. Lee e al., “Co ela ion o
’eche au-
dio dis ance wi h human pe cep ion o en i onmen-
al audio is embedding dependan ,” a Xi p ep in
a Xi :2403.17508, 2024.
[53]
X. Liu, C. Gong, and Q. Liu, “Flow s aigh and as :
Lea ning o gene a e and ans e da a wi h ec i ied
low,” in The Ele en h In e na ional Con e ence on
Lea ning Rep esen a ions, ICLR 2023, Kigali, Rwanda,
May 1-5, 2023, 2023.
[54]
W. Peebles and S. Xie, “Scalable di usion models wi h
ans o me s,” in IEEE/CVF In e na ional Con e ence
on Compu e Vision, ICCV 2023, Pa is, F ance, Oc o-
be 1-6, 2023, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
441