QUANTIZE & FACTORIZE: A FAST YET EFFECTIVE UNSUPERVISED
AUDIO REPRESENTATION WITHOUT DEEP LEARNING
Jaehun Kim Ma hew C. McCallum And eas F. Ehmann
Si iusXM+Pando a, USA
[email p o ec ed]
ABSTRACT
Founda ion models ha e become inc easingly p e alen in
ackling Music In o ma ion Re ie al (MIR) asks. Al-
hough hey can be a powe ul ool o unde s anding mu-
sic, he compu a ion equi ed o he aining and in e ence
o hese models con inues o g ow as hey become mo e
complex. Specialized accele a ion, such as G aphical P o-
cessing Uni s (GPUs), has become necessa y o ope a -
ing hese models, as hey a e mos ly based on la ge Deep
Lea ning (DL) a chi ec u es. Fu he mo e, i is di icul
o use s o in e p e hem due o hei black-box na u e. In
his wo k, we p opose Quan ize s and Fac o ize s o Mu-
sic embeddings (QFM), a as , unsupe ised audio ep e-
sen a ion o music unde s anding backed by a wide ange
o ich MIR ea u es and e icien ea u e lea ne s. Expe -
imen al esul s show ha QFM models pe o m wi hin he
ange o esul s achie ed by ecen p e ious open sou ce
DL models on all e alua ed asks, wi h compe i i e e-
sul s on a subse . This is su p ising gi en he signi ican ly
smalle compu a ional equi emen s o QFM models o
aining and in e ence.
1. INTRODUCTION
Compu ing da a ep esen a ions h ough p e- ained oun-
da ion models is a common Machine Lea ning (ML) p ac-
ice which is widely applied o he ield o MIR [1]. These
models essen ially eplace he ole o con en ional music
audio ea u es [2,3]. Typically, DL models ha e been he
co e o he ecen e olu ion o ounda ion models in MIR.
Va ious wo ks explo ed new a chi ec u es and lea ning ob-
jec i es, leading o subs an ial imp o emen o ounda ion
models ecen ly [4, 5, 6, 7, 8]. Howe e , non-DL ep e-
sen a ion lea ning in MIR is no ably unde s udied, lea -
ing only a hand ul o wo ks [9, 10, 11] a e DL me h-
ods became almos ubiqui ous in ML p ac ices. Howe e ,
gi en inc easing conce ns abou some weaknesses, such as
lack o in e p e abili y [12] o ex ensi e ene gy consump-
ion [13], i seems wo hwhile o e isi non-DL al e na-
i es o he ounda ion models, especially wi h ega d o
© J. Kim, M. C. McCallum, A. F. Ehmann. Licensed un-
de a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: J. Kim, M. C. McCallum, A. F. Ehmann, “Quan ize &
Fac o ize: A as ye e ec i e unsupe ised audio ep esen a ion wi hou
deep lea ning”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
compu a ional e iciency o aining and embedding com-
pu a ion, and possibly o be e in e p e abili y.
This wo k p oposes a amewo k o lea ning unsupe -
ised music ep esen a ion as an e icien ye e ec i e al-
e na i e o DL models. The p ima y goal o he ame-
wo k is o add ess he compu a ional e iciency o ain-
ing and in e ence while main aining compa able e ec i e-
ness. Speci ically, we achie e hese goals by employ-
ing es ablished MIR ea u es and encoding hei in o ma-
ion h ough e icien pa allel dic iona y lea ne s, namely
Quan iza ion – Fac o iza ion (QF) modules. We conduc
expe imen s simila o ecen li e a u e on music ounda-
ion models [8,7,6] o show he e ec i eness o QFM em-
beddings. The esul s sugges ha he me hod can achie e
compa able e ec i eness o some o ecen DL models
while being signi ican ly as e in aining and in e ence.
Ou con ibu ions in his wo k a e as ollows: 1) we in-
oduce QFM, a low- esou ce ounda ion model use ul o
a ious MIR downs eam asks. 2) we p o ide QFM as a
gene ic a chi ec u e in which ea u e enginee ing and ma-
chine lea ning p ac ices a e almos equally impo an , en-
cou aging collabo a ions be ween wide audiences wi hin
he communi y. We s ess ha his wo k does no aim o
show he s a e-o - he-a pe o mance. Ins ead, we aim
o p o ide an al e na i e, MIR ea u e-o ien ed ML model
compa able o some DL-based models o e ec i eness,
while achie ing no able compu a ional e iciency.
2. RELATED WORK
Ea ly wo ks in unsupe ised music ep esen a ion lea n-
ing include p obabilis ic componen models such as he
Gaussian Mix u e Model (GMM) o i s ex ension o in i-
ni e mix u es [14,9]. O he popula me hods a e dic iona y
lea ning app oaches such as spa se coding o ec o quan-
iza ion [15, 10, 16, 17]. Essen ially, bo h app oaches y
o ep esen he music as a spa se esponse ec o o he
lea ned la en componen s.
La e , he ield e ol ed o seek mo e high-capaci y, non-
linea models. Deep Belie Ne wo ks (DBN) we e ex-
plo ed as an unsupe ised, nonlinea ep esen a ion lea ne
o spec al ea u e ames om sho chunks o audio [18,
19]. La e , a ious ypes o Con olu ional Neu al Ne -
wo k (CNN) a chi ec u es we e explo ed o ep esen a-
ion lea ning in he ans e lea ning con ex [2, 20, 21,
22, 23, 24]. Ano he b eak h ough was con as i e lea n-
ing [25], which is one o he mos popula and success ul
787
G1QF
ea u e1 ea u e2
G1QF
ea u eN
G1QF
…
QF o al
Scale
PCA
(a) QFM embeddings
Quan iza ion
UniG am
WMF
Fac o iza ion
embedding
ea u e ec o s
PCA(whi en)
KMeans/
GMM
ZSco e
(b) QF module
Figu e 1: The o e all a chi ec u e o QFM (a) and de ailed
illus a ion o submodules o each QF module (b).
lea ning objec i es oday o unsupe ised ep esen a ion
lea ning [6, 7]. Recen de elopmen s in na u al language
modeling inspi ed new a chi ec u al app oaches, such as
ans o me s, and ha e also been epo ed o be e ec i e
in lea ning music ep esen a ion [8, 5]. O he no able de-
elopmen s include semi-supe ised app oaches [26, 27],
mul i-modal lea ning app oaches [28,29,4] and he use o
music gene a i e models as p e- ained ounda ion mod-
els [30,31].
Howe e , hese high capaci y models o en equi e ex-
ensi e da a and compu a ion o aining and in e ence,
aising conce ns abou excessi e ene gy consump ion [13].
In addi ion, he lack o in e p e abili y o such models
is conside ed one o he main d awbacks and ex ensi ely
ackled wi hin he MIR ield [32,33] and beyond [12].
3. METHODOLOGY
To add ess some o he a o emen ioned sho comings o
DL-based ounda ion models, we p opose an unsupe -
ised music ep esen a ion lea ning amewo k ha is
highly e icien , ye e ec i e. A he amewo k’s co e
a e Quan iza ion–Fac o iza ion (QF) modules and mul i-
aspec ea u e usion. QF modules se e as dic iona y
lea ne s o each ea u e se , comp ising g oups o MIR
ea u es. Figu e 1 illus a es he diag ams o he gene al
a chi ec u e o QFM (a) and he QF module (b). In his
sec ion, we desc ibe each componen o QFM in de ail.
3.1 MIR Fea u e Se s
The ea u es a e he mos c i ical building block o QFM,
as we employ shallow ea u e lea ne s. Howe e , he a-
ionale o QFM is ha shallow lea ne s p o ide su icien
ep esen a ion i enough ea u e se s a e p o ided. We use
5se s o ea u es in o al, which a e illus a ed in Table 1.
The selec ed ea u es a e g ouped in o 5ca ego ies
acco ding o he ca ego iza ion ollowing ha o li-
b osa [40] 1. F om an inpu audio chunk, all ea u es a e
compu ed and g ouped, and hen passed o he co espond-
ing g oup-speci ic QF modules. Fo audio ea u es, we se
he sampling a e, hop size as 22kHz and 512, espec i ely.
O he pa ame e s a y o ea u es, whe e we use he de-
aul s speci ied in he lib osa package. 2-D pa ches a e
1h ps://lib osa.o g/doc/0.10.2/index.h ml
G oup Sub Fea u es Dim
MFCC 96 MFCC coe icien s excluding he i s 95
CQT 84-bin log Cons an -Q powe spec um 84
Rhy hmic∗
Mul i-band Onse S eng h†(4),
Tempog am Ra io [34,35] (13),
no malized BPM es ima ion (1)
18
Spec al
Spec al { Cen oid [36] (1),
Bandwid h [36] (1), Con as [37] (7),
Fla ness [38] (1), Rollo (1) },
Ze o-c ossing Ra e (1), Tonne z [39] (6)
18
Pa ches (9×9) pa ches om log mel spec og am 81
To al 296
Table 1: Lis o ea u e g oups and ea u es belong o
each g oup. The numbe s in pa en hesis mean he dimen-
sionali y o he ea u e. ∗We apply he log ans o m on
Rhy hmic ea u e g oup †Each 4 bands a e di ided by
{16,32,64}- h bins among 96 mel bins
andomly sampled om nono e lapping 9×9 ile g id on
96-bin log melspec og am.
3.2 QF module
A QF module is composed o cascading submodules o
quan iza ion and ac o iza ion. The ole o he quan iza-
ion module is o map each inpu ea u e ec o wi hin he
inpu audio chunk o a sequence o disc e e codes. Fo
ins ance, a se o ea u e ec o s is compu ed om audio
ames and hen a quan ize , such as K-Means clus e ing,
can assign each ec o o he closes clus e cen oid index.
Such quan iza ion has been a common me hod in dic io-
na y lea ning [17,10] and also ecen ly in language-model-
inspi ed audio models [8].
The ac o iza ion submodule is o encode he quan ized
code sequence in o la en ec o s using ac o iza ion mod-
els. In he a o emen ioned ans o me models [8], ans-
o me laye s a e chosen o his s ep. Ins ead, we ake a
simpli ied assump ion o code exchangeabili y wi hin an
audio chunk [41], wi h which we wish o cap u e a su -
icien amoun o in o ma ion while enjoying signi ican
compu a ional e iciency. We can hen o mula e a spa se
unig am ma ix whose ow and column co espond o each
audio chunk wi hin he da ase and he codes (i.e., clus e
index), espec i ely. Thus, each en y indica es how e-
quen ly each code is p esen o a gi en audio chunk a e
quan iza ion. Wi h such simpli ica ion, an e icien la en
model such as Weigh ed Ma ix Fac o iza ion (WMF) [42]
can be applied o encode he esul ing ac o s.
While he QF module is a gene ic amewo k sui s
many submodule op ions, we chose a speci ic con igu-
a ion: 1) Fo quan iza ion, we use a pipeline combin-
ing an op ional s anda d scale , hen whi ening P incipal
Componen Analysis (PCA), ollowed by K-Means clus-
e ing o Gaussian Mix u e Models (GMMs), which has
been p o en e ec i e [17]. 2) The ac o iza ion includes
he unig am ep esen a ion [43] and WMF. The aining
o QF modules is conduc ed module-by-module sequen-
ially, om he i s submodule o he quan iza ion o he
las submodule o ac o iza ion. Once ained, he new em-
beddings can be e icien ly in e ed om he unig am code
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
788
equency ec o s o unseen ea u e ec o s [42].
3.3 G1 and QF o al
O he ea u e encode s wi hin QFM a e G1 and QF o al.
G1 e e s o a single mul i a ia e Gaussian, whe e we com-
pu e he mean and s anda d de ia ion o each ea u e se
om an audio chunk. The conca ena ion o hese Gaus-
sian pa ame e s is o en used as a baseline ea u e in he
li e a u e [2, 9, 31]. Howe e , we include i as a submod-
ule o QFM o p ese e he o iginal aw cha ac e is ic o
he ea u es [17], which we lose du ing he QF embedding
p ocess as ac o ize s only obse e quan ized codes, no
he aw ea u es di ec ly. Also, since QFs do no conside
sequen ial dependency, he s anda d de ia ions o G1 p o-
ides a aluable summa y o empo al a iabili y wi hin
ea u es. We compu e he G1 and QF ac o s o each ea-
u e se we conside .
The QF o al is a sepa a e ac o iza ion module ha ac-
o izes he o al coun ma ix whe e pe ea u e-se unig am
coun ma ices a e s acked ho izon ally. The p ocess hus
cap u es he in e ac ion be ween ea u e se s, whe eas in-
dependen QF modules a e isola ed o each ea u e se .
3.4 QFM In e ence
QFM compu es an embedding o a gi en audio chunk as
ollows; 1) Fo a gi en audio chunk, N ea u e se s a e
compu ed. 2) These alues a e passed o G1 modules, ea-
u e speci ic QF, and QF o al modules. Finally, 3) scaling
and dimensionali y educ ion is pe o med using a “ obus ”
scaling 2and PCA on he conca ena ed o al embedding.
4. EXPERIMENT
To e i y he e ec i eness o QFM, we conduc ed an ex-
pe imen inspi ed by p e ious wo k. The expe imen is
designed o es a lea ned music audio ep esen a ion on
a ange o downs eam asks, employing se e al publicly
a ailable da ase s. We gene ally ollow he high-le el de-
sign and de ails om [7] ye simpli y he da ase selec ion,
adding a ew addi ional analyses such as an abla ion s udy.
4.1 QFMs
We compa e he use ulness o ep esen a ions: 1) wi hin
di e en con igu a ions o QFMs, and 2) be ween selec ed
open sou ce ounda ion models es ablished ecen ly.
The de ails o he QFM con igu a ions can be ound in
Table 2. We conside 5di e en QFM con igu a ions o
examine he ac o s a ec ing pe o mance and e iciency.
To ha end, he model anges om small o la ge in e ms
o pa ame iza ion and olume o he aining da ase . Fo
example, he numbe o componen s Kis se o 512 o he
nano model, and 8192 o he la ge model, while we
gene ally apply he same ac o dimensionali y D o ac-
o ize s pe ea u e se . The inal dimensionali y o PCA
2 he obus scaling me hod is non-pa ame ic al e na i e o he s an-
da d scaling whe e we use he median and in e qua ile ange ins ead o
mean and s anda d de ia ion, which leads o a scaling mo e obus o ou -
lie s [44].
QFM/Models T ainse Quan ize W(s) K D D′
nano FMA GMM 9 512 128 762
mic o FMA GMM 9 1024 256 1350
small FMA K-Means 9 4096 256 1266
medium FMA K-Means 9 8192 256 1219
la ge MSD K-Means 9 8192 256 1167
G1 9 592
CLMR MTT 2.6 512
MusicFM FMA Random 29.1 1024
MULE Musicse 3 1728
Table 2: Con igu a ions o QFM models and compa ing
o he ounda ion models. W,K,D,D′ e e o he inpu
audio leng h in seconds, he numbe o componen /clus e
o quan ize s, dimensionali y o he WMF ac o s, and he
inal dimensionali y o esul ing embeddings.
D′is au oma ically de e mined by he desi ed explained
a iance a io o he PCA, se o 97%. We use diagonal
co a iance GMMs o he smalle models (mic o,nano)
ins ead o K-Means since hey a e designed o es as e in-
e ence by se ing he numbe o componen s Ksmall. As
GMMs end o be e i he da a due o inc eased lexibil-
i y, we use hem o compensa e o he pe o mance ade-
o due o he small model size.
Fo each ea u e se , we apply sligh ly di e en con-
igu a ions o QF modules o be e pe o mance; 1) we
gene ally apply he s anda d scaling as an ini ial s ep o
quan iza ion, excep he CQT, whe e we assume ha he
a e age magni ude o speci ic no es o oc a es can be use-
ul in o ma ion. 2) We se a di e en hype pa ame e o
each PCA wi hin each quan ize whe e we se 95%,94%,
and 99% o he explained a iance a io o MFCC, CQT,
and Pa ches ea u e g oups, espec i ely. In con as , we
use he au oma ic me hod [45] o ind he op imal dimen-
sionali y o he Rhy hmic and Spec al ea u es whe e he
numbe o ea u es is smalle . Finally, while we gene -
ally apply he same dimensionali y epo ed in Table 2 o
WMF, we hal e he numbe o he Rhy hmic ea u es.
This is due o he obse a ion ha he numbe o unique
codes wi h nonze o equency in his ea u e g oup is much
less o each audio chunk, meaning he code coun ma ix
is spa se . This is p ima ily because he empog am a io
ea u e is gene ally s able o e he gi en audio chunk. An-
o he excep ion is QF o al, whe e we double he dimen-
sionali y Das i ac o izes conca ena ed, wide ma ices
ha include all ea u es.
Finally, he e a e some hype pa ame e s ha a e applied
globally: we se 5and 0.5 o he weigh ing and egu-
la iza ion coe icien s o WMF, and we il e componen s
(columns) in a unig am ma ix i hey p esen less han
0.1% o songs wi hin he aining se . Also, he leng h o
he audio chunk du ing aining ime is se o 9seconds
due o he de aul se up o he empog am a io ea u es.
The embeddings can s ill be compu ed wi h sho e audio
chunks a in e ence ime, wi h a po en ial loss o ce ain
in o ma ion in he hy hmic ea u es.
4.2 Baselines and O he Founda ion Models
We compa e QFMs o audio ounda ion models om he
li e a u e, speci ically wi hin he scope o unsupe ised
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
789
unimodal audio ep esen a ions.
G1 is he conca ena ion o he G1 ea u es desc ibed in
Sec ion 3.3, ollowed by he obus scaling. We include
all 5 ea u e se s in Table 1, which esul in a o al dimen-
sionali y o 592. This se es as he baseline o all com-
pa isons. We ake 9second audio chunks wi h a3second
o e lap as he s anda d inpu ollowing QFM models.
CLMR [6] is a 1D CNN model, which shows he e -
ec i eness o he con as i e loss [25] on he unsupe ised
music ep esen a ion when combined wi h da a augmen a-
ion. We use non-o e lapping 2.6second audio chunks as
inpu ollowing he o iginal wo k [6].
MULE [7] is ano he unsupe ised audio ounda ion
model ained wi h a simila objec i e. The main di e -
ences be ween CLMR a e he aining da ase and he a -
chi ec u e. MULE employs 2D CNN based on [46] ins ead
o 1D CNN and is ained on Musicse , which is a p op i-
e a y da ase o only music audio unlike da ase s like Au-
diose [47,7]. I also applies ad anced sampling and da a
augmen a ion s a egies. The wo k ook he global a e -
age pooling on imeline embeddings p oduced om o e -
lapping 3second audio chunks, used la e o downs eam
asks. As an excep ion, we ep in he esul s o MULE
om i s o iginal wo k ins ead o ep oducing hem.
MusicFM [8] is a ans o me -based ounda ion model
employing masked oken modeling, inspi ed by language
models. We chose a e sion ained on he FMA-la ge
da ase , which is he same da ase used o aining QFM
models and is publicly a ailable. We use non-o e lapping
29.1s window leng hs o inpu audio chunking 3.
4.3 Da ase s
4.3.1 T aining Se
Fo aining QFM models, we employ he audio da a
om he FMA da ase [48] and he Million Song Da ase
(MSD) [49]. In pa icula , we use he ‘FMA-la ge’ e sion,
which con ains 106,574 30-second song segmen s o all
QFM a ian s in oduced in Table 2, excep he la ge
model, whe e we employ he en i e one million song p e-
iews om MSD. Fo e icien aining, we sample 3 9-
second audio chunks om each audio ile and use hese
samples as obse a ions. The pe - ea u e ec o models
wi hin each quan ize module (i.e., PCA, K-Means) a e
ained on u he subsampled ec o s om hese chunks,
whe e we sample 60 ec o s pe chunk, excep he la ge
model, whe e we sample 200 ec o s pe chunk ins ead.
4.3.2 Downs eam Da ase s
Fo downs eam da ase s, we gene ally ollow [7], selec -
ing a subse . The o e all desc ip ion o he downs eam
da ase s can be ound in Table 3.
GSkey da ase is he 24-way key-scale classi ica ion
da ase in oduced in [50]. Following p e ious wo k [31,
7], we use he second e sion o he da ase 4 o aining
3h ps://gi hub.com/minzwon/music m
4h ps://gi hub.com/Gian S eps/
gian s eps-key-da ase
Da ase Task P obe #I ems #Labels A g. Secs H s
MTT agging MLP 26k 50 29 209
NSyn hPpi ch LR 306k 112 4 340
NSyn hIins umen LR 306k 11 4 340
GTZAN gen e LR 930 10 30 7.8
Emo emo ion MLP 744 N/A 45 9.3
Jam-MT mood/ heme MLP 18k 56 219 1.1k
Table 3: De ails o downs eam ask da ase s. The p obe
se up is ollowing p e ious wo ks [7,31]. MLPs ha e a sin-
gle hidden laye wi h 512 uni s excep on he Emo da ase
which has 1024 uni s. LR e e s o Logis ic Reg ession.
and alida ion, and he i s e sion is used o es ing. We
apply he weigh ed accu acy o measu e pe o mance 5.
MTT is a popula music agging da ase [51]. Follow-
ing [7] we use he 50 mos equen ags wi h he com-
monly used da a spli om [52]. The app oach lea es a
small subse o songs no anno a ed wi h any ags, which
we s ill keep o be e compa e wi h he p e ious wo ks.
The e ec i eness o his ask is measu ed by AUC-ROC
(AUC) [53] and mean A e age P ecision (mAP).
NSyn h con ains audio eco dings o a ious ins u-
men al sounds gene a ed om comme cial audio sample
lib a ies [54]. Two main asks a e pi ch de ec ion and in-
s umen classi ica ion. We ake bo h asks and use he o -
icial spli om he da ase . Fo bo h da ase s, we measu e
e ec i eness by classi ica ion accu acy.
GTZAN is a da ase o gen e classi ica ion [55]. We
apply he aul - il e ed spli de eloped in [56,57] 6. Simi-
la ly o NSyn h asks, we measu e accu acy.
Emo Music da ase p o ides ime-con inuous anno a-
ion o coo dina es in he A ousal-Valence (AV) space, col-
lec ed by mul iple anno a o s o o al 744 songs [58]. We
simpli y he p oblem o a pe -song mul i eg ession whe e
we ake he a e age alue o each a ousal and alence di-
mension, ac oss anno a o s and o e he imeline, as he
global coo dina e es ima es o he song in he AV space.
We use he a is -based spli in [31], and he coe icien o
de e mina ion (R2) is measu ed sepa a ely on a ousal (R2
a)
and alence (R2
) alues, o assess pe o mance.
Jam-MT is a pa o he MTG-Jamendo da ase special-
ized in music mood agging [59]. We employ he o icial
spli p o ided, using ull 56 ags. Simila ly o MTT, we
use AUC and mAP me ics o his da ase .
4.4 O he Expe imen al De ails
We employed a simpli ied model selec ion p o ocol o he
main downs eam ask expe imen s while gene ally ol-
lowing [7, 31]. We choose he bes model ype o each
ask ound by [7] and un a hype pa ame e uning simila
o [31] wi hin each model se up pe da ase . Fo exam-
ple, we use a Mul i-Laye Pe cep on (MLP) classi ie o
he MTT da ase wi h a single hidden laye o dimension-
ali y 512. A he same ime, he op imal hype pa ame e
o o he ac o s such as lea ning a e, egula iza ion, and
5h ps://www.music-i .o g/mi ex/wiki/2021:
Audio_Key_De ec ion
6h ps://gi hub.com/jongpillee/music_da ase _
spli
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
790
s anda diza ion, a e de e mined acco ding o bes pe o -
mance on he alida ion se , o QFM and each o he o he
ounda ion models wi h which we compa e.
Fo all QFM models and o he ounda ion models, we
apply a mul iple-ins ance lea ning app oach [60] o down-
s eam asks when he audio clip leng h exceeds hei ec-
ommended inpu window leng h [46], excep when he au-
dio inpu is sho e han ha (i.e., NSyn h da ase ). The
p obabili y es ima es 7o ained p obe models a e a e -
aged o e he imeline o de e mine he inal p edic ion.
We apply his sco e-pooling app oach o all downs eam
asks. All QFMs use a9s window wi h 3s o e lap, and
o he models use he window and o e lap size desc ibed
in Sec ion 4.2. Fu he , we un 5 uns conside ing he
andomness o MLP aining and epo he dis ibu ion
h ough boxplo isualiza ion. In con as , linea models
epo no a iabili y in he inal esul .
We implemen QFM h ough ML modules p o ided by
sciki -lea n [61] and employ lib osa [40] o ea u e com-
pu a ion. Finally, we use implici 8 o WMF imple-
men a ion wi h a simple cus om sciki -lea n w appe . As
o ha dwa e, o all compu a ions, including aining, em-
bedding ex ac ion, and expe imen s, we employ a Vi ual
Machine (VM) wi h 32 i ual CPU ( CPU) co es and 117
GB memo y wi hou any accele a o s such as GPUs, whe e
each CPU co e uns a single h ead. We un he expe i-
men on Ubun u24.04 Linux sys em, wi h Py hon 3.10.
5. RESULT AND DISCUSSION
5.1 E ec i eness
0.65
0.70
0.75
0.80
0.4
0.6
0.8
0.2
0.3
0.4
0.5
0.6
0.7
0.675
0.700
0.725
0.750
0.890
0.895
0.900
0.905
0.910
0.915
0.36
0.37
0.38
0.39
0.40
nano
mic o
small
medium
la ge
0.625
0.650
0.675
0.700
0.725
nano
mic o
small
medium
la ge
0.3
0.4
0.5
nano
mic o
small
medium
la ge
0.75
0.76
0.77
0.78
nano
mic o
small
medium
la ge
0.13
0.14
0.15
0.16
NSyn h
i
(Acc) NSyn h
p
(Acc) GS
key
(W.Acc) GTZAN (Acc)
MTT (AUC) MTT (mAP) Emo (
R
2
a
) Emo (
R
2
)
JAM-MT (AUC) JAM-MT (mAP)
model G1 CLMR MusicFM MULE
Figu e 2: Pe o mance on es da ase s. Boxplo s in each
pane indica e he pe o mance dis ibu ion o QFM a i-
an s, while ho izon al lines e e he mean pe o mance o
baseline ounda ion models. We indica e he me ics used
o each da ase in he box on op o each pane.
7logi s a e used o eg ession p oblems.
8h ps://gi hub.com/ben ed/implici
QFM/Models A g. Pe . HMTT (H s)
G1 0.5964 0.3416 (±0.0024)
nano 0.6343 0.4561 (±0.0050)
la ge 0.6480 0.7008 (±0.0081)
CLMR 0.5269 2.7503 (±0.0010)
MusicFM 0.6638 8.8634 (±0.1744)
Table 4: O e all pe o mance and e iciency o selec ed
ounda ion models measu ed on MTT da ase . ‘A g. Pe .’
e e s he a e aged pe o mance measu e ac oss all asks
conside ed.
We i s examine he o e all e ec i eness o QFM in
downs eam asks. As depic ed in Figu e 2 and Table 4,
he QFM embeddings a e mo e e ec i e han he CLMR
and G1 embeddings, al hough gene ally wo se han he
mo e ecen DL models MULE and MusicFM. In pa ic-
ula , he e a e a ew asks in which some QFMs a e on pa
wi h o ou pe o m MULE and MusicFM, such as GSkey,
NSyn hp, and Emo (A ousal). On he o he hand, QFM
shows weakness especially in agging asks such as MTT
and Jam-MT, while hey migh s ill ou pe o m some DL
models like CLMR. The a e age esul sugges s ha QFMs
a e gene ally compa able o DL ep esen a ions, subs an-
ially be e han CMLR and a ew poin s behind MULE
and MusicFM, as shown in Table 4 and Figu e 2.
Wi hin QFM, he o e all e ec i eness oughly posi-
i ely co ela es wi h he ‘size’ o he QFM con igu a-
ion. Some excep ions a e he cases whe e all QFM mod-
els show simila ly good pe o mance, such as GSKey and
Emo (A ousal) o NSyn hiwhe e la ge pe o m ela-
i ely wo se han smalle models.
5.2 E iciency
We e alua e he e iciency o QFM by measu ing he em-
bedding compu a ion ime and compa e wi h o he oun-
da ion models 9. We measu e in e ence ime o e MTT
da ase consis ing o 25,860 29.1seconds p e iew audio
clips, equi alen o app oxima ely 209 audio hou s. We
measu e HMTT, he ime spen in hou s compu ing he em-
beddings o en i e audio clips wi hin MTT, using each
model in he compu ing en i onmen desc ibed in Sec-
ion 4.4. We epo he a e age and s anda d de ia ion a e
unning 5 uns.
The esul sugges s ha he QFMs a e signi ican ly
as e han bo h DL ounda ion models. In pa icula ,
QFM la ge and nano ou pe o m MusicFM by mo e
han 12 and nea ly 20 imes. Compa ed o he ela i ely
ligh weigh CLMR, hey a e as e by app oxima ely 4and
6 imes each. Compa ed o G1, QFMs a e sligh ly mo e ex-
pensi e, sugges ing ha he majo i y o compu a ion ime
is ea u e compu a ion. Al hough we do no o mally com-
pa e aining ime, he aining cos o QFMs is also sub-
s an ially lowe han DL models; he aining ime o he
medium model is abou 3hou s on a 32-co e i ual ma-
chine wi hou GPU, which is signi ican ly as e han hose
o DL ounda ion models 10 .
9We exclude MULE as we do no ep oduce he esul , and we only
choose he la ges and smalles QFM con igu a ions o simplici y.
10 la ge model’s aining ime is abou 30 hou s on he same 32-co e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
791
The e icien compu a ion o QFM p o ides se e al
bene i s: 1) I can lead o a as e de elopmen cycle. Due
o he modula na u e o QF modules o each ea u e se , i
allows one o expe imen wi h a ious ea u es and model
con igu a ions he eo , which can ansla e in o be e pe -
o mance. 2) one can compu e QFMs a scale wi h less
compu a ional esou ces. Gi en ha QFMs o e eason-
able music unde s anding and some imes e en ou pe o m
he mos signi ican DL models in some downs eam asks,
he ease o compu a ion can be appealing o employing
QFMs as ei he a complimen a y o DL models, o as a
main embedding when compu a ion equi emen s dic a e.
5.3 Abla ion S udy
5.3.1 Fea u e Abla ion
We conduc ed wo abla ion s udies o unde s and he con-
ibu ion o some componen s o QFM model pe o mance.
In he i s s udy, we ocus on he e ec o ea u e g oups
by compa ing QFMs wi h di e en subse s o ea u e
g oups. We chose la ge a ian as he e e ence QFM
model, wi h a ull se o ea u es. We compa e 5QFM
a ian s in which only a subse o ea u e se s is p esen .
Fo simplici y, hese models include an inc easing numbe
o ea u e se s in he o de o ea u e se s p esen ed in Ta-
ble 1. We e e o hem based on he ini ial o included
ea u e se s. (i.e., he ‘M’ model only includes MFCC, and
‘MCR’ includes MFCC, CQT, and Rhy hmic ea u e se s).
Fo hese ea u e-subse models, we exclude inal PCA and
QF o al. Hence, he dimensionali y o subse embeddings
can be la ge o smalle han he la ge model. Expe i-
men al esul s can be ound in Fig 3 (a).
In gene al, he addi ion o each ea u e se signi i-
can ly imp o es he o e all e ec i eness o QFMs in all
downs eam asks. Especially he i s h ee ea u e se s
– MFCC, CQT, and Rhy hmic – b ing no able imp o e-
men s compa ed o Spec al Fea u e and Pa ches. I may
indica e ha he Spec al and Pa ches ea u es se do no
p o ide much addi ional in o ma ion han he i s h ee
ea u e se s. In some asks, such as GTZAN and GSkey,
he MCR a ian ou pe o ms he e sion wi h mo e ea-
u e se s. Howe e , o he asks such as Emo and agging
da ase s bene i om comple e se s o ea u es.
5.3.2 Model Con igu a ion
We compa e h ee a ian s o examine he e ec o model
sub-componen s on model pe o mance: G1 model, and
medium model, and inally medium model wi h andom-
ized SVD [62] ins ead o WMF wi h he same inal dimen-
sionali y. As shown in Fig 3 (b), he medium model ou -
pe o ms bo h he G1 and SVD a ian s, sugges ing: 1) QF
modules add subs an ially mo e in o ma ion on op o G1,
u he mo e, 2) gi en ha he use o SVD subs an ially un-
de pe o ms when compa ed o he WMF (and e en wo se
han he G1 model) WMF is c ucial o he e ec i eness o
QFMs.
VM we desc ibed ea lie , as we employed MSD which is abou 10 imes
la ge han he FMA-la ge da ase .
0.55
0.60
0.65
0.70
0.75
0.875
0.900
0.925
0.950
0.62
0.64
0.66
0.68
0.65
0.70
0.75
0.880
0.885
0.890
0.895
0.900
0.33
0.34
0.35
0.36
0.37
0.38
M
MC
MCR
MCRS
MCRSP
la ge
0.55
0.60
0.65
0.70
M
MC
MCR
MCRS
MCRSP
la ge
0.2
0.3
0.4
0.5
M
MC
MCR
MCRS
MCRSP
la ge
0.75
0.76
0.77
M
MC
MCR
MCRS
MCRSP
la ge
0.125
0.130
0.135
0.140
0.145
0.150
NSyn h
i
(Acc) NSyn h
p
(Acc) GS
key
(W.Acc) GTZAN (Acc)
MTT (AUC) MTT (mAP) Emo (
R
2
a
)Emo (
R
2
)
JAM-MT (AUC) JAM-MT (mAP)
(a) Fea u e Abla ion S udy.
0.625
0.650
0.675
0.700
0.725
0.750
0.90
0.92
0.94
0.60
0.62
0.64
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.885
0.890
0.895
0.900
0.35
0.36
0.37
0.38
G1
medium SVD
mediumWMF
-10.0
-7.5
-5.0
-2.5
0.0
G1
medium SVD
mediumWMF
-20
-15
-10
-5
0
G1
medium SVD
mediumWMF
0.750
0.755
0.760
0.765
0.770
G1
medium SVD
mediumWMF
0.130
0.135
0.140
0.145
0.150
NSyn h
i
(Acc) NSyn h
p
(Acc) GS
key
(W.Acc) GTZAN (Acc)
MTT (AUC) MTT (mAP) Emo (
R
2
a
)Emo (
R
2
)
JAM-MT (AUC) JAM-MT (mAP)
(b) Module Abla ion S udy.
Figu e 3: Abla ion s udy esul .
6. CONCLUSION AND FUTURE WORKS
We p opose QFM as a non-DL ounda ion model empow-
e ed by he ensemble o ich MIR ea u es u ilizing e -
icien shallow ea u e lea ne s. The expe imen al esul
shows ha i pe o ms compa able o some ecen DL mod-
els while being signi ican ly mo e e icien , implying ha
QFM models can be complimen o DL audio ep esen a-
ions in indus ial scale applica ions. Fu he mo e, QFMs
p o ide an al e na i e whe e he cos o DL model in e -
ence is p ohibi ed.
Al hough wi h hese bene i s, we obse e weaknesses,
especially in some downs eam asks compa ed o he e-
cen DL-based ounda ion models. Resea ch owa ds ad-
di ional ea u es o hei manipula ions wi h QF modules
may u he imp o e he o e all e ec i eness, as shown in
he esul s. Gi en ha he e a e many music/audio ea u es
ha ha e no been explo ed and u he uncha e ed new
ea u es ha may be in oduced in he u u e, QFM can be
an al e na i e pla o m ha he MIR communi y can con-
ibu e o, o imp o e music ep esen a ions.
Ano he po en ial bene i is ha he in e p e able linea
and clus e ing ope a o s in QFM models open up al e na-
i e a enues o in e p e ing music audio ep esen a ions,
which can be s udied u he in a scope beyond ha o he
cu en wo k.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
792
7. REFERENCES
[1] Y. Ma, A. Øland, A. Ragni, B. M. D. Se e,
C. Sai is, C. Donahue, C. Lin, C. Plachou as, E. Bene-
os, E. Quin on, E. Sha i, F. Mo eale, G. Zhang,
G. Fazekas, G. Xia, H. Zhang, I. Manco, J. Huang,
J. Guino , L. Lin, L. Ma inelli, M. W. Y. Lam,
M. Sha ma, Q. Kong, R. B. Dannenbe g, R. Yuan,
S. Wu, S. Wu, S. Dai, S. Lei, S. Kang, S. Dixon,
W. Chen, W. Huang, X. Du, X. Qu, X. Tan, Y. Li,
Z. Tian, Z. Wu, Z. Wu, Z. Ma, and Z. Wang, “Foun-
da ion models o music: A su ey,” CoRR, ol.
abs/2408.14340, 2024.
[2] K. Choi, G. Fazekas, M. B. Sandle , and K. Cho,
“T ans e lea ning o music classi ica ion and eg es-
sion asks,” in P oceedings o he 18 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2017, Suzhou, China, Oc obe 23-27, 2017,
S. J. Cunningham, Z. Duan, X. Hu, and D. Tu nbull,
Eds., 2017, pp. 141–149.
[3] E. J. Humph ey, J. P. Bello, and Y. LeCun, “Mo -
ing beyond ea u e design: Deep a chi ec u es and
au oma ic ea u e lea ning in music in o ma ics,” in
P oceedings o he 13 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, ISMIR 2012,
Mos ei o S.Ben o Da Vi ó ia, Po o, Po ugal, Oc obe
8-12, 2012, F. Gouyon, P. He e a, L. G. Ma ins, and
M. Mülle , Eds. FEUP Edições, 2012, pp. 403–408.
[4] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d-
o-cap ion augmen a ion,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
ICASSP 2023, Rhodes Island, G eece, June 4-10, 2023.
IEEE, 2023, pp. 1–5.
[5] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os, N. Gyenge,
R. B. Dannenbe g, R. Liu, W. Chen, G. Xia, Y. Shi,
W. Huang, Z. Wang, Y. Guo, and J. Fu, “MERT: acous-
ic music unde s anding model wi h la ge-scale sel -
supe ised aining,” in The Twel h In e na ional Con-
e ence on Lea ning Rep esen a ions, ICLR 2024, Vi-
enna, Aus ia, May 7-11, 2024. OpenRe iew.ne ,
2024.
[6] J. Spijke e and J. A. Bu goyne, “Con as i e lea n-
ing o musical ep esen a ions,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2021, Online, No embe
7-12, 2021, J. H. Lee, A. Le ch, Z. Duan, J. Nam,
P. Rao, P. an K anenbu g, and A. S ini asamu hy,
Eds., 2021, pp. 673–681.
[7] M. C. McCallum, F. Ko zeniowski, S. O amas,
F. Gouyon, and A. F. Ehmann, “Supe ised and un-
supe ised lea ning o audio ep esen a ions o mu-
sic unde s anding,” in P oceedings o he 23 d In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence, ISMIR 2022, Bengalu u, India, Decembe 4-
8, 2022, P. Rao, H. A. Mu hy, A. S ini asamu hy,
R. M. Bi ne , R. C. Repe o, M. Go o, X. Se a, and
M. Mi on, Eds., 2022, pp. 256–263.
[8] M. Won, Y. Hung, and D. Le, “A ounda ion model o
music in o ma ics,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing, ICASSP
2024, Seoul, Republic o Ko ea, Ap il 14-19, 2024.
IEEE, 2024, pp. 1226–1230.
[9] J. Kim and C. C. S. Liem, “The powe o deep wi hou
going deep? A s udy o HDPGMM music ep esen a-
ion lea ning,” in P oceedings o he 23 d In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2022, Bengalu u, India, Decembe 4-8, 2022,
P. Rao, H. A. Mu hy, A. S ini asamu hy, R. M. Bi -
ne , R. C. Repe o, M. Go o, X. Se a, and M. Mi on,
Eds., 2022, pp. 116–124.
[10] Y. Vaizman, B. McFee, and G. R. G. Lanck ie ,
“Codebook-based audio ea u e ep esen a ion o mu-
sic in o ma ion e ie al,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 22, no. 10, pp. 1483–1493,
2014.
[11] H. Eghbal-zadeh, B. Lehne , M. Schedl, and G. Wid-
me , “I- ec o s o imb e-based music simila i y and
music a is classi ica ion,” in P oceedings o he 16 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2015, Málaga, Spain, Oc obe 26-
30, 2015, M. Mülle and F. Wie ing, Eds., 2015, pp.
554–560.
[12] C. Molna , In e p e able machine lea ning. Lulu.
com, 2020.
[13] A. Holzap el, A. Kaila, and P. Jääskeläinen, “G een
mi ? in es iga ing compu a ional cos o ecen music-
ai esea ch in ISMIR,” in P oceedings o he 25 h In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence, ISMIR 2024, San F ancisco, Cali o nia, USA
and Online, No embe 10-14, 2024, B. Kaneshi o,
G. J. Myso e, O. Nie o, C. Donahue, C. A. Huang, J. H.
Lee, B. McFee, and M. C. McCallum, Eds., 2024, pp.
371–380.
[14] M. D. Ho man, D. M. Blei, and P. R. Cook, “Con en -
based musical simila i y compu a ion using he hie -
a chical di ichle p ocess,” in ISMIR 2008, 9 h In e -
na ional Con e ence on Music In o ma ion Re ie al,
D exel Uni e si y, Philadelphia, PA, USA, Sep embe
14-18, 2008, J. P. Bello, E. Chew, and D. Tu nbull,
Eds., 2008, pp. 349–354.
[15] B. McFee, L. Ba ing on, and G. R. G. Lanck ie ,
“Lea ning con en simila i y o music ecommenda-
ion,” IEEE T ans. Speech Audio P ocess., ol. 20,
no. 8, pp. 2207–2218, 2012.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
793
[16] J. Nam, J. He e a, M. Slaney, and J. O. S. III, “Lea n-
ing spa se ea u e ep esen a ions o music anno a ion
and e ie al,” in P oceedings o he 13 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2012, Mos ei o S.Ben o Da Vi ó ia, Po o, Po -
ugal, Oc obe 8-12, 2012, F. Gouyon, P. He e a, L. G.
Ma ins, and M. Mülle , Eds. FEUP Edições, 2012,
pp. 565–570.
[17] S. Dieleman and B. Sch auwen, “Mul iscale ap-
p oaches o music audio ea u e lea ning,” in
P oceedings o he 14 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, ISMIR 2013,
Cu i iba, B azil, No embe 4-8, 2013, A. de Souza
B i o J ., F. Gouyon, and S. Dixon, Eds., 2013, pp.
3–8.
[18] J. Nam, J. He e a, and K. Lee, “A deep bag-o -
ea u es model o music au o- agging,” CoRR, ol.
abs/1508.04999, 2015.
[19] H. Lee, P. T. Pham, Y. La gman, and A. Y. Ng, “Un-
supe ised ea u e lea ning o audio classi ica ion us-
ing con olu ional deep belie ne wo ks,” in Ad ances
in Neu al In o ma ion P ocessing Sys ems 22: 23 d
Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems 2009. P oceedings o a mee ing held 7-10 De-
cembe 2009, Vancou e , B i ish Columbia, Canada,
Y. Bengio, D. Schuu mans, J. D. La e y, C. K. I.
Williams, and A. Culo a, Eds. Cu an Associa es,
Inc., 2009, pp. 1096–1104.
[20] J. Kim, J. U bano, C. C. S. Liem, and A. Hanjalic,
“One deep music ep esen a ion o ule hem all? A
compa a i e analysis o di e en ep esen a ion lea n-
ing s a egies,” Neu al Compu . Appl., ol. 32, no. 4,
pp. 1067–1093, 2020.
[21] J. Lee, J. Pa k, K. L. Kim, and J. Nam, “Samplecnn:
End- o-end deep con olu ional neu al ne wo ks using
e y small il e s o music classi ica ion,” Applied Sci-
ences, ol. 8, no. 1, 2018.
[22] J. Pons, T. Lidy, and X. Se a, “Expe imen ing wi h
musically mo i a ed con olu ional neu al ne wo ks,” in
2016 14 h In e na ional Wo kshop on Con en -Based
Mul imedia Indexing (CBMI), 2016, pp. 1–6.
[23] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold, M. Slaney, R. J. Weiss,
and K. W. Wilson, “CNN a chi ec u es o la ge-scale
audio classi ica ion,” in 2017 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing,
ICASSP 2017, New O leans, LA, USA, Ma ch 5-9,
2017. IEEE, 2017, pp. 131–135.
[24] R. A andjelo ic and A. Zisse man, “Look, lis en and
lea n,” in IEEE In e na ional Con e ence on Compu e
Vision, ICCV 2017, Venice, I aly, Oc obe 22-29, 2017.
IEEE Compu e Socie y, 2017, pp. 609–617.
[25] T. Chen, S. Ko nbli h, M. No ouzi, and G. E. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in P oceedings o he 37 h In e na-
ional Con e ence on Machine Lea ning, ICML 2020,
13-18 July 2020, Vi ual E en , se . P oceedings o Ma-
chine Lea ning Resea ch, ol. 119. PMLR, 2020, pp.
1597–1607.
[26] M. Won, K. Choi, and X. Se a, “Semi-supe ised
music agging ans o me ,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2021, Online, No embe
7-12, 2021, J. H. Lee, A. Le ch, Z. Duan, J. Nam,
P. Rao, P. an K anenbu g, and A. S ini asamu hy,
Eds., 2021, pp. 769–776.
[27] J. Guino , E. Quin on, and G. Fazekas, “Semi-
supe ised con as i e lea ning o musical ep esen a-
ions,” in P oceedings o he 25 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2024, San F ancisco, Cali o nia, USA and On-
line, No embe 10-14, 2024, B. Kaneshi o, G. J.
Myso e, O. Nie o, C. Donahue, C. A. Huang, J. H. Lee,
B. McFee, and M. C. McCallum, Eds., 2024, pp. 571–
579.
[28] A. Fe a o, J. Kim, S. O amas, A. F. Ehmann, and
F. Gouyon, “Con as i e lea ning o c oss-modal a is
e ie al,” in P oceedings o he 24 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2023, Milan, I aly, No embe 5-9, 2023, A. Sa i,
F. An onacci, M. Sandle , P. Bes agini, S. Dixon,
B. Liang, G. Richa d, and J. Pauwels, Eds., 2023, pp.
375–382.
[29] J. C ame , H. Wu, J. Salamon, and J. P. Bello, “Look,
lis en, and lea n mo e: Design choices o deep au-
dio embeddings,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing, ICASSP
2019, B igh on, Uni ed Kingdom, May 12-17, 2019.
IEEE, 2019, pp. 3852–3856.
[30] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” CoRR, ol. abs/2005.00341, 2020.
[31] R. Cas ellon, C. Donahue, and P. Liang, “Codi ied au-
dio language modeling lea ns use ul ep esen a ions
o music in o ma ion e ie al,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2021, Online, No embe
7-12, 2021, J. H. Lee, A. Le ch, Z. Duan, J. Nam,
P. Rao, P. an K anenbu g, and A. S ini asamu hy,
Eds., 2021, pp. 88–96.
[32] S. Mish a, B. L. S u m, and S. Dixon, “Local in e -
p e able model-agnos ic explana ions o music con-
en analysis,” in P oceedings o he 18 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2017, Suzhou, China, Oc obe 23-27, 2017,
S. J. Cunningham, Z. Duan, X. Hu, and D. Tu nbull,
Eds., 2017, pp. 537–543.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
794
[33] ——, “Unde s anding a deep machine lis ening model
h ough ea u e in e sion,” in P oceedings o he 19 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2018, Pa is, F ance, Sep embe
23-27, 2018, E. Gómez, X. Hu, E. Humph ey, and
E. Bene os, Eds., 2018, pp. 755–762.
[34] M. P ockup, A. F. Ehmann, F. Gouyon, E. M. Schmid ,
and Y. E. Kim, “Modeling musical hy hma scale wi h
he music genome p ojec ,” in 2015 IEEE Wo kshop on
Applica ions o Signal P ocessing o Audio and Acous-
ics (WASPAA), 2015, pp. 1–5.
[35] G. Pee e s, “Rhy hm classi ica ion using spec al
hy hm pa e ns,” in ISMIR 2005, 6 h In e na ional
Con e ence on Music In o ma ion Re ie al, London,
UK, 11-15 Sep embe 2005, P oceedings, 2005, pp.
644–647.
[36] D. Fi zGe ald and J. Paulus, Unpi ched Pe cussion
T ansc ip ion. Bos on, MA: Sp inge US, 2006, pp.
131–162.
[37] D. Jiang, L. Lu, H. Zhang, J. Tao, and L. Cai, “Mu-
sic ype classi ica ion by spec al con as ea u e,” in
P oceedings o he 2002 IEEE In e na ional Con e -
ence on Mul imedia and Expo, ICME 2002, Lausanne,
Swi ze land. Augus 26-29, 2002. Volume I. IEEE
Compu e Socie y, 2002, pp. 113–116.
[38] S. Dubno , “Gene aliza ion o spec al la ness mea-
su e o non-gaussian linea p ocesses,” IEEE Signal
P ocess. Le ., ol. 11, no. 8, pp. 698–701, 2004.
[39] C. Ha e, M. Sandle , and M. Gasse , “De ec ing ha -
monic change in musical audio,” in P oceedings o he
1s ACM Wo kshop on Audio and Music Compu ing
Mul imedia, se . AMCMM ’06. New Yo k, NY, USA:
Associa ion o Compu ing Machine y, 2006, p. 21–26.
[40] B. McFee, C. Ra el, D. Liang, D. P. W. El-
lis, M. McVica , E. Ba enbe g, and O. Nie o, “li-
b osa: Audio and music signal analysis in py hon,” in
P oceedings o he 14 h Py hon in Science Con e ence
2015 (SciPy 2015), Aus in, Texas, July 6 - 12, 2015,
K. Hu and J. Be gs a, Eds. scipy.o g, 2015, pp.
18–24.
[41] D. M. Blei, A. Y. Ng, and M. I. Jo dan, “La en di ich-
le alloca ion,” in Ad ances in Neu al In o ma ion P o-
cessing Sys ems 14 [Neu al In o ma ion P ocessing
Sys ems: Na u al and Syn he ic, NIPS 2001, Decem-
be 3-8, 2001, Vancou e , B i ish Columbia, Canada],
T. G. Die e ich, S. Becke , and Z. Ghah amani, Eds.
MIT P ess, 2001, pp. 601–608.
[42] Y. Hu, Y. Ko en, and C. Volinsky, “Collabo a i e il e -
ing o implici eedback da ase s,” in P oceedings o
he 8 h IEEE In e na ional Con e ence on Da a Min-
ing (ICDM 2008), Decembe 15-19, 2008, Pisa, I aly.
IEEE Compu e Socie y, 2008, pp. 263–272.
[43] Z. S. Ha is, “Dis ibu ional s uc u e,” WORD, ol. 10,
no. 2-3, pp. 146–162, 1954.
[44] H. Kal enbach, A Concise Guide o S a is ics, se .
Sp inge B ie s in S a is ics. Sp inge Be lin Heidel-
be g, 2011.
[45] T. P. Minka, “Au oma ic choice o dimensionali y o
PCA,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 13, Pape s om Neu al In o ma ion P ocess-
ing Sys ems (NIPS) 2000, Den e , CO, USA, T. K.
Leen, T. G. Die e ich, and V. T esp, Eds. MIT P ess,
2000, pp. 598–604.
[46] L. Wang, P. Luc, Y. Wu, A. Recasens, L. Smai a,
A. B ock, A. Jaegle, J. Alay ac, S. Dieleman, J. Ca -
ei a, and A. an den Oo d, “Towa ds lea ning uni e -
sal audio ep esen a ions,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing,
ICASSP 2022, Vi ual and Singapo e, 23-27 May 2022.
IEEE, 2022, pp. 4593–4597.
[47] J. F. Gemmeke, D. P. W. Ellis, D. F eedman, A. Jansen,
W. Law ence, R. C. Moo e, M. Plakal, and M. Ri e ,
“Audio se : An on ology and human-labeled da ase
o audio e en s,” in 2017 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing,
ICASSP 2017, New O leans, LA, USA, Ma ch 5-9,
2017. IEEE, 2017, pp. 776–780.
[48] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,”
in P oceedings o he 18 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2017,
Suzhou, China, Oc obe 23-27, 2017, S. J. Cunning-
ham, Z. Duan, X. Hu, and D. Tu nbull, Eds., 2017, pp.
316–323.
[49] T. Be in-Mahieux, D. P. W. Ellis, B. Whi man, and
P. Lame e, “The million song da ase ,” in P oceedings
o he 12 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR 2011, Miami, Flo ida,
USA, Oc obe 24-28, 2011, A. Klapu i and C. Leide ,
Eds. Uni e si y o Miami, 2011, pp. 591–596.
[50] P. Knees, Á. Fa aldo, P. He e a, R. Vogl, S. Böck,
F. Hö schläge , and M. L. Go , “Two da a se s
o empo es ima ion and key de ec ion in elec onic
dance music anno a ed om use co ec ions,” in
P oceedings o he 16 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, ISMIR 2015,
Málaga, Spain, Oc obe 26-30, 2015, M. Mülle and
F. Wie ing, Eds., 2015, pp. 364–370.
[51] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o algo i hms using games: The
case o music agging,” in P oceedings o he 10 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2009, Kobe In e na ional Con e -
ence Cen e , Kobe, Japan, Oc obe 26-30, 2009, K. Hi-
a a, G. Tzane akis, and K. Yoshii, Eds. In e na-
ional Socie y o Music In o ma ion Re ie al, 2009,
pp. 387–392.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
795