BEYOND GENRE: DIAGNOSING BIAS IN MUSIC EMBEDDINGS USING
CONCEPT ACTIVATION VECTORS
Roman B. Gebha d
Cyani e
[email p o ec ed]
A ne Kuhle
Cyani e
[email p o ec ed]
Eylül Bek u
Cyani e, TU Be lin
[email p o ec ed]
ABSTRACT
Music ep esen a ion models a e widely used o asks
such as agging, e ie al, and music unde s anding. Ye ,
hei po en ial o encode cul u al bias emains unde ex-
plo ed. In his pape , we apply Concep Ac i a ion Vec-
o s (CAVs) o in es iga e whe he non-musical singe a -
ibu es - such as gende and language - in luence gen e
ep esen a ions in unin ended ways. We analyze ou s a e-
o - he-a models (MERT,Whispe ,MuQ,MuQ-MuLan)
using he ST aDa da ase , ca e ully balancing aining se s
o con ol o gen e con ounds. Ou esul s e eal signi -
ican model-speci ic biases, aligning wi h dispa i ies e-
po ed in MIR and music sociology. Fu he mo e, we p o-
pose a pos -hoc debiasing s a egy using concep ec o
manipula ion, demons a ing i s e ec i eness in mi iga ing
hese biases. These indings highligh he need o bias-
awa e model design and show ha concep ualized in e -
p e abili y me hods o e p ac ical ools o diagnosing and
mi iga ing ep esen a ional bias in MIR.
1. INTRODUCTION
Model bias is a well-known challenge ac oss machine
lea ning domains. While ex ensi ely s udied in NLP and
compu e ision [1, 2], i has ecen ly also gained g ow-
ing a en ion in MIR [3–5]. Beyond deg ading pe o -
mance, model bias also poses se ious challenges o ML
ai ness [3]. In MIR, models may lea n o e lec o ampli y
socie al imbalances p esen in he music indus y, ein o c-
ing s e eo ypes in classi ica ion, ecommenda ion, and mu-
sical unde s anding. This phenomenon has been empi i-
cally demons a ed in p io wo k on a is gende bias in
music ecommenda ion sys ems [4]. Recen esea ch in
explainable AI o MIR has shown ha mul i-modal la ge
language models (LLMs) ained o music unde s anding
o en s uggle o u ilize audio con en , a imes igno ing i
en i ely in a o o accompanying ex da a [6]. This o e -
eliance on he ex ual modali y can lead o audi o y hallu-
cina ions, whe e models gene a e plausible-sounding bu
inaccu a e desc ip ions. We hypo hesize ha such ailu es
© Roman B. Gebha d , A ne Kuhle, and Eylül Bek u . Li-
censed unde a C ea i e Commons A ibu ion 4.0 In e na ional License
(CC BY 4.0). A ibu ion: Roman B. Gebha d , A ne Kuhle, and Eylül
Bek u , “Beyond Gen e: Diagnosing Bias in Music Embeddings Using
Concep Ac i a ion Vec o s”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
may also pa ially s em om biased in e nal audio ep e-
sen a ions lea ned by neu al models, whe e s e eo ypical
musical pa e ns o demog aphic co ela ions o e ly in lu-
ence he in e nal encoding. Al eady lawed audio ep e-
sen a ions will na u ally hinde he model’s abili y o gen-
e alize and eason ai h ully abou music. We he e o e
shi ou ocus o analyzing he audio ep esen a ions hem-
sel es, whe e such biases may o igina e bu emain la gely
unexplo ed. To add ess his gap, we employ Concep Ac-
i a ion Vec o s (CAVs) [7] o sys ema ically p obe o
unwan ed concep en anglemen . We in es iga e whe he
non-musical ac o s like singe gende and language in lu-
ence gen e ep esen a ions. While hese a ibu es should
no a ec gen e ep esen a ion, we hypo hesize ha skews
in aining da a lead models o associa e hem wi h speci ic
gen es. P io wo k sugges s such imbalances a e gen e-
dependen [8–11]: Me al and Hip-Hop a e hea ily male-
domina ed, while gen es like Pop,Elec onic, and R&B
a e mo e balanced [10]. A model migh hus associa e
male ocals wi h Me al and emale ocals wi h Pop, e en
hough ocal gende is no a gen e-de ining ai . Simila ly,
while language may se e as a gen e cue in speci ic cases
(e.g., Po uguese in B azilian music), o e eliance on dom-
inan languages isks ma ginalizing o he s [12]. To in es-
iga e hese biases, we quan i y how s ongly non-musical
a ibu es a e e lec ed in he model’s la en space. By
adap ing Tes ing wi h CAVs (TCAV) [7] o ozen audio
encode s, we es ima e how consis en ly gen e-speci ic au-
dio embeddings align wi h a gi en concep di ec ion. Sec-
ondly, we explo e he possibili ies o applying CAVs o
concep emo al o addi ion, ep esen ing a simple pos -
hoc de-biasing s a egy. We publish ou code on Gi hub. 1
This wo k aims o p o ide insigh s in o how s a e-o -
he-a music ep esen a ion models encode and p opaga e
bias, ad oca ing o mo e ai ness-awa e design in MIR.
While we ocus on audio encode s, ou app oach gene al-
izes o o he music- ela ed models. I o e s a ligh weigh ,
in e p e able amewo k o su ace and mi iga e biases us-
ing small, a ge ed da ase s.
2. RELATED WORK
2.1 Concep -based Explana ions
Concep -based explana ions ha e eme ged in machine
lea ning as a way o make models mo e in e p e able by
1h ps://gi hub.com/WhoCa es96/CAV-MIR
279
aligning hei la en ep esen a ions wi h human eason-
ing and in ui i e concep s, a he han solely elying on
low-le el inpu ea u es such as indi idual pixels o aw
da a poin s [13]. Concep -based in e p e abili y o neu al
ne wo ks ypically ollows wo main app oaches: (1) de-
signing neu al ne wo k a chi ec u es ha a e in insically
in e p e able and (2) gene a ing pos -hoc explana ions o
al eady- ained ne wo ks. In insic, concep -awa e me h-
ods om he i s ca ego y ypically achie e high ask accu-
acies, link concep s di ec ly o p edic ions h ough lea n-
ing which enable mo e e ec i e in e en ions o concep s
on he model’s decisions [14, 15]. Howe e , hese in in-
sic app oaches equi e aining wi h labelled concep da a,
which can be cos ly o ob ain [15].
In con as , pos -hoc concep -based me hods, no ably
he Concep Ac i a ion Vec o (CAV) me hod in oduced
by Kim e al. [7], ha e p o en pa icula ly e ec i e due
o hei explici alignmen wi h human-unde s andable and
domain- ele an concep s wi hou he need o anno a ed
concep labels o model e aining [16–18]. Recen e-
sea ch applies CAV-based me hods o e i y whe he mod-
els ocus on desi ed seman ic concep s, like objec shapes,
o undesi ed spu ious signals [19]. The e o e, CAV-based
me hods ha e had success ul applica ions in explainabil-
i y, unde s anding model ep esen a ions, concep en an-
glemen de ec ion, spa ial dependency e alua ions [20]
and bias e alua ion [21]. The ela ed Tes ing wi h CAV
(TCAV) me hod quan i ies he in luence o each concep on
he model’s p edic ions by compu ing di ec ional de i a-
i es along he co esponding CAVs [7].
In he MIR domain, esea che s ha e applied pos -hoc
in e p e abili y me hods ha ope a e di ec ly on he inpu
spec og am, pe u bing small ime– equency pa ches and
acking he esul ing ou pu changes o e eal which e-
gions mos s ongly in luence he classi ie ’s decision [22].
These me hods ypically lack he b oade concep ual in-
sigh s ha me hods such as CAVs o e by di ec ly asso-
cia ing p edic ions wi h human-de ined concep s [7] and
p o iding explana ions o model beha iou p o ided by
TCAV.
Mo i a ed by his gap, ecen MIR wo k inco po a es
concep -based me hods, such as CAVs, o s uc u ing
complex gen e and mood ca ego ies in o hie a chical ep-
esen a ions [23], and gene a ing explana ions ailo ed ex-
plici ly o musicologis s by isualizing in e p e able musi-
cal concep s [16]. Building upon hese de elopmen s, ou
wo k applies CAV-based app oaches ha explici ly align
model bias wi h human-de ined concep s. This ollows
me hodologies in oduced o bias explo a ion and o de-
biasing e ie al sys ems wi h CAVs.
2.2 Bias Explo a ion and Mi iga ion
Exis ing s udies in e p e bias ia indi idual neu ons,
highe -dimensional subspaces, o linea di ec ions in la en
space [24–26]. By adap ing CAV, we employ he linea -
di ec ions app oach, de ining concep s as linea di ec ions
lea ned h ough supe ised aining on model ac i a ions.
Recen esea ch has u ilized CAV me hods pa icu-
la ly in compu e ision o de ec and quan i y undesi -
able model eliance on demog aphic biases [27, 28]. To
he bes o ou knowledge, he speech-audio TCAV/CAV
li e a u e has hus a ocused mainly on accen - ela ed
bias, wi h English as he inpu language [21]. In con as ,
ou esea ch explici ly examines biases a ising om di -
e ences ac oss mul iple spoken languages in audio da a,
making i he i s o explo e language- ela ed bias ac oss
bo h speech and MIR domains.
Me hods conce ned wi h bias in he audio domain can
be g ouped in o echniques applied be o e o du ing model
aining, and pos -hoc app oaches aimed a iden i ying and
mi iga ing biases a e aining, p ima ily h ough debias-
ing la en ep esen a ions. P e-o -du ing s a egies include
mi iga ion o da a and anno a ion biases including ca e ul
da ase selec ion and imp o ed anspa ency h ough de-
ailed documen a ion [29], coun e ac ual a en ion lea n-
ing [30], and oken masking o p e en o e i ing o he
da ase du ing lea ning [31]. Pos -hoc me hods in MIR ha
employ bias explo a ion u ilize s a is ical signi icance es s
o compa e pe o mance dis ibu ions ac oss a ec ed and
ad an aged g oups [32], and analyze pe o mance dispa -
i ies by e alua ing di e ences be ween uni e sal and cul-
u ally adap ed models [33].
Ou app oach in his pape is mos simila o ha o [5],
who apply a dimensionali y- educ ion me hod, Linea Dis-
c iminan Analysis (LDA), o iden i y he bias di ec ion in
p e- ained audio embeddings by aining LDA o sepa a e
wo di e en da ase s. Wang e al. [5] add ess he da ase
bias ha is in luenced by bo h he aining app oach o he
embeddings and he alignmen o class ocabula ies be-
ween da ase s. In con as , ou CAV-based me hod de ines
undesi able biases as concep s which allows us o analyze
high-le el biases ela ed o how he a is ep esen a ions
mani es hemsel es in he embeddings. Whe eas Wang
e al. [5] emo e domain-sensi i i y bias by sub ac ing a
linea p ojec ion om he embedding i sel , ou app oach
ins ead adjus s he linea CAV classi ie ha p obes he em-
bedding. Bo h s a egies a e linea , bu hey ac a di e en
s ages o he pipeline.
The e o e, ou wo k expands upon exis ing pos -hoc
bias explo a ion and e ie al-sys em debiasing e o s by
applying CAV-based in e p e abili y o add ess and mi i-
ga e he e ec s o demog aphic and sociocul u al ac o s
in music ep esen a ion models.
3. MUSIC REPRESENTATION MODELS
We e alua e ou s a e-o - he-a music ep esen a ion
models in hei publicly a ailable, p e ained o m:
MERT,Whispe ,MuQ, and MuQ-MuLan. All ou can
be used o gene a e audio embeddings o asks such as mu-
sic agging,ze o-sho and downs eam classi ica ion,mu-
sic e ie al, and as audio encode s in music unde s anding
LLMs.
MERT (MERT- 1-95M) [34] is a sel -supe ised
T ans o me model ained on la ge-scale music da ase s
using masked modeling and pseudo-labels om acous ic
and musical eache models. I cap u es musical s uc-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
280
u e and seman ics, making i pa icula ly e ec i e o
asks like music- ex e ie al and gen e classi ica ion. I
is widely used in open-sou ce Music-LLMs [6].
Whispe (Whispe -la ge- 2) [35] is a speech ecogni-
ion model ained on 680,000 hou s o mul ilingual and
mul i ask audio da a. While o iginally designed o au-
oma ic speech ecogni ion (ASR), i has shown some ca-
pabili y in p ocessing music audio, pa icula ly o an-
sc ip ion asks [36]. Like MERT, Whispe is o en used
as an encode in music LLMs pipelines [6], con ibu ing
o music e ie al and cap ioning asks. Unlike he o he
models, Whispe was no ained wi h music; ins ead, i
is op imized o cap u e linguis ic s uc u e, phone ic ea-
u es, and p osody, which p esumably could lead o less
en anglemen be ween language and musical gen e as he
model’s main ocus should be he singe s’ oice ins ead o
he musical con en .
MuQ (MuQ-la ge-msd-i e ) [37] is a sel -supe ised
model ha lea ns disc e e music ep esen a ions ia Mel
Residual Vec o Quan iza ion (Mel-RVQ). I is ained
solely on audio da a, wi hou manual labels, and excels
a ze o-sho agging and ins umen classi ica ion.
MuQ-MuLan (MuQ-MuLan-la ge) [37] ex ends MuQ
by inco po a ing a join aining objec i e wi h a ex en-
code (MuLan) on a la ge-scale co pus o pai ed music–
ex da a. This allows MuQ-MuLan o align musical and
ex ual ea u es in a sha ed embedding space, enabling
music– ex e ie al and cap ioning.
In ou s udy, we include bo h MuQ and MuQ-MuLan
o assess how ex supe ision a ec s concep encoding
and en anglemen . While MuQ cap u es s uc u e-d i en
ep esen a ions g ounded in audio alone, MuQ-MuLan -
h ough i s alignmen wi h ex - may encode s onge co -
ela ions be ween musical and non-musical a ibu es, po-
en ially leading o inc eased cul u al o linguis ic bias.
4. DATASET
We use ST aDa (Singe T ai s Da ase ) [38], a la ge-scale
da ase designed o analyzing singe - ela ed a ibu es in
music. Speci ically, we le e age he au oma ic-s ada sub-
se , which includes me ada a o o e 25,000 acks. This
me ada a - co e ing lead singe gende (male, emale, non-
bina y) and language is c oss- alida ed ac oss mul iple
sou ces o ensu e eliabili y. To ob ain audio, we use he
Deeze API o e ie e 30-second audio p e iews and suc-
cess ully collec 22,168 acks. To add ess unde ep esen-
a ion o ce ain gen e–gende combina ions in ST aDa,
we supplemen he da ase wi h 251 addi ional acks om
Deeze playlis s speci ically cu a ed a ound he unde ep-
esen ed concep s. 2These playlis s p o ide an ex e nal,
non-manual sou ce o cu a ion. Fo quali y assu ance,
based on playlis name and lis ening we manually anno a e
each ack’s gen e, and he singe ’s assumed gende and
language, disca ding misma ches wi h he in ended playlis
heme. Acknowledging po en ial cu a ion bias, we pub-
licly elease all playlis IDs, ack IDs, and me ada a in he
2A ec ed gen es a e ma ked wi h * in Figu e 1
addi ional ma e ial.
F om his co pus, we cons uc balanced ain / es
da ase s o nine bina y classi ica ion asks: gende (male,
emale) and he se en mos common languages in ST aDa
(en, , i , p , ja, es, de). T aining se s a e used o lea n
CAVs; es se s o compu e TCAV sco es. Following [7],
we conside poo ly p ojec ing CAVs as incapable o eli-
ably ep esen ing a concep and hus unsui able o bias
analysis.
To cons uc he CAV aining and es se s, we s a -
i y he da a ac oss all combina ions o language, gen e,
and gende . Fo each bina y concep (e.g., gende o lan-
guage), we sample an equal numbe o posi i e (e.g., gen-
de = emale; language = Po uguese) and andomly se-
lec ed non-posi i e samples (e.g., gende emale; language
Po uguese) wi hin each gen e o p e en gen e-based con-
ounds. When da a is abundan , we ix a maximum o
50 samples pe (label, gen e) cell o aining and assign
he emainde o es ing; when sca ce, we downscale p o-
po ionally. This yields gen e-balanced, disjoin ain/ es
spli s wi h b oad subg oup di e si y. To educe concep
en anglemen , we cap he numbe o aining samples pe
(concep , gen e) pai , p e en ing dominan subg oups om
o e whelming he concep de ini ion.
Such ca e ul balancing is essen ial o ensu e ha ob-
se ed bias e lec s he model’s in e nal ep esen a ion -
no a i ac s o he da ase . While ou s a egy con ols o
known a ibu es using a ailable me ada a, i canno elim-
ina e la en o unobse ed ac o s. Fo example, emale-
and male-led English-language jazz may di e in imb e
o a angemen , bu we assume hey emain acous ically
compa able in e ms o gen e-de ining ai s. Ou CAVs
aim o isola e he in ended concep unde his app oxima-
ion, acknowledging po en ial esidual en anglemen .
5. METHOD
CAVs assume ha a a ge concep is linea ly sepa able in
a model’s la en space - ha is, he e exis s a hype plane
ha dis inguishes be ween samples wi h and wi hou he
concep . We cons uc a CAV by aining a logis ic eg es-
sion model on he inal-laye la en embeddings x, which
lea ns he decision unc ion:
ˆy=σ(w⊤·x+b)(1)
whe e σis he sigmoid unc ion. The hype plane w⊤·
x+b= 0 de ines he decision bounda y, and he CAV is
he no mal ec o wo hogonal o his bounda y. To ali-
da e i s eliabili y, we can e alua e he classi ie ’s accu acy.
High pe o mance indica es ha he concep is linea ly en-
coded; o he wise, he CAV is conside ed un eliable. This
does no necessa ily mean he concep is absen om he
embedding space, bu a he ha i canno be ully cap-
u ed by ou linea analysis. Once a CAV is lea ned, i can
be used o measu e how s ongly indi idual audio samples
align wi h a gi en concep di ec ion. This enables in ui i e
inspec ion o model beha io - o example, iden i ying
whe he ce ain gen es sys ema ically e lec demog aphic
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
281
a ibu es. To quan i y such pa e ns, we adap he TCAV
app oach o es ima e how o en a ca ego y’s embeddings
align wi h he concep , which may indica e po en ial bias
in he ep esen a ion.
5.1 Measu ing Concep Alignmen Using TCAV
TCAV [7] p o ides a s a is ical amewo k o quan i y he
ex en o which a model’s in e nal ep esen a ions ely on
a gi en concep , enabling s uc u ed and scalable analysis
o concep in luence.
Unlike he o iginal TCAV me hod - which compu es di-
ec ional de i a i es o class logi s om a ained classi ie
- ou app oach sligh ly de ia es, as i ope a es on ozen
audio encode s wi hou access o g adien s om a so max
laye o any downs eam p edic ion head. Ins ead, we e al-
ua e alignmen wi h a concep by es ing whe he gen e-
speci ic inal-laye embeddings all on he posi i e side o
he lea ned hype plane. We omi he sigmoid and use i s
logi as he p ojec ion,
pCAV(x) = CAV⊤·x+b(2)
This p ojec ion e lec s he sample’s alignmen wi h he
lea ned concep . Including he bias e m bis essen ial he e
- unlike in he o iginal TCAV o mula ion - since we do
no compu e di ec ional de i a i es, whe e he bias would
cancel ou . Ins ead, we in e p e he CAV as a comple e
decision unc ion.
Ou TCAV sco e is hen de ined as he ac ion o sam-
ples wi h a posi i e p ojec ion, indica ing how o en he
model’s ep esen a ions align wi h he concep :
TCAV =1
N
N
X
i=1
I(pCAV(xi)>0) (3)
whe e Ideno es he indica o unc ion, e u ning 1 i he
condi ion is ue and 0 o he wise. By compa ing TCAV
sco es ac oss di e en gen es, we assess he ex en o
which he model’s gen e ep esen a ions ela e o he gi en
concep . Because he concep es se s a e balanced, he
expec ed TCAV sco e unde he null hypo hesis (no align-
men ) is 0.5. Sco es signi ican ly abo e o below his
h eshold indica e sys ema ic alignmen - and hus po en-
ial bias. To ensu e he obus ness o ou indings, we ol-
low he o iginal TCAV p o ocol: o each concep , we ain
500 CAVs on independen ly sampled, balanced aining
subse s comp ising 25% o he da a. This yields a dis i-
bu ion o TCAV sco es pe concep –gen e pai . We hen
conduc a wo-sided - es o e alua e whe he he mean
TCAV sco e signi ican ly de ia es om 0.5, and apply a
Bon e oni co ec ion o accoun o mul iple compa isons.
5.2 Adjus ing Bias ia Concep Vec o Manipula ion
To explo e and mi iga e bias in gen e ep esen a ions, we
modi y gen e-speci ic CAVs using bias- ela ed concep di-
ec ions. As a case s udy, we ocus on Hip-Hop, which
we expec o be nega i ely aligned wi h he Female ocal
concep . We c ea e a Hip-Hop CAV based on a gende -
balanced da ase and ank acks in a simila ly balanced
es se using hei Hip-Hop p ojec ion sco es pCAV(x)
(Eq. 2). I bias is encoded, male- ocal acks should ap-
pea nea he op. To a enua e his e ec , we adjus he
Hip-Hop CAV by in e pola ing owa d he Female ocal
concep :
CAVadj
hiphop = (1 −λ)·CAVhiphop +λ·CAV emale,
wi h λ∈[0,1] con olling he adjus men s eng h. This
shi s he decision bounda y owa d he emale ocal di-
ec ion, educing male dominance in he op- anked acks.
Sub ac ing he Male ocal CAV should achie e a simi-
la e ec , gi en ha he wo concep ec o s a e app oxi-
ma ely opposi es by cons uc ion.
This p ocedu e e eals how seman ically meaning ul di-
ec ions can in luence gen e alignmen and p o ides a sim-
ple, ep oducible pos -hoc echnique o analyze and mi i-
ga e ep esen a ional bias. While we adjus CAV di ec ions
he e, simila s a egies could in p inciple be applied o em-
beddings di ec ly.
6. RESULTS
To in es iga e he in luence o non-musical a ibu es on
gen e ep esen a ion, we analyze he model esponses o
wo ep esen a i e concep s in dep h: Female ocals and
Po uguese language. The wo discussed concep s we e
selec ed o hei illus a i e powe - Female ocals as a
p oxy o he singe ’s gende (no ing ha he Male ocals
concep yields la gely in e se esul s), and Po uguese as
a ep esen a i e example o linguis ic a ia ion in music.
Po uguese was speci ically chosen due o i s s ong as-
socia ion wi h gen es such as La in Ame ican Music and
B azilian Music, whe e meaning ul en anglemen migh
be expec ed, in con as o o he gen es whe e such as-
socia ions should be less likely. Addi ional esul s o all
concep s a e p o ided in he supplemen a y ma e ial and
la gely mi o he ends desc ibed in he ollowing sec-
ions.
6.1 E alua ion o he Female Concep
We i s assess he TCAV sco es ac oss gen es o he con-
cep o Female ocals ac oss he ou models, displayed in
Figu e 1. The a e age classi ica ion accu acy o he ained
CAVs exceeds 80% o all models excep MuQ-MuLan,
indica ing ha he concep is linea ly encoded in hei la-
en spaces and hus eliably cap u ed by he CAVs. Mos
TCAV sco es de ia e signi ican ly om he chance le el,
indica ed by he Bon e oni-co ec ed 95% con idence in-
e als no spanning ac oss he 0.5 ma k. This sugges s he
p esence o bias in he in e nal gen e ep esen a ions o all
models. The ac ha hese sco es o en di e ge in di ec-
ion be ween models - despi e being ained on he same
balanced da a - s ongly sugges s ha he obse ed biases
e lec genuine di e ences in how each model encodes he
concep .
MERT displays signi ican nega i e biases o gen es
such as me al, ock, and Hip-Hop, wi h TCAV sco es sub-
s an ially below 0.5. In con as , gen es like Elec onic,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
282
Figu e 1. Pe -model TCAV e alua ion o concep s Gende -Female (uppe ) and Language-Po uguese (lowe ). Violin
plo s ep esen he dis ibu ion o TCAV sco es ac oss 500 CAV ainings. Boxes indica e a Bon e oni-co ec ed 95%
con idence in e al. Blue/o ange colo ing indica es signi ican posi i e/nega i e co ela ion be ween gen e and concep .
R&B, and Soul show clea posi i e associa ions wi h he
Female ocals concep . These pa e ns align wi h common
ocal s e eo ypes associa ed wi h hese gen es. In e es -
ingly, Classical music e eals he second-s onges nega-
i e bias in MERT, despi e showing a signi ican ly posi i e
associa ion wi h emale ocals in all o he models.
Whispe demons a es a no ably di e en pa e n.
While i ag ees wi h expec a ions o a ew gen es (e.g.,
Me al,Hip-Hop), i s TCAV sco es di e ge - some imes
subs an ially - in o he gen es. These inconsis encies sug-
ges ha Whispe ’s in e nal ep esen a ions di e om
hose o MERT. One plausible explana ion is Whispe ’s
o igin as an au oma ic speech ecogni ion (ASR) model,
which may ende i mo e sensi i e o ocal cha ac e is-
ics such as pi ch, imb e, o e en ly ics, leading o model-
speci ic associa ions be ween ocal ai s and gen e.
MuQ’s dis ibu ion pa e ns in e es ingly esemble
hose o Whispe , wi h bo h models showing ela i ely
sub le gen e-speci ic de ia ions. Nei he model is exposed
o music-speci ic ex ual labels du ing aining. This ab-
sence o gen e- o cul u ally-aligned ex inpu may en-
cou age mo e s uc u ally g ounded ep esen a ions, e-
sul ing in simila concep en anglemen wi h non-musical
a ibu es.
MuQ-MuLan, while aligned in gene al bias di ec ion
wi h i s sibling MuQ, e eals much s onge and mo e po-
la ized TCAV sco es. The model shows signi ican ly neg-
a i e sco es o many gen es, and ampli ied posi i e sco es
in o he s. This sugges s ha MuQ-MuLan, ained wi h
ex supe ision, encodes s onge cul u al associa ions be-
ween gende and gen e. No ably, MuQ-MuLan exhibi s
s ong and s a is ically signi ican TCAV sco es despi e
lowe CAV es accu acy. This sugges s ha he model’s
ep esen a ions a e aligned wi h he concep in a mo e en-
angled o di use manne - cap u ing meaning ul bias e en
when he concep is no cleanly linea ly sepa able. In sum-
ma y, all models exhibi gen e-speci ic biases ela ed o e-
male ocals, wi h signi ican a ia ion in bo h magni ude
and di ec ion. These indings sugges ha he aining ob-
jec i e, modali y, and supe ision signal ha e a subs an ial
in luence on how gende ed in o ma ion becomes encoded,
and unde sco e he need o awa eness o such e ec s in
downs eam MIR applica ions.
6.2 E alua ion o he Po uguese Language Concep
We now u n o he TCAV e alua ion o he Po uguese
language concep ac oss gen es (Figu e 1). While Whis-
pe ’s accu acy is close o pe ec , he a e age CAV classi-
ica ion accu acy o he o he models is lowe han o he
gende -based concep s, and while s ill abo e chance le el
i should be in e p e ed wi h cau ion.
MERT shows s ong posi i e TCAV sco es o La in
Ame ican Music and B azilian Music, which aligns wi h
expec a ions gi en he na u al linguis ic-cul u al o e lap.
Su p isingly, MERT shows he s onges bias o Hip-Hop.
In con as , gen es such as Rock,Ch is ian, and Blues ex-
hibi signi ican nega i e biases, sugges ing a possible en-
anglemen o Po uguese wi h s ylis ic o hy hmic ea-
u es mo e p e alen in o he gen es.
Whispe , by con as , yields TCAV sco es ha emain
close o he null hypo hesis o 0.5 o mos gen es, indica -
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
283
Figu e 2. E ec o concep ec o debiasing on he Hip-
Hop CAV. We show he a io o male singe s among he op
50% o anked acks when g adually adding he Female
ocals CAV (blue) o sub ac ing he Male ocals CAV (o -
ange), as a unc ion o he adjus men weigh λ.
ing ela i ely li le concep alignmen . This suppo s he
in e p e a ion ha Whispe - ained p ima ily o mul i-
lingual speech ecogni ion - encodes language in a mo e
disen angled ashion, possibly igno ing mos musical in-
o ma ion. A no able excep ion is a small bu signi ican
posi i e bias o B azilian Music, hypo he ically e lec ing
Whispe ’s heigh ened sensi i i y o ocal o phone ic ai s
p esen in B azilian Po uguese singing.
MuQ cap u es he expec ed posi i e associa ions be-
ween Po uguese and he La in Ame ican and B azilian
Music gen es, while showing nega i e o neu al biases o
o he gen es. These biases a e mo e s ongly ampli ied in
MuQ-MuLan, whe e all gen e-speci ic TCAV sco es ex-
hibi clea e pola i y. This again may highligh he e ec
o mul imodal aining: while MuQ lea ns om acous ic
ea u es alone, MuQ-MuLan’s ex alignmen appea s o
ein o ce cul u al and linguis ic co ela ions, magni ying
concep en anglemen .
6.3 Concep Debiasing
Figu e 2 isualizes he e ec o ou ec o -based debiasing
s a egy applied o he Hip-Hop CAV o MuQ-MuLan.
He e, λ= 0 ep esen s he o iginal CAV-based so ing,
whe e a s ong male o e ep esen a ion is obse ed in he
op 50%, as expec ed om he ea lie TCAV analysis. As
λinc eases, we ei he add he Female ocals CAV o sub-
ac he Male ocals CAV, and moni o he p opo ion o
male singe s among he op 50% anked acks in a gende -
balanced Hip-Hop es se .
No ably, bo h ope a ions lead o a nea ly linea educ-
ion in male dominance as λinc eases, sugges ing a mean-
ing ul seman ic di ec ion in he la en space. This indi-
ca es ha he Male and Female ocals CAVs encode well-
isola ed ep esen a ions o ocal gende . The consis en
e ec s ac oss bo h di ec ions suppo hei in ui i e sym-
me y - expec ed due o hei mu ually exclusi e and bal-
anced cons uc ion in aining. A quali a i e e iew o
male-labeled acks ha emained highly anked a e debi-
asing e ealed ha many we e in ac w ongly labeled, and
ins ead ea u e emale ocals, u he ein o cing he elia-
bili y o he lea ned CAVs in cap u ing ocal gende . Ou
indings highligh ha he lea ned CAVs cap u e meaning-
ul and obus concep di ec ions, and ha concep ec o
manipula ion can se e as a simple pos -hoc s a egy o
adjus ing model beha io , as showcased he e in a anking
scena io.
7. LIMITATIONS AND FUTURE WORK
Concep disen anglemen is essen ial o in e p e able ep-
esen a ions. In ou wo k, we mi iga e spu ious associ-
a ions be ween non-musical concep s (e.g., gende , lan-
guage) and gen e by ca e ully balancing da ase s o a oid
subg oup o e ep esen a ion. Howe e , e en wi h his bal-
ancing, lea ned CAVs may s ill be en angled wi h la en
o unobse ed ac o s, especially when he a ge concep
is no cleanly sepa able in he embedding space. Noisy
o ambiguous concep labels may u he deg ade he cla -
i y o he esul ing CAVs. As a esul , TCAV sco es may
e lec no only he in ended concep bu also co ela ed di-
mensions, limi ing in e p e abili y. Fu u e wo k could ex-
plo e o hogonal CAV aining s a egies (e.g., [18]) o be -
e isola e indi idual concep s by explici ly educing o e -
lap be ween concep ec o s in he la en space. Addi ion-
ally, ou linea analysis assumes a oughly in e p e able
embedding geome y, which may no cap u e complex
concep in e ac ions. Simple, acous ically salien concep s
(like gende ed oice imb e) may yield clea e , mo e in-
e p e able CAVs han abs ac , cul u ally embedded ones
(like he singe s’ age o egional associa ions). This could
unin en ionally bias he analysis owa d mo e acous ically
g ounded a ibu es.
8. CONCLUSION
We sys ema ically in es iga ed non-musical bias in s a e-
o - he-a music embedding models using Concep Ac i-
a ion Vec o s (CAVs) and an adap ed TCAV pipeline.
Ou esul s e eal signi ican and meaning ul en angle-
men s be ween gen e ep esen a ions and a ibu es such
as singe gende and language, wi h a ia ion ac oss mod-
els. These pa e ns e lec known dispa i ies in he music
indus y and highligh he need o bias-awa e model de-
elopmen in MIR. Beyond diagnos ic insigh s, we demon-
s a e ha CAVs can se e as an in ui i e and ligh weigh
ool o pos -hoc debiasing h ough concep ec o manip-
ula ion. Ou app oach gene alizes beyond music ep esen-
a ion models: i can be eadily applied o any MIR sys em
ha p oduces la en embeddings, including gen e classi-
ie s, agge s, o e ie al models. C ucially, i equi es
only a small se o cu a ed concep examples, making i
p ac ical and accessible o eal-wo ld deploymen .
Wi h his wo k, we aim o encou age b oade esea ch
in o how MIR models unde s and and ep esen music -
and he social and cul u al implica ions ha ollow. While
we use concep -based analysis, ega dless o me hod, ou
p ima y goal is o os e c i ical e lec ion on he biases and
assump ions embedded in music echnologies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
284
9. ETHICS STATEMENT
This wo k in es iga es ep esen a ional biases in music
embedding models, ocusing on demog aphic and linguis-
ic a ibu es. Ou goal is o expose how models may en-
code and p opaga e social and cul u al imbalances, aiming
o p omo e ai e and mo e inclusi e MIR sys ems.
We acknowledge ha concep s like gende and lan-
guage a e complex, luid, and socially cons uc ed. Ou bi-
na y ea men o gende (male/ emale) e lec s limi a ions
in a ailable me ada a and is no an endo semen o educ-
i e amings. We ecognize he b oade spec um o gen-
de iden i ies and emphasize he need o mo e inclusi e
da a collec ion p ac ices in u u e esea ch. The absence
o non-bina y classes in ou s udy is due o insu icien an-
no a ed da a, and we encou age he communi y o expand
upon hese axes wi h mo e ep esen a i e da ase s. In ou
da a augmen a ion p ocess, we we e no able o o mally
e i y gen e iden i y, and elied on ocal cha ac e is ics o
in e gende , in oducing a po en ial sou ce o labeling un-
ce ain y.
All da ase s used in his s udy we e sou ced om
publicly a ailable esou ces and supplemen ed wi h ca e-
ully anno a ed samples o imp o e ep esen a ion ac oss
g oups. We a e commi ed o anspa ency and ep o-
ducibili y in ou esea ch p ac ices and publish he supple-
men ed me ada a alongside his wo k.
While ou ocus is on diagnosing and mi iga ing biases,
we also acknowledge he b oade e hical implica ions o
ou wo k. This includes he po en ial misuse o debiasing
echniques and he unin ended consequences o highligh -
ing biases. Engaging wi h communi ies a ec ed by hese
biases is c ucial o ensu ing ha ou esea ch is g ounded
in eal-wo ld expe iences and needs.
Ou indings a e in ended o os e c i ical e lec ion on
he biases and assump ions embedded in music echnolo-
gies. We hope his wo k encou ages b oade esea ch in o
how MIR models unde s and and ep esen music, and he
social and cul u al implica ions ha ollow.
10. REFERENCES
[1] I. Ga ido-Muñoz, A. Mon ejo-Ráez, F. Ma ínez-
San iago, and L. A. U eña-López, “A su ey on bias
in deep nlp,” APPLIED SCIENCES, ol. 11, no. 7.
[2] E. N ou si, P. Fa alios, U. Gadi aju, V. Iosi idis, W. Ne-
jdl, M.-E. Vidal, S. Ruggie i, F. Tu ini, S. Papadopou-
los, E. K asanakis e al., “Bias in da a-d i en a i i-
cial in elligence sys ems – an in oduc o y su ey,” WI-
LEY INTERDISCIPLINARY REVIEWS: DATA MIN-
ING AND KNOWLEDGE DISCOVERY, ol. 10, no. 3.
[3] A. Holzap el, B. L. S u m, and M. Coeckelbe gh, “E h-
ical dimensions o music in o ma ion e ie al ech-
nology,” TRANSACTIONS OF THE INTERNATIONAL
SOCIETY FOR MUSIC INFORMATION RETRIEVAL,
Sep 2018.
[4] D. Shakespea e, L. Po ca o, E. Gómez, and C. Cas illo,
“Explo ing a is gende bias in music ecommenda-
ion,” 2020.
[5] C. Wang, G. Richa d, and B. McFee, “T ans e lea n-
ing and bias co ec ion wi h p e- ained audio embed-
dings,” in P oceedings o he 24 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2023.
[6] B. Weck, I. Manco, E. Bene os, E. Quin on,
G. Fazekas, and D. Bogdano , “Muchomusic: E al-
ua ing music unde s anding in mul imodal audio-
language models,” in P oceedings o he 25 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2024.
[7] B. Kim, M. Wa enbe g, J. Gilme , C. Cai, J. Wexle ,
F. Viegas, and R. Say es, “In e p e abili y beyond ea-
u e a ibu ion: Tes ing wi h concep ac i a ion ec o s
( ca ),” in Ad ances in Neu al In o ma ion P ocessing
Sys ems (Neu IPS), 2018.
[8] B. C. Richa dson, R. Yode , and T. F. P. II, “Gende and
pe cep ion o music gen e in college s uden s,” MOD-
ERN PSYCHOLOGICAL STUDIES, 2022.
[9] C. Tabak, “Gende and music: Gende oles and
he music indus y,” THE JOURNAL OF WORLD
WOMEN STUDIES, 2023.
[10] A. Epps-Da ling, H. C ame , and R. T. Bouye , “A is
gende ep esen a ion in music s eaming,” in P oceed-
ings o he 21s In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence (ISMIR), 2020.
[11] A. Fe a o, X. Se a, and C. Baue , “B eak he loop:
Gende imbalance in music ecommende s,” in P o-
ceedings o he 2021 Con e ence on Human In o ma-
ion In e ac ion and Re ie al. New Yo k, NY, USA:
Associa ion o Compu ing Machine y, 2021.
[12] S. Howa d, C. N. Silla, and C. G. Johnson, “Au oma ic
ly ics-based music gen e classi ica ion in a mul ilin-
gual se ing,” 2011.
[13] C.-K. Yeh, B. Kim, and P. Ra ikuma , “Human-
cen e ed concep explana ions o neu al ne wo ks,”
2022.
[14] Y. Zhang, D. S. Ca alho, and A. F ei as, “Lea ning
disen angled seman ic spaces o explana ions ia in-
e ible neu al ne wo ks,” in P oceedings o he 62nd
Annual Mee ing o he Associa ion o Compu a ional
Linguis ics (ACL), 2024.
[15] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann,
E. Pie son, B. Kim, and P. Liang, “Concep bo leneck
models,” in P oceedings o he 37 h In e na ional Con-
e ence on Machine Lea ning, se . P oceedings o Ma-
chine Lea ning Resea ch, H. D. III and A. Singh, Eds.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
285
[16] F. Fosca in, K. Hoed , V. P ahe , A. Flexe , and G. Wid-
me , “Concep -based echniques o ’musicologis -
iendly’ explana ions in a deep music classi ie ,” in
P oceedings o he In . Socie y o Music In o ma ion
Re ie al Con ., 2022.
[17] K. A. Thakoo , S. C. Koo a ho a, D. C. Hood, and
P. Sajda, “Robus and in e p e able con olu ional neu-
al ne wo ks o de ec glaucoma in op ical cohe -
ence omog aphy images,” IEEE TRANSACTIONS ON
BIOMEDICAL ENGINEERING, 2021.
[18] E. E ogulla i, S. Lapuschkin, W. Samek, and F. Pahde,
“Pos -hoc concep disen anglemen : F om co ela ed
o isola ed concep ep esen a ions,” 2025.
[19] C. J. Ande s, L. Webe , D. Neumann, W. Samek,
K. R. Mülle , and S. Lapuschkin, “Finding and emo -
ing cle e hans: Using explana ion me hods o debug
and imp o e deep models,” INFORMATION FUSION,
2022.
[20] A. Nicolson, L. Schu , J. A. Noble, and Y. Gal, “Ex-
plaining explainabili y: Recommenda ions o e ec-
i e use o concep ac i a ion ec o s (ca s),” 2025.
[21] Z. Wei, A. Caines, P. Bu e y, and M. Gales,
“Analysing bias in spoken language assessmen using
concep ac i a ion ec o s,” in P oceedings o he IEEE
In e na ional Con e ence on Acous ics, Speech, and
Signal P ocessing (ICASSP), 2021.
[22] S. Mish a, B. L. S u m, and S. Dixon, “Local in e -
p e able model-agnos ic explana ions o music con-
en analysis,” in P oceedings o he 18 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), Suzhou, China.
[23] D. A cha , R. Hennequin, and V. Guigue, “Lea ning
unsupe ised hie a chies o audio concep s,” in P o-
ceedings o he 23 d In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), Bengalu u,
India, 2022.
[24] Z. Yu and S. Ananiadou, “Unde s anding and mi iga -
ing gende bias in llms ia in e p e able neu on edi -
ing,” a Xi p ep in , ol. a Xi :2501.14457, 2025.
[25] A. K ishnan, B. M. Abdullah, and D. Klakow, “On he
encoding o gende in ans o me -based as ep esen-
a ions,” in P oc. o In e speech 2024. ISCA, Sep em-
be 2024.
[26] T. Bolukbasi, K.-W. Chang, J. Zou, V. Salig ama, and
A. Kalai, “Man is o compu e p og amme as woman
is o homemake ? debiasing wo d embeddings,” a Xi
p ep in , ol. a Xi :1607.06520, 2016.
[27] R. Co ea, K. Pahwa, B. Pa el, C. M. Vachon, J. W.
Gichoya, and I. Bane jee, “E icien ad e sa ial debi-
asing wi h concep ac i a ion ec o – medical image
case-s udies,” JOURNAL OF BIOMEDICAL INFOR-
MATICS.
[28] X. Tong and L. Kagal, “In es iga ing bias in image
classi ica ion using model explana ions,” 2020.
[29] L. S. Maia, M. Rocamo a, L. W. P. Biscainho, and
M. Fuen es, “Selec i e anno a ion o ew da a o bea
acking o la in ame ican music using hy hmic ea-
u es,” TRANSACTIONS OF THE INTERNATIONAL
SOCIETY FOR MUSIC INFORMATION RETRIEVAL,
May 2024.
[30] Y.-X. Lin, J.-C. Lin, W.-L. Wei, and J.-C. Wang,
“Lea nable coun e ac ual a en ion o music classi i-
ca ion,” IEEE TRANSACTIONS ON AUDIO, SPEECH
AND LANGUAGE PROCESSING, 2025.
[31] Z. Zhao, “Le ne wo k decide wha o lea n: Symbolic
music unde s anding model based on la ge-scale ad e -
sa ial p e- aining,” 2025.
[32] F. Yesile , M. Mi on, J. Se à, and E. Gómez, “As-
sessing algo i hmic biases o musical e sion iden i-
ica ion,” in P oceedings o he Fi een h ACM In e -
na ional Con e ence on Web Sea ch and Da a Mining,
2022.
[33] A. Holzap el, F. K ebs, and A. S ini asamu hy,
“T acking he “odd”: Me e in e ence in a cul u ally
di e se music co pus,” in P oceedings o he 15 h In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2014.
[34] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Lin, A. Ragni, E. Bene os, N. Gyenge, R. Dan-
nenbe g, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang,
Y. Guo, and J. Fu, “Me : Acous ic music unde s and-
ing model wi h la ge-scale sel -supe ised aining,”
2023.
[35] A. Rad o d, J. W. Kim, T. Xu, G. B ockman,
C. McLea ey, and I. Su ske e , “Robus speech ecog-
ni ion ia la ge-scale weak supe ision,” a Xi p ep in
a Xi :2212.04356, 2022.
[36] L. Zhuo, R. Yuan, J. Pan, Y. Ma, Y. Li, G. Zhang,
S. Liu, R. Dannenbe g, J. Fu, C. Lin, E. Bene os,
W. Chen, W. Xue, and Y. Guo, “Ly icwhiz: Robus
mul ilingual ze o-sho ly ics ansc ip ion by whispe -
ing o cha gp ,” in P oceedings o he 24 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2023.
[37] H. Zhu, Y. Zhou, H. Chen, J. Yu, Z. Ma, R. Gu, Y. Luo,
W. Tan, and X. Chen, “Muq: Sel -supe ised music
ep esen a ion lea ning wi h mel esidual ec o quan-
iza ion,” a Xi p ep in a Xi :2501.01108, 2025.
[38] Y. Kong, V.-A. T an, and R. Hennequin, “S ada: A
singe ai s da ase ,” in P oceedings o In e speech
2024, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
286