scieee Science in your language
[en] (orig)

Identification and Clustering of Unseen Ragas in Indian Art Music

Author: Parampreet Singh; Adwik Gupta; Aakarsh Mishra; Vipul Arora
Publisher: Zenodo
DOI: 10.5281/zenodo.17706596
Source: https://zenodo.org/records/17706596/files/000093.pdf
IDENTIFICATION AND CLUSTERING OF UNSEEN RAGAS IN INDIAN
ART MUSIC
Pa amp ee Singh†, Adwik Gup a, Aaka sh Mish a, Vipul A o a
Indian Ins i u e o echnology, Kanpu
{pa ams21, adwikg22, aaka sh21, ipula }@ii k.ac.in
ABSTRACT
Raga classi ica ion in Indian A Music is an open-se p ob-
lem whe e unseen classes may appea du ing es ing. How-
e e , adi ional app oaches o en ea i as a closed se
p oblem, ejec ing he possibili y o encoun e ing unseen
classes. In his wo k, we y o ackle his p oblem by
i s employing an Unce ain y-based Ou -O -Dis ibu ion
(OOD) de ec ion, gi en a se con aining known and un-
known classes. Nex , o he audio samples iden i ied
as OOD, we employ No el Class Disco e y (NCD) ap-
p oach o clus e hem in o dis inc unseen Raga classes.
We achie e his by ha nessing in o ma ion om labelled
da a and u he applying con as i e lea ning on unla-
belled da a. Wi h ho ough analysis, we demons a e he
in luence o di e en componen s o he loss unc ion on
clus e ing pe o mance and examine how a ying open-
ness a ec s he NCD ask in hand.
1. INTRODUCTION
Ragas o m he co e melodic amewo k o Indian A Mu-
sic (IAM), each cha ac e ized by a dis inc se o no es
and imp o isa ional ules ha e oke speci ic emo ions
o moods [1]. Iden i ying Ragas in audio eco dings
has a ious applica ions, including music ecommenda ion
sys ems, cul u al p ese a ion, and music educa ion [1].
While adi ional me hods would ely on handc a ed ea-
u es and expe knowledge, ecen ad ancemen s in deep
lea ning ha e enabled au oma ed Raga iden i ica ion [2–6],
whe e he sho age o labeled da ase s emains a signi i-
can challenge. Labeling Raga audios in MIR is cos ly and
labo -in ensi e, equi ing domain expe ise, while a ia-
ions in s yle and eco ding condi ions u he complica e
anno a ion. The p oblem o Raga iden i ica ion is inhe -
en ly an open-se p oblem, since he numbe o Ragas is
no ixed, and new, unseen classes can eme ge du ing es -
ing, making classi ica ion mo e challenging. Howe e , ex-
is ing app oaches ha e la gely ea ed i as a closed-se
p oblem [2, 3, 5, 6], limi ing hei abili y o handle no el
Raga classes du ing es ing.
© P. Singh, A. Gup a, A. Mish a and V. A o a. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: P. Singh, A. Gup a, A. Mish a and V. A o a, “Iden i-
ica ion and Clus e ing o Unseen Ragas in Indian A Music”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
This wo k ackles he challenge o unknown Raga
classes h ough he ollowing app oach. Fi s , we pe o m
Ou -o -Dis ibu ion (OOD) de ec ion by using unce ain y
es ima es om a model ained only on seen classes, iden i-
ying unseen Ragas wi hou p io exposu e o hem. Nex ,
we ame his as a No el Class Disco e y (NCD) p ob-
lem, whe e he OOD Raga samples a e assumed o belong
o dis inc , p e iously unseen classes and a e clus e ed in
a sel -supe ised manne . Fo NCD, we would gene ally
ha e a ge classes <= aining classes. So, we de ine
openness o he NCD p oblem in a simila manne o open-
se [7] p oblems as:
ONCD = 1 −s2× | aining classes|
2× | aining classes|+| es classes|(1)
Fo ou ask, we de ine wo disjoin subse s o Raga
classes: a closed-se aining se C ain consis ing o 12
known Raga classes belonging o PIM [6] da ase (sou ced
om P asa Bha a i 1audios), and a held-ou a ge se C es
comp ising no el Raga classes ha a e en i ely unseen du -
ing aining, belonging o bo h Sa aga (Hindus ani) [8] and
PIM [6] da ase s. We analyze ou app oach on a ying le -
els o openness on bo h he da ase s.
By u ilizing his amewo k, we can e ec i ely ap in o
he as amoun o eely a ailable, unlabeled Raga eco d-
ings om online pla o ms like YouTube, signi ican ly e-
ducing dependence on manually labeled da a. Ou ap-
p oach no only add esses he challenge posed by limi ed
labeled da ase s bu also enhances he abili y o MIR sys-
ems o ecognize a b oade ange o Ragas, p o iding a
scalable and adap i e solu ion o music classi ica ion. The
codes, me ada a, and o he esou ces can be accessed a he
dedica ed Gi hub Reposi o y.
2. RELATED WORKS
2.1 OOD De ec ion
Unce ain y es ima ion is a well-es ablished ield in ma-
chine lea ning ha ocuses on e alua ing he con idence
o model p edic ions o gi en es examples. Va ious ap-
p oaches u ilize unce ain y o iden i ying OOD samples.
The wo k [9] p oposes using maximum so max p obabil-
i ies as unce ain y indica o s. Deep ensembles [10] com-
1P asa Bha a i is India’s public b oadcas ing agency, comp ising Do-
o da shan Tele ision Ne wo k and All India Radio. I main ains an ex-
ensi e a chi e o Indian classical music eco dings.
797
bine mul iple models o achie e obus unce ain y es i-
ma es. Bayesian Neu al Ne wo ks o e p incipled unce -
ain y quan i ica ion h ough pos e io dis ibu ion app ox-
ima ion. O he me hods include echniques which ain
an auxilia y model o p edic con idence sco es [11–13].
Mon e Ca lo d opou (MC-d opou ) [14] applies d opou
du ing in e ence o simula e Bayesian sampling. In ou
wo k, we u ilize unce ain y sco es om MC-d opou o
OOD de ec ion, le e aging ou p e- ained model wi hou
equi ing addi ional aining.
2.2 No el Class Disco e y (NCD)
No el Class Disco e y ocuses on clus e ing unknown
classes in unlabeled da a while u ilizing knowledge om
labeled da a o known classes [15–19]. Unlike semi-
supe ised lea ning [20, 21], which assumes sha ed la-
bel spaces, o ze o-sho lea ning [22, 23], which e-
qui es human-de ined seman ic a ibu es, NCD enables
disco e y o no el ca ego ies wi hou such dependencies.
This makes i pa icula ly aluable o music applica ions,
whe e new classes con inuously eme ge.
In he image domain, NCD app oaches ha e explo ed
a ious con as i e lea ning echniques. Han e al. [18] in-
oduces a g aph-based app oach o ans e ing knowl-
edge om labeled o unlabeled da a. Ranking s a is-
ics [16] ha e been in oduced o cons uc nega i e sam-
ples o con as i e loss, while Neighbo hood Con as i e
Lea ning (NCL) [17] eplaces anking s a is ics wi h co-
sine simila i y and p oposes me hods o gene a ing ha d
nega i es.
2.3 Sel -supe ised Lea ning in Music Classi ica ion
Se e al wo ks in music classi ica ion ha e explo ed sel -
supe ised lea ning echniques. Di e en iable anking
[24] echniques on spec og am pa ches imp o e ins u-
men classi ica ion and pi ch es ima ion, hough his ap-
p oach is compu a ionally in ensi e. [25] u ilizes sel -
supe ised con as i e lea ning o singing oice analysis
by applying audio-speci ic ans o ma ions such as ime-
s e ching and pi ch-shi ing o dis inguish ocal imb e
and exp ession. Ano he s udy, [26], in eg a es he Swin
T ans o me in o a con as i e lea ning amewo k o mu-
sic gen e classi ica ion, demons a ing s ong pe o mance
wi h limi ed labeled da a. Addi ionally, [27] explo es he
eo de ing o shu led spec og am segmen s o imp o e
lea ned audio ep esen a ions o asks such as ins umen
classi ica ion and pi ch es ima ion.
In ou wo k, o NCD, we build on Neighbo hood Con-
as i e Lea ning (NCL) [17] wi h ailo ed modi ica ions
in posi i e/nega i e pai gene a ion and be e ans o ma-
ions o consis ency loss o ou ask. We ain a supe -
ised model o lea n meaning ul ep esen a ions, and hen
use hese ep esen a ions o ain ano he model in a sel -
supe ised manne o disco e and ca ego ize no el Raga
classes in he unlabeled da ase .
Figu e 1. Block diag am illus a ing he o e all sys em
wo k low: audio inpu is i s con e ed o a ch omag am
and p ocessed by a ea u e ex ac o . Ex ac ed ea u es
a e hen used o classi ica ion, ou -o -dis ibu ion (OOD)
de ec ion, and subsequen clus e ing o OOD samples, en-
abling bo h in-dis ibu ion classi ica ion and unsupe ised
g ouping o OOD da a.
3. METHOD
The o e all low o he whole p ocess is shown in Fig-
u e 1. We cons uc a labeled subse Slcon aining N num-
be o 30-second audio clips xl
i, sou ced om he PIM
da ase [6], each belonging o one o he c p ede ined Raga
classes. We p e-p ocess o emo e speech segmen s, dis-
ca d audio clips sho e han 30 seconds, and subsequen ly
ex ac onic-no malized ch omag am ea u es [6], which
o ms he inpu o ain he Raga classi ie (·). Fo mally,
he labeled subse Slis de ined as: Sl=(xl
i, c
i)N
i=1 ,
whe e c
i∈ Clco esponds o i s g ound- u h Raga label.
Simila ly, we de ine an unlabeled subse o M samples
Su={xu
i}M
i=1 ,whe e he co esponding class labels a e
assumed o be absen . The se o unseen classes Cuis
a ied in size based on he openness o he p oblem.
3.1 Supe ised p e- aining
Fo classi ica ion, we spli Slin o aining, alida ion, and
es subse s, and ain a CNN-LSTM model (·)in a ully
supe ised manne using ca ego ical c oss-en opy loss.
Once ained, his CNN-LSTM model se es as a ea u e
ex ac o by emo ing he inal so max laye . The esul -
ing ea u e ex ac o , deno ed as ea (·), gene a es em-
beddings yi o bo h Sland Su, which a e la e used o
OOD de ec ion and NCD.
3.2 OOD De ec ion
Mon e Ca lo (MC) D opou [14] is a echnique o es ima -
ing epis emic unce ain y in deep lea ning models. Gi en a
p e- ained CNN-LSTM classi ie (·), we enable d opou
a in e ence ime o app oxima e a Bayesian neu al ne -
wo k. The p edic i e unce ain y is es ima ed by pe o m-
ing Ts ochas ic o wa d passes, yielding a se o so -
max ou pu s. By doing his, we assume ha he ne wo k’s
pa ame e s W a y unde di e en d opou masks. The
a iance o hese p edic ions quan i ies unce ain y alues.
Highe a iance indica es g ea e unce ain y, sugges ing a
highe likelihood o he sample belonging o an OOD class.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
798
3.3 No el Class Disco e y
3.3.1 BCE Loss
Fo an inpu audio clip xu
i∈Su, le yu
i= ea (xu
i)
be he embeddings using he p e- ained ea u e ex ac o
ea (·). The cosine simila i y εbe ween a pai o ea u e
embeddings (yu
i, yu
j)is gi en by:
ε(yu
i, yu
j) = (yu
i)⊤yu
j
∥yu
i∥∥yu
j∥(2)
Now, using his, we assign a pai wise pseudo-label i,j as:
i,j =
1
ε(yu
i, yu
j)≥δ,(3)
whe e δis a simila i y h eshold ha de e mines whe he
he wo samples belong o he same la en class. Fu he -
mo e, i wo audio samples xu
iand xu
ja e o med by spli -
ing om he same audio ile, hey a e assigned i,j = 1 as
hey de ini ely belong o he same class.
These pai wise pseudo-labels a e used o ain a sel -
a en ion encode model g(·), which inco po a es a mul i-
head sel -a en ion mechanism u ilizing scaled do -p oduc
a en ion, along wi h laye no maliza ion and eed o wa d
sub-laye s. The ne wo k consis s o mul iple such s acked
laye s, wi h he inpu being he embedding yu
iand i s ou -
pu deno ed as zu
i=g(yu
i). Fo BCE loss be ween he
gi en pai o inpu s, we de ine no malized do p oduc be-
ween he ou pu embeddings, gi en by pi,j:
pi,j =(zu
i)⊤zu
j
∥zu
i∥ · ∥zu
j∥(4)
The BCE loss unc ion is de ined as:
ℓbce = i,j log(pi,j) + (1 − i,j) log(1 −pi,j).(5)
3.3.2 Consis ency Loss
To en o ce consis ency unde ans o ma ions, we in o-
duce a loss ensu ing ha an audio sample xiand i s ans-
o med e sions ˜xiyield simila ou pu s. We gene a e al-
e na e iews by ime shi ing, whe e o a gi en audio clip,
we c ea e wo ans o med e sions by sligh ly shi ing i s
s a and end imes (by 2 seconds) wi hin he o iginal au-
dio, and by olume modi ica ion (inc ease and dec ease).
We hen ex ac embeddings om he ans o med audio
˜xi, ob aining ˜zi=g( (˜xi)), and apply MSE loss as:
ℓmse =1
Cl
Cl
X
i=1 zl
i−˜zl
i2+1
Cu
Cu
X
j=1 zu
j−˜zu
j2.(6)
3.3.3 Con as i e Lea ning
To de ine con as i e loss, we cons uc posi i e and neg-
a i e pai s o ou da ase . Fo nega i e pai s, each yu
iis
compa ed wi h all embeddings yn∈Su∪Sl, using cosine
simila i y ε(yu
i, yn). We hen c ea e a lis ζu
i.
ζu
i=lis (ε(yu
i, yn)),∀{yu
i∈Su}.(7)
Algo i hm 1 alg:No el Raga Clus e ing
Requi e: OOD da ase Su, ea u e ex ac o ea (·)
Requi e: Encode model g(·), lea ning a es β, γ, empe -
a u e pa ame e τ
1: Ex ac ch omag am ea u es om all xu
i∈Su
2: Use ea (·) o compu e embeddings yu
i
3: o each pai (yu
i, yu
j)∈Sudo
4: Compu e cosine simila i y ε(yu
i, yu
j)
5: Assign pseudo-label i,j based on h eshold δ
6: Compu e pi,j using Eq: 4
7: Compu e BCE Loss ℓbce
8: end o
9: o each sample xu
ido
10: Apply ime and olume shi s on xu
i o ge ˆxu
i
11: Compu e ans o med embeddings ˆyu
i= (ˆxu
i)
12: Compu e Consis ency Loss ℓmse
13: end o
14: o each sample xu
ido
15: Selec Hha des nega i e samples ξmand posi i e
samples ϕ
16: De ine con as i e loss ℓcl using posi i e and neg-
a i e pai s
17: end o
18: Compu e o al loss: ℓ=ℓbce +βℓcl +γℓmse
19: o epoch = 1 o Edo
20: T ain g(·)using o al loss ℓ
21: end o
22: Ou pu : T ained model g(·)
The simila i ies a e anked in ascending o de , and he H
leas simila embeddings a e selec ed as ha d nega i es:
ξh=a g oph(ζu
i),∀i. (8)
Fo posi i e pai s, simila o BCE loss, all he audio
samples xu
iand xu
jo igina ing om same audio ile a e
conside ed o belong o he same class and hence ea ed
as posi i e pai s. Thei co esponding embeddings ˆzu
ia e
s o ed in he se ϕwhich is de ined as:
ϕ={ˆzu
i|zu
isha es he same sou ce audio ile}
We now de ine he con as i e loss ℓcl [17] as:
ℓcl =−1
kX
ˆzu
i∈β
log eε(zu
i,ˆzu
i)/τ
eε(zu
i,ˆzu
i)/τ +P¯z∈ξmeε(zu
i,¯zu
m)/τ ,
(9)
whe e τis a empe a u e pa ame e ha con ols he
concen a ion o simila i y sco es. This loss unc ion op-
imizes embeddings by b inging each sample close o i s
posi i e coun e pa ˆzu
iwhile pushing i away om ha d
nega i es ¯zu
m. Finally, ge a uni ied objec i e unc ion ℓby
combining eq: 5,6,9:
ℓ=ℓbce +βℓcl +γℓmse.(10)
He e, βand γa e he scaling hype pa ame e s, as he mag-
ni ude o hese losses a ies signi ican ly. P ope uning
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
799
o hese hype pa ame e s is c i ical o achie ing op imal
pe o mance. This combined loss is used o ain he sel -
a en ion encode g(·), which lea ns o di e en ia e unseen
aga classes in a sel -supe ised manne . This whole ain-
ing p ocess is explained in Algo i hm 1.
3.4 Clus e ing Techniques
Gi en he embeddings zi om he encode model g(·), we
expe imen wi h h ee di e en app oaches o g ouping
he embeddings o assign p edic ed labels:
(i) Compu ing a cosine simila i y ma ix ac oss embedding
pai s (zi, zj), and gi en a h eshold h, g ouping hose wi h
simila i y ε > h in o he same clus e .
(ii) Applying K-means clus e ing o g oup he embeddings
in o Kclus e s.
(iii) Reducing he dimensionali y o embeddings using
UMAP o isualiza ion, ollowed by K-means clus e ing
on he ans o med ep esen a ions.
3.5 E alua ion Me ics
We assess he quali y o he clus e s so o med using bo h
label-independen and label-dependen e alua ion me ics.
(i) Silhoue e Sco e(SS) [28] is a label-independen me -
ic, which e alua es how well a da a poin is si ua ed wi hin
i s designa ed clus e in ela ion o o he clus e s, wi hou
conside ing he g ound u h o hose clus e s. The sco e
alls be ween -1 and 1. Fo well-sepa a ed clus e s, SS
comes ou o be 1, and i is -1 o poo ly o med clus e s.
(ii) Adjus ed Rand Index (ARI) [29] is a label-dependen
me ic, which compa es he simila i y be ween p edic ed
clus e s and ac ual g ound u h clus e s, wi h an adjus -
men o andom assignmen s. The sco e anges om 0 o
1, whe e 1 ep esen s pe ec alignmen wi h he g ound
u h.
(iii) Mu ual In o ma ion (MI) [30] measu es he amoun
o in o ma ion sha ed be ween he ue clus e s (c ) and
p edic ed clus e s (cp). I cap u es how much knowing he
p edic ed clus e assignmen educes unce ain y abou he
ue clus e assignmen . The ange o MI is no bounded,
wi h highe alues indica ing ha he p edic ing clus e ing
is mo e aligned wi h he ac ual class s uc u e.
(i ) Clus e ing Accu acy (ACC) e alua es how well he
p edic ed clus e s align wi h he ue labels. Fo each
g ound u h clus e c , we iden i y he p edic ed clus e
cp ha has he highes o e lap wi h c . The subse o em-
beddings ha belong o bo h c and cpis ep esen ed as:
cp ={zi|zi∈cpand zi∈c }.
Then, ACC o a gi en ue clus e c is hen compu ed as:
ACC(c ) = |cp |
|c |×100.
Misclassi ied poin s a e hose ha do no belong o any
ma ched clus e . Fu he mo e, i a p edic ed clus e cp
is mapped o mul iple ue clus e s c , he clus e ing is
conside ed in alid, and accu acy, along wi h o he pe o -
mance me ics, is no calcula ed.
4. EXPERIMENTAL RESULTS
The labeled da ase Slconsis s o 141 audio iles sou ced
om PIM [6] da ase , segmen ed in o 5,734 audio sam-
ples, wi h a o al du a ion o app oxima ely 47.78 hou s.
A CNN-LSTM model (·)is ained in a supe ised man-
ne on his da ase o mul i-class classi ica ion ac oss 12
Raga classes, achie ing an F1-sco e o 0.89 h ough c oss-
alida ion. This ained model se es as a ea u e ex ac o
o downs eam asks, whe e ep esen a ions o OOD de-
ec ion and NCD a e ob ained by ex ac ing ea u es om
di e en dep hs o he ne wo k. We cons uc ano he se
Su o which he Raga labels a e disca ded, ea ing i as
unlabeled da a. We conduc a ange o OOD and NCD ex-
pe imen s using bo h he PIM [6] and Sa aga [8] (Hindus-
ani) da ase s a di e en s ages, as summa ized in Table 1.
Expe imen Da ase Desc ip ion
OOD de ec-
ion
PIM/
Sa aga
Ca y ou OOD De ec ion
o bo h da ase s sepa a ely
using (·); esul s in Table 2
Fea u e abla-
ion
PIM Compa e Ch omag am s
Melody [31] s MERT [32]
ea u es; esul s in Table 3
Loss compo-
nen abla ion
PIM Tes ℓbce/ℓcl/ℓmse con ibu-
ions; esul s in Table 5
Clus e ing
compa ison
PIM/
Sa aga
E alua e Cosine-sim s K-
Means s UMAP+K-means;
esul s in Table 4
Openness
s udy
PIM Analyze pe o mance a
openness = 0.09 & 0.18;
esul s in Table 6
Table 1. Summa y o all expe imen al se ups, da ase s, and
hei co esponding esul loca ions in he pape .
Me ic/Da ase Sa aga PIM
OOD Accu acy 85.6% 80.87%
Table 2. Compa ison o OOD de ec ion Accu acy o
Sa aga and PIM da ase s
4.1 OOD
Fo OOD de ec ion, we selec es iles om i e unseen
classes in he PIM and Sa aga da ase s, p io i izing hose
wi h highe ep esen a ion. F om PIM, we use 41 au-
dio iles, esul ing in 2,435 audio clips (20.29 hou s), be-
longing o 5 Raga classes: Bagesh i, Bhopali, Jog-Kauns,
Mish a-Khamaj, and Pu iya-Kalyan. F om Sa aga, 14 au-
dio iles, yielding 1,136 audio clips (9.46 hou s) belonging
o 5 Raga classes: Bhopali, Bhimpalasi, Ma wa, Sh ee,
Todi. An equal numbe o iles om Sl(only om PIM
da ase ) is included o compa ison. The (·)model is
ained wi h MC-d opou , wi h T=50 o wa d passes o
each xi, and a a iance-based h eshold is applied o clas-
si y samples as OOD o in-dis ibu ion. Resul s, p esen ed
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
800
in Table 2, demons a e OOD de ec ion pe o mance. The
model pe o ms be e on Sa aga o OOD de ec ion, he
eason being since he (·)model is ained on he PIM
da ase , he OOD eco dings om PIM may sha e acous ic
simila i ies wi h he aining da a, making OOD de ec ion
mo e challenging. In con as , he Sa aga da ase , eco ded
in di e en acous ic en i onmen s, se es as a mo e dis-
inc and hus easie a ge o OOD de ec ion.
4.2 Fea u e Abla ion
Fo Sl, we ex ac embeddings using he p e- ained
MERT model [32], melody-based embeddings om [31],
and he ea u e ex ac o ea (·), which ex ac s embed-
dings om he penul ima e laye o CNN-LSTM classi ie
(·). These embeddings a e hen clus e ed using cosine
simila i y, as desc ibed in Sec ion 3.4.
The clus e ing ou comes o Sla e summa ized in Ta-
ble 3. The esul s indica e ha embeddings om bo h
MERT and melody-based models yield subpa pe o -
mance, e en when e alua ed wi h label-independen me -
ics. In con as , ea (·)p o ides signi ican ly be e clus-
e ing esul s. So, we adop ea (·)as he ea u e ex ac o
o he emainde o ou s udy.
Me ic MERT Melody ea (·)
SS 0.13 -0.01 0.54
ARI 0.00 0.08 0.83
MI 0.02 0.22 1.99
ACC 11.15 25.04 90.05
Table 3. Compa ison o MERT, Melody ex ac ion ool
(Mel), and ea (·) o clus e ing using k-means on Sl
4.3 NCD
4.3.1 Compa ison wi h baseline
Fo he baseline, clus e ing is pe o med di ec ly on he
embeddings yiusing he h ee clus e ing me hods de-
sc ibed in Sec ion 3.4. In ou p oposed app oach, we ain
he encode model g(·)using he combined loss ℓ(eq: 10)
on bo h he PIM and Sa aga da ase s. The esul ing clus-
e ing pe o mance o bo h baseline and p oposed me h-
ods is p esen ed in Table4. As expec ed, he baseline e-
sul s o Sua e signi ican ly wo se han hose o he la-
beled da ase . This ou come is an icipa ed since he ea-
u e ex ac o ea (·)is no ained on Su, and Suand
Slcon ain disjoin Raga classes. Consequen ly, clus e ing
pe o mance is poo o bo h label-dependen and label-
independen clus e ing me ics unde he baseline.
Fig. 2 shows he con usion ma ix o classi ica ion o
5 unknown Raga classes: Bhopali, Bagesh i, Jog-Kouns,
Mish a-Khamaj, and Pu iya-Kalyan ou o PIM da ase .
We compu e 1-sco es based on he con usion ma ix, and
obse e ha he model pe o ms well o Bagesh i (F1:
0.85) and Bhopali (F1: 0.92), which a e mo e dis inc and
s aigh o wa d Ragas. Howe e , i s uggles wi h Mish a-
Khamaj (F1: 0.51), Jog-Kouns (F1: 0.69), and Pu iya-
Kalyan (F1: 0.60). These Ragas, being Mish a (mixed)
Ragas, inhe en ly sha e musical simila i ies wi h mo e han
one Ragas in hei s uc u e i sel , making hem mo e chal-
lenging o dis inguish and o en leading o con usion o
he model. This highligh s he in insic complexi y o
Mish a Ragas and emphasizes he need o mo e e ined
app oaches o accu a ely classi y such Ragas.
Bg Bp JK MK PK
P edic ed Classes
BgBpJKMKPK
T ue Classes
422 3 42 88 0
7625 10 27 23
2 0 335 28 154
3 7 9 150 44
7 25 50 78 282
0
100
200
300
400
500
600
Figu e 2. Con usion ma ix o Sushowing classi i-
ca ion pe o mance on he PIM da ase o i e Ragas:
Bhopali (Bp), Bagesh i (Bg), Jog-Kouns (JK), Mish a-
Khamaj (MK), and Pu iya-Kalyan (PK).
Fo he Sa aga da ase , ained on Raga Bhopali,
Bhimpilasi, Ma wa, Todi, and Sh ee in hei se Su, he
con usion ma ix (no shown) e eals signi ican o e lap
be ween Raag Sh ee and Ma wa. This can be a ibu ed
o hei s uc u al simila i ies as hey bo h belong o he
Ma wa haa 2, sha e common no es wi h one excep ion,
omi Pancham 3no e in Ascen (Aa oh), and a e sung a
he same ime o he day. We also ind ha he audio
eco dings o hese 2 Ragas ea u e he same singe s in he
da ase , and also om he same conce , leading o sha ed
onal and acous ic cha ac e is ics, which may ha e caused
hem o clus e closely and, hence, poo e clus e ing pe -
o mance compa ed o he PIM da ase . Ano he hing is
ha in Sa aga da ase , he ep esen a ion o each Raga class
is limi ed o max 3 audio iles, whe e e in PIM, we ha e
a leas 7 audio iles o each o he unlabeled classes.
4.3.2 Loss componen Abla ion
To unde s and he indi idual con ibu ions o di e -
en componen s in ou inal loss unc ion ℓ, we ain
he encode model g(·)sepa a ely using each compo-
nen —Bina y C oss-En opy (BCE) loss (ℓbce), Con-
as i e loss (ℓcl), hei sum (ℓcl+bce), and he ull com-
bined loss ℓ(Eq. 10). Fo his compa ison, we apply K-
means clus e ing on he esul ing embeddings using only
he PIM da ase . The clus e ing pe o mance o each se up
is summa ized in Table5.
2A haa is a pa en scale in Hindus ani Music, ha de ines he se o
no es used in agas. I wo agas belong o he same haa , hey a e likely
o sha e simila no es, making hem mo e acous ically simila .
3The i h no e in he scale; when omi ed in he Aa oh o agas om
he same Thaa , i u he educes hei melodic dis inc i eness.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
801

Da ase Clus e ing Me hods SS ARI MI ACC (%)
PIM
Cosine Simila i y Baseline 0.22 0.50 0.87 58.96
P oposed 0.75 0.58 1.17 72.10
K-Means Baseline 0.36 0.48 0.79 70.75
P oposed 0.85 0.64 0.94 79.34
UMAP Baseline 0.63 0.53 0.83 71.48
P oposed 0.79 0.60 0.85 72.99
Sa aga
Cosine Simila i y Baseline 0.15 0.30 0.65 53.61
P oposed 0.71 0.43 0.86 78.37
K-Means Baseline 0.40 0.41 0.79 75.44
P oposed 0.82 0.44 0.82 81.04
UMAP Baseline 0.60 0.44 0.81 73.85
P oposed 0.66 0.47 0.85 78.88
Table 4. Pe o mance compa ison o clus e ing me hods on PIM and Sa aga Da ase s
Me ic ℓcl ℓbce ℓcl+bce ℓ
SS 0.39 0.59 0.62 0.85
ARI 0.52 0.55 0.59 0.64
MI 0.76 0.84 0.87 0.94
ACC (%) 70.16 75.43 76.04 79.34
Table 5. Compa ison o clus e ing me ics o aining g(·)
using ℓcl,ℓbce,ℓcl+bce, and l, a e clus e ing zu
iusing K-
means clus e ing
We obse e ha ℓcl o ms poo clus e s, as e iden om
he plo (no shown), whe e we see all he samples sepa-
a ed like hey a e plo ed along he bounda y o a ci cle. I
has been explained by [33] also ha con as i e Lea ning
(CL) pushes dissimila samples apa wi hou p ese ing
seman ic s uc u e, some imes g ouping un ela ed samples
while sepa a ing simila ones, which is e iden he e also.
BCE pe o ms be e by ocusing on con iden ly simila
pai s and igno ing unce ain ones. ℓcl+bce combines he
s eng hs o bo h, u he imp o ing clus e ing. Adding
MSE enhances seman ic consis ency, making ℓ he mos
e ec i e, ou pe o ming all h ee ac oss all me ics.
4.3.3 Openness S udy
We analyze he impac o openness on clus e ing pe o -
mance. As de ined in Sec ion 1, openness is de e mined
by he numbe o labeled classes |Cl|and he numbe o
unseen classes |Cu|. In ou case, |Cl|is ixed o 12, bu we
now expe imen wi h alues 5 and 12 o |Cu|, esul ing
in openness alues o 0.09 and 0.18, espec i ely o PIM
da ase . A highe openness alue co esponds o a mo e
challenging p oblem, as is obse ed in Table 6. We obse e
a signi ican d op in pe o mance, pa icula ly in ACC,
sugges ing ha some classes a e being clus e ed poo ly o
e en andomly, despi e a ela i ely good SS sco e. This
may be due o educed ep esen a ion o ce ain classes
as he numbe o samples pe class dec eases. Inc easing
he sample size could po en ially imp o e clus e ing pe -
o mance.
Ou esul s show ha he p oposed me hod achie es
clus e ing quali y compa able o supe ised app oaches,
Me ic ONCD = 0.09 ONCD = 0.18
SS 0.85 0.50
ARI 0.64 0.44
MI 0.94 0.83
ACC (%) 79.34 55.68
Table 6. Clus e ing Compa ison o Di e en Le els o
Openness Eq: 1 (ONCD )
which is aluable o MIR asks like Raga Iden i ica ion
whe e labeled da a is limi ed. I enables scalable use o un-
labeled eco dings, expanding Raga da ase s wi hou hea y
eliance on manual labeling.
5. CONCLUSION AND FUTURE SCOPE
In his s udy, we p opose a no el app oach o iden i-
ying and clus e ing unseen Raga classes in Indian A
Music. We i s use Unce ain y Es ima ion o Ou -o -
Dis ibu ion (OOD) de ec ion on bo h he Sa aga and PIM
da ase s, e ec i ely dis inguishing unknown Ragas om
known ones. Then, we apply a con as i e lea ning-based
No el Class Disco e y (NCD) me hod in a sel -supe ised
se ing o clus e he OOD Ragas in o dis inc clus e s. Ou
app oach demons a es s ong c oss-da ase gene aliza ion,
as ea u es ex ac ed om PIM we e success ully used o
ain and clus e o Sa aga. Addi ionally, we analyze he
impac o a ying openness alues, showing ha highe
openness yields poo e clus e ing pe o mance, highligh -
ing he need o u he imp o emen s.
Fu u e wo k can ocus on be e handling o Mish a Ra-
gas o educe con usion wi h pa en Ragas. Expanding
Raga Iden i ica ion da ase s, explo ing mul imodal o hie -
a chical lea ning could enhance adap abili y and may help
mi iga e pe o mance d ops a highe openness. F aming
he ask as a Gene al Class Disco e y (GCD) ask, whe e
he model lea ns om bo h labeled and unlabeled se s si-
mul aneously, could be a good u u e di ec ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
802
6. ACKNOWLEDGEMENTS
This wo k was suppo ed by P asa Bha a i, India’s public
b oadcas ing agency.
7. REFERENCES
[1] X. Se a, “The compu a ional s udy o a musical cul-
u e h ough i s digi al aces,” Ac a Musicologica,
ol. 89, no. 1, p. 24–44, Jun. 2017.
[2] S. Chowdhu i, “Phonone : mul i-s age deep neu al ne -
wo ks o aga iden i ica ion in hindus ani classical
music,” in ICMR, 2019.
[3] S. Paschalidou and I. Milia esi, “Mul imodal Deep
Lea ning A chi ec u e o Hindus ani Raga Classi ica-
ion,” Senso s & T ansduce s, ol. 261, no. 2, pp. 77–
86, Feb. 2024.
[4] A. A. Bidka , R. S. Deshpande, and Y. H. Dandawa e,
“A no h indian aga ecogni ion using ensemble clas-
si ie ,” IJEET, ol. 12, no. 6, pp. 251–258, 2021.
[5] S. T. Madhusudhan and G. V. Chowdha y, “Deeps gm
- sequence classi ica ion and anking in indian
classical music ia deep lea ning,” A Xi , ol.
abs/2402.10168, 2024. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:208334841
[6] P. Singh and V. A o a, “Explainable deep lea ning
analysis o aga iden i ica ion in indian a music,”
2024. [Online]. A ailable: h ps://a xi .o g/abs/2406.
02443
[7] W. J. Schei e , A. de Rezende Rocha, A. Sapko a,
and T. E. Boul , “Towa d open se ecogni ion,” IEEE
T ansac ions on Pa e n Analysis and Machine In elli-
gence, ol. 35, no. 7, pp. 1757–1772, 2013.
[8] A. S ini asamu hy, S. Gula i, R. Ca o Repe o, and
X. Se a, “Sa aga: Open da ase s o esea ch on in-
dian a music,” Empi ical Musicology Re iew, ol. 16,
no. 1, p. 85–98, Dec. 2021.
[9] D. Hend ycks and K. Gimpel, “A baseline o de ec ing
misclassi ied and ou -o -dis ibu ion examples in neu-
al ne wo ks,” a Xi p ep in a Xi :1610.02136, 2016.
[10] B. Lakshmina ayanan, A. P i zel, and C. Blundell,
“Simple and scalable p edic i e unce ain y es ima ion
using deep ensembles,” Ad ances in neu al in o ma-
ion p ocessing sys ems, ol. 30, 2017.
[11] C. Co biè e, N. Thome, A. Ba -Hen, M. Co d, and
P. Pé ez, “Add essing ailu e p edic ion by lea ning
model con idence,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 32, 2019.
[12] C. Co bie e, N. Thome, A. Sapo a, T.-H. Vu, M. Co d,
and P. Pe ez, “Con idence es ima ion ia auxilia y
models,” IEEE T ansac ions on Pa e n Analysis and
Machine In elligence, ol. 44, no. 10, pp. 6043–6055,
2021.
[13] S. Kuma , P. Singh, and V. A o a, “Con idence-
enhanced models o indian a music analysis,” in
2025 IEEE In e na ional Con e ence on Acous ics,
Speech, and Signal P ocessing Wo kshops (ICASSPW),
2025.
[14] Y. Gal and Z. Ghah amani, “D opou as a bayesian
app oxima ion: Rep esen ing model unce ain y in
deep lea ning,” in in e na ional con e ence on machine
lea ning. PMLR, 2016, pp. 1050–1059.
[15] Y. J. Lee and K. G auman, “Objec -g aphs o con ex -
awa e ca ego y disco e y,” in 2010 IEEE Compu e
Socie y Con e ence on Compu e Vision and Pa e n
Recogni ion, 2010, pp. 1–8.
[16] K. Han, S.-A. Rebu i, S. Eh ha d , A. Vedaldi, and
A. Zisse man, “Au oma ically disco e ing and lea n-
ing new isual ca ego ies wi h anking s a is ics,” in In-
e na ional Con e ence on Lea ning Rep esen a ions,
2020.
[17] Z. Zhong, E. Fini, S. Roy, Z. Luo, E. Ricci, and
N. Sebe, “Neighbo hood con as i e lea ning o no el
class disco e y,” in 2021 IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR),
2021, pp. 10 862–10 870.
[18] K. Han, A. Vedaldi, and A. Zisse man, “Lea ning
o disco e no el isual ca ego ies ia deep ans e
clus e ing,” 2019 IEEE/CVF In e na ional Con e ence
on Compu e Vision (ICCV), pp. 8400–8408, 2019.
[Online]. A ailable: h ps://api.seman icschola .o g/
Co pusID:201646290
[19] Y.-C. Hsu, Z. L , J. Schlosse , P. Odom, and
Z. Ki a, “Mul i-class classi ica ion wi hou mul i-class
labels,” in In e na ional Con e ence on Lea ning
Rep esen a ions, 2019. [Online]. A ailable: h ps:
//open e iew.ne / o um?id=SJzR2iRcK7
[20] X. Yang, Z. Song, I. King, and Z. Xu, “A su ey on
deep semi-supe ised lea ning,” IEEE T ansac ions on
Knowledge and Da a Enginee ing, ol. 35, no. 9, pp.
8934–8954, 2023.
[21] Y. Yang, N. Jiang, Y. Xu, and D.-C. Zhan, “Robus
semi-supe ised lea ning by wisely le e aging open-
se da a,” IEEE T ansac ions on Pa e n Analysis and
Machine In elligence, pp. 1–15, 2024.
[22] Y. Xian, B. Schiele, and Z. Aka a, “Ze o-sho lea ning
— he good, he bad and he ugly,” in 2017 IEEE Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2017, pp. 3077–3086.
[23] Y. Xian, C. H. Lampe , B. Schiele, and Z. Aka a,
“Ze o-sho lea ning—a comp ehensi e e alua ion o
he good, he bad and he ugly,” IEEE T ansac ions
on Pa e n Analysis and Machine In elligence, ol. 41,
no. 9, pp. 2251–2265, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
803
[24] A. N. Ca , Q. Be he , M. Blondel, O. Teboul,
and N. Zeghidou , “Sel -supe ised lea ning o audio
ep esen a ions om pe mu a ions wi h di e en iable
anking,” IEEE Signal P ocessing Le e s, ol. 28, pp.
708–712, 2021.
[25] H. Yaku a, K. Wa anabe, and M. Go o, “Sel -
supe ised con as i e lea ning o singing oices,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 30, pp. 1614–1623, 2022.
[26] H. Zhao, C. Zhang, B. Zhu, Z. Ma, and K. Zhang, “S3 :
Sel -supe ised p e- aining wi h swin ans o me o
music classi ica ion,” in ICASSP 2022 - 2022 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), 2022, pp. 606–610.
[27] E. Fonseca, D. O ego, K. McGuinness, N. E.
O’Conno , and X. Se a, “Unsupe ised con as i e
lea ning o sound e en ep esen a ions,” in ICASSP
2021 - 2021 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2021,
pp. 371–375.
[28] P. J. Rousseeuw, “Silhoue es: A g aphical aid o he in-
e p e a ion and alida ion o clus e analysis,” Jou nal
o Compu a ional and Applied Ma hema ics, ol. 20,
pp. 53–65, 1987.
[29] N. X. Vinh, J. Epps, and J. Bailey, “In o ma ion he-
o e ic measu es o clus e ings compa ison: Va ian s,
p ope ies, no maliza ion and co ec ion o chance,”
Jou nal o Machine Lea ning Resea ch, ol. 11,
no. 95, pp. 2837–2854, 2010. [Online]. A ailable:
h p://jml .o g/pape s/ 11/ inh10a.h ml
[30] A. S ehl and J. Ghosh, “Clus e ensembles - a knowl-
edge euse amewo k o combining mul iple pa i-
ions,” Jou nal o Machine Lea ning Resea ch, ol. 3,
pp. 583–617, 01 2002.
[31] K. R. Saxena and V. A o a, “In e ac i e singing melody
ex ac ion based on ac i e adap a ion,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, ol. 32, pp. 2729–2738, 2024.
[32] Y. LI, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os, N. Gyenge,
R. Dannenbe g, R. Liu, W. Chen, G. Xia, Y. Shi,
W. Huang, Z. Wang, Y. Guo, and J. Fu, “MERT:
Acous ic music unde s anding model wi h la ge-scale
sel -supe ised aining,” in The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions, 2024.
[33] F. Wang and H. Liu, “Unde s anding he beha iou o
con as i e loss,” in P oceedings o he IEEE/CVF con-
e ence on compu e ision and pa e n ecogni ion,
2021, pp. 2495–2504.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
804