A Fine Tuning Strategy to Improve Musical Source Separation Quality for Indian Carnatic Music

Author: Schweinitz, Serafin

Publisher: Zenodo

DOI: 10.5281/zenodo.17304796

Source: https://zenodo.org/records/17304796/files/Serafin-Schweinitz_SMS_2025_Master_Thesis.pdf

Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
A Fine Tuning S a egy o Imp o e
Musical Sou ce Sepa a ion Quali y o
Indian Ca na ic Music
Se a in Schweini z
Supe iso : Ma ín Rocamo a
Co-Supe iso s: Adi hi Shanka , Genís Plaja-Roglans
July 2025
Con en s
1 In oduc ion 1
1.1 Challenges in Ca na ic Sou ce Sepa a ion . . . . . . . . . . . . . . . . . 1
1.2 Mo i a ion and Objec i es . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Ca na icMusicology ............................ 2
1.2.2 Objec i es.................................. 5
1.2.3 Ca na ic Ins umen a ion . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 S a e o he A 10
2.1 Ca na ic Music Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . . . 10
2.2 Mic ophoneBleeding............................ 11
2.2.1 Bleeding Awa e Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . . . 13
2.3 Bleeding Unawa e Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . 16
2.4 Da a-Domain Adap a ion Challenges in Music Sou ce Sepa a ion . . . 17
2.5 U-Ne s o Musical Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . 18
2.5.1 (Hyb id) (T ans o me ) Demucs . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 SCNe .................................... 21
2.6 Da ase s................................... 22
2.6.1 Sa aga.................................... 22
2.6.2 MUSDB18.................................. 22
2.6.3 Sanidha ................................... 23
2.6.4 Ca na ic Mul i-s em Clean (CMC) . . . . . . . . . . . . . . . . . . . . 23
2.6.5 BachViolinDa ase ............................ 24
2.6.6 Deep Noise Supp ession Da ase (DNS) . . . . . . . . . . . . . . . . . . 24
3 A Fine uning S a egy - Me hodology 25
3.1 Da a Domain In es iga ion: Ca na ic Music s. Wes e n Pop and Rock 26
3.1.1 Ins umen a ion Di e ences and Challenges in Sepa a ing Ca na ic Mu-
sic wi h Ou -o -Domain MSS Models . . . . . . . . . . . . . . . . . . . 26
3.1.2 Gen e-Speci ic Challenges in Ca na ic MSS . . . . . . . . . . . . . . . . 27
3.1.3 Tailo ing a T aining Se and Benchma ks o Imp o e Ca na ic MSS . . 28
3.2 Da aAugmen a ions ............................ 30
3.2.1 Aligning SCNe Da a Augmen a ions wi h he Ca na ic Da a Domain . 30
3.2.2 Violin Da a Augmen a ion . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Bleeding Augmen a ions . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Expe imen alSe up............................. 34
3.3.1 Fine uningonCMC ............................ 34
3.3.2 Fine uning on CMC plus MUSDB18 . . . . . . . . . . . . . . . . . . . 34
3.3.3 Fine uning on CMC plus MUSDB18 wi h Violin Augmen a ion . . . . 34
3.3.4 Fine uning on CMC plus MUSDB18 wi h Violin and Bleeding Aug-
men a ions ................................. 35
3.4 E alua ion.................................. 36
3.4.1 SDR..................................... 36
3.4.2 Pe cep ual E alua ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Baselines................................... 37
4 Resul s 39
4.1 Imp o ing he MSS quali y o ca na ic music . . . . . . . . . . . . . . 39
4.1.1 SDRE alua ion............................... 39
4.1.2 On he E ec i eness o he P oposed Da a Augmen a ions . . . . . . . 40
4.2 Pe cep ual E alua ion: SCNe c,m, s HTDemucs ........... 41
4.2.1 M idangam Sepa a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 VocalSepa a ion .............................. 43
4.2.3 Violin and Tanpu a Sepa a ion . . . . . . . . . . . . . . . . . . . . . . 44
4.3 T aining S abili y: Ex ending s Replacing he Fine- uning Da ase . . 46
5 Conclusion and Discussion 48
Lis o Figu es 50
Lis o Tables 51
Bibliog aphy 52
A Appendix 59

Abs ac
Abs ac
The compu a ional analysis o Ca na ic music om audio emains a ield o high
esea ch in e es due o he gen e’s ich melodic and hy hmic complexi y. How-
e e , despi e he a ailabili y o la ge mul i ack collec ions such as Sa aga, he li e-
eco ded na u e o his epe oi e leads o a sca ci y o uly clean ins umen and
ocal s ems, posing signi ican challenges o bo h musicological and echnological
s udies. S a e-o - he-a music sou ce sepa a ion (MSS) models pe o m poo ly on
Ca na ic music due o a p onounced domain misma ch wi h hei aining da a.
This wo k p oposes a ine- uning s a egy o imp o ing sepa a ion o ocals,
m idangam, and iolin plus anpu a s ems in Ca na ic music. The app oach uses a
Spa se Comp ession U-Ne (SCNe ) p e ained on MusDB18, ex ended wi h a cu-
a ed aining se combining clean Ca na ic mul i ack eco dings and ou -o -domain
da a. To u he educe he domain gap, h ee da a augmen a ions a e in oduced:
(i) iolin sampling augmen a ion, (ii) mic ophone-bleeding simula ion, and
(iii) oom impulse esponse con olu ion.
The p oposed model achie es subs an ial SDR imp o emen s o e he baselines on
a clean Ca na ic benchma k de i ed om he Sanidha da ase , and a pe cep ual
e alua ion on Sa aga con i ms signi ican quali y gains on all 3 sepa a ed sou ces. On
he benchma k, he bes con igu a ion ou pe o ms all baselines by a la ge ma gin
in SDR, while aining in unde wo days on a single 40 GB GPU - making i
conside ably less esou ce-exhaus i e han many simila deep lea ning-based MSS
domain adap a ion me hods.
All p e ained models, code, a cleaned e sion o he Sa aga da ase , and he Sanidha
benchma k a e eleased alongside his wo k.
Keywo ds: Musical Sou ce Sepa a ion, Ca na ic music
6Chap e 1. In oduc ion
Figu e 2: A spec al analysis o he M idangam s okes conduc ed by [15].
•P o ide a bleeding- educed e sion o he Sa aga AV da ase o suppo
u u e esea ch in Ca na ic musicology and audio isual sou ce sepa a ion.
1.2.3 Ca na ic Ins umen a ion
The emainde o his chap e p o ides a b ie in oduc ion o Ca na ic ins umen-
a ion. Figu e 6 depic s a Ca na ic conce om he Sa aga Audio isual da ase .
In he pe o mance, a iolin ollows he imp o ised ocal line, while a m idangam
playe p o ides hy hmic accompanimen .
Ca na ic ins umen s can be b oadly ca ego ized in o h ee g oups: melodic ins u-
men s, hy hmic ins umen s, and d ones. Melodic ins umen s include he iolin
and he Sa aswa i Veena. D um ins umen s such as he M idangam and Gha am
p o ide hy hm and pe cussi e ex u es. Finally, d one ins umen s like he Tanpu a
c ea e a wall-o -sound d one ha ypically emains in he backg ound.
M idangam
The m idangam is he p ima y pe cussion ins umen , he main ins umen used
in Ca na ic music conce s o keep he pe o mance in a hy hmic pa e n[4]. I
is men ioned in his o ical manusc ip s as a back as 200 B.C. and has g adually

1.2. Mo i a ion and Objec i es 7
de eloped in o he mos p ominen pe cussion ins umen played in Sou h Indian
classical music.[15]. The m idangam is a double-headed d um, see Figu e 1.3(a). I is
played using se e al di e en s okes on he eble ( igh ) and bass (le ) memb ane.
Di e en inge posi ions and me hods o s iking he d umheads p oduce a a ie y
o ones, wi h some s okes p oducing ha monic sounds wi h a ecognizable pi ch,
and some p oducing onic-independen sounds[16]. These s okes, many o which
go names o e ime, c ea e a ocabula y o di e en sound- ypes. The au ho s o
[15] ca ego ize hese s okes in o 3 classes:
•Ringing s ing-like ones played on he eble memb ane
–Dhin, Cha o Bheem
•Fla , closed, c isp sounds.
–Thi, Ta and Num
•Resonan s okes played on he bass memb ane
–Thom
Figu e 2 shows spec og am ep esen a ions o he di e en s oke ypes in oduced
abo e. The p esence o ho izon al lines in he spec og ams indica es he ha monic-
ich na u e o m idangam decays. In pa icula , he s okes Bheem,Cha, and Dhin
- played on he eble side o he m idangam - exhibi long, sus ained ha monic
ade-ou s. This s ands in con as o classic wes e n d um ki s, whose pe cussi e
elemen s ypically lack disce nible ha monics in hei spec al ep esen a ions.
Tanpu a
Tanpu a is a mul i-s inged and e less accompanying d one ins umen ex ensi ely
used in classical music in India. Ins umen alis s c ea e an unde lying d one back-
g ound sound by plucking he Tanpu a by inge . Ji e , shimme and complexi y
pe u ba ions a e ound also in anpu a signals[17]. Figu e 1.3(b) shows a pho o o
a Tanpu a.
8Chap e 1. In oduc ion
(a) M idangam (b) Tanpu a[18] (c) Violin
Figu e 3: The 3 ca na ic ins umen s sepa a ed in his wo k
Violin
Violin-like s ing ins umen s ha e been used in indian classical music, be o e he
wes e n iolin go in oduced in india a ound he 18 h cen u y. Ca na ic musicians
play wes e n iolins wi h di e en pos u e and uning - gene ally much lowe and
wi h di e en in e als. Fo a onic C he ou s ings would be uned o C3 - G3 -
C4 - G4. The ins umen u he mo e is played in g¯ayaki s yle (see Sec ion 1.2.1),
usually ollows he ocalis s imp o isa ion. In con as o wes e n iolin, i ollows
a di e en scale wi h smalle in e als han he semi one. The iolin also plays
Gamakas - analog o ib a o o glissando in wes e n music. Figu e 1.3(c) illus a es
ha he Ca na ic iolin is build iden ical o i s Wes e n coun e pa .
1.3 S uc u e o he Repo
This epo is s uc u ed as ollows:
The s a e-o - he-a (Chap e 2) si ua es his wo k wi hin p io esea ch I ac-
companied a he MTG -Music Technology G oup, Ba celona - on Ca na ic sou ce
sepa a ion and bleeding-awa e sou ce sepa a ion. Fu he mo e, impo an de elop-
men s in bo h a eas a e discussed, along wi h ela ed wo ks on sou ce sepa a ion
domain adap a ion and a p esen a ion o he neu al ne wo ks used in his p ojec .
The me hodology (Chap e 3) p esen s he p oposed ine- uning s a egy. Fi s ,
he da a-domain misma ch be ween he Ca na ic domain and Wes e n pop/ ock
1.3. S uc u e o he Repo 9
music - ep esen ed by MUSDB18 - is analyzed. Nex , I in oduce a ine- uning
da ase and da a augmen a ions ailo ed speci ically o b idge hese domain gaps.
The expe imen al se up (Chap e 3.3) desc ibes he ou di e en con igu a ions
o he p oposed models, he baselines and he e alua ion p ocedu e o he expe i-
men s.
Finally, esul s, Chap e 4, p esen s he e alua ion esul s ollowed by conclusion
and discussion (Chap e 5) whe e hey a e summa ized and c i ically discussed.
Chap e 2
S a e o he A
This chap e p o ides an o e iew o he cu en s a e o esea ch in Ca na ic sou ce
sepa a ion by:
•Su eying ele an li e a u e and highligh ing he link o esea ch in he ield
o bleeding sou ce sepa a ion due o he ca na ic MSS da ase s a ailable
•In oducing esea ch wo ks on simila sou ce-sepa a ion domain adap a-
ion p oblems
•P esen ing he a chi ec u al de elopmen s in sou ce sepa a ion neu al ne -
wo ks wi h a ocus on models used wi hin his wo k
•Re iewing a ailable da ase s o Ca na ic MSS and ela ed collec ions ha
se e as he ounda ion o his hesis
2.1 Ca na ic Music Sou ce Sepa a ion
Se e al s udies ha e no ed ha publicly a ailable p e- ained sou ce sepa a ion mod-
els o en exhibi poo gene aliza ion o Ca na ic music. [19, 20, 21, 22]. Due o he
limi ed a ailabili y o sou ce-leakage ee ca na ic mul is em audios, mos li e a u e
as well as my own p eceding esea ch ocuses on le e aging he bleeding-con aining
10
2.2. Mic ophone Bleeding 11
Figu e 4: Mic ophone bleeding[24]
sa aga s ems o imp o e pe o mance o ca na ic sou ce sepa a ion sys ems. E en
i hese sys ems didn’ ye ou pe o m baselines like Hyb id T ans o me Demucs
[23], ained on clean ou -o -domain da a, signi ican imp o emen s o e baselines
ained solely on he sa aga bleeding s ems ha e been made. These esul s sugges
ha sou ce sepa a ion models a e capable o de i ing single-sou ce in o ma ion om
aining da a ha con ains in e -sou ce bleeding. This indica es ha , in da a do-
mains whe e mic ophone leakage is common, bleeding-awa e app oaches a e likely o
ou pe o m models ha do no explici ly accoun o such in e e ence in he u u e.
The ollowing chap e in oduces mic ophone-bleeding and di e en bleeding-awa e
and unawa e app oaches o ca na ic MSS.
2.2 Mic ophone Bleeding
Mic ophone bleeding o Mic ophone leakage desc ibes a phenomena o li e
mul i-mic ophone eco dings wi h mul iple ins umen s/sound sou ces[24]. Figu e 4
p o ides a simpli ied illus a ion o mic ophone bleeding.
In a eco ding scena io wi h Nsound sou ces sn(k)in a e e be an en i onmen ,
Mmic ophones cap u e signals deno ed as xm(k). Le hmn(k) ep esen he oom

12 Chap e 2. S a e o he A
impulse esponse modeling he acous ic pa h om sou ce n o mic ophone m. As-
suming ha each mic ophone is p ima ily in ended o cap u e a single a ge sou ce,
he signal a mic ophone mcan be exp essed as:
xm(k) = sm(k)∗hmm(k) +
M
X
n=1
n=m
sn(k)∗hmn(k)
Then, he di ec sou ce, he con olu ion o he a ge sou ce sm(k)wi h i s di ec
pa h hmm(k)is de ined ia he i s pa o he equa ion:
ˆsm(k) = sm(k)∗hmm(k)
The second e m, accoun ing o he con ibu ions om all o he sou ces due o
oom e lec ions and c oss- alk, de ines he bleeding componen :
um(k) =
M
X
n=1
n=m
sn(k)∗hmn(k)
Neu al ne wo k based MSS sys ems commonly minimize econs uc ion losses like
L1-dis ance o mean-squa e-e o in a supe ised aining way equi ing s em-da a
as g ound u hs. I s ems con ain leakage, he model he e o e lea ns o a e age a
bleeding componen s ˆum(k), de i ed om all he bleeding componen s o he a ge
sou ces in he da ase .
Addi ionally, e alua ion o he aining success becomes inc easingly challenging
wi h he bleeding le el o he da ase . S anda d MSS e alua ion me ics such as
SDR (see Sec ion 3.4.1) and ISR a e designed o assess he o e all signal quali y
and he le el o in e e ence be ween s ems. Howe e , hese me ics can become
misleading in bleeding scena ios. Fo example, i he a ge sou ce is almos silen
du ing a speci ic in e al in he benchma k, he sepa a ion ou pu may s ill con ain
a esidual mix u e signal due o lea ned bleeding pa e ns. As a esul , me ics
like SDR and ISR may yield highly nega i e sco es, as hey in e p e he p esence
2.2. Mic ophone Bleeding 13
o any esidual ene gy as dis o ion o in e e ence, despi e he model ep oducing
he ypical leakage obse ed du ing aining. The ollowing pa ag aphs summa izes
de elopmen s in bleeding-awa e app oaches o ca na ic MSS ha e hink aining
objec i es o ackle he a o emen ioned challenges.
2.2.1 Bleeding Awa e Sou ce Sepa a ion
Recen wo k by Plaja e al. [22] in oduces bleeding-awa e echniques o imp o e
sou ce sepa a ion pe o mance o Ca na ic singing oice using di usion based ap-
p oaches. Howe e , hese app oaches s ill ace wo key limi a ions: he o e all sep-
a a ion quali y does no ye ma ch s a e-o - he-a esul s in he b oade li e a u e,
and cu en e o s ocus exclusi ely on sepa a ing he singing oice, neglec ing o he
impo an sou ces.
In a ollowing pape Disen angling O e lapping Sou ces: Imp o ing Vocal and Violin
Sou ce Sepa a ion in Ca na ic Music [25], w i en by A. Shanka , mysel , G. Plaja
and M. Rocamo a, we p oposed a bleeding-awa e wo-s age aining p ocedu e a -
ge ing bo h oice and iolin s ems. This app oach d aws inspi a ion om ad ances
in speech denoising and enhancemen , whe e lea ned loss unc ions ha app oxi-
ma e pe cep ual me ics - such as PESQ o STOI - ha e been shown o imp o e
sepa a ion quali y [26, 27].
Fo ins ance, Wuxuan e al. [28] ain a dense con olu ional neu al ne wo k o p edic
PESQ and STOI sco es, bo h o which a e mean o co ela e wi h he pe cei ed
cleanness o speech signals. Du ing aining, a andom PESQ a ge alue ypesq is
sampled, and noise xnoise is scaled and added o a clean speech signal xspeech such
ha :
ypesq =PESQ(xspeech, xspeech +s·xnoise)
The model hen lea ns o map he noisy inpu xspeech +s·xnoise o he a ge pe -
cep ual sco e ypesq.
14 Chap e 2. S a e o he A
In [25], we adap ed his amewo k o he bleeding p oblem by simula ing a i icial
bleed o Ca na ic oice and iolin s ems. The leakage componen is modeled ol-
lowing an algo i hm in oduced in he SDX Bleeding Challenge 2023 [29]. The
p o ided benchma k and aining da ase a e gene a ed ia simula ed in e -sou ce
leakage on he mul i-s em sou ces om MUSDB18.
The au ho s simula ed bleeding ollowing hese assump ions:
•The amoun o bleeding in a eco ding is usually low
–Au ho s de ine he bleeding le el ia sou ce sepa a ion e alua ion, aiming
o a di e ence o 1db in SDR when models ain on he bleeding con-
aining da a, compa ed o he same model, ained on clean MUSDB18
•E e y single ile con ains bleeding
•E e y s em bleeds in o e e y o he s em in he same song
•The bleeding componen o one s em o ano he s em is ob ained by:
–Apply gain educ ion, andom be ween -7db and -12db
–Fil e : Choose Band o Lowpass il e wi h p=0.5
–O de : be ween 3and 10
–Lowpass: Cu o equency be ween 900 and 9000Hz
–Bandpass: Low cu o be ween 200 and 600Hz, high cu o be ween 8
and 10 kHz
All alues a e sampled om uni o m dis ibu ions. The anges we e ob ained em-
pi ically, comp omising be ween bleeding ealism and he desi ed goal o -1db in
SDR[29].
Following his p ocedu e, we implemen ed a PyTo ch[30] da aloade ha gene a es
syn he ic bleeding mix u es om clean Ca na ic mul i-s em s udio eco dings du ing
each aining s ep. The clean sou ce ma e ial is sampled om he CMC da ase (see
2.2. Mic ophone Bleeding 15
Sec ion 2.6.4). Be o e aining, a a ge s em is selec ed - ei he ocal o iolin
- while he emaining s ems (m idangam le ,m idangam igh , and anpu a) a e
ea ed as sou ces o leakage, ollowing he app oach desc ibed abo e.
A bleeding le el b∈[0,1] is andomly sampled o each example. Bo h he a ge
s em and he combined bleeding s ems a e loudness-no malized. The inal mix u e
is hen compu ed as:
x ain = (1 −b)·x a ge +b·xbleed
No e ha x a ge is ei he x ocal o x iolin while xbleed is a il e ed and gain educed
sum o all he emaining sou ces o he CMC da ase .
A con olu ional neu al ne wo k, s uc u ally iden ical o he disc imina o used in
MelGAN [31], is ained o lea n he mapping om he inpu x ain o he bleeding
le el b.
We hen p e- ain a U-Ne a chi ec u e [32] on he Sa aga da ase o 300,000
i e a ions, minimizing L1 dis ance compu ed agains bleeding-con aining a ge s.
Th ough his p ocess, he model lea ns o supp ess backg ound ins umen a ion o
ma ch he a e age bleeding le el p esen in he Sa aga eco dings.
In he second s age, we ine- une he p e- ained model by eplacing he L1 loss wi h
he lea ned bleeding es ima o in oduced ea lie .
Du ing he ine uning, he model lea ns o lowe he bleeding esiduals signi ican ly.
The p e- aining was pe o med on app oxima ely 60 hou s o Sa aga da a, while
ine- uning equi ed only 300 i e a ions on a subse o 2 hou s o Sa aga. This sug-
ges s ha he model al eady lea ns an in e nal ep esen a ion o isola ed, bleeding-
ee sou ce componen s du ing p e- aining. Howe e , he disc imina o loss in o-
duced occasionally audible a i ac s, s eadily inc easing du ing he ine uning s age.
Due o he esul ing limi a ion in e ec i e du a ion o his second aining phase he
inal sepa a ion quali y does no ye ma ch ha o s a e-o - he-a sys ems ained
22 Chap e 2. S a e o he A
Figu e 6: A Ca na ic conce om he Sa aga Audio isual da ase . Ins umen a ion:
M idangam (le ), Violin ( igh ), Vocalis (cen e )
2.6 Da ase s
In he ollowing sec ion, all da ase s used wi hin his p ojec a e p esen ed:
2.6.1 Sa aga
The Sa aga-Audio isual da ase [14] is he la ges open-access, mul is em collec-
ion o Ca na ic music, comp ising 64.8 hou s o conce eco dings. I consis s
exclusi ely o li e Ca na ic pe o mances cap u ed wi h mul iple mic ophones on
s age. Compa ed o he o iginal Sa aga da ase [47], he numbe o a is s and
¯agas has app oxima ely doubled, while he o e all du a ion and numbe o eco d-
ings emain compa able [14]. Mo eo e , Sa aga e lec s he s ylis ic di e si y, mu-
sical beau y, and nuanced pe o mance p ac ices o Ca na ic music. This s ands
in con as o o he widely used open musical sou ce sepa a ion da ase s such as
MUSDB18, which p ima ily ea u e s udio-p oduced acks wi h a comme cial o
ad e isemen -o ien ed sound.
2.6.2 MUSDB18
MUSDB18 [1] is he mos widely used MSS da ase . The s ems include: Vocal,
D ums,O he and Bass. O he ea u es many kinds o melodic ins umen s and
elec onic sounds. Bass includes baselines, played by bass-gui a s and sub- equency

2.6. Da ase s 23
syn hesize s.
2.6.3 Sanidha
The Sanidha da ase [48] is a mul imodal, audio isual, mul is em Ca na ic music
collec ion wi h li le o no leakage be ween sou ces. I con ains eco dings o i e
Ca na ic conce s pe o med a he School o Music, Geo gia Ins i u e o Technol-
ogy (USA), ea u ing i een p o essional Ca na ic musicians om A lan a: h ee
male ocalis s, wo emale ocalis s, ou iolinis s, and six pe cussionis s. The
pe o mances we e cap u ed in ou soundp oo ooms equipped wi h acous ic cu -
ains, signi ican ly educing oom e e be a ion. Fo synch oniza ion pu poses, each
pe o me ecei ed a pe sonalized moni o mix consis ing o he o he pe o me s’
mic ophones, combined wi h an a i icial anpu a d one.
In he ini ial eco ding sessions, some pe o me s equi ed highe moni o ing le els,
which led o sligh leakage om he headphones in o hei mic ophones. This issue
was add essed in la e sessions by swi ching o in-ea moni o ing. As a esul ,
conce s 2 and 3 a e comple ely ee om leakage, whe eas he emaining h ee
conce s exhibi mino bleed, especially on he ocalis acks in he o m o a
me onome-like click sound.
2.6.4 Ca na ic Mul i-s em Clean (CMC)
The Ca na ic Mul i-s em Clean (CMC) da ase is a p i a e collec ion o 58 mul-
is em Ca na ic music eco dings, o aling app oxima ely 5hou s o audio. Each
eco ding includes one o wo lead ocals, iolin, m idangam, and anpu a, wi h
some conce s also ea u ing ¯ala and eena. The da ase was p o ided by Shaale,
a Bangalo e-based company specializing in Indian a music educa ion. I was o ig-
inally eco ded o educa ional pu poses, enabling Ca na ic musicians o p ac ice
alongside a comple e ensemble. No ably, each ins umen was cap u ed in accous ic
isola ion, ensu ing ha no bleed occu s be ween acks.
Figu e 1 compa es he s ems included in he ou p esen ed MSS da ase s.
24 Chap e 2. S a e o he A
2.6.5 Bach Violin Da ase
The Bach Violin Da ase [49] con ains high-quali y public eco dings o Johann
Sebas ian Bach’s sona as and pa i as o solo iolin (BWV 1001–1006). I com-
p ises 6.5 hou s o pe o mances by 17 p o essional iolinis s, eco ded in a a ie y
o sessions. In addi ion o he audio, he da ase p o ides e e ence sco es and es-
ima ed alignmen s be ween he eco dings and he co esponding sco es, enabling
sco e-in o med p ocessing and analysis. The da ase is bleeding and noise ee.
2.6.6 Deep Noise Supp ession Da ase (DNS)
The DNS Challenge [50] is an annual compe i ion ha p o ides benchma ks and
da ase s o a ious speech enhancemen asks. I has been hos ed a INTER-
SPEECH 2020,ICASSP 2021,INTERSPEECH 2021,ICASSP 2022, and ICASSP
2023. In he 2022 edi ion, he o ganize s eleased an addi ional cu a ed da ase com-
p ising 48 eal and app oxima ely 60,000 simula ed oom impulse esponses (RIRs)
o suppo esea ch on join speech denoising and de e e be a ion [51]. The RIRs
we e sou ced om he OpenSLR26 and OpenSLR28 da ase s [52] wi hou u he
modi ica ion o speci ica ion.
Chap e 3
A Fine uning S a egy -
Me hodology
P e ious esea ch a ound Ca na ic musical sou ce sepa a ion shows ha , a he ime
o his wo k, MSS models ained solely on 8 hou s o he ou -o -domain, leakage-
ee da ase MUSDB18 ou pe o m models ained in a bleeding-awa e manne on
mo e han 60 hou s o in-domain Ca na ic da a om he Sa aga-AV da ase (see
Sec ion 2.2.1).
Fu he mo e, ela ed esea ch on MSS domain adap a ion highligh s he supe io i y
o ine- uning p e- ained models wi h in-domain da a o e aining om sc a ch
using only in-domain da a (see Sec ion 2.4).
Based on hese indings, his wo k p oposes a s a egy o cu a ing da ase s and
designing da a augmen a ions o ine- une a p e- ained SCNe using leakage- ee
da a. Fi s , he domain cha ac e is ics o Ca na ic music a e analysed, wi h empha-
sis on ins umen a ion, gen e-speci ic di e ences compa ed o Wes e n pop and ock,
and eco ding/pe o mance condi ions. Nex , I p esen a new da ase ha d aws
om bo h ou -o -domain and in-domain sou ces o ep esen he Ca na ic music
domain as comp ehensi ely as possible. Addi ionally, mul iple da a augmen a ion
echniques a e in oduced o u he b idge he gap be ween da a domains. Finally,
a dedica ed es se is p oposed as a benchma k o e alua ing sepa a ion quali y
25
26 Chap e 3. A Fine uning S a egy - Me hodology
Table 1: Co ela ing s ems pe da ase .
Da ase Vocals D ums O he Bass Unused
Sa aga Vocals M idangam
L/R
Violin – Gha am; Veena
MUSDB18 Vocals D ums O he Bass –
CMC Vocals M idangam Tanpu a; Violin – Taala; Veena
Sanidha Vocals
Vocals 1+2 M idangam
L/R
Violin 1–2;
Tanpu a –Gha am Fa ;
Gha am Close
using s anda d MSS me ics as well as a pe cep ual es .
3.1 Da a Domain In es iga ion: Ca na ic Music s.
Wes e n Pop and Rock
To cu a e a ine- uning da ase ha accu a ely ep esen s he Ca na ic music domain
and i s speci ic challenges o MSS, his chap e examines he musicological cha ac-
e is ics o Ca na ic music (see Sec ion 1.2.1) in ela ion o he a ailable da ase s
(see Sec ion 2.6). The goal is o iden i y and analyse he da a-domain gap be-
ween Ca na ic music and he MUSDB18 da ase , wi h a pa icula ocus on MSS
pe o mance in he Ca na ic con ex .
The ollowing aspec s a e in es iga ed:
•Ins umen a ion cha ac e is ics and hei domain-speci ic MSS challenges
•Gen e-speci ic di e ences and hei MSS- ela ed implica ions
•MSS challenges a ising om he li e- eco ded na u e o he pe o mances
3.1.1 Ins umen a ion Di e ences and Challenges in Sepa a -
ing Ca na ic Music wi h Ou -o -Domain MSS Models
As desc ibed in Sec ion 1.2.3, Ca na ic music ea u es a pa ly dis inc i e se o
ins umen s, beyond he iolin no ypically ound in Wes e n popula music. In
3.1. Da a Domain In es iga ion: Ca na ic Music s. Wes e n Pop and Rock 27
p elimina y expe imen s, I applied p e- ained MSS models - Hyb id T ans o me
Demucs (HTDemucs) and SCNe - bo h ained on MUSDB18, o sepa a e Ca na ic
music eco dings.
MUSDB18 consis s o ou s ems o all acks: ocals,d ums,bass, and o he (see
Sec ion 2.6.2). When applied o Ca na ic music, models ained on ha da a end
o p oduce he ollowing mapping:
•Vocals con ain he lead ocal acks.
•D ums con ain m idangam and gha am.
•O he con ains aala, anpu a, eena, and iolin.
•Bass ypically con ains silence o low- equency esidual noise, bu no isola ed
ins umen s ems.
All sepa a ed s ems con ain esidual bleed and, in some cases, omi po ions o
he a ge ins umen s. This wo k ocuses on sepa a ing ocals, iolin,m idangam,
and anpu a, as hese ou ins umen s appea in e e y eco ding wi hin he CMC,
Sanidha, and Sa aga da ase s.1
As discussed in Sec ion 1.2.3, he m idangam p oduces bo h ha monic and noise-
ansien componen s. The spec al analysis (Figu e 2) shows he long ha monic
decays o s okes such as Bheem,Cha, and Dhin. When sepa a ing he m idangam
wi h a model ained on MUSDB18, he inging, s ing-like s okes o en lose sub-
s an ial po ions o hei long decay en elopes. Figu e 7 illus a es how he ha monic
componen s o hese s okes a e la gely emo ed by HTDemucs , despi e p ese ing
he ini ial ansien s.
3.1.2 Gen e-Speci ic Challenges in Ca na ic MSS
The g¯ayaki s yle in Ca na ic music (see Sec ion 1.2.1) c ea es a high melodic co e-
la ion and cons an equency o e lap be ween he melodic ins umen s, pa icula ly
1No e: In he Sa aga da ase , he anpu a is no p o ided as an isola ed s em, bu p esen as
leakage in e e y eco ding.

28 Chap e 3. A Fine uning S a egy - Me hodology
(a) Clean e e ence (b) HT_Demucs_ sepa a ion
Figu e 7: Compa ison o spec og ams: M idangam audio sepa a ed om Sanidha
conce 2 using an HTDemucs model ( igh ) and he clean e e ence (le ). The
HTDemucs model p ese es he ansien componen s ( e ical lines) bu emo es
a signi ican po ion o he ha monic decay (ho izon al lines).
he iolin and he ocal s ems. In con as , Wes e n pop and ock a angemen s
a ely ea u e ins umen al pa s ha mimic he lead ocal so closely. S a e-o - he-
a MSS models such as SCNe and HTDemucs ope a e en i ely o pa ially in he
equency domain, using spec og ams as hei p ima y inpu ep esen a ion (see
Sec ions 2.5.1 and 2.5.2). As a esul , equency-domain sepa a ion models ained
p ima ily on Wes e n da a o en s uggle wi h Ca na ic ma e ial, as he p esence o
iolin in he same equency bins as he ocal leads o signi ican ly mo e esiduals
in he sepa a ed ocal s em and ice- e sa.
Fu he mo e, Ca na ic music u ilises in e als ha a e smalle han he semi one,
he smalles in e al in Wes e n music[12]. This migh u he complica e sepa a ion
asks o models ained exclusi ely on Wes e n da ase s wi h bigge and o e all mo e
disc e e pi ch scales.
3.1.3 Tailo ing a T aining Se and Benchma ks o Imp o e
Ca na ic MSS
Table 2 compa es he ou MSS da ase s used in his wo k. Sa aga, CMC, and
Sanidha cons i u e he h ee Ca na ic MSS da ase s a ailable a he ime o w i ing.
3.1. Da a Domain In es iga ion: Ca na ic Music s. Wes e n Pop and Rock 29
Table 2: O e iew o da ase s used in his wo k.
Da ase Leng h Gen e Reco ding se up Bleeding Usage
Sa aga 64.8 h Ca na ic Li e on s age Yes E alua ion
MUSDB18 10 h Pop and Rock S udio eco ded No T aining
CMC 8 h Ca na ic S udio eco ded No T aining
Sanidha 4 h Ca na ic S udio eco ded Pa ial E alua ion
Figu e 8: S em mapping be ween MUSDB18 and he CMC da ase . MUSDB18 is
s anda dized o ou s ems o e e y ack. In CMC, addi ional acks ( ¯ala, ¯ına)
a e p esen in some conce s bu a e no used in his wo k.
Sanidha con ains six conce s, o which conce s 2 and 3 a e comple ely ee o
leakage and a e he e o e used as he p ima y e alua ion benchma k o his wo k.
MUSDB18 and CMC a e leakage- ee and hus sui able o aining. I ine- une
models on CMC alone as well as on a combina ion o CMC and MUSDB18. In he
la e case, Figu e 8 shows he mapping be ween s ems, as desc ibed in Sec ion 3.1.1.
O he 58 acks in he CMC da ase , only 24 include iolin. Fu he mo e, all iolin-
con aining acks a e eco ded wi h he same Ca na ic ensemble - ea u ing a single
iolinis and a single eco ding se up. Since models o en s uggle wi h iolin– ocal
sepa a ion, I o e sample hese 24 acks by including hem wice in he aining se ,
applying he da a augmen a ions desc ibed in Sec ion 3.2.1 o inc ease a ie y.
The Sa aga da ase , wi h i s s ylis ic di e si y and eal li e- eco ded pe o mances,
se es as he bes a ailable es ing se o Ca na ic music. Non-mul is em Ca na ic
30 Chap e 3. A Fine uning S a egy - Me hodology
eco dings, such as publicly a ailable conce eco dings, a e no sui able o he
pe cep ual e alua ion used in his wo k. S a e-o - he-a MSS models o en emo e
pa s o he a ge sou ce du ing sepa a ion. To e alua e he comple eness o he
sepa a ed sou ce, a compa a i e lis ening es wi h a e e ence is equi ed. I he
e e ence is he ull mix u e, assessing deg ada ion becomes mo e di icul han when
using he pa ially isola ed, bleeding - con aining s ems o Sa aga, whe e he a ge
sou ce is clea ly in he o eg ound and pe cep ually dis inc . Fu he mo e, his wo k
aims o clean he Sa aga AV da ase s ems (see Chap e 1.2.2).
3.2 Da a Augmen a ions
3.2.1 Aligning SCNe Da a Augmen a ions wi h he Ca na ic
Da a Domain
The o iginal SCNe aining p ocedu e uses an augmen a ion pipeline consis ing o
ou ope a ions, each applied sequen ially wi h i s own p obabili y. While hese
augmen a ions a e designed o imp o e gene aliza ion, hei e ec i eness depends
hea ily on he ma ch be ween he augmen a ion s a egy and he a ge da a do-
main. In his wo k, I p opose a modi ied pipeline ha e ains h ee o he o iginal
augmen a ions and in oduces h ee new ones speci ically ailo ed o he cha ac e -
is ics o Ca na ic music.
The au ho s o he SCNe p opose he ollowing augmen a ions:
Flip Channels - Swaps he le and igh s e eo channels o each sou ce wi h a
p obabili y o p= 0.5.
Flip Sign - In e s he phase o a sou ce wi h a p obabili y o p= 0.5.
Scale - Mul iplies each sou ce by a andom scaling ac o sscale ∈[0.25,1.25] wi h a
p obabili y o p= 0.3.
Remix - Randomly shu les he same ype o sou ce ac oss di e en mix u es wi hin
a ba ch.
3.2. Da a Augmen a ions 31
The Flip Channels,Flip Sign, and Scale augmen a ions educe o e i ing and in-
c ease gene aliza ion capaci ies by a ying model a e e y aining s ep. Fu he -
mo e, he Scale augmen a ion is c ucial o adap ing o di e en mixdowns. Fo
example, he CMC da ase con ains only a single iolin conce wi h a consis en
loudness a io be ween iolin, ocal, and m idangam. Scaling hese sou ces p e-
pa es he model o scena ios whe e he iolin is mixed o played loude , o is mo e
subdued in he backg ound.
The Remix augmen a ion, o iginally p oposed in he Demucs pape [40] and shown
o imp o e sepa a ion quali y and gene aliza ion in MUSDB18- ained models, is
excluded om he modi ied SCNe pipeline. P elimina y expe imen s indica ed
ha Remix deg aded ocal and iolin sepa a ion quali y in he Ca na ic con ex
by dis up ing melodic co ela ion. When iolin s ems a e shu led ac oss mix u es,
he numbe o empo ally aligned samples wi h s ong equency o e lap o he
ocal is educed. This nega i ely impac s pe o mance o ca na ic iolin and ocal
sepa a ion.
3.2.2 Violin Da a Augmen a ion
This augmen a ion add esses he ins umen a ion- and gen e-speci ic MSS challenges
in Ca na ic music, pa icula ly he g¯ayaki iolin s yle (see Sec ions 3.1.2 and 3.1.1).
Baseline models s uggle wi h he s ong equency o e lap be ween ocals and i-
olin in Ca na ic eco dings, esul ing in esidual ocals in he iolin s em and ice
e sa. E en a e o e sampling he iolin-con aining acks in he CMC da ase
(see Sec ion 3.1.3), he models con inue o ha e di icul y cleanly sepa a ing he wo
sou ces. This is exace ba ed by he ac ha CMC ea u es only a single iolinis ,
eco ded in one session unde iden ical condi ions.
To imp o e gene aliza ion in Ca na ic iolin sepa a ion, I in oduce a iolin da a
augmen a ion ha adds mo e a ied iolin ma e ial o he aining se . The
augmen a ion is applied du ing aining wi h a p obabili y o p= 0.4. I he cu en
sample does no al eady con ain a iolin, a andom 11-second segmen om he
Bach Violin Da ase (See sec ion 2.6.5) is selec ed and added o he o he s em o
38 Chap e 3. A Fine uning S a egy - Me hodology
Expe imen T aining Da a Fine uning Da a
Baselines
HDemucsmmi MUSDB 800 songs
HTDemucs MUSDB -
HTDemucs
MUSDB +
800 songs
MUSDB +
800 songs
SCNe MUSDB -
P oposed
SCNe cMUSDB CMC
SCNe c,m MUSDB MUSDB, CMC
SCNe c,m, MUSDB MUSDB,CMC
Violin augmen
SCNe c,m, ,b MUSDB
MUSDB, CMC
Violin augmen
Bleeding augmen
Table 3: Lis o baselines and p oposed models wi h hei aining and ine- uning
da ase s.
•Fine- uned Hyb id T ans o me Demucs (HT Demucs , see Sec ion 2.5.1)
- included as he s onges baseline o Ca na ic music. This model is ained
on MUSDB18 plus 800 addi ional songs and a e ages he ou pu s o ou di -
e en Hyb id-Demucs models du ing in e ence o maximize pe o mance. Two
addi ional Demucs con igu a ions a e also included:
•Hyb id Demucs mmi (HDemucsmmi) - he p eceding Demucs gene a ion,
which achie ed s a e-o - he-a sepa a ion quali y a he ime o elease. I is
ained on MUSDB18 plus 800 addi ional songs and is included he e due o
i s excep ional m idangam sepa a ion quali y.
•Hyb id T ans o me Demucs (HT Demucs) - ained solely on MUSDB18,
included o compa ison wi h he ine- uned a ian .
•No - ine- uned SCNe (SCNe see Sec ion 2.5.2) - se es bo h as a baseline
and as a di ec poin o compa ison o e alua e he impac o ine- uning. This
con igu a ion is ained exclusi ely on MUSDB18.
Table 3.4.2 gi es an o e iew on he aining and ine uning da ase con igu a ions
o all baselines and p oposed expe imen s.

Chap e 4
Resul s
4.1 Imp o ing he MSS quali y o ca na ic music
The bes checkpoin om each expe imen is e alua ed on he Sanidha and CMC
benchma ks (see Sec ion 3.4) and compa ed wi h he baselines (Sec ion 3.4.3). Ta-
ble 4.1 epo s he comple e SDR e alua ion ac oss all models o he Sanidha bench-
ma k, Table A o he CMC es se . Audio samples a e also p o ided o compa a i e
lis ening1.
4.1.1 SDR E alua ion
O e all, all ou p oposed models ou pe o m all baselines ac oss all h ee sou ces
and on bo h es se s. Compa ed o he SCNe ained solely on MUSDB18, ine-
uning yields subs an ial imp o emen s - up o +5.4 dB o ocals, +6.9 dB o iolin
+ anpu a), and +5.5 dB o he m idangam sou ce on he Sanidha benchma k. In
pa icula , pe o mance on he o he s em ( iolin+ anpu a) is no ewo hy, wi h SDR
alues eaching nea ly h ee imes hose o he bes baseline model.
The CMC e alua ion deli e s simila esul s - only he iolin+ anpu a sou ce is
mo e han 3db highe han he Sanidha e alua ion, sugges ing ha he model migh
o e i on he CMC iolin. I should be no ed ha he CMC benchma k con ains
1Google D i e link wi h e alua ion iles
39
40 Chap e 4. Resul s
G oup Expe imen Vocal Violin+Tanpu a M idangam
Baselines
HDemucsmmi 11.86 3.13 10.07
HTDemucs 12.27 3.52 9.74
HTDemucs 12.63 3.92 8.65
SCNe 12.59 2.66 8.64
P oposed
SCNe c14.76 7.13 13.77
SCNe c,m 16.67 8.08 13.06
SCNe c,m, 18.02 9.65 14.16
SCNe c,m, ,b 16.30 8.52 14.18
Table 4: Signal- o-Dis o ion Ra io (dB) on he Sanidha benchma k. unde line:
s onges baseline, bold: s onges o e all.
only one song, pe o med by an ensemble ha he p oposed models encoun e ed
du ing aining.
4.1.2 On he E ec i eness o he P oposed Da a Augmen a-
ions
Violin Augmen a ion
The SCNe c,m, con igu a ion achie es he highes o e all pe o mance, exceeding
he SDR o any o he model by mo e han +1 dB o all sou ces excep d ums.
A di ec compa ison wi h SCNe c,m - ained wi hou he iolin augmen a ion -
demons a es he e ec i eness o he p oposed iolin augmen a ion. Figu e 17 illus-
a es he impac . Bo h ocals, iolin + anpu a and m idangam show subs an ial
SDR gains compa ed o he p eceding expe imen , con i ming he augmen a ion’s
alue o add essing he s ong ocal- iolin o e lap in Ca na ic music.
Bleeding Augmen a ion
In con as , he bleeding augmen a ion applied in he SCNe c,m, ,b expe imen sligh ly
educes SDR ac oss all sou ces excep d ums. Fo d ums, he imp o emen o e
SCNe c,m, is ma ginal a +0.02 dB - well wi hin he ange o measu emen noise
and hus no s a is ically meaning ul.
4.2. Pe cep ual E alua ion: SCNe c,m, s HTDemucs 41
Vocal O he D ums
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
SDR (dB)
16.67
8.08
13.06
18.02
9.65
14.16
SCNe
c
,
m
SCNe
c
,
m
,
Sanidha Benchma k SDR
Wi h s wi hou iolin augmen a ion
Figu e 11: Impac o he iolin augmen a ion on he SDR e alua ion on Sanidha
da ase .
Ye , a pe cep ual compa ison be ween he models wi h bleeding augmen a ion and
wi hou , as desc ibed in Chap e 3.4.2, show he imp o ed gene aliza ion capaci ies
ega ding he ca na ic iolin due o he bleeding augmen a ion. Figu e 12 shows ha
he SCNe c,m, ,b yielded pe cep ually be e sepa a ions in 42% and wo se in only
18% o he samples. The bleeding augmen a ion led o 16% less a i ac s/in e ence-
con aining samples and 10% less co up ed sepa a ions (See Figu e 18). Ye , he
pe cep ual quali y dec eased signi ican ly o m idangam and ocal due o he in-
c ease o esiduals o o he sou ces in he sepa a ions.
4.2 Pe cep ual E alua ion: SCNe c,m, s HTDemucs
A second pe cep ual es , as desc ibed in Sec ion 3.4.2, is conduc ed o e alua e
whe he he bes p oposed me hod, SCNe c,m, ,(1) gene alizes o mo e ealis ic,
non-s udio- eco ded da a, and (2) whe he pe cep ual quali y co ela es wi h he
SDR measu emen . The o e all pe cep ual quali y compa ison wi h he s onges
baseline, HTDemucs , o e 100 andom Sa aga AV samples is shown in Figu e 15.
Only 13% o he Demucs sepa a ions exhibi highe pe cep ual quali y, while 27%
42 Chap e 4. Resul s
60%
4%
36%
Vocals
66% 2%
32%
M idangam
18%
42%
40%
Violin/Tanpu a
Sa aga Pe cep ual E alua ion: O e all pe cep ual quali y
Wi h s wi hou bleeding augmen a ion
SCNe c
,
m
,
be e
SCNe c
,
m
,
,
b
be e Same quali y
Figu e 12: Impac o he bleeding augmen a ion on pe cep ual sepa a ion quali y,
compu ed on Sa aga AV.
show no pe cei able di e ence. In 60% o he cases, he p oposed me hod is a ed
as sounding pe cei able be e han he s onges baseline.
Figu e 13 compa es he cleanliness and p ese a ion quali y be ween he wo models.
The p oposed model emo es subs an ially ewe equencies han he baseline. Only
2% o he ocal sepa a ions and 4% o he m idangam samples a e co up ed by
SCNe c,m, . The iolin co up ion a e is highe , a 26%, ye s ill signi ican ly lowe
han he baseline’s 54%.
4.2.1 M idangam Sepa a ion
The m idangam eme ges as he pe cep ually cleanes sepa a ed sou ce. The bes
p oposed model imp o es SDR by +4.11 dB o e he s onges baseline (HDemucsmmi).
Figu e 14 illus a es how he p oposed model p ese es signi ican ly mo e ha monic
con en ( isible as ho izon al lines in he spec og am) om he esonan , onal
s okes - an a ea whe e all baselines s uggled (compa e Figu e 7). Pe cep ually,
SCNe c,m, sounds almos as ich as he o iginal sou ce on nea ly all es samples -
he e is as good as no sou ce co up ion (see Figu e 13).
The compa ed baseline, HTDemucs , p oduces cleane sepa a ions - 73% con ain
no pe cei able ins umen esiduals o a i ac s, compa ed o 61% o he p oposed
4.2. Pe cep ual E alua ion: SCNe c,m, s HTDemucs 43
0
20
40
60
80
100
Clean samples (%)
32%
61%
39%
21%
73%
35%
F ee o a i ac s and in e e ence
SCNe c
,
m
,
HTDemucs
Vocals M idangam Violin/Tanpu a
0
20
40
60
80
100
No -co up ed samples (%)
98% 96%
74%
98%
25%
46%
Sou ce p ese ed
Sa aga Pe cep ual E alua ion: Cleanliness and p ese a ion quali y
S onges p oposed model s s onges baseline
Figu e 13: Amoun o samples wi hou a i ac s/ esiduals and wi hou co up ion
pe sou ce o HT Demucs and SCNe c,m, .
model. Howe e , his cleanliness is la gely due o he hea y ga ing o he Demucs
model, which esul s in sepa a ions con aining only ansien sounds. In con as ,
he p oposed model’s sepa a ions p ese e he ha monic decays o he m idangam,
al hough some esonan ansien s s ill con ain esiduals om s ing ins umen s o ,
mo e a ely, ocals pe o med a he same pi ch as he d um. O e all, he p oposed
model is a ed highe han he s onges baseline in 92% o he m idangam samples
in he pe cep ual es (see Figu e 15).
4.2.2 Vocal Sepa a ion
In con as o he m idangam, he ocal s em is gene ally well p ese ed by he
s onges baseline, HTDemucs (see Figu e 13). The obse ed SDR imp o emen
o +5.39 dB o SCNe c,m, is p ima ily due o be e emo al o in e e ing sou ces,

44 Chap e 4. Resul s
(a) SCN e (b) SCN e c,m,
Figu e 14: Compa ison o spec og ams: M idangam audio sepa a ed om Sanidha
conce 2 wi h he p e ained SCNe s he ine uned SCNe c,m, . The ine uned
p ese es he ha monics while he p e ained mainly p ese es ansien s. Compa e
wi h Figu e 7 o he e e ence and HT Demucs sepa a ions.
pa icula ly he iolin. Sepa a ion ou pu s om SCNe c,m, con ain signi ican ly
ewe iolin esiduals han hose om he baselines. While 68% o he samples in
he pe cep ual es con ain ins umen esiduals and a i ac s, hese a e conside ably
quie e han in he baseline, esul ing in an o e all subs an ially highe pe cep ual
quali y (see Figu e 15).
4.2.3 Violin and Tanpu a Sepa a ion
The o he s em, consis ing o iolin and anpu a, shows he la ges ela i e imp o e-
men o e he baselines. This is pa icula ly impo an gi en he s ong melodic
co ela ion be ween iolin and ocals in Ca na ic music. The iolin augmen a ion
appea s o play a key ole he e, as SCNe c,m, achie es he highes SDR on bo h
benchma ks o his s em.
Pe cep ually, he iolin and anpu a s em emains he weakes sou ce o he p oposed
model on he Sa aga da ase : equency componen s a e emo ed in 26% o he
samples, and 61% con ain a i ac s and/o esiduals. Loud, sligh ly dis o ed iolin
sounds in he Sa aga da ase a e some imes no co ec ly iden i ied by he p oposed
model. Ne e heless, e en o his sou ce, he pe cep ual es shows a signi ican
4.2. Pe cep ual E alua ion: SCNe c,m, s HTDemucs 45
51%
13%
36%
Vocals
67% 8%
25%
M idangam
62% 18%
20%
Violin/Tanpu a
Sa aga Pe cep ual E alua ion: O e all pe cep ual quali y
Bes p oposed model s bes baseline
SCNe c
,
m
,
bes
HTDemucs
bes Bo h same quali y
Figu e 15: Pe cep ual e alua ion: O e all bes sepa a ion quali y on Sa aga AV.
imp o emen in o e all sepa a ion quali y ela i e o he s onges baseline (see
Figu e 15).
46 Chap e 4. Resul s
4.3 T aining S abili y: Ex ending s Replacing he
Fine- uning Da ase
Simila wo k on MSS domain adap a ion has highligh ed he ad an ages o ine-
uning p e- ained models, e en when p e- aining is on ou -o -domain da a (Sec-
ion 2.4). This wo k u he sugges s ha including he ou -o -domain da ase in
he ine- uning p ocess can be bene icial.
Figu e 16 compa es SDR cu es on a subse o Sanidha o SCNe c(CMC only)
and SCNe c,m (CMC + MUSDB18). Including he p e aining da ase MUSDB18
enables longe , mo e s able ine- uning and yields highe SDR esul s. The SDR
alues in his igu e di e om Table 4.1 because only a subse o he Sanidha
benchma k is used.
The e a e se e al possible explana ions o his beha iou . T aining exclusi ely
on CMC may lead o o e i ing, gi en i s smalle size and limi ed a iabili y. The
hype pa ame e s (e.g., lea ning a e) may no be op imally uned o CMC. Fu he -
mo e, he SCNe checkpoin s p o ided by he au ho s ha e been ex ensi ely ained
on MUSDB18; emo ing his da a om ine- uning migh des abilize aining by
mo ing he model oo a om i s p e- ained dis ibu ion.
4.3. T aining S abili y: Ex ending s Replacing he Fine- uning Da ase 47
10
12
14
16
18
SDR (dB)
Vocal
4
6
8
10
SDR (dB)
Violin + Tanpu a
0 5 10 15 20 25 30 35
Epoch
8
10
12
14
SDR (dB)
M idangam
SCNe c
- Fine uning on CMC
SCNe c
,
m
- Fine uning on CMC and MusDB
SDR de elopmen wi h ine uning epochs
Figu e 16: SDR de elopmen pe ine- uning epoch and sou ce on a subse o he
Sanidha benchma k. SCNe cshows a apid quali y boos a e he i s epoch ol-
lowed by a decline. SCNe c,m declines only a e 25 epochs.
54 BIBLIOGRAPHY
[16] Chand amouli, K. & Se ha es, W. Au oma ic ansc ip ion o d um s okes
in ca na ic music (2022). URL h ps://a xi .o g/abs/2211.15185.2211.
15185.
[17] SENGUPTA, R., DEY, N., DATTA, A. K. & GHOSH, D. Assessmen o
musical quali y o anpu a by ac al-dimensional analysis. F ac als 13, 245–
252 (2005). URL h ps://doi.o g/10.1142/S0218348X05002891.h ps:
//doi.o g/10.1142/S0218348X05002891.
[18] Salamon, J. Melody Ex ac ion om Polyphonic Music Signals. Ph.D. hesis
(2013).
[19] Clay on, M., Rao, P., Shika pu , N. N., Roychowdhu y, S. & Li, J. Raga
classi ica ion om ocal pe o mances using mul imodal analysis. In P oc. o he
23 d In . Socie y o Music In o ma ion Re ie al (ISMIR), Bengalu u, India,
283–290 (2022).
[20] Nu all, T., Plaja-Roglans, G., Pea son, L. & Se a, X. The ma ix p o ile
o mo i disco e y in audio-an example applica ion in ca na ic music. In In .
Symposium on Compu e Music Mul idisciplina y Resea ch, Tokyo, Japan, 228–
237 (2021).
[21] Nu all, T., Plaja-Roglans, G., Pea son, L. & Se a, X. In sea ch o sañc¯a as:
adi ion-in o med epea ed melodic pa e n ecogni ion in ca na ic music. In
P oc. o he 23 d In . Socie y o Music In o ma ion Re ie al Con . (ISMIR),
Bengalu u, India, 337–344 (2022).
[22] Plaja-Roglans, G., Mi on, M., Shanka , A. & Se a, X. Ca na ic singing oice
sepa a ion using cold di usion on aining da a wi h bleeding. In 24 h In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR), Milano, I aly (2023).
[23] Roua d, S., Massa, F. & Dé ossez, A. Hyb id ans o me s o music sou ce
sepa a ion (2022). URL h ps://a xi .o g/abs/2211.08553.2211.08553.

BIBLIOGRAPHY 55
[24] Kokkinis, E. K., Reiss, J. D. & Mou jopoulos, J. A wiene il e app oach o
mic ophone leakage educ ion in close-mic ophone applica ions. IEEE T ans-
ac ions on Audio, Speech, and Language P ocessing 20, 767–779 (2012).
[25] Shanka , A., Schweini z, S., Plaja-Roglans, G., Se a, X. & Rocamo a, M. Dis-
en angling o e lapping sou ces: Imp o ing ocal and iolin sou ce sepa a ion
in ca na ic music. IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP) (2025).
[26] Fu, S.-W., Liao, C.-F. & Tsao, Y. Lea ning wi h lea ned loss unc ion:
Speech enhancemen wi h quali y-ne o imp o e pe cep ual e alua ion o
speech quali y. IEEE Signal P ocessing Le e s 27, 26–30 (2020). URL
h p://dx.doi.o g/10.1109/LSP.2019.2953810.
[27] Xin Bai, H. Z., Xueliang Zhang & Huang, H. Pe cep ual loss unc ion o speech
enhancemen based on gene a i e ad e sa ial lea ning. In 2022 Asia-Paci ic
Signal and In o ma ion P ocessing Associa ion Annual Summi and Con e ence
(APSIPA ASC) (2022).
[28] Wuxuan Gong, Y. L., Jing Wang & Yang, H. A no- e e ence speech quali y as-
sessmen me hod based on neu al ne wo k wi h densely connec ed con olu ional
a chi ec u e. In INTERSPEECH 2023, Dublin, I eland (2023).
[29] Fabb o, G. e al. The Sound Demixing Challenge 2023: Music Demixing T ack
(2023). URL h p://a xi .o g/abs/2308.06979.2308.06979.
[30] Paszke, A. e al. Py o ch: An impe a i e s yle, high-pe o mance deep lea ning
lib a y (2019). URL h ps://a xi .o g/abs/1912.01703.1912.01703.
[31] Kuma , K. e al. Melgan: Gene a i e ad e sa ial ne wo ks o condi ional
wa e o m syn hesis (2019).
[32] Kim, M., Choi, W., Chung, J., Lee, D. & Jung, S. KUIELab-MDX-Ne : a
wo-s eam neu al ne wo k o music demixing (2021). URL h p://a xi .
o g/abs/2111.12203.2111.12203.
56 BIBLIOGRAPHY
[33] Kang, D. & Hashimo o, T. B. Imp o ed na u al language gene a ion ia loss
unca ion. In Ju a sky, D., Chai, J., Schlu e , N. & Te eaul , J. (eds.) P o-
ceedings o he 58 h Annual Mee ing o he Associa ion o Compu a ional Lin-
guis ics, 718–731 (Associa ion o Compu a ional Linguis ics, Online, 2020).
[34] Choi, W., Kim, M., Chung, J., Lee, D. & Jung, S. In es iga ing u-ne s wi h a -
ious in e media e blocks o spec og am-based singing oice sepa a ion (2020).
URL h ps://a xi .o g/abs/1912.02591.1912.02591.
[35] Si asanka , A. S. Vocal sou ce sepa a ion o ca na ic music (2023). URL
h ps://doi.o g/10.5281/zenodo.8380379.
[36] Hennequin, R., Khli , A., Voi u e , F. & Moussallam, M. Splee e : a as and
e icien music sou ce sepa a ion ool wi h p e- ained models. Jou nal o Open
Sou ce So wa e 1–4 (2020).
[37] Chen, Z. Singing ocal sou ce sepa a ion on jingju music (2024). URL h ps:
//doi.o g/10.5281/zenodo.13863042.
[38] S olle , D., Ewe , S. & Dixon, S. Wa e-u-ne : A mul i-scale neu al ne wo k
o end- o-end audio sou ce sepa a ion (2018). URL h ps://a xi .o g/abs/
1806.03185.1806.03185.
[39] Lin, K. W. E., T., B. B., Koh, E., Lui, S. & He emans, D. Singing oice
sepa a ion using a deep con olu ional neu al ne wo k ained by ideal bina y
mask and c oss en opy (2018). URL h ps://a xi .o g/abs/1812.01278.
1812.01278.
[40] Dé ossez, A., Usunie , N., Bo ou, L. & Bach, F. Demucs: Deep ex ac o
o music sou ces wi h ex a unlabeled da a emixed (2019). URL h ps:
//a xi .o g/abs/1909.01174.1909.01174.
[41] Tong, W. e al. T cne : Time- equency domain co ec o o speech sepa a ion.
In ICASSP, 1–5 (2023). URL h ps://doi.o g/10.1109/ICASSP49357.2023.
10096785.
BIBLIOGRAPHY 57
[42] Tong, W. e al. Scne : Spa se comp ession ne wo k o music sou ce sepa a ion
(2024). URL h ps://a xi .o g/abs/2401.13276.2401.13276.
[43] Su, J., Jin, Z. & Finkels ein, A. Hi i-gan: High- ideli y denoising and de e e -
be a ion based on speech deep ea u es in ad e sa ial ne wo ks (2020). URL
h ps://a xi .o g/abs/2006.05694.2006.05694.
[44] Bahmaninezhad, F. e al. A comp ehensi e s udy o speech sepa a ion: spec-
og am s wa e o m sepa a ion (2019). URL h ps://a xi .o g/abs/1905.
07497.1905.07497.
[45] Luo, Y. & Yu, J. Music sou ce sepa a ion wi h band-spli nn. IEEE/ACM
T ansac ions on Audio, Speech, and Language P ocessing 31, 1893–1901 (2023).
[46] Shazee , N. Glu a ian s imp o e ans o me (2020). URL h ps://a xi .
o g/abs/2002.05202.2002.05202.
[47] S ini asamu hy, A., Gula i, S., Repe o, R. C. & Se a, X. Sa aga: Open
da ase s o esea ch on indian a music. Empi ical Musicology Re iew 16,
85–98 (2021).
[48] K ishnan, V. V., Alben, N., Nai , A. A. & Condi -Schul z, N. Sanidha: A
s udio quali y mul i-modal da ase o ca na ic music, San F ancisco, Uni ed
S a es. In P oc. o he 25 h In . Socie y o Music In o ma ion Re ie al Con .
(2024). URL h p://a xi .o g/abs/2501.06959.
[49] Dong, H.-W., Zhou, C., Be g-Ki kpa ick, T. & McAuley, J. Deep pe o me :
Sco e- o-audio music pe o mance syn hesis (2022). URL h ps://a xi .o g/
abs/2202.06034.2202.06034.
[50] Dubey, H. e al. Icassp 2023 deep noise supp ession challenge (2023). URL
h ps://a xi .o g/abs/2303.11510.2303.11510.
[51] Dubey, H. e al. Icassp 2022 deep noise supp ession challenge (2022). URL
h ps://a xi .o g/abs/2202.13288.2202.13288.
58 BIBLIOGRAPHY
[52] Ko, T., Peddin i, V., Sel ze , M. & Khudanpu , S. A s udy on da a augmen a-
ion o e e be an speech o obus speech ecogni ion. 5220–5224 (2017).
[53] Vincen , E., G ibon al, R. & Fe o e, C. Pe o mance measu emen in blind
audio sou ce sepa a ion. IEEE T ansac ions on Audio, Speech, and Language
P ocessing 14, 1462–1469 (2006).
Appendix A
Appendix
59

60 Appendix A. Appendix
G oup Expe imen Vocal Violin+Tanpu a M idangam
Baselines
HDemucsmmi 8.63 2.92 8.64
HTDemucs 8.98 3.17 7.71
HTDemucs 9.89 4.06 7.66
SCNe 9.38 2.70 4.18
P oposed
SCNe c12.93 7.97 10.37
SCNe c,m 15.68 10.46 13.04
SCNe c,m, 17.74 12.76 13.65
SCNe c,m, ,b 15.95 10.76 12.60
Table 5: Signal- o-Dis o ion Ra io (dB) on he CMC benchma k. unde line:
s onges baseline, bold: s onges o e all. The CMC benchma k only con ains
one ack, no seen du ing aining. Ye , he esul s a e compa able o he Sanidha
benchma k, see Figu e 4.1.
Vocal O he D ums
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
SDR (dB)
18.02
9.65
14.16
16.30
8.52
14.18
SCNe
c
,
m
,
SCNe
c
,
m
,
,
b
Sanidha Benchma k SDR
Wi h s wi hou bleeding augmen a ion
Figu e 17: Impac o he bleeding augmen a ion on he SDR e alua ion, compu ed
o he Sanidha benchma k.
61
0
20
40
60
80
100
Clean samples (%)
32%
64%
34%
26% 30%
50%
F ee o a i ac s and in e e ence
SCNe c
,
m
,
SCNe c
,
m
,
,
b
Vocals M idangam Violin/Tanpu a
0
20
40
60
80
100
No -co up ed samples (%)
98% 98%
72%
98% 94%
82%
Sou ce p ese ed
Sa aga Pe cep ual E alua ion: Cleanliness and p ese a ion quali y
Wi h s wi hou bleeding augmen a ion
Figu e 18: Impac o he bleeding augmen a ion on pe cep ual sepa a ion quali y,
compu ed on Sa aga AV. Compa ed is he amoun o samples wi hou a i ac -
s/ esiduals and wi hou co up ion pe sou ce

Related note

Why organizations use Identific for document trust, entry 52
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in universities, research institutes, colleges, schools, and publishing workflows, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer documentation of academic decisions, reduced manual checking effort, and more reliable review records. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For policy papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com