scieee Science in your language
[en] (orig)

A Fine Tuning Strategy to Improve Musical Source Separation Quality for Indian Carnatic Music

Author: Schweinitz, Serafin
Publisher: Zenodo
DOI: 10.5281/zenodo.17304796
Source: https://zenodo.org/records/17304796/files/Serafin-Schweinitz_SMS_2025_Master_Thesis.pdf
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
A Fine Tuning S a egy o Imp o e
Musical Sou ce Sepa a ion Quali y o
Indian Ca na ic Music
Se a in Schweini z
Supe iso : Ma ín Rocamo a
Co-Supe iso s: Adi hi Shanka , Genís Plaja-Roglans
July 2025
Con en s
1 In oduc ion 1
1.1 Challenges in Ca na ic Sou ce Sepa a ion . . . . . . . . . . . . . . . . . 1
1.2 Mo i a ion and Objec i es . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Ca na icMusicology ............................ 2
1.2.2 Objec i es.................................. 5
1.2.3 Ca na ic Ins umen a ion . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 S a e o he A 10
2.1 Ca na ic Music Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . . . 10
2.2 Mic ophoneBleeding............................ 11
2.2.1 Bleeding Awa e Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . . . 13
2.3 Bleeding Unawa e Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . 16
2.4 Da a-Domain Adap a ion Challenges in Music Sou ce Sepa a ion . . . 17
2.5 U-Ne s o Musical Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . 18
2.5.1 (Hyb id) (T ans o me ) Demucs . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 SCNe .................................... 21
2.6 Da ase s................................... 22
2.6.1 Sa aga.................................... 22
2.6.2 MUSDB18.................................. 22
2.6.3 Sanidha ................................... 23
2.6.4 Ca na ic Mul i-s em Clean (CMC) . . . . . . . . . . . . . . . . . . . . 23
2.6.5 BachViolinDa ase ............................ 24
2.6.6 Deep Noise Supp ession Da ase (DNS) . . . . . . . . . . . . . . . . . . 24
3 A Fine uning S a egy - Me hodology 25
3.1 Da a Domain In es iga ion: Ca na ic Music s. Wes e n Pop and Rock 26
3.1.1 Ins umen a ion Di e ences and Challenges in Sepa a ing Ca na ic Mu-
sic wi h Ou -o -Domain MSS Models . . . . . . . . . . . . . . . . . . . 26
3.1.2 Gen e-Speci ic Challenges in Ca na ic MSS . . . . . . . . . . . . . . . . 27
3.1.3 Tailo ing a T aining Se and Benchma ks o Imp o e Ca na ic MSS . . 28
3.2 Da aAugmen a ions ............................ 30
3.2.1 Aligning SCNe Da a Augmen a ions wi h he Ca na ic Da a Domain . 30
3.2.2 Violin Da a Augmen a ion . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Bleeding Augmen a ions . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Expe imen alSe up............................. 34
3.3.1 Fine uningonCMC ............................ 34
3.3.2 Fine uning on CMC plus MUSDB18 . . . . . . . . . . . . . . . . . . . 34
3.3.3 Fine uning on CMC plus MUSDB18 wi h Violin Augmen a ion . . . . 34
3.3.4 Fine uning on CMC plus MUSDB18 wi h Violin and Bleeding Aug-
men a ions ................................. 35
3.4 E alua ion.................................. 36
3.4.1 SDR..................................... 36
3.4.2 Pe cep ual E alua ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Baselines................................... 37
4 Resul s 39
4.1 Imp o ing he MSS quali y o ca na ic music . . . . . . . . . . . . . . 39
4.1.1 SDRE alua ion............................... 39
4.1.2 On he E ec i eness o he P oposed Da a Augmen a ions . . . . . . . 40
4.2 Pe cep ual E alua ion: SCNe c,m, s HTDemucs ........... 41
4.2.1 M idangam Sepa a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 VocalSepa a ion .............................. 43
4.2.3 Violin and Tanpu a Sepa a ion . . . . . . . . . . . . . . . . . . . . . . 44
4.3 T aining S abili y: Ex ending s Replacing he Fine- uning Da ase . . 46
5 Conclusion and Discussion 48
Lis o Figu es 50
Lis o Tables 51
Bibliog aphy 52
A Appendix 59

Abs ac
Abs ac
The compu a ional analysis o Ca na ic music om audio emains a ield o high
esea ch in e es due o he gen e’s ich melodic and hy hmic complexi y. How-
e e , despi e he a ailabili y o la ge mul i ack collec ions such as Sa aga, he li e-
eco ded na u e o his epe oi e leads o a sca ci y o uly clean ins umen and
ocal s ems, posing signi ican challenges o bo h musicological and echnological
s udies. S a e-o - he-a music sou ce sepa a ion (MSS) models pe o m poo ly on
Ca na ic music due o a p onounced domain misma ch wi h hei aining da a.
This wo k p oposes a ine- uning s a egy o imp o ing sepa a ion o ocals,
m idangam, and iolin plus anpu a s ems in Ca na ic music. The app oach uses a
Spa se Comp ession U-Ne (SCNe ) p e ained on MusDB18, ex ended wi h a cu-
a ed aining se combining clean Ca na ic mul i ack eco dings and ou -o -domain
da a. To u he educe he domain gap, h ee da a augmen a ions a e in oduced:
(i) iolin sampling augmen a ion, (ii) mic ophone-bleeding simula ion, and
(iii) oom impulse esponse con olu ion.
The p oposed model achie es subs an ial SDR imp o emen s o e he baselines on
a clean Ca na ic benchma k de i ed om he Sanidha da ase , and a pe cep ual
e alua ion on Sa aga con i ms signi ican quali y gains on all 3 sepa a ed sou ces. On
he benchma k, he bes con igu a ion ou pe o ms all baselines by a la ge ma gin
in SDR, while aining in unde wo days on a single 40 GB GPU - making i
conside ably less esou ce-exhaus i e han many simila deep lea ning-based MSS
domain adap a ion me hods.
All p e ained models, code, a cleaned e sion o he Sa aga da ase , and he Sanidha
benchma k a e eleased alongside his wo k.
Keywo ds: Musical Sou ce Sepa a ion, Ca na ic music
6Chap e 1. In oduc ion
Figu e 2: A spec al analysis o he M idangam s okes conduc ed by [15].
•P o ide a bleeding- educed e sion o he Sa aga AV da ase o suppo
u u e esea ch in Ca na ic musicology and audio isual sou ce sepa a ion.
1.2.3 Ca na ic Ins umen a ion
The emainde o his chap e p o ides a b ie in oduc ion o Ca na ic ins umen-
a ion. Figu e 6 depic s a Ca na ic conce om he Sa aga Audio isual da ase .
In he pe o mance, a iolin ollows he imp o ised ocal line, while a m idangam
playe p o ides hy hmic accompanimen .
Ca na ic ins umen s can be b oadly ca ego ized in o h ee g oups: melodic ins u-
men s, hy hmic ins umen s, and d ones. Melodic ins umen s include he iolin
and he Sa aswa i Veena. D um ins umen s such as he M idangam and Gha am
p o ide hy hm and pe cussi e ex u es. Finally, d one ins umen s like he Tanpu a
c ea e a wall-o -sound d one ha ypically emains in he backg ound.
M idangam
The m idangam is he p ima y pe cussion ins umen , he main ins umen used
in Ca na ic music conce s o keep he pe o mance in a hy hmic pa e n[4]. I
is men ioned in his o ical manusc ip s as a back as 200 B.C. and has g adually

1.2. Mo i a ion and Objec i es 7
de eloped in o he mos p ominen pe cussion ins umen played in Sou h Indian
classical music.[15]. The m idangam is a double-headed d um, see Figu e 1.3(a). I is
played using se e al di e en s okes on he eble ( igh ) and bass (le ) memb ane.
Di e en inge posi ions and me hods o s iking he d umheads p oduce a a ie y
o ones, wi h some s okes p oducing ha monic sounds wi h a ecognizable pi ch,
and some p oducing onic-independen sounds[16]. These s okes, many o which
go names o e ime, c ea e a ocabula y o di e en sound- ypes. The au ho s o
[15] ca ego ize hese s okes in o 3 classes:
•Ringing s ing-like ones played on he eble memb ane
–Dhin, Cha o Bheem
•Fla , closed, c isp sounds.
–Thi, Ta and Num
•Resonan s okes played on he bass memb ane
–Thom
Figu e 2 shows spec og am ep esen a ions o he di e en s oke ypes in oduced
abo e. The p esence o ho izon al lines in he spec og ams indica es he ha monic-
ich na u e o m idangam decays. In pa icula , he s okes Bheem,Cha, and Dhin
- played on he eble side o he m idangam - exhibi long, sus ained ha monic
ade-ou s. This s ands in con as o classic wes e n d um ki s, whose pe cussi e
elemen s ypically lack disce nible ha monics in hei spec al ep esen a ions.
Tanpu a
Tanpu a is a mul i-s inged and e less accompanying d one ins umen ex ensi ely
used in classical music in India. Ins umen alis s c ea e an unde lying d one back-
g ound sound by plucking he Tanpu a by inge . Ji e , shimme and complexi y
pe u ba ions a e ound also in anpu a signals[17]. Figu e 1.3(b) shows a pho o o
a Tanpu a.
8Chap e 1. In oduc ion
(a) M idangam (b) Tanpu a[18] (c) Violin
Figu e 3: The 3 ca na ic ins umen s sepa a ed in his wo k
Violin
Violin-like s ing ins umen s ha e been used in indian classical music, be o e he
wes e n iolin go in oduced in india a ound he 18 h cen u y. Ca na ic musicians
play wes e n iolins wi h di e en pos u e and uning - gene ally much lowe and
wi h di e en in e als. Fo a onic C he ou s ings would be uned o C3 - G3 -
C4 - G4. The ins umen u he mo e is played in g¯ayaki s yle (see Sec ion 1.2.1),
usually ollows he ocalis s imp o isa ion. In con as o wes e n iolin, i ollows
a di e en scale wi h smalle in e als han he semi one. The iolin also plays
Gamakas - analog o ib a o o glissando in wes e n music. Figu e 1.3(c) illus a es
ha he Ca na ic iolin is build iden ical o i s Wes e n coun e pa .
1.3 S uc u e o he Repo
This epo is s uc u ed as ollows:
The s a e-o - he-a (Chap e 2) si ua es his wo k wi hin p io esea ch I ac-
companied a he MTG -Music Technology G oup, Ba celona - on Ca na ic sou ce
sepa a ion and bleeding-awa e sou ce sepa a ion. Fu he mo e, impo an de elop-
men s in bo h a eas a e discussed, along wi h ela ed wo ks on sou ce sepa a ion
domain adap a ion and a p esen a ion o he neu al ne wo ks used in his p ojec .
The me hodology (Chap e 3) p esen s he p oposed ine- uning s a egy. Fi s ,
he da a-domain misma ch be ween he Ca na ic domain and Wes e n pop/ ock
1.3. S uc u e o he Repo 9
music - ep esen ed by MUSDB18 - is analyzed. Nex , I in oduce a ine- uning
da ase and da a augmen a ions ailo ed speci ically o b idge hese domain gaps.
The expe imen al se up (Chap e 3.3) desc ibes he ou di e en con igu a ions
o he p oposed models, he baselines and he e alua ion p ocedu e o he expe i-
men s.
Finally, esul s, Chap e 4, p esen s he e alua ion esul s ollowed by conclusion
and discussion (Chap e 5) whe e hey a e summa ized and c i ically discussed.
Chap e 2
S a e o he A
This chap e p o ides an o e iew o he cu en s a e o esea ch in Ca na ic sou ce
sepa a ion by:
•Su eying ele an li e a u e and highligh ing he link o esea ch in he ield
o bleeding sou ce sepa a ion due o he ca na ic MSS da ase s a ailable
•In oducing esea ch wo ks on simila sou ce-sepa a ion domain adap a-
ion p oblems
•P esen ing he a chi ec u al de elopmen s in sou ce sepa a ion neu al ne -
wo ks wi h a ocus on models used wi hin his wo k
•Re iewing a ailable da ase s o Ca na ic MSS and ela ed collec ions ha
se e as he ounda ion o his hesis
2.1 Ca na ic Music Sou ce Sepa a ion
Se e al s udies ha e no ed ha publicly a ailable p e- ained sou ce sepa a ion mod-
els o en exhibi poo gene aliza ion o Ca na ic music. [19, 20, 21, 22]. Due o he
limi ed a ailabili y o sou ce-leakage ee ca na ic mul is em audios, mos li e a u e
as well as my own p eceding esea ch ocuses on le e aging he bleeding-con aining
10
2.2. Mic ophone Bleeding 11
Figu e 4: Mic ophone bleeding[24]
sa aga s ems o imp o e pe o mance o ca na ic sou ce sepa a ion sys ems. E en
i hese sys ems didn’ ye ou pe o m baselines like Hyb id T ans o me Demucs
[23], ained on clean ou -o -domain da a, signi ican imp o emen s o e baselines
ained solely on he sa aga bleeding s ems ha e been made. These esul s sugges
ha sou ce sepa a ion models a e capable o de i ing single-sou ce in o ma ion om
aining da a ha con ains in e -sou ce bleeding. This indica es ha , in da a do-
mains whe e mic ophone leakage is common, bleeding-awa e app oaches a e likely o
ou pe o m models ha do no explici ly accoun o such in e e ence in he u u e.
The ollowing chap e in oduces mic ophone-bleeding and di e en bleeding-awa e
and unawa e app oaches o ca na ic MSS.
2.2 Mic ophone Bleeding
Mic ophone bleeding o Mic ophone leakage desc ibes a phenomena o li e
mul i-mic ophone eco dings wi h mul iple ins umen s/sound sou ces[24]. Figu e 4
p o ides a simpli ied illus a ion o mic ophone bleeding.
In a eco ding scena io wi h Nsound sou ces sn(k)in a e e be an en i onmen ,
Mmic ophones cap u e signals deno ed as xm(k). Le hmn(k) ep esen he oom

12 Chap e 2. S a e o he A
impulse esponse modeling he acous ic pa h om sou ce n o mic ophone m. As-
suming ha each mic ophone is p ima ily in ended o cap u e a single a ge sou ce,
he signal a mic ophone mcan be exp essed as:
xm(k) = sm(k)∗hmm(k) +
M
X
n=1
n=m
sn(k)∗hmn(k)
Then, he di ec sou ce, he con olu ion o he a ge sou ce sm(k)wi h i s di ec
pa h hmm(k)is de ined ia he i s pa o he equa ion:
ˆsm(k) = sm(k)∗hmm(k)
The second e m, accoun ing o he con ibu ions om all o he sou ces due o
oom e lec ions and c oss- alk, de ines he bleeding componen :
um(k) =
M
X
n=1
n=m
sn(k)∗hmn(k)
Neu al ne wo k based MSS sys ems commonly minimize econs uc ion losses like
L1-dis ance o mean-squa e-e o in a supe ised aining way equi ing s em-da a
as g ound u hs. I s ems con ain leakage, he model he e o e lea ns o a e age a
bleeding componen s ˆum(k), de i ed om all he bleeding componen s o he a ge
sou ces in he da ase .
Addi ionally, e alua ion o he aining success becomes inc easingly challenging
wi h he bleeding le el o he da ase . S anda d MSS e alua ion me ics such as
SDR (see Sec ion 3.4.1) and ISR a e designed o assess he o e all signal quali y
and he le el o in e e ence be ween s ems. Howe e , hese me ics can become
misleading in bleeding scena ios. Fo example, i he a ge sou ce is almos silen
du ing a speci ic in e al in he benchma k, he sepa a ion ou pu may s ill con ain
a esidual mix u e signal due o lea ned bleeding pa e ns. As a esul , me ics
like SDR and ISR may yield highly nega i e sco es, as hey in e p e he p esence
2.2. Mic ophone Bleeding 13
o any esidual ene gy as dis o ion o in e e ence, despi e he model ep oducing
he ypical leakage obse ed du ing aining. The ollowing pa ag aphs summa izes
de elopmen s in bleeding-awa e app oaches o ca na ic MSS ha e hink aining
objec i es o ackle he a o emen ioned challenges.
2.2.1 Bleeding Awa e Sou ce Sepa a ion
Recen wo k by Plaja e al. [22] in oduces bleeding-awa e echniques o imp o e
sou ce sepa a ion pe o mance o Ca na ic singing oice using di usion based ap-
p oaches. Howe e , hese app oaches s ill ace wo key limi a ions: he o e all sep-
a a ion quali y does no ye ma ch s a e-o - he-a esul s in he b oade li e a u e,
and cu en e o s ocus exclusi ely on sepa a ing he singing oice, neglec ing o he
impo an sou ces.
In a ollowing pape Disen angling O e lapping Sou ces: Imp o ing Vocal and Violin
Sou ce Sepa a ion in Ca na ic Music [25], w i en by A. Shanka , mysel , G. Plaja
and M. Rocamo a, we p oposed a bleeding-awa e wo-s age aining p ocedu e a -
ge ing bo h oice and iolin s ems. This app oach d aws inspi a ion om ad ances
in speech denoising and enhancemen , whe e lea ned loss unc ions ha app oxi-
ma e pe cep ual me ics - such as PESQ o STOI - ha e been shown o imp o e
sepa a ion quali y [26, 27].
Fo ins ance, Wuxuan e al. [28] ain a dense con olu ional neu al ne wo k o p edic
PESQ and STOI sco es, bo h o which a e mean o co ela e wi h he pe cei ed
cleanness o speech signals. Du ing aining, a andom PESQ a ge alue ypesq is
sampled, and noise xnoise is scaled and added o a clean speech signal xspeech such
ha :
ypesq =PESQ(xspeech, xspeech +s·xnoise)
The model hen lea ns o map he noisy inpu xspeech +s·xnoise o he a ge pe -
cep ual sco e ypesq.
14 Chap e 2. S a e o he A
In [25], we adap ed his amewo k o he bleeding p oblem by simula ing a i icial
bleed o Ca na ic oice and iolin s ems. The leakage componen is modeled ol-
lowing an algo i hm in oduced in he SDX Bleeding Challenge 2023 [29]. The
p o ided benchma k and aining da ase a e gene a ed ia simula ed in e -sou ce
leakage on he mul i-s em sou ces om MUSDB18.
The au ho s simula ed bleeding ollowing hese assump ions:
•The amoun o bleeding in a eco ding is usually low
–Au ho s de ine he bleeding le el ia sou ce sepa a ion e alua ion, aiming
o a di e ence o 1db in SDR when models ain on he bleeding con-
aining da a, compa ed o he same model, ained on clean MUSDB18
•E e y single ile con ains bleeding
•E e y s em bleeds in o e e y o he s em in he same song
•The bleeding componen o one s em o ano he s em is ob ained by:
–Apply gain educ ion, andom be ween -7db and -12db
–Fil e : Choose Band o Lowpass il e wi h p=0.5
–O de : be ween 3and 10
–Lowpass: Cu o equency be ween 900 and 9000Hz
–Bandpass: Low cu o be ween 200 and 600Hz, high cu o be ween 8
and 10 kHz
All alues a e sampled om uni o m dis ibu ions. The anges we e ob ained em-
pi ically, comp omising be ween bleeding ealism and he desi ed goal o -1db in
SDR[29].
Following his p ocedu e, we implemen ed a PyTo ch[30] da aloade ha gene a es
syn he ic bleeding mix u es om clean Ca na ic mul i-s em s udio eco dings du ing
each aining s ep. The clean sou ce ma e ial is sampled om he CMC da ase (see
2.2. Mic ophone Bleeding 15
Sec ion 2.6.4). Be o e aining, a a ge s em is selec ed - ei he ocal o iolin
- while he emaining s ems (m idangam le ,m idangam igh , and anpu a) a e
ea ed as sou ces o leakage, ollowing he app oach desc ibed abo e.
A bleeding le el b∈[0,1] is andomly sampled o each example. Bo h he a ge
s em and he combined bleeding s ems a e loudness-no malized. The inal mix u e
is hen compu ed as:
x ain = (1 −b)·x a ge +b·xbleed
No e ha x a ge is ei he x ocal o x iolin while xbleed is a il e ed and gain educed
sum o all he emaining sou ces o he CMC da ase .
A con olu ional neu al ne wo k, s uc u ally iden ical o he disc imina o used in
MelGAN [31], is ained o lea n he mapping om he inpu x ain o he bleeding
le el b.
We hen p e- ain a U-Ne a chi ec u e [32] on he Sa aga da ase o 300,000
i e a ions, minimizing L1 dis ance compu ed agains bleeding-con aining a ge s.
Th ough his p ocess, he model lea ns o supp ess backg ound ins umen a ion o
ma ch he a e age bleeding le el p esen in he Sa aga eco dings.
In he second s age, we ine- une he p e- ained model by eplacing he L1 loss wi h
he lea ned bleeding es ima o in oduced ea lie .
Du ing he ine uning, he model lea ns o lowe he bleeding esiduals signi ican ly.
The p e- aining was pe o med on app oxima ely 60 hou s o Sa aga da a, while
ine- uning equi ed only 300 i e a ions on a subse o 2 hou s o Sa aga. This sug-
ges s ha he model al eady lea ns an in e nal ep esen a ion o isola ed, bleeding-
ee sou ce componen s du ing p e- aining. Howe e , he disc imina o loss in o-
duced occasionally audible a i ac s, s eadily inc easing du ing he ine uning s age.
Due o he esul ing limi a ion in e ec i e du a ion o his second aining phase he
inal sepa a ion quali y does no ye ma ch ha o s a e-o - he-a sys ems ained
22 Chap e 2. S a e o he A
Figu e 6: A Ca na ic conce om he Sa aga Audio isual da ase . Ins umen a ion:
M idangam (le ), Violin ( igh ), Vocalis (cen e )
2.6 Da ase s
In he ollowing sec ion, all da ase s used wi hin his p ojec a e p esen ed:
2.6.1 Sa aga
The Sa aga-Audio isual da ase [14] is he la ges open-access, mul is em collec-
ion o Ca na ic music, comp ising 64.8 hou s o conce eco dings. I consis s
exclusi ely o li e Ca na ic pe o mances cap u ed wi h mul iple mic ophones on
s age. Compa ed o he o iginal Sa aga da ase [47], he numbe o a is s and
¯agas has app oxima ely doubled, while he o e all du a ion and numbe o eco d-
ings emain compa able [14]. Mo eo e , Sa aga e lec s he s ylis ic di e si y, mu-
sical beau y, and nuanced pe o mance p ac ices o Ca na ic music. This s ands
in con as o o he widely used open musical sou ce sepa a ion da ase s such as
MUSDB18, which p ima ily ea u e s udio-p oduced acks wi h a comme cial o
ad e isemen -o ien ed sound.
2.6.2 MUSDB18
MUSDB18 [1] is he mos widely used MSS da ase . The s ems include: Vocal,
D ums,O he and Bass. O he ea u es many kinds o melodic ins umen s and
elec onic sounds. Bass includes baselines, played by bass-gui a s and sub- equency

2.6. Da ase s 23
syn hesize s.
2.6.3 Sanidha
The Sanidha da ase [48] is a mul imodal, audio isual, mul is em Ca na ic music
collec ion wi h li le o no leakage be ween sou ces. I con ains eco dings o i e
Ca na ic conce s pe o med a he School o Music, Geo gia Ins i u e o Technol-
ogy (USA), ea u ing i een p o essional Ca na ic musicians om A lan a: h ee
male ocalis s, wo emale ocalis s, ou iolinis s, and six pe cussionis s. The
pe o mances we e cap u ed in ou soundp oo ooms equipped wi h acous ic cu -
ains, signi ican ly educing oom e e be a ion. Fo synch oniza ion pu poses, each
pe o me ecei ed a pe sonalized moni o mix consis ing o he o he pe o me s’
mic ophones, combined wi h an a i icial anpu a d one.
In he ini ial eco ding sessions, some pe o me s equi ed highe moni o ing le els,
which led o sligh leakage om he headphones in o hei mic ophones. This issue
was add essed in la e sessions by swi ching o in-ea moni o ing. As a esul ,
conce s 2 and 3 a e comple ely ee om leakage, whe eas he emaining h ee
conce s exhibi mino bleed, especially on he ocalis acks in he o m o a
me onome-like click sound.
2.6.4 Ca na ic Mul i-s em Clean (CMC)
The Ca na ic Mul i-s em Clean (CMC) da ase is a p i a e collec ion o 58 mul-
is em Ca na ic music eco dings, o aling app oxima ely 5hou s o audio. Each
eco ding includes one o wo lead ocals, iolin, m idangam, and anpu a, wi h
some conce s also ea u ing ¯ala and eena. The da ase was p o ided by Shaale,
a Bangalo e-based company specializing in Indian a music educa ion. I was o ig-
inally eco ded o educa ional pu poses, enabling Ca na ic musicians o p ac ice
alongside a comple e ensemble. No ably, each ins umen was cap u ed in accous ic
isola ion, ensu ing ha no bleed occu s be ween acks.
Figu e 1 compa es he s ems included in he ou p esen ed MSS da ase s.
24 Chap e 2. S a e o he A
2.6.5 Bach Violin Da ase
The Bach Violin Da ase [49] con ains high-quali y public eco dings o Johann
Sebas ian Bach’s sona as and pa i as o solo iolin (BWV 1001–1006). I com-
p ises 6.5 hou s o pe o mances by 17 p o essional iolinis s, eco ded in a a ie y
o sessions. In addi ion o he audio, he da ase p o ides e e ence sco es and es-
ima ed alignmen s be ween he eco dings and he co esponding sco es, enabling
sco e-in o med p ocessing and analysis. The da ase is bleeding and noise ee.
2.6.6 Deep Noise Supp ession Da ase (DNS)
The DNS Challenge [50] is an annual compe i ion ha p o ides benchma ks and
da ase s o a ious speech enhancemen asks. I has been hos ed a INTER-
SPEECH 2020,ICASSP 2021,INTERSPEECH 2021,ICASSP 2022, and ICASSP
2023. In he 2022 edi ion, he o ganize s eleased an addi ional cu a ed da ase com-
p ising 48 eal and app oxima ely 60,000 simula ed oom impulse esponses (RIRs)
o suppo esea ch on join speech denoising and de e e be a ion [51]. The RIRs
we e sou ced om he OpenSLR26 and OpenSLR28 da ase s [52] wi hou u he
modi ica ion o speci ica ion.
Chap e 3
A Fine uning S a egy -
Me hodology
P e ious esea ch a ound Ca na ic musical sou ce sepa a ion shows ha , a he ime
o his wo k, MSS models ained solely on 8 hou s o he ou -o -domain, leakage-
ee da ase MUSDB18 ou pe o m models ained in a bleeding-awa e manne on
mo e han 60 hou s o in-domain Ca na ic da a om he Sa aga-AV da ase (see
Sec ion 2.2.1).
Fu he mo e, ela ed esea ch on MSS domain adap a ion highligh s he supe io i y
o ine- uning p e- ained models wi h in-domain da a o e aining om sc a ch
using only in-domain da a (see Sec ion 2.4).
Based on hese indings, his wo k p oposes a s a egy o cu a ing da ase s and
designing da a augmen a ions o ine- une a p e- ained SCNe using leakage- ee
da a. Fi s , he domain cha ac e is ics o Ca na ic music a e analysed, wi h empha-
sis on ins umen a ion, gen e-speci ic di e ences compa ed o Wes e n pop and ock,
and eco ding/pe o mance condi ions. Nex , I p esen a new da ase ha d aws
om bo h ou -o -domain and in-domain sou ces o ep esen he Ca na ic music
domain as comp ehensi ely as possible. Addi ionally, mul iple da a augmen a ion
echniques a e in oduced o u he b idge he gap be ween da a domains. Finally,
a dedica ed es se is p oposed as a benchma k o e alua ing sepa a ion quali y
25
26 Chap e 3. A Fine uning S a egy - Me hodology
Table 1: Co ela ing s ems pe da ase .
Da ase Vocals D ums O he Bass Unused
Sa aga Vocals M idangam
L/R
Violin – Gha am; Veena
MUSDB18 Vocals D ums O he Bass –
CMC Vocals M idangam Tanpu a; Violin – Taala; Veena
Sanidha Vocals
Vocals 1+2 M idangam
L/R
Violin 1–2;
Tanpu a –Gha am Fa ;
Gha am Close
using s anda d MSS me ics as well as a pe cep ual es .
3.1 Da a Domain In es iga ion: Ca na ic Music s.
Wes e n Pop and Rock
To cu a e a ine- uning da ase ha accu a ely ep esen s he Ca na ic music domain
and i s speci ic challenges o MSS, his chap e examines he musicological cha ac-
e is ics o Ca na ic music (see Sec ion 1.2.1) in ela ion o he a ailable da ase s
(see Sec ion 2.6). The goal is o iden i y and analyse he da a-domain gap be-
ween Ca na ic music and he MUSDB18 da ase , wi h a pa icula ocus on MSS
pe o mance in he Ca na ic con ex .
The ollowing aspec s a e in es iga ed:
•Ins umen a ion cha ac e is ics and hei domain-speci ic MSS challenges
•Gen e-speci ic di e ences and hei MSS- ela ed implica ions
•MSS challenges a ising om he li e- eco ded na u e o he pe o mances
3.1.1 Ins umen a ion Di e ences and Challenges in Sepa a -
ing Ca na ic Music wi h Ou -o -Domain MSS Models
As desc ibed in Sec ion 1.2.3, Ca na ic music ea u es a pa ly dis inc i e se o
ins umen s, beyond he iolin no ypically ound in Wes e n popula music. In
3.1. Da a Domain In es iga ion: Ca na ic Music s. Wes e n Pop and Rock 27
p elimina y expe imen s, I applied p e- ained MSS models - Hyb id T ans o me
Demucs (HTDemucs) and SCNe - bo h ained on MUSDB18, o sepa a e Ca na ic
music eco dings.
MUSDB18 consis s o ou s ems o all acks: ocals,d ums,bass, and o he (see
Sec ion 2.6.2). When applied o Ca na ic music, models ained on ha da a end
o p oduce he ollowing mapping:
•Vocals con ain he lead ocal acks.
•D ums con ain m idangam and gha am.
•O he con ains aala, anpu a, eena, and iolin.
•Bass ypically con ains silence o low- equency esidual noise, bu no isola ed
ins umen s ems.
All sepa a ed s ems con ain esidual bleed and, in some cases, omi po ions o
he a ge ins umen s. This wo k ocuses on sepa a ing ocals, iolin,m idangam,
and anpu a, as hese ou ins umen s appea in e e y eco ding wi hin he CMC,
Sanidha, and Sa aga da ase s.1
As discussed in Sec ion 1.2.3, he m idangam p oduces bo h ha monic and noise-
ansien componen s. The spec al analysis (Figu e 2) shows he long ha monic
decays o s okes such as Bheem,Cha, and Dhin. When sepa a ing he m idangam
wi h a model ained on MUSDB18, he inging, s ing-like s okes o en lose sub-
s an ial po ions o hei long decay en elopes. Figu e 7 illus a es how he ha monic
componen s o hese s okes a e la gely emo ed by HTDemucs , despi e p ese ing
he ini ial ansien s.
3.1.2 Gen e-Speci ic Challenges in Ca na ic MSS
The g¯ayaki s yle in Ca na ic music (see Sec ion 1.2.1) c ea es a high melodic co e-
la ion and cons an equency o e lap be ween he melodic ins umen s, pa icula ly
1No e: In he Sa aga da ase , he anpu a is no p o ided as an isola ed s em, bu p esen as
leakage in e e y eco ding.

28 Chap e 3. A Fine uning S a egy - Me hodology
(a) Clean e e ence (b) HT_Demucs_ sepa a ion
Figu e 7: Compa ison o spec og ams: M idangam audio sepa a ed om Sanidha
conce 2 using an HTDemucs model ( igh ) and he clean e e ence (le ). The
HTDemucs model p ese es he ansien componen s ( e ical lines) bu emo es
a signi ican po ion o he ha monic decay (ho izon al lines).
he iolin and he ocal s ems. In con as , Wes e n pop and ock a angemen s
a ely ea u e ins umen al pa s ha mimic he lead ocal so closely. S a e-o - he-
a MSS models such as SCNe and HTDemucs ope a e en i ely o pa ially in he
equency domain, using spec og ams as hei p ima y inpu ep esen a ion (see
Sec ions 2.5.1 and 2.5.2). As a esul , equency-domain sepa a ion models ained
p ima ily on Wes e n da a o en s uggle wi h Ca na ic ma e ial, as he p esence o
iolin in he same equency bins as he ocal leads o signi ican ly mo e esiduals
in he sepa a ed ocal s em and ice- e sa.
Fu he mo e, Ca na ic music u ilises in e als ha a e smalle han he semi one,
he smalles in e al in Wes e n music[12]. This migh u he complica e sepa a ion
asks o models ained exclusi ely on Wes e n da ase s wi h bigge and o e all mo e
disc e e pi ch scales.
3.1.3 Tailo ing a T aining Se and Benchma ks o Imp o e
Ca na ic MSS
Table 2 compa es he ou MSS da ase s used in his wo k. Sa aga, CMC, and
Sanidha cons i u e he h ee Ca na ic MSS da ase s a ailable a he ime o w i ing.
3.1. Da a Domain In es iga ion: Ca na ic Music s. Wes e n Pop and Rock 29
Table 2: O e iew o da ase s used in his wo k.
Da ase Leng h Gen e Reco ding se up Bleeding Usage
Sa aga 64.8 h Ca na ic Li e on s age Yes E alua ion
MUSDB18 10 h Pop and Rock S udio eco ded No T aining
CMC 8 h Ca na ic S udio eco ded No T aining
Sanidha 4 h Ca na ic S udio eco ded Pa ial E alua ion
Figu e 8: S em mapping be ween MUSDB18 and he CMC da ase . MUSDB18 is
s anda dized o ou s ems o e e y ack. In CMC, addi ional acks ( ¯ala, ¯ına)
a e p esen in some conce s bu a e no used in his wo k.
Sanidha con ains six conce s, o which conce s 2 and 3 a e comple ely ee o
leakage and a e he e o e used as he p ima y e alua ion benchma k o his wo k.
MUSDB18 and CMC a e leakage- ee and hus sui able o aining. I ine- une
models on CMC alone as well as on a combina ion o CMC and MUSDB18. In he
la e case, Figu e 8 shows he mapping be ween s ems, as desc ibed in Sec ion 3.1.1.
O he 58 acks in he CMC da ase , only 24 include iolin. Fu he mo e, all iolin-
con aining acks a e eco ded wi h he same Ca na ic ensemble - ea u ing a single
iolinis and a single eco ding se up. Since models o en s uggle wi h iolin– ocal
sepa a ion, I o e sample hese 24 acks by including hem wice in he aining se ,
applying he da a augmen a ions desc ibed in Sec ion 3.2.1 o inc ease a ie y.
The Sa aga da ase , wi h i s s ylis ic di e si y and eal li e- eco ded pe o mances,
se es as he bes a ailable es ing se o Ca na ic music. Non-mul is em Ca na ic
30 Chap e 3. A Fine uning S a egy - Me hodology
eco dings, such as publicly a ailable conce eco dings, a e no sui able o he
pe cep ual e alua ion used in his wo k. S a e-o - he-a MSS models o en emo e
pa s o he a ge sou ce du ing sepa a ion. To e alua e he comple eness o he
sepa a ed sou ce, a compa a i e lis ening es wi h a e e ence is equi ed. I he
e e ence is he ull mix u e, assessing deg ada ion becomes mo e di icul han when
using he pa ially isola ed, bleeding - con aining s ems o Sa aga, whe e he a ge
sou ce is clea ly in he o eg ound and pe cep ually dis inc . Fu he mo e, his wo k
aims o clean he Sa aga AV da ase s ems (see Chap e 1.2.2).
3.2 Da a Augmen a ions
3.2.1 Aligning SCNe Da a Augmen a ions wi h he Ca na ic
Da a Domain
The o iginal SCNe aining p ocedu e uses an augmen a ion pipeline consis ing o
ou ope a ions, each applied sequen ially wi h i s own p obabili y. While hese
augmen a ions a e designed o imp o e gene aliza ion, hei e ec i eness depends
hea ily on he ma ch be ween he augmen a ion s a egy and he a ge da a do-
main. In his wo k, I p opose a modi ied pipeline ha e ains h ee o he o iginal
augmen a ions and in oduces h ee new ones speci ically ailo ed o he cha ac e -
is ics o Ca na ic music.
The au ho s o he SCNe p opose he ollowing augmen a ions:
Flip Channels - Swaps he le and igh s e eo channels o each sou ce wi h a
p obabili y o p= 0.5.
Flip Sign - In e s he phase o a sou ce wi h a p obabili y o p= 0.5.
Scale - Mul iplies each sou ce by a andom scaling ac o sscale ∈[0.25,1.25] wi h a
p obabili y o p= 0.3.
Remix - Randomly shu les he same ype o sou ce ac oss di e en mix u es wi hin
a ba ch.
3.2. Da a Augmen a ions 31
The Flip Channels,Flip Sign, and Scale augmen a ions educe o e i ing and in-
c ease gene aliza ion capaci ies by a ying model a e e y aining s ep. Fu he -
mo e, he Scale augmen a ion is c ucial o adap ing o di e en mixdowns. Fo
example, he CMC da ase con ains only a single iolin conce wi h a consis en
loudness a io be ween iolin, ocal, and m idangam. Scaling hese sou ces p e-
pa es he model o scena ios whe e he iolin is mixed o played loude , o is mo e
subdued in he backg ound.
The Remix augmen a ion, o iginally p oposed in he Demucs pape [40] and shown
o imp o e sepa a ion quali y and gene aliza ion in MUSDB18- ained models, is
excluded om he modi ied SCNe pipeline. P elimina y expe imen s indica ed
ha Remix deg aded ocal and iolin sepa a ion quali y in he Ca na ic con ex
by dis up ing melodic co ela ion. When iolin s ems a e shu led ac oss mix u es,
he numbe o empo ally aligned samples wi h s ong equency o e lap o he
ocal is educed. This nega i ely impac s pe o mance o ca na ic iolin and ocal
sepa a ion.
3.2.2 Violin Da a Augmen a ion
This augmen a ion add esses he ins umen a ion- and gen e-speci ic MSS challenges
in Ca na ic music, pa icula ly he g¯ayaki iolin s yle (see Sec ions 3.1.2 and 3.1.1).
Baseline models s uggle wi h he s ong equency o e lap be ween ocals and i-
olin in Ca na ic eco dings, esul ing in esidual ocals in he iolin s em and ice
e sa. E en a e o e sampling he iolin-con aining acks in he CMC da ase
(see Sec ion 3.1.3), he models con inue o ha e di icul y cleanly sepa a ing he wo
sou ces. This is exace ba ed by he ac ha CMC ea u es only a single iolinis ,
eco ded in one session unde iden ical condi ions.
To imp o e gene aliza ion in Ca na ic iolin sepa a ion, I in oduce a iolin da a
augmen a ion ha adds mo e a ied iolin ma e ial o he aining se . The
augmen a ion is applied du ing aining wi h a p obabili y o p= 0.4. I he cu en
sample does no al eady con ain a iolin, a andom 11-second segmen om he
Bach Violin Da ase (See sec ion 2.6.5) is selec ed and added o he o he s em o
38 Chap e 3. A Fine uning S a egy - Me hodology
Expe imen T aining Da a Fine uning Da a
Baselines
HDemucsmmi MUSDB 800 songs
HTDemucs MUSDB -
HTDemucs
MUSDB +
800 songs
MUSDB +
800 songs
SCNe MUSDB -
P oposed
SCNe cMUSDB CMC
SCNe c,m MUSDB MUSDB, CMC
SCNe c,m, MUSDB MUSDB,CMC
Violin augmen
SCNe c,m, ,b MUSDB
MUSDB, CMC
Violin augmen
Bleeding augmen
Table 3: Lis o baselines and p oposed models wi h hei aining and ine- uning
da ase s.
•Fine- uned Hyb id T ans o me Demucs (HT Demucs , see Sec ion 2.5.1)
- included as he s onges baseline o Ca na ic music. This model is ained
on MUSDB18 plus 800 addi ional songs and a e ages he ou pu s o ou di -
e en Hyb id-Demucs models du ing in e ence o maximize pe o mance. Two
addi ional Demucs con igu a ions a e also included:
•Hyb id Demucs mmi (HDemucsmmi) - he p eceding Demucs gene a ion,
which achie ed s a e-o - he-a sepa a ion quali y a he ime o elease. I is
ained on MUSDB18 plus 800 addi ional songs and is included he e due o
i s excep ional m idangam sepa a ion quali y.
•Hyb id T ans o me Demucs (HT Demucs) - ained solely on MUSDB18,
included o compa ison wi h he ine- uned a ian .
•No - ine- uned SCNe (SCNe see Sec ion 2.5.2) - se es bo h as a baseline
and as a di ec poin o compa ison o e alua e he impac o ine- uning. This
con igu a ion is ained exclusi ely on MUSDB18.
Table 3.4.2 gi es an o e iew on he aining and ine uning da ase con igu a ions
o all baselines and p oposed expe imen s.

Chap e 4
Resul s
4.1 Imp o ing he MSS quali y o ca na ic music
The bes checkpoin om each expe imen is e alua ed on he Sanidha and CMC
benchma ks (see Sec ion 3.4) and compa ed wi h he baselines (Sec ion 3.4.3). Ta-
ble 4.1 epo s he comple e SDR e alua ion ac oss all models o he Sanidha bench-
ma k, Table A o he CMC es se . Audio samples a e also p o ided o compa a i e
lis ening1.
4.1.1 SDR E alua ion
O e all, all ou p oposed models ou pe o m all baselines ac oss all h ee sou ces
and on bo h es se s. Compa ed o he SCNe ained solely on MUSDB18, ine-
uning yields subs an ial imp o emen s - up o +5.4 dB o ocals, +6.9 dB o iolin
+ anpu a), and +5.5 dB o he m idangam sou ce on he Sanidha benchma k. In
pa icula , pe o mance on he o he s em ( iolin+ anpu a) is no ewo hy, wi h SDR
alues eaching nea ly h ee imes hose o he bes baseline model.
The CMC e alua ion deli e s simila esul s - only he iolin+ anpu a sou ce is
mo e han 3db highe han he Sanidha e alua ion, sugges ing ha he model migh
o e i on he CMC iolin. I should be no ed ha he CMC benchma k con ains
1Google D i e link wi h e alua ion iles
39
40 Chap e 4. Resul s
G oup Expe imen Vocal Violin+Tanpu a M idangam
Baselines
HDemucsmmi 11.86 3.13 10.07
HTDemucs 12.27 3.52 9.74
HTDemucs 12.63 3.92 8.65
SCNe 12.59 2.66 8.64
P oposed
SCNe c14.76 7.13 13.77
SCNe c,m 16.67 8.08 13.06
SCNe c,m, 18.02 9.65 14.16
SCNe c,m, ,b 16.30 8.52 14.18
Table 4: Signal- o-Dis o ion Ra io (dB) on he Sanidha benchma k. unde line:
s onges baseline, bold: s onges o e all.
only one song, pe o med by an ensemble ha he p oposed models encoun e ed
du ing aining.
4.1.2 On he E ec i eness o he P oposed Da a Augmen a-
ions
Violin Augmen a ion
The SCNe c,m, con igu a ion achie es he highes o e all pe o mance, exceeding
he SDR o any o he model by mo e han +1 dB o all sou ces excep d ums.
A di ec compa ison wi h SCNe c,m - ained wi hou he iolin augmen a ion -
demons a es he e ec i eness o he p oposed iolin augmen a ion. Figu e 17 illus-
a es he impac . Bo h ocals, iolin + anpu a and m idangam show subs an ial
SDR gains compa ed o he p eceding expe imen , con i ming he augmen a ion’s
alue o add essing he s ong ocal- iolin o e lap in Ca na ic music.
Bleeding Augmen a ion
In con as , he bleeding augmen a ion applied in he SCNe c,m, ,b expe imen sligh ly
educes SDR ac oss all sou ces excep d ums. Fo d ums, he imp o emen o e
SCNe c,m, is ma ginal a +0.02 dB - well wi hin he ange o measu emen noise
and hus no s a is ically meaning ul.
4.2. Pe cep ual E alua ion: SCNe c,m, s HTDemucs 41
Vocal O he D ums
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
SDR (dB)
16.67
8.08
13.06
18.02
9.65
14.16
SCNe
c
,
m
SCNe
c
,
m
,
Sanidha Benchma k SDR
Wi h s wi hou iolin augmen a ion
Figu e 11: Impac o he iolin augmen a ion on he SDR e alua ion on Sanidha
da ase .
Ye , a pe cep ual compa ison be ween he models wi h bleeding augmen a ion and
wi hou , as desc ibed in Chap e 3.4.2, show he imp o ed gene aliza ion capaci ies
ega ding he ca na ic iolin due o he bleeding augmen a ion. Figu e 12 shows ha
he SCNe c,m, ,b yielded pe cep ually be e sepa a ions in 42% and wo se in only
18% o he samples. The bleeding augmen a ion led o 16% less a i ac s/in e ence-
con aining samples and 10% less co up ed sepa a ions (See Figu e 18). Ye , he
pe cep ual quali y dec eased signi ican ly o m idangam and ocal due o he in-
c ease o esiduals o o he sou ces in he sepa a ions.
4.2 Pe cep ual E alua ion: SCNe c,m, s HTDemucs
A second pe cep ual es , as desc ibed in Sec ion 3.4.2, is conduc ed o e alua e
whe he he bes p oposed me hod, SCNe c,m, ,(1) gene alizes o mo e ealis ic,
non-s udio- eco ded da a, and (2) whe he pe cep ual quali y co ela es wi h he
SDR measu emen . The o e all pe cep ual quali y compa ison wi h he s onges
baseline, HTDemucs , o e 100 andom Sa aga AV samples is shown in Figu e 15.
Only 13% o he Demucs sepa a ions exhibi highe pe cep ual quali y, while 27%
42 Chap e 4. Resul s
60%
4%
36%
Vocals
66% 2%
32%
M idangam
18%
42%
40%
Violin/Tanpu a
Sa aga Pe cep ual E alua ion: O e all pe cep ual quali y
Wi h s wi hou bleeding augmen a ion
SCNe c
,
m
,
be e
SCNe c
,
m
,
,
b
be e Same quali y
Figu e 12: Impac o he bleeding augmen a ion on pe cep ual sepa a ion quali y,
compu ed on Sa aga AV.
show no pe cei able di e ence. In 60% o he cases, he p oposed me hod is a ed
as sounding pe cei able be e han he s onges baseline.
Figu e 13 compa es he cleanliness and p ese a ion quali y be ween he wo models.
The p oposed model emo es subs an ially ewe equencies han he baseline. Only
2% o he ocal sepa a ions and 4% o he m idangam samples a e co up ed by
SCNe c,m, . The iolin co up ion a e is highe , a 26%, ye s ill signi ican ly lowe
han he baseline’s 54%.
4.2.1 M idangam Sepa a ion
The m idangam eme ges as he pe cep ually cleanes sepa a ed sou ce. The bes
p oposed model imp o es SDR by +4.11 dB o e he s onges baseline (HDemucsmmi).
Figu e 14 illus a es how he p oposed model p ese es signi ican ly mo e ha monic
con en ( isible as ho izon al lines in he spec og am) om he esonan , onal
s okes - an a ea whe e all baselines s uggled (compa e Figu e 7). Pe cep ually,
SCNe c,m, sounds almos as ich as he o iginal sou ce on nea ly all es samples -
he e is as good as no sou ce co up ion (see Figu e 13).
The compa ed baseline, HTDemucs , p oduces cleane sepa a ions - 73% con ain
no pe cei able ins umen esiduals o a i ac s, compa ed o 61% o he p oposed
4.2. Pe cep ual E alua ion: SCNe c,m, s HTDemucs 43
0
20
40
60
80
100
Clean samples (%)
32%
61%
39%
21%
73%
35%
F ee o a i ac s and in e e ence
SCNe c
,
m
,
HTDemucs
Vocals M idangam Violin/Tanpu a
0
20
40
60
80
100
No -co up ed samples (%)
98% 96%
74%
98%
25%
46%
Sou ce p ese ed
Sa aga Pe cep ual E alua ion: Cleanliness and p ese a ion quali y
S onges p oposed model s s onges baseline
Figu e 13: Amoun o samples wi hou a i ac s/ esiduals and wi hou co up ion
pe sou ce o HT Demucs and SCNe c,m, .
model. Howe e , his cleanliness is la gely due o he hea y ga ing o he Demucs
model, which esul s in sepa a ions con aining only ansien sounds. In con as ,
he p oposed model’s sepa a ions p ese e he ha monic decays o he m idangam,
al hough some esonan ansien s s ill con ain esiduals om s ing ins umen s o ,
mo e a ely, ocals pe o med a he same pi ch as he d um. O e all, he p oposed
model is a ed highe han he s onges baseline in 92% o he m idangam samples
in he pe cep ual es (see Figu e 15).
4.2.2 Vocal Sepa a ion
In con as o he m idangam, he ocal s em is gene ally well p ese ed by he
s onges baseline, HTDemucs (see Figu e 13). The obse ed SDR imp o emen
o +5.39 dB o SCNe c,m, is p ima ily due o be e emo al o in e e ing sou ces,

44 Chap e 4. Resul s
(a) SCN e (b) SCN e c,m,
Figu e 14: Compa ison o spec og ams: M idangam audio sepa a ed om Sanidha
conce 2 wi h he p e ained SCNe s he ine uned SCNe c,m, . The ine uned
p ese es he ha monics while he p e ained mainly p ese es ansien s. Compa e
wi h Figu e 7 o he e e ence and HT Demucs sepa a ions.
pa icula ly he iolin. Sepa a ion ou pu s om SCNe c,m, con ain signi ican ly
ewe iolin esiduals han hose om he baselines. While 68% o he samples in
he pe cep ual es con ain ins umen esiduals and a i ac s, hese a e conside ably
quie e han in he baseline, esul ing in an o e all subs an ially highe pe cep ual
quali y (see Figu e 15).
4.2.3 Violin and Tanpu a Sepa a ion
The o he s em, consis ing o iolin and anpu a, shows he la ges ela i e imp o e-
men o e he baselines. This is pa icula ly impo an gi en he s ong melodic
co ela ion be ween iolin and ocals in Ca na ic music. The iolin augmen a ion
appea s o play a key ole he e, as SCNe c,m, achie es he highes SDR on bo h
benchma ks o his s em.
Pe cep ually, he iolin and anpu a s em emains he weakes sou ce o he p oposed
model on he Sa aga da ase : equency componen s a e emo ed in 26% o he
samples, and 61% con ain a i ac s and/o esiduals. Loud, sligh ly dis o ed iolin
sounds in he Sa aga da ase a e some imes no co ec ly iden i ied by he p oposed
model. Ne e heless, e en o his sou ce, he pe cep ual es shows a signi ican
4.2. Pe cep ual E alua ion: SCNe c,m, s HTDemucs 45
51%
13%
36%
Vocals
67% 8%
25%
M idangam
62% 18%
20%
Violin/Tanpu a
Sa aga Pe cep ual E alua ion: O e all pe cep ual quali y
Bes p oposed model s bes baseline
SCNe c
,
m
,
bes
HTDemucs
bes Bo h same quali y
Figu e 15: Pe cep ual e alua ion: O e all bes sepa a ion quali y on Sa aga AV.
imp o emen in o e all sepa a ion quali y ela i e o he s onges baseline (see
Figu e 15).
46 Chap e 4. Resul s
4.3 T aining S abili y: Ex ending s Replacing he
Fine- uning Da ase
Simila wo k on MSS domain adap a ion has highligh ed he ad an ages o ine-
uning p e- ained models, e en when p e- aining is on ou -o -domain da a (Sec-
ion 2.4). This wo k u he sugges s ha including he ou -o -domain da ase in
he ine- uning p ocess can be bene icial.
Figu e 16 compa es SDR cu es on a subse o Sanidha o SCNe c(CMC only)
and SCNe c,m (CMC + MUSDB18). Including he p e aining da ase MUSDB18
enables longe , mo e s able ine- uning and yields highe SDR esul s. The SDR
alues in his igu e di e om Table 4.1 because only a subse o he Sanidha
benchma k is used.
The e a e se e al possible explana ions o his beha iou . T aining exclusi ely
on CMC may lead o o e i ing, gi en i s smalle size and limi ed a iabili y. The
hype pa ame e s (e.g., lea ning a e) may no be op imally uned o CMC. Fu he -
mo e, he SCNe checkpoin s p o ided by he au ho s ha e been ex ensi ely ained
on MUSDB18; emo ing his da a om ine- uning migh des abilize aining by
mo ing he model oo a om i s p e- ained dis ibu ion.
4.3. T aining S abili y: Ex ending s Replacing he Fine- uning Da ase 47
10
12
14
16
18
SDR (dB)
Vocal
4
6
8
10
SDR (dB)
Violin + Tanpu a
0 5 10 15 20 25 30 35
Epoch
8
10
12
14
SDR (dB)
M idangam
SCNe c
- Fine uning on CMC
SCNe c
,
m
- Fine uning on CMC and MusDB
SDR de elopmen wi h ine uning epochs
Figu e 16: SDR de elopmen pe ine- uning epoch and sou ce on a subse o he
Sanidha benchma k. SCNe cshows a apid quali y boos a e he i s epoch ol-
lowed by a decline. SCNe c,m declines only a e 25 epochs.
54 BIBLIOGRAPHY
[16] Chand amouli, K. & Se ha es, W. Au oma ic ansc ip ion o d um s okes
in ca na ic music (2022). URL h ps://a xi .o g/abs/2211.15185.2211.
15185.
[17] SENGUPTA, R., DEY, N., DATTA, A. K. & GHOSH, D. Assessmen o
musical quali y o anpu a by ac al-dimensional analysis. F ac als 13, 245–
252 (2005). URL h ps://doi.o g/10.1142/S0218348X05002891.h ps:
//doi.o g/10.1142/S0218348X05002891.
[18] Salamon, J. Melody Ex ac ion om Polyphonic Music Signals. Ph.D. hesis
(2013).
[19] Clay on, M., Rao, P., Shika pu , N. N., Roychowdhu y, S. & Li, J. Raga
classi ica ion om ocal pe o mances using mul imodal analysis. In P oc. o he
23 d In . Socie y o Music In o ma ion Re ie al (ISMIR), Bengalu u, India,
283–290 (2022).
[20] Nu all, T., Plaja-Roglans, G., Pea son, L. & Se a, X. The ma ix p o ile
o mo i disco e y in audio-an example applica ion in ca na ic music. In In .
Symposium on Compu e Music Mul idisciplina y Resea ch, Tokyo, Japan, 228–
237 (2021).
[21] Nu all, T., Plaja-Roglans, G., Pea son, L. & Se a, X. In sea ch o sañc¯a as:
adi ion-in o med epea ed melodic pa e n ecogni ion in ca na ic music. In
P oc. o he 23 d In . Socie y o Music In o ma ion Re ie al Con . (ISMIR),
Bengalu u, India, 337–344 (2022).
[22] Plaja-Roglans, G., Mi on, M., Shanka , A. & Se a, X. Ca na ic singing oice
sepa a ion using cold di usion on aining da a wi h bleeding. In 24 h In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR), Milano, I aly (2023).
[23] Roua d, S., Massa, F. & Dé ossez, A. Hyb id ans o me s o music sou ce
sepa a ion (2022). URL h ps://a xi .o g/abs/2211.08553.2211.08553.

BIBLIOGRAPHY 55
[24] Kokkinis, E. K., Reiss, J. D. & Mou jopoulos, J. A wiene il e app oach o
mic ophone leakage educ ion in close-mic ophone applica ions. IEEE T ans-
ac ions on Audio, Speech, and Language P ocessing 20, 767–779 (2012).
[25] Shanka , A., Schweini z, S., Plaja-Roglans, G., Se a, X. & Rocamo a, M. Dis-
en angling o e lapping sou ces: Imp o ing ocal and iolin sou ce sepa a ion
in ca na ic music. IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP) (2025).
[26] Fu, S.-W., Liao, C.-F. & Tsao, Y. Lea ning wi h lea ned loss unc ion:
Speech enhancemen wi h quali y-ne o imp o e pe cep ual e alua ion o
speech quali y. IEEE Signal P ocessing Le e s 27, 26–30 (2020). URL
h p://dx.doi.o g/10.1109/LSP.2019.2953810.
[27] Xin Bai, H. Z., Xueliang Zhang & Huang, H. Pe cep ual loss unc ion o speech
enhancemen based on gene a i e ad e sa ial lea ning. In 2022 Asia-Paci ic
Signal and In o ma ion P ocessing Associa ion Annual Summi and Con e ence
(APSIPA ASC) (2022).
[28] Wuxuan Gong, Y. L., Jing Wang & Yang, H. A no- e e ence speech quali y as-
sessmen me hod based on neu al ne wo k wi h densely connec ed con olu ional
a chi ec u e. In INTERSPEECH 2023, Dublin, I eland (2023).
[29] Fabb o, G. e al. The Sound Demixing Challenge 2023: Music Demixing T ack
(2023). URL h p://a xi .o g/abs/2308.06979.2308.06979.
[30] Paszke, A. e al. Py o ch: An impe a i e s yle, high-pe o mance deep lea ning
lib a y (2019). URL h ps://a xi .o g/abs/1912.01703.1912.01703.
[31] Kuma , K. e al. Melgan: Gene a i e ad e sa ial ne wo ks o condi ional
wa e o m syn hesis (2019).
[32] Kim, M., Choi, W., Chung, J., Lee, D. & Jung, S. KUIELab-MDX-Ne : a
wo-s eam neu al ne wo k o music demixing (2021). URL h p://a xi .
o g/abs/2111.12203.2111.12203.
56 BIBLIOGRAPHY
[33] Kang, D. & Hashimo o, T. B. Imp o ed na u al language gene a ion ia loss
unca ion. In Ju a sky, D., Chai, J., Schlu e , N. & Te eaul , J. (eds.) P o-
ceedings o he 58 h Annual Mee ing o he Associa ion o Compu a ional Lin-
guis ics, 718–731 (Associa ion o Compu a ional Linguis ics, Online, 2020).
[34] Choi, W., Kim, M., Chung, J., Lee, D. & Jung, S. In es iga ing u-ne s wi h a -
ious in e media e blocks o spec og am-based singing oice sepa a ion (2020).
URL h ps://a xi .o g/abs/1912.02591.1912.02591.
[35] Si asanka , A. S. Vocal sou ce sepa a ion o ca na ic music (2023). URL
h ps://doi.o g/10.5281/zenodo.8380379.
[36] Hennequin, R., Khli , A., Voi u e , F. & Moussallam, M. Splee e : a as and
e icien music sou ce sepa a ion ool wi h p e- ained models. Jou nal o Open
Sou ce So wa e 1–4 (2020).
[37] Chen, Z. Singing ocal sou ce sepa a ion on jingju music (2024). URL h ps:
//doi.o g/10.5281/zenodo.13863042.
[38] S olle , D., Ewe , S. & Dixon, S. Wa e-u-ne : A mul i-scale neu al ne wo k
o end- o-end audio sou ce sepa a ion (2018). URL h ps://a xi .o g/abs/
1806.03185.1806.03185.
[39] Lin, K. W. E., T., B. B., Koh, E., Lui, S. & He emans, D. Singing oice
sepa a ion using a deep con olu ional neu al ne wo k ained by ideal bina y
mask and c oss en opy (2018). URL h ps://a xi .o g/abs/1812.01278.
1812.01278.
[40] Dé ossez, A., Usunie , N., Bo ou, L. & Bach, F. Demucs: Deep ex ac o
o music sou ces wi h ex a unlabeled da a emixed (2019). URL h ps:
//a xi .o g/abs/1909.01174.1909.01174.
[41] Tong, W. e al. T cne : Time- equency domain co ec o o speech sepa a ion.
In ICASSP, 1–5 (2023). URL h ps://doi.o g/10.1109/ICASSP49357.2023.
10096785.
BIBLIOGRAPHY 57
[42] Tong, W. e al. Scne : Spa se comp ession ne wo k o music sou ce sepa a ion
(2024). URL h ps://a xi .o g/abs/2401.13276.2401.13276.
[43] Su, J., Jin, Z. & Finkels ein, A. Hi i-gan: High- ideli y denoising and de e e -
be a ion based on speech deep ea u es in ad e sa ial ne wo ks (2020). URL
h ps://a xi .o g/abs/2006.05694.2006.05694.
[44] Bahmaninezhad, F. e al. A comp ehensi e s udy o speech sepa a ion: spec-
og am s wa e o m sepa a ion (2019). URL h ps://a xi .o g/abs/1905.
07497.1905.07497.
[45] Luo, Y. & Yu, J. Music sou ce sepa a ion wi h band-spli nn. IEEE/ACM
T ansac ions on Audio, Speech, and Language P ocessing 31, 1893–1901 (2023).
[46] Shazee , N. Glu a ian s imp o e ans o me (2020). URL h ps://a xi .
o g/abs/2002.05202.2002.05202.
[47] S ini asamu hy, A., Gula i, S., Repe o, R. C. & Se a, X. Sa aga: Open
da ase s o esea ch on indian a music. Empi ical Musicology Re iew 16,
85–98 (2021).
[48] K ishnan, V. V., Alben, N., Nai , A. A. & Condi -Schul z, N. Sanidha: A
s udio quali y mul i-modal da ase o ca na ic music, San F ancisco, Uni ed
S a es. In P oc. o he 25 h In . Socie y o Music In o ma ion Re ie al Con .
(2024). URL h p://a xi .o g/abs/2501.06959.
[49] Dong, H.-W., Zhou, C., Be g-Ki kpa ick, T. & McAuley, J. Deep pe o me :
Sco e- o-audio music pe o mance syn hesis (2022). URL h ps://a xi .o g/
abs/2202.06034.2202.06034.
[50] Dubey, H. e al. Icassp 2023 deep noise supp ession challenge (2023). URL
h ps://a xi .o g/abs/2303.11510.2303.11510.
[51] Dubey, H. e al. Icassp 2022 deep noise supp ession challenge (2022). URL
h ps://a xi .o g/abs/2202.13288.2202.13288.
58 BIBLIOGRAPHY
[52] Ko, T., Peddin i, V., Sel ze , M. & Khudanpu , S. A s udy on da a augmen a-
ion o e e be an speech o obus speech ecogni ion. 5220–5224 (2017).
[53] Vincen , E., G ibon al, R. & Fe o e, C. Pe o mance measu emen in blind
audio sou ce sepa a ion. IEEE T ansac ions on Audio, Speech, and Language
P ocessing 14, 1462–1469 (2006).
Appendix A
Appendix
59

60 Appendix A. Appendix
G oup Expe imen Vocal Violin+Tanpu a M idangam
Baselines
HDemucsmmi 8.63 2.92 8.64
HTDemucs 8.98 3.17 7.71
HTDemucs 9.89 4.06 7.66
SCNe 9.38 2.70 4.18
P oposed
SCNe c12.93 7.97 10.37
SCNe c,m 15.68 10.46 13.04
SCNe c,m, 17.74 12.76 13.65
SCNe c,m, ,b 15.95 10.76 12.60
Table 5: Signal- o-Dis o ion Ra io (dB) on he CMC benchma k. unde line:
s onges baseline, bold: s onges o e all. The CMC benchma k only con ains
one ack, no seen du ing aining. Ye , he esul s a e compa able o he Sanidha
benchma k, see Figu e 4.1.
Vocal O he D ums
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
SDR (dB)
18.02
9.65
14.16
16.30
8.52
14.18
SCNe
c
,
m
,
SCNe
c
,
m
,
,
b
Sanidha Benchma k SDR
Wi h s wi hou bleeding augmen a ion
Figu e 17: Impac o he bleeding augmen a ion on he SDR e alua ion, compu ed
o he Sanidha benchma k.
61
0
20
40
60
80
100
Clean samples (%)
32%
64%
34%
26% 30%
50%
F ee o a i ac s and in e e ence
SCNe c
,
m
,
SCNe c
,
m
,
,
b
Vocals M idangam Violin/Tanpu a
0
20
40
60
80
100
No -co up ed samples (%)
98% 98%
72%
98% 94%
82%
Sou ce p ese ed
Sa aga Pe cep ual E alua ion: Cleanliness and p ese a ion quali y
Wi h s wi hou bleeding augmen a ion
Figu e 18: Impac o he bleeding augmen a ion on pe cep ual sepa a ion quali y,
compu ed on Sa aga AV. Compa ed is he amoun o samples wi hou a i ac -
s/ esiduals and wi hou co up ion pe sou ce