scieee Science in your language
[en] (orig)

Understanding Audio Source Separation in Carnatic Music with Multimodal Data

Author: Fuhrmann, Théo
Publisher: Zenodo
DOI: 10.5281/zenodo.17304988
Source: https://zenodo.org/records/17304988/files/Theo-Fuhrmann_SMC_2025_Master_Thesis.pdf
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Unde s anding Audio Sou ce Sepa a ion
in Ca na ic Music wi h Mul imodal Da a
Théo Fuh mann
Supe iso : Ma ín Rocamo a
Co-Supe iso : Glo ia Ha o
Augus 2025
Con en s
Abs ac
Acknowledgemen
1 In oduc ion 1
1.1 Mo i a ion.................................. 2
1.2 Objec i es.................................. 2
2 S a e O The A 4
2.1 Sou ceSepa a ion.............................. 4
2.1.1 Audio Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Audio-Visual Sou ce Sepa a ion . . . . . . . . . . . . . . . . . . . . . . 6
2.2 In e p e abili y and Model Analysis . . . . . . . . . . . . . . . . . . . . 7
2.3 Da ase s................................... 8
2.4 Ca na icMusicinMIR........................... 9
3 Da ase and P ep ocessing 10
3.1 Da ase O e iew.............................. 10
3.2 Pose Es ima ion and Ins umen Labeling . . . . . . . . . . . . . . . . 11
3.3 Fea u eEx ac ion ............................. 11
3.3.1 Mo ion.................................... 11
3.3.2 Audio .................................... 13
3.4 Synch oniza ion and Sou ce Alignmen . . . . . . . . . . . . . . . . . . 13
3.4.1 Da aFil e ing................................ 14
3.5 Limi a ions ................................. 14
4 Co ela ion and Timing Analysis 15
4.1 Sliding Window Co ela ion Analysis . . . . . . . . . . . . . . . . . . . 15
4.1.1 Speed and Accele a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Vocal-Speci ic Fea u es . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.3 Violin-Speci ic Fea u es . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 C oss-Co ela ion and Lag Es ima ion . . . . . . . . . . . . . . . . . . 20
5 Model Analysis and In e p e abili y 22
5.1 O e iewo Models............................. 22
5.2 Vocalis Model ............................... 22
5.2.1 Model A chi ec u e and T aining . . . . . . . . . . . . . . . . . . . . . 22
5.2.2 G adien -Based In e p e abili y . . . . . . . . . . . . . . . . . . . . . . 24
5.2.3 Abla ionS udies .............................. 32
5.3 Vocal&ViolinModel ........................... 34
5.3.1 Model A chi ec u e and T aining . . . . . . . . . . . . . . . . . . . . . 34
5.3.2 A en ionAnalysis ............................. 35
5.3.3 FiLMAnalysis ............................... 37
6 Discussion and Conclusion 39
6.1 Gene alDiscussion ............................. 39
6.2 Conclusions ................................. 40
6.3 Fu u eWo k................................. 41
Bibliog aphy 43
.1 Tempo al Co ela ion Analysis Plo s . . . . . . . . . . . . . . . . . . . 50
.2 Inpu ×G adien Plo s........................... 51
.3 In eg a ed G adien Plo s . . . . . . . . . . . . . . . . . . . . . . . . . 52
.4 A en ion Analysis Plo s . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Abs ac
This hesis in es iga es he in e nal mechanisms o audio- isual sou ce sepa a ion
models, ocusing on how pe o me mo ion guides sou ce sepa a ion in Ca na ic mu-
sic. To mo e beyond "black box" pe o mance me ics, we employ a dual app oach
combining model-independen analysis o audio- isual synch ony wi h model-speci ic
in e p e abili y echniques applied o Voice-Vision T ans o me (VoViT) a chi ec-
u es. A c oss-co ela ion and sliding-window analysis on he Sa aga da ase i s es-
ablishes ins umen -speci ic empo al pa e ns, e ealing ha a ge ed ins umen -
speci ic mo ion ea u es exhibi s onge and mo e consis en co ela ions wi h audio
dynamics han gene al body mo ion. While global linea audio- isual synch ony is
weak, hese analyses highligh he impo ance o localised and ins umen -speci ic
mo ion cues. Subsequen ly, we apply g adien -based saliency me hods o a ocal
sepa a ion model, demons a ing i s p ima y eliance on acial keypoin s. Abla-
ion s udies causally con i m ha hese acial egions a e c ucial o ocal sep-
a a ion, while body mo ion con ibu es minimally. We u he analyze he a -
en ion mechanisms o a ocal & iolin model o unde s and how i disen angles
spec ally o e lapping sou ces, speci ically h ough a e ised hyb id usion a chi-
ec u e. FiLM (Fea u e-wise Linea Modula ion) analysis e eals ha isual in-
o ma ion causally modula es audio ea u es by consis en ly ampli ying o sup-
p essing hem, ac ing as a dynamic ga ing mechanism. This esea ch p o ides a
amewo k o in e p e ing ges u e-based audio- isual sys ems, o e s no el insigh s
in o audio- isual lea ning, and con ibu es o he de elopmen o mo e anspa en
and cul u ally-awa e Music In o ma ion Re ie al echnologies. Code a ailable a :
h ps://gi hub.com/ heo uh mann/mas e s- hesis
Keywo ds: Model In e p e abili y; Audio- isual; Sou ce Sepa a ion; Ca na ic Music;

Acknowledgemen
I would like o exp ess my g a i ude o Adi hi o he ime and o p o iding me wi h
he esou ces o ca y ou his hesis. I am also g a e ul o Ma ín and Glo ia o hei
help ul eedback, and o my pa ne Isabelle o he cons an suppo h oughou
he p ocess. Finally, I hank Gab iel, a o me cowo ke , o lending me his GPU
du ing mos o he hesis, which allowed me o explo e di e en di ec ions wi hou
comp omise.
Chap e 1
In oduc ion
Isola ing a single sound sou ce om a complex audi o y scene, known as audio
sou ce sepa a ion, is a key ask wi h applica ions anging om hea ing aids and
speech ecogni ion o music p oduc ion and emixing. Ea ly app oaches we e based
on s a is ical signal p ocessing, bu he ise o deep lea ning has b ough majo
ad ances, enabling models o sepa a e complex musical mix u es wi h imp essi e
accu acy.
In pa allel o ad ances in audio-only me hods, esea che s ha e explo ed mul imodal
app oaches ha inco po a e isual cues (such as lip mo emen s o musical ges u es)
o help disambigua e sound. These audio- isual models complemen audio-only
sys ems and ha e opened new possibili ies o imp o ing sepa a ion. Howe e , hei
g owing complexi y o en comes a he cos o in e p e abili y. As hese deep lea ning
models become ha de o analyze, i becomes inc easingly di icul o unde s and how
hey make decisions. This opaci y limi s ou abili y o us , debug, and imp o e
hem, and makes i ha de o see possible connec ions be ween sigh and sound.
This hesis ackles he challenge by shi ing he ocus om pe o mance me ics o
model in e p e abili y. We explo e how s a e-o - he-a audio- isual sys ems use
isual in o ma ion o sepa a e sound sou ces. To g ound ou analysis, we ocus on
Ca na ic music, a classical adi ion om Sou he n India known o i s ich acous ic
ex u es and complex pe o mance p ac ices. I s unique ges u es and exp essi e s yle
1
8Chap e 2. S a e O The A
Fo ins ance, Meyes e al. [31] e alua ed he con ibu ion o di e en neu al laye s
by sys ema ically emo ing speci ic uni s and obse ing he esul ing pe o mance
changes on he MNIST da ase .
In his wo k, we apply g adien -based saliency me hods o s uc u ed isual inpu s
(2D acial and body landma k coo dina es) o iden i y which landma ks mos in lu-
ence he model’s oice sepa a ion ou pu . Unlike ypical saliency me hods applied
o aw images, his app oach p o ides di ec insigh in o how mo ion and pose guide
he model’s p edic ions, o e ing a no el pe spec i e on he ole o isual cues in
audio- isual sou ce sepa a ion.
2.3 Da ase s
Music In o ma ion Resea ch (MIR) has adi ionally ocused on Wes e n music, o -
en unde ep esen ing he di e se ange o musical adi ions wo ldwide. This imbal-
ance has mo i a ed ecen e o s o inco po a e non-Wes e n music s yles in o MIR
s udies. As highligh ed by Se a [32], adop ing a mul icul u al app oach is c ucial
o de eloping mo e inclusi e and ep esen a i e compu a ional me hodologies.
No able Wes e n- ocused da ase s used o mul imodal audio sou ce sepa a ion in-
clude AudioSe [33], a la ge collec ion o 10-second YouTube ideos manually labeled
among 632 audio classes; MUSIC [17], which p o ides 714 labeled YouTube eco d-
ings o musical pe o mances; and URMP [34], a high-quali y manually eco ded
audio isual da ase o mul i-ins umen classical music pieces.
Building on ecen e o s o expand MIR esea ch beyond Wes e n music, we use
he Sa aga Audio isual da ase [35], a ecen ly de eloped la ge-scale mul imodal
collec ion designed o he s udy o Ca na ic music. Buil on he p inciples o he
o iginal Sa aga da ase [36], which included only audio eco dings, his da ase con-
ains 42 eco ded Ca na ic conce s, o aling o e 60 hou s o audio- isual da a,
wi h mul i- ack audio eco dings and synch onized ideo oo age.
Addi ionally, he Sanidha da ase se es as ano he aluable audio- isual esou ce
o Ca na ic music esea ch [37]. I con ains 5 high-de ini ion eco dings o Ca na ic

2.4. Ca na ic Music in MIR 9
conce s, o aling 15.35 hou s. Each pe o me was eco ded sepa a ely, wi h simul-
aneous ideo eco dings conduc ed in di e en ooms o ensu e clean, isola ed audio
and synch onized isual da a. These da ase s suppo ongoing e o s o di e si y
MIR and ad ance me hods o non-Wes e n music.
2.4 Ca na ic Music in MIR
Despi e he dominance o Wes e n music in MIR, he e has been a g owing ecogni-
ion o he need o s udy non-Wes e n musical adi ions. Ca na ic music, wi h i s
unique aga-based s uc u e, in ica e o namen a ion, and complex hy hmic pa -
e ns, p esen s bo h signi ican challenges and exci ing oppo uni ies o MIR.
Ea ly s udies explo ed Ca na ic music h ough di e en MIR echniques. Fo ex-
ample, esea ch on singe iden i ica ion in Ca na ic music [38] le e aged he 22- one
oc a e sys em o de elop ceps al coe icien s o cap u ing dis inc ocal cha ac-
e is ics. La e , a s udy on compu a ional app oaches o unde s anding melody
in Ca na ic music [39] emphasized he need o ailo ed me hodologies in music
in o ma ion p ocessing o Ca na ic music. Ano he example is a s udy on aga
classi ica ion [40], which p oposed a new sys em o classi y music in o agas using
di e en audio ea u es.
Mo e ecen ly, ad ances in sou ce sepa a ion ha e add essed challenges in Ca na ic
singing. Plaja-Roglans e al. [41] ained a cold di usion model o mi iga e audio
bleeding in pe o mance eco dings. In ano he s udy, hey de eloped a sys em
o gene a e ocal pi ch anno a ions using he Sa aga da ase , c ea ing he Sa aga-
Ca na ic-Melody-Syn h (SCMS) da ase , which was hen used o ain a s a e-o -
he-a pi ch ex ac ion model o Ca na ic music [42].
Building on his, Shanka e al. [43] le e aged he Sa aga audio- isual da ase
o adap he VoViT a chi ec u e o ges u e-guided sou ce sepa a ion in Ca na ic
conce eco dings. They explo ed di e en audio isual usion s a egies and showed
ha in eg a ing acial and body keypoin s imp o es ocal and iolin sepa a ion, e en
in he p esence o sou ce bleeding and pa ial isual occlusion.
Chap e 3
Da ase and P ep ocessing
3.1 Da ase O e iew
We ocused on pe o mances wi h a consis en le - o- igh layou (m idangam, o-
cal, and iolin) as his se up p o ides a mo e con olled en i onmen o mul imodal
analysis. In his con igu a ion, side a is s ypically ace inwa d owa d he ocalis ,
causing one side o hei body o be pa ially o ully occluded. This asymme y
in isibili y can a ec he eliabili y o pose es ima ion, so we limi ed ou selec ion
o conce s wi h a consis en layou o ensu e ha he same side o he m idangam
and iolin playe s emained isible h oughou . Fo his eason, we used only 26
conce eco dings om he Sa aga Audio isual collec ion, o aling a ound 28 hou s
o audio- isual ma e ial.
A mino i y o pe o mances ollowed an al e na e layou ( iolin, ocal, and m i-
dangam), which we e excluded o ensu e pose es ima ion consis ency. Howe e ,
due o imp o ed bow isibili y, hese 3 eco dings a e e isi ed in Chap e 4 o
a iolin-speci ic case s udy in ol ing mo e specialized mo ion ea u es. Al hough
limi ed o 3 conce s, his iolin subse p o ides nea ly 4 hou s o music, o e ing
su icien wi hin-pe o mance a iabili y o analyze audio isual co ela ions, while
gene aliza ion o o he pe o me s emains a ca ea .
10
3.2. Pose Es ima ion and Ins umen Labeling 11
3.2 Pose Es ima ion and Ins umen Labeling
Pose es ima ion was pe o med using he RTMW-X (384x288) model om MMPose,
ollowing he se up by Rod igues [44]. Since pe o me s emain sea ed in ixed
posi ions h oughou each conce , mean cen oids we e compu ed ac oss he ull
du a ion o each pe o mance. De ec ions wi h low empo al p esence we e disca ded
as likely alse posi i es, such as s anding spec a o s. To il e he emaining alse
posi i es, mainly wall a wo k depic ing human igu es, he es ima ions abo e a
ce ain e ical h eshold we e excluded. Each alid es ima ion was hen assigned
o one o he h ee ins umen s based on a manually gene a ed le - o- igh pe o me
layou anno a ed o each ideo. The assignmen s we e e i ied by isualizing sho
clips wi h colo -coded skele on o e lays, one colo pe ins umen , con i ming bo h
spa ial consis ency and pose es ima ion quali y (see Figu e 1).
Figu e 1: Colo -coded pose es ima ion isualiza ion.
3.3 Fea u e Ex ac ion
3.3.1 Mo ion
Keypoin s we e no malized pe pe o me by cen e ing and escaling a ound hei
spa io- empo al mean o ensu e consis en scale and posi ion. Lowe body keypoin s
we e disca ded due o limi ed mo emen (as pe o me s emain sea ed) and inaccu-
a e pose es ima ion om clo hing and ins umen occlusions. Fo he ocalis , bo h
a ms we e used; o side ins umen alis s (m idangam and iolin), only he isible
a m was e ained due o equen occlusions.
12 Chap e 3. Da ase and P ep ocessing
Fo he ini ial analysis, speed and accele a ion we e compu ed using NumPy’s
g adien unc ion o e he 2D keypoin s ex ac ed using MMPose, and a e aged
o e he ull uppe body as well as speci ic egions: le a m, igh a m, head, ace,
le hand, and igh hand.
In addi ion o hese gene al mo ion desc ip o s, wo case s udies we e conduc ed
ocusing on mo e domain-speci ic mo emen cues:
Vocal-speci ic ea u es: To cap u e ges u es di ec ly ela ed o ocal p oduc ion
[15, 45], we ocused on mou h and jaw mo ion. We ex ac ed 3D ace keypoin s
using he 3DDFA ace pose es ima o , which is used in he o i - b model analysed
in Chap e 5. The 3D na u e o he keypoin s allowed us o o a e he ace o a
on al posi ion, minimizing he in luence o head pose on he measu emen s.
Du ing inspec ion, we ound ha 3DDFA’s ace acking some imes ailed when
he ocalis ’s ace was occluded o u ned away, p oducing inaccu a e landma ks.
To add ess his, we used he mo e s able MMPose nose keypoin as a e e ence,
disca ding ames whe e i s dis ance om he 3DDFA ace cen oid exceeded a se
h eshold. This educed noise in he ocal-speci ic ea u es.
F om he il e ed and aligned keypoin s, we compu ed wo ea u es o cap u e acial
mo emen s. The i s is mou h a ea, which combines mou h wid h and heigh o
app oxima e he deg ee o mou h opening. The second is nose- o-jaw dis ance,
which acks he e ical displacemen o he jaw.
Violin-speci ic ea u es: Fo iolinis s wi h non-occluded bowing a ms, we com-
pu ed addi ional mo ion desc ip o s using he 2D keypoin s om MMPose o cap u e
bowing ges u es mo e di ec ly. We chose hese ea u es based on hei clea ele-
ance in p io biomechanical s udies [46, 47]. The i s , w is eloci y, cap u es
he speed o he bowing mo ion. The second, elbow angle, is de ined by he an-
gle be ween he shoulde , elbow, and w is , e lec ing a icula ion changes du ing
bow s okes. Finally, a m ex ension measu es he Euclidean dis ance be ween he
shoulde and w is , indica ing he deg ee o a m each du ing pe o mance.
3.4. Synch oniza ion and Sou ce Alignmen 13
3.3.2 Audio
Audio ea u es include onse en elope and Roo -Mean-Squa e (RMS) ene gy, bo h
ex ac ed using Lib osa. The onse en elope highligh s sudden changes in ene gy, o -
en co esponding o no e o syllable a acks, while RMS ene gy p o ides a measu e
o o e all loudness o e ime. P io wo k has shown ha hese empo al ene gy a i-
a ions a e closely coupled wi h a icula o y mo ion in audio isual speech [45]. Fo
he m idangam, s e eo acks we e mixed in o a single mono signal be o e calcula ing
he ea u es.
3.4 Synch oniza ion and Sou ce Alignmen
To enable mul imodal ame-wise analysis, audio and ideo da a we e empo ally
aligned. Al hough mos Sa aga ideos we e eco ded a 30 ames pe second ( ps),
some pe o mances had sligh ly di e en ame a es. Videos eco ded a 29.99 ps
we e esampled o exac ly 30 ps o consis ency. Videos eco ded a 24 ps we e
le unchanged o p ese e hei o iginal empo al esolu ion. Each song’s ps was
s o ed in he me ada a. Pose es ima ion was pe o med a he o iginal (o adjus ed)
ideo ps, and mo ion ea u es we e compu ed acco dingly. To enable a ame-
wise audio isual co ela ion analysis, he 48 kHz audio was esampled using linea
in e pola ion o ma ch he numbe o ideo ames.
The ideo da a was eco ded concu en ly wi h he pe o mance, while he audio was
cap u ed sepa a ely in ou di e en acks (one pe ins umen , wi h he m idan-
gam using wo acks). Al hough hese sou ces o igina e om he same pe o mance,
hey we e pieced oge he manually, which could cause a po en ial desynch oniza ion
due o he edi ing, eco ding o pos -p ocessing a i ac s. While no ob ious desyn-
ch oniza ion was pe cei ed du ing quali a i e inspec ion, po en ial misalignmen is
o mally assessed in Chap e 4.

14 Chap e 3. Da ase and P ep ocessing
3.4.1 Da a Fil e ing
Pose es ima ion con idence a ied ac oss ins umen s and ames. In o al, 0.79% o
ames con ained NaN alues, and 5.5% had low con idence sco es 1. Mean con idence
sco es pe ins umen we e as ollows: m idangam: 7.16, iolin: 7.11, ocal: 8.00.
This dis ibu ion is expec ed, as ocalis s ypically ace he came a di ec ly and a e
subjec o less occlusion. F ames wi h con idence below he con idence h eshold
we e disca ded, and hese missing o un eliable alues we e masked in all subsequen
analyses. No in e pola ion o padding was applied gi en he ela i ely small amoun
o un eliable da a. As a esul , all subsequen analyses we e conduc ed only on high-
con idence, empo ally aligned ames2.
3.5 Limi a ions
Despi e he ca e aken du ing p ep ocessing and anno a ion, se e al limi a ions a -
ec he da ase , which a e conside ed when in e p e ing mo ion–audio ela ionships
in he subsequen chap e s:
•Occlusions: The ocalis is some imes pa ially occluded by he mic ophone,
and he la e al posi ioning o he iolinis and m idangam playe leads o
equen sel -occlusion o loss o hand de ail.
•Pose Es ima ion De ail: The esolu ion o he o iginal eco dings limi s he
ine-g ained accu acy o pose es ima ions, pa icula ly o as hand mo emen s
o inge a icula ion.
•Audio Bleed: Al hough he audio was eco ded in isola ed acks, all in-
s umen s we e cap u ed in he same physical space, leading o audio bleeding
ac oss acks. This may educe he p ecision o ins umen -speci ic audio ea-
u es and complica e hei co ela ion wi h mo ion da a.
1The con idence h eshold was se o 3, wi h sco es anging up o ≈11
2Some ins umen s, such as he m idangam, na u ally exhibi a delay be ween physical mo ion
and co esponding audio due o he physics o sound p oduc ion and human pe o mance. No
co ec i e empo al alignmen was applied o accoun o hese exp essi e lags. Thei impac is
analyzed la e in Chap e 4 in he con ex o audio isual co ela ions.
Chap e 4
Co ela ion and Timing Analysis
This chap e explo es he empo al ela ionships be ween mo ion and audio ea u es
ac oss di e en ins umen s. We analyze how acial and body mo emen s co ela e
wi h sonic e en s o e ime and discuss he implica ions o obse ed lags o synch ony
be ween modali ies.
4.1 Sliding Window Co ela ion Analysis
4.1.1 Speed and Accele a ion
To cap u e local empo al dependencies be ween modali ies, we apply a sliding win-
dow co ela ion app oach be ween mo ion and audio ea u es. Pea son co ela ions
a e compu ed o e a 0.5-second window wi h a 0.1-second s ep size. To ocus on
meaning ul in e ac ions, only co ela ions wi h an absolu e alue abo e 0.5 a e e-
ained1. S ong co ela ion windows a e iden i ied ac oss ou ea u e pai s: mo ion
speed s. audio onse , mo ion speed s. audio RMS, mo ion accele a ion s. au-
dio onse , and mo ion accele a ion s. audio RMS. The numbe o high-co ela ion
windows is hen agg ega ed pe body pa and pe ins umen o highligh which
egions con ibu e mos consis en ly o he audio signal ac oss all ea u e pai s.
1Bo h s ong posi i e and s ong nega i e co ela ions be ween audio and mo ion ea u es a e
ea ed as equally meaning ul, as hey each e lec consis en ela ionships.
15
16 Chap e 4. Co ela ion and Timing Analysis
A la ge po ion o he da ase exhibi s s ong co ela ions be ween mo ion and
audio ea u es. O e all, 95.5% o he o al du a ion con ains a leas one s ongly
co ela ed window. By ins umen , co e age is 60.8% o m idangam, 76.7% o
ocals, and 60.9% o iolin. To accoun o empo al o e lap, we also compu ed
co e age using non-o e lapping windows (0.5 s hop), inding 64.6% o e all, wi h
26.3% o m idangam, 37.9% o ocals, and 25.6% o iolin. These esul s sugges
ha a no able ac ion o he da ase may exhibi audio-mo ion coupling, wi h
he e ec pe sis ing e en when accoun ing o window o e lap, p o iding a model-
independen pe spec i e be o e in oducing audio- isual sou ce sepa a ion analyses.
Figu e 2: Numbe o s ong windows pe body pa and ins umen .
In Figu e 2, he ocalis shows he highes numbe o s ong co ela ions, likely due
o longe ac i e pe o mance du a ions and on al isibili y, which imp o e key-
poin es ima ion. Fo he ins umen alis s, co ela ions a e sligh ly highe in he
a ms/hands han in he ace o head, consis en wi h hei ole in sound p oduc ion.
The ocalis also shows mo e hand han ace co ela ions, possibly e lec ing exp es-
si e ges u es. The m idangam pe o me shows ewe co ela ions o e all, as only he
non-occluded a m was isible, ep esen ing hal o he ins umen ’s ac i i y. Mo e
gene ally, he ela i ely uni o m coun s ac oss body pa s, especially o side- acing
pe o me s, sugges ha occlusions and pose es ima ion e o s dilu e mo ion signals
in ac i e egions like he hands, while mo e consis en ly es ima ed a eas such as he
ace con ain s able bu less in o ma i e co ela ions. These agg ega e esul s should
he e o e be in e p e ed wi h cau ion gi en he limi a ions no ed in Sec ion 3.5.
4.1. Sliding Window Co ela ion Analysis 17
4.1.2 Vocal-Speci ic Fea u es
We applied he sliding-window co ela ion me hod o wo ocal-speci ic acial ea-
u es (mou h a ea and nose- o-jaw dis ance) ex ac ed om he 3D ace keypoin s.
To ensu e eliabili y, only ames whe e he 3DDFA ace es ima ion passed he ac-
cu acy check desc ibed in Sec ion 3.3.1 we e included.
Case s udy. To be e illus a e he analysis, we examined a sho pe o mance
segmen om Ameya Ka hikeyan – Jalajakshi and plo ed he empo al e olu ion
o co ela ions be ween he wo ocal-speci ic mo ion ea u es (mou h a ea and nose-
o-jaw dis ance) and he audio ea u es (RMS ene gy and onse en elope). Only he
i s 30 seconds o he 41-second pe o mance we e e ained a e il e ing un eliable
3DDFA ames. As shown in Figu e 3, bo h audio ea u es ollow a b oadly simila
empo al end, al hough RMS ene gy consis en ly eaches highe co ela ion alues
han he onse en elope.
Figu e 3: Tempo al e olu ion o ocal mo ion ea u es s audio ea u es co ela ion
h oughou he pe o mance o Ameya Ka hikeyan - Jalajakshi
Bo h ocal-speci ic ea u es p oduced compa able coun s o s ong co ela ion win-
dows: 48 o he onse en elope and 209 o RMS ene gy (combined o e he wo
24 Chap e 5. Model Analysis and In e p e abili y
aises he ques ion o whe he including body keypoin s in oduces use ul cues o
me ely noise, mo i a ing ou in e p e abili y analysis o examine how each isual
modali y con ibu es o oice sepa a ion.
5.2.2 G adien -Based In e p e abili y
To unde s and how ou model le e ages isual inpu o oice sepa a ion, we ap-
ply g adien -based a ibu ion me hods. While g adien s a e ypically used du ing
aining o upda e model pa ame e s by compu ing ∂L
∂θ , in in e p e abili y con ex s,
we ins ead compu e he g adien o he loss Lwi h espec o he inpu x. This
allows us o es ima e how sensi i e he ou pu is o changes in each inpu ea u e,
essen ially measu ing he impo ance o ou audio isual ea u es.
We apply all a ibu ion me hods o e 4-second audio- isual chunks, using he Mean
Squa ed E o (MSE) be ween p edic ed and g ound- u h audio as he a ge loss.
Vanilla G adien s
The simples app oach compu es he g adien o he loss wi h espec o each inpu :
Saliencyi=∂L
∂xi
.
These aw g adien s indica e which inpu s would mos in luence he ou pu i sligh ly
pe u bed. Howe e , his me hod is o en noisy due o g adien sa u a ion o non-
linea i y in he model’s ac i a ions.
In p ac ice, we ound ha he g adien s had e y low-magni ude a ibu ion sco es,
and e en a e applying a cons an scaling ac o o 216 (which was la e used o all
me hods o consis ency), hey we e ha d o in e p e .
Inpu ×G adien
To imp o e in e p e abili y, we used he Inpu ×G adien me hod, de ined as:
Saliencyi=xi·∂L
∂xi
.

5.2. Vocalis Model 25
This me hod app oxima es a i s -o de Taylo expansion o he model’s ou pu
a ound a ze o baseline, combining he di ec ion o in luence ( he g adien ) wi h
he inpu ’s ac ual magni ude. This makes a ibu ions mo e meaning ul han aw
g adien s, especially when ea u e scales di e , such as be ween sub le acial mo ions
and b oad hand ges u es.
The implici ze o baseline se es as a e e ence poin , in e p e ing each inpu ’s con-
ibu ion ela i e o i s absence [50]. As wi h anilla g adien s, we applied a scaling
ac o o 216 o ampli y he small a ibu ion alues and agg ega ed saliency sco es
ac oss ime and space o ob ain body-pa -le el insigh s.
Compa ed o anilla g adien s, Inpu ×G adien p oduced clea e and mo e s a-
ble esul s, highligh ing acial egions, pa icula ly he nose and mou h. In he
agg ega ed analysis (using he mean saliency pe pa ), body egions showed lowe
impo ance, wi h he head s anding ou sligh ly among hem, ein o cing he model’s
eliance on acial cues. Fo he empo al e olu ion, we used he o al saliency pe
pa o e lec no only he ela i e impo ance bu also he absolu e con ibu ion
o each egion o e ime. Using a ze o baseline o all inpu s (especially he ace) e-
mained a limi a ion, la e add essed wi h a mo e ealis ic baseline in he In eg a ed
G adien s me hod. Full isualiza ions a e included in Appendix .2.
In eg a ed G adien s
To add ess limi a ions o Inpu ×G adien , such as incomple e a ibu ions and he
sensi i i y o a bi a y baselines, we implemen ed In eg a ed G adien s (IG) [51].
IG compu es he a e age g adien along a s aigh -line pa h om a baseline inpu
˜
x o he ac ual inpu x:
IGi(x)=(xi−˜xi)Z1
0
∂L(˜
x+α(x−˜
x))
∂xi
dα
26 Chap e 5. Model Analysis and In e p e abili y
We app oxima e he in eg al using 25 s eps1, compu ing he g adien o he MSE
loss a each in e pola ed inpu .
Baseline selec ion was adap ed o e lec he model’s p ep ocessing pipeline. The
audio wa e o m and body keypoin s a e bo h no malized and cen e ed a ound ze o,
making a ze o baseline app op ia e. Howe e , he model cen e s ace keypoin s
a ound a da ase -wide mean du ing p ep ocessing, bu unlike he o he modali ies,
hese coo dina es a e no no malized o scaled. This misma ch can in oduce a
bias in he model’s in e nal ep esen a ions and po en ially a ec how a ibu ion
me hods in e p e he ole o acial inpu s. To mi iga e his, we used he da ase -
mean ace as he baseline o acial keypoin s in he In eg a ed G adien s me hod.
While his choice imp o es alignmen wi h he model’s ac ual inpu space, i doesn’
ully esol e he in e p e abili y challenges in oduced by he lack o no maliza ion,
pa icula ly when compa ing isual modali ies2. To check he impac o his baseline
misma ch, we also compa ed a ibu ions ob ained wi h In eg a ed G adien s agains
hose om a simple inpu ×g adien s me hod. Using he case s udy pe o mance,
bo h me hods p oduced highly simila dis ibu ions o saliency ac oss egions (e.g.,
ace egions, Spea man ρ≈0.86; body egions, ρ≈0.93), sugges ing ha he
choice o baseline does no subs an ially al e he ela i e impo ance pa e ns, e en
i absolu e magni udes di e .
Case s udy. We i s illus a e he me hod in mo e de ail wi h a single pe o -
mance. Fo each 4-second chunk, we compu e a saliency sco e by a e aging g adien
alues ac oss ime and spa ial dimensions. These sco es a e hen used in wo comple-
men a y analyses: a global one ha a e ages saliency o e he en i e pe o mance,
and a empo al one ha acks how saliency e ol es h oughou he song.
The esul ing saliency sco es con i m ea lie ends: acial egions (especially he
nose, ollowed by he mou h) domina e he a ibu ions. As shown in Figu e 6c,
1The au ho s ecommend using be ween 20–1000 s eps. We selec ed 25 o compu a ional
cos easons, and e i ied i s adequacy by compa ing esul s wi h 50 s eps: dis ibu ions o acial
saliencies we e nea ly iden ical (symme ic Kullback-Leible ≈3.9×10−6, Spea man ρ= 1.0).
h ps://gi hub.com/anku aly/In eg a ed-G adien s/blob/mas e /how o.md
2This limi a ion should be kep in mind when in e p e ing esul s and compa ing saliency ac oss
ace and body egions.
5.2. Vocalis Model 27
(a) Body keypoin s weigh ed by IG sco es (b) Face keypoin s weigh ed by IG sco es
(c) A e aged IG saliency sco es o di e en body and ace pa s.
Figu e 6: Spa ial and a e aged isualiza ion o In eg a ed G adien s sco es o isual
inpu modali ies in he pe o mance o Abhi am Bode - En ha Bhagyamu using
VoViT- b. Landma k adius size indica es a e aged saliency sco e.
he mo e accu a e baseline made he body and ace sco es mo e compa able, ye
he o e all dis ibu ion pa e n emains. Among body pa s, only he head shows a
no able con ibu ion, ein o cing he model’s eliance on acial cues o oice sepa-
a ion.
The acial isualiza ion on Figu e 6b e eals ha he keypoin s a ound he inne
and ou e lips, pa icula ly hose nea he cen e o he mou h, exhibi he highes
saliency, which aligns wi h he mo e p onounced mo emen o hese poin s du ing
mou h opening and closing. The e is also a no iceable saliency concen a ion on
28 Chap e 5. Model Analysis and In e p e abili y
he chin, likely e lec ing he ole o jaw mo ion du ing singing. Ano he in e es ing
ind is he s ong a ibu ion p esen in he nose ac oss all i s keypoin s, especially
along he b idge. As o he body (Figu e 6a, al hough head keypoin s emain he
mos salien , some inge s on he igh hand also show ele a ed saliency. This may
be linked o hy hmic hand ges u es o sub le mo emen s ha coincide wi h ocal
luc ua ions, sugges ing he model cap u es no only acial a icula ion bu also
complemen a y body mo ion cues3. A mo e de ailed iew o indi idual keypoin
con ibu ions, g ouped by body egion, is p o ided in Appendix .3.
Figu e 7: Tempo al e olu ion o saliency sco es o majo body egions h oughou
he pe o mance o Abhi am Bode - En ha Bhagyamu using VoViT- b
Mo ing on o he empo al analysis, Figu e 7 p o ides a empo al b eakdown o
saliency sco es o b oade egions. While he o e all impo ance o egions luc ua es
o e ime, hei ela i e ankings emain mos ly s able. Facial egions consis en ly
domina e, ollowed by he igh hand and head. In e es ingly, saliency sco es ac oss
3Al hough hand and inge keypoin s occasionally exhibi ele a ed saliency, masking expe i-
men s (Table 2) show li le causal e ec on sepa a ion quali y. These isualiza ions may he e o e
e lec inciden al co ela ions o model noise a he han s ong eliance on body cues.
5.2. Vocalis Model 29
all egions end o ise and all oge he , sugges ing sha ed in luence om mo e
abs ac ac o s such as ocal ac i i y o model con idence, a he han isola ed
spikes in impo ance o speci ic egions. This in e p e a ion is suppo ed by he
mode a e posi i e co ela ions obse ed be ween saliency dynamics and sepa a ion
quali y (using scale-in a ian signal- o-dis o ion a io, SI-SDR), wi h alues anging
om ≈0.38 o body egions o ≈0.46 o ace-domina ed agg ega es. These
esul s indica e ha he empo al saliency modula ion is no a bi a y bu e lec s
momen s whe e isual cues con ibu e mo e s ongly o e ec i e sepa a ion.
Figu e 8: Tempo al e olu ion o saliency sco es wi h s ong window o e lay o he
pe o mance o Abhi am Bode - En ha Bhagyamu using VoViT- b
To b idge he model-independen explo a ion in Chap e 4 wi h ou cu en a i-
bu ion analysis, we plo ed he empo al e olu ion o agg ega ed ace and body
saliency sco es, o e laying he imes amps o s ong co ela ion windows (|ρ|>0.66
o e 0.5s) o bo h keypoin speed and accele a ion (Figu e 8). While accele a ion
e en s showed weak- o-mode a e co ela ions wi h isual a en ion ( = 0.17–0.22,
wi h ace egions mos esponsi e), speed co ela ions we e negligible ( ≈0). Ac-
cele a ion aligned wi h local saliency a ia ions in some cases, bu he e ec size was
small. Vocal-speci ic cues showed sligh ly highe associa ions: jaw- o-nose mo ion
eached = 0.21–0.24 (p < 0.05) wi h abou 6 e en s pe 4s chunk on a e age, and
mou h a ea mo ion = 0.12–0.22 wi h 5.6 e en s pe chunk. These di e ences sug-
ges ha a icula o y ges u es may o e mo e consis en empo al s uc u e han

30 Chap e 5. Model Analysis and In e p e abili y
gene al pa e ns, hough he co ela ions emain modes o e all.
Da ase -Wide Analysis. While he case s udy illus a es local ends, i is un-
clea whe he hese gene alize. To add ess his, we epea ed he IG analysis ac oss
he en i e da ase , es ic ed o pe o mances wi h a ocalis , iolinis , and m idan-
gam playe . Fo each 4-second segmen , we compu ed saliency sco es o all isual
keypoin s and agg ega ed hem by body egion.
Ra he han compu ing IG o e e y 4-second segmen , we ocused on wo con as -
ing subse s:
•The op 10% o chunks wi h he highes numbe o s ongly co ela ed ames
be ween ocal-speci ic mo ion ea u es and audio pe pe o mance.
•The bo om 10% o chunks wi h he lowes numbe o s ong co ela ions.
By compa ing hese subse s, we es whe he he model’s a ibu ions a e mo e
meaning ul when audio isual synch ony is high.
Figu e 9: A e aged IG saliency sco es o di e en body and ace pa s.
Figu e 9 shows he dis ibu ion o a e age saliency ac oss g ouped acial and body
egions o hese wo subse s. The esul s e eal ha he p opo ional dis ibu ion
o keypoin saliency emains almos iden ical4ac oss subse s, closely esembling
he pa e n seen in he case s udy. Howe e , he absolu e scale o saliency di e s
4The L2 dis ance o he no malized ace and body dis ibu ions is 0.03 and 0.05 espec i ely
5.2. Vocalis Model 31
subs an ially: ace egions exhibi saliency alues an o de o magni ude highe han
body egions, and wi hin each g oup, he op 10% subse shows a ibu ions 2 o 3
imes s onge han he bo om 10%. This indica es ha while body keypoin s a e
used minimally, he model elies hea ily on acial ea u es, especially he mou h and
nose, o i s p edic ions.
This e ec was no limi ed o isual inpu s. Audio saliency also ollowed he same
end, wi h a e age alues o 0.0467 o he op subse and 0.0168 o he bo om
subse . This sugges s ha in low-co ela ion windows he model is o e all less
inpu -dependen , po en ially elying mo e on lea ned p io s o exhibi ing sa u a ed
beha io . Impo an ly, his da ase -wide esul mi o s he case s udy (Figu e 8),
whe e peaks and alleys o ace and body saliency coincided wi h ames o high and
low audio isual co ela ion. Taken oge he , hese indings indica e ha audio isual
synch ony ac s p ima ily as a scaling ac o o saliency: i modula es he o e all
sensi i i y o he model o i s inpu s wi hou al e ing he ela i e spa ial dis ibu ion
o a ibu ions ac oss keypoin s.
While he ace + body analysis e ealed s ong eliance on acial landma ks, i
emained unclea whe he body inpu dilu es o complemen s he acial con ibu ion.
To unde s and whe he he body inpu dilu es o complemen s he ace con ibu ion,
we ex ended he IG analysis o he ace-only and body-only models.
Compa ison wi h ace-only and body-only models. The same in eg a ed
g adien s analysis was applied o he ace-only and body-only models unde iden i-
cal condi ions o hose used o he ace+body model. Figu e 10 shows he dis i-
bu ions o a e age saliency ac oss acial keypoin s o he ace+body and ace-only
models. While he o e all magni ude o ace-only saliency is oughly hal ha o
he ace+body model, he ela i e dis ibu ion shi s: he mou h eme ges as he
mos salien egion, ollowed by he nose, whe eas in he ace+body model he nose
exhibi s he highes a ibu ion. This sugges s ha when cons ained o acial inpu
alone, he model emphasizes egions mos di ec ly linked o ocaliza ion, pa icu-
la ly he mou h, whe eas he ace+body model sp eads a ibu ion mo e di usely,
32 Chap e 5. Model Analysis and In e p e abili y
including owa d body keypoin s ha con ibu e ela i ely li le (see Figu e 9). This
di e ence may help explain why he ace-only model achie es s onge sepa a ion
pe o mance.
Figu e 10: A e aged IG saliency ac oss acial keypoin s o ace+body and ace-only.
Audio saliency alues u he suppo his in e p e a ion. While he ace+body
and ace-only models exhibi nea ly iden ical eliance on audio inpu (1.16 s. 1.17
a e age saliency in he op 10%), he body-only model shows subs an ially highe
audio saliency (1.54). This indica es ha when dep i ed o in o ma i e acial cues,
he model compensa es by o e - elying on he audio s eam, con i ming ha body
landma ks alone p o ide limi ed use ul in o ma ion o ocal sepa a ion.
Fo comple eness, we also examined he saliency dis ibu ion o he body-only model
(Figu e 18 in Appendix .3). In his case, he head s ill eme ges as he mos salien
egion, despi e he absence o acial landma ks. This sugges s ha e en wi h coa se
body keypoin s, he model a emp s o exploi head mo ion as a weak p oxy o
ocal ac i i y, hough hese signals a e insu icien o suppo e ec i e sepa a ion.
5.2.3 Abla ion S udies
To es causali y a he han co ela ion, we pe o med a se o in e ence- ime abla-
ions ha di ec ly in e ene on he isual inpu s and measu e he esul ing change
in sepa a ion quali y. All abla ions we e execu ed on he ocalis s on he ace+body
VoViT a ian used in he g adien -based analysis. We epo changes in SI-SDR
5.2. Vocalis Model 33
compu ed be ween he model’s es ima e and he g ound- u h a ge . Two comple-
men a y s udies we e implemen ed:
Tempo al shu le. The empo al shu le sc ip andomly pe mu es he ime di-
mension o he ace and body keypoin enso s wi hin each 4-s chunk while lea ing
he audio unchanged. Conc e ely, o each sample he enso s shaped (B, T, C, K)
a e pe mu ed along T, hen o wa ded h ough he ozen model o p oduce a sep-
a a ed es ima e. Fo e e y (a is ,song) we eco d he baseline SI-SDR and he
shu led SI-SDR, and compu e ∆SI-SDR =SI-SDRbaseline −SI-SDRshu led as he pe -
song e ec . This in e en ion es s whe he he model equi es sho - e m empo al
alignmen be ween ges u es and sound.
Regional masking. The egion masking sc ip ze oes o eplaces speci ic sub-
se s o landma ks be o e o wa ding he sample. We implemen ed h ee masks: (i)
mou h, (ii) nose, and (iii) ull body. Masks a e illed wi h ze os o he body and wi h
he espec i e da ase mean ace keypoin s o he ace egions. Fo each masked
condi ion we compu e pe -song SI-SDR and he co esponding ∆SI-SDR ela i e o
he unmasked baseline. This di ec ly p obes which spa ial egions a e necessa y o
he model’s pe o mance.
Resul s. The empo al shu le p oduced essen ially no e ec on sepa a ion: ac oss
he da ase he a e age SI-SDR changed by ∆SI-SDR ≈ −0.03 dB (baseline ≈12.32
dB, shu led ≈12.35 dB), indica ing ha des oying amewise synch ony did no
meaning ully deg ade pe o mance.
By con as , egional masking p oduced e y la ge, egion-speci ic deg ada ions.
Table 2 summa izes he ace+body esul s
In e p e a ion. The iny e ec o empo al shu ling sugges s he model doesn’
ac ually need ame-by- ame synch ony be ween ges u es and sound, a leas no a
he scale we es ed i 5. In con as , masking he mou h o nose causes a signi ican
5This in e en ion shu led keypoin s wi hin 4-second chunks, which des oys ine-g ained syn-
ch ony bu s ill p ese es slowe -scale co-occu ence. I he e o e ules ou ame-le el dependence
bu no longe - imescale audio isual alignmen .
40 Chap e 6. Discussion and Conclusion
whe he human-in e p e able mo ion ea u es (e.g., speed, accele a ion) con ained
any ela ion o audio desc ip o s (onse en elope and RMS). The goal was no only
o es o co ela ions bu o be e unde s and he kind o in o ma ion p esen in
he da a be o e mo ing on o model-le el analysis. Al hough o e all co ela ions
we e weak, we obse ed ha ins umen -speci ic cues, such as mou h mo ion o
oice and elbow angle o iolin, aligned mo e consis en ly wi h audio ac i i y han
gene al mo ion. This inding highligh ed he impo ance o selec ing whe e o look
o isual in o ma ion depending on he ins umen o in e es .
Building on his con ex , Chap e 5 u ned o he in e p e abili y o audio isual sep-
a a ion models hemsel es. We began by analyzing he ace, body, and ace+body
VoViT model a ian s and hen ex ended he in es iga ion o he usion module
a ian s, which added mul ihead a en ion and FiLM laye s o be e exploi c oss-
modal cues. Se e al complemen a y me hods we e employed o es he models:
g adien -based saliency analysis o e eal spa ial and empo al ocus, case s ud-
ies o audio isual alignmen , compa isons ac oss di e en inpu a ian s, abla ions
o assess he causal impac o isual cues, and an in-dep h analysis o he usion
s a egies.
Taken oge he , hese h ee s ages e lec a p og ession om da a p epa a ion, o
explo a o y da a unde s anding, o model in e p e abili y. This laye ed app oach
p o ided bo h a p ac ical e alua ion o audio isual sepa a ion in Ca na ic music and
a me hodological amewo k o disen angling how isual cues in luence mul imodal
models.
6.2 Conclusions
The esul s o his wo k cla i y how isual in o ma ion con ibu es o music sou ce
sepa a ion in complex, eal-wo ld eco dings. Da ase explo a ion showed ha
ins umen -speci ic mo ion ea u es, such as mou h mo emen s o ocals, align mo e
closely wi h audio ac i i y han gene al mo ion, highligh ing whe e meaning ul i-
sual cues eside.

6.3. Fu u e Wo k 41
In e p e abili y analyses o VoViT models e ealed ha acial landma ks domi-
na e isual con ibu ions o ocal sepa a ion, while body mo ion is la gely pe-
iphe al. Abla ions con i med he causal ele ance o hese isual ea u es. Fo
highe -pe o ming ocal and iolin models, ea u e usion h ough mul ihead a en-
ion and FiLM laye s enabled selec i e c oss-modal modula ion, showing ha he
models can ampli y o supp ess audio ea u es based on isual con ex .
O e all, isual cues can meaning ully suppo sou ce sepa a ion, bu hei impac is
ins umen - and egion-speci ic. The indings unde line he impo ance o a ge ed
isual ea u e selec ion and ca e ully designed usion mechanisms in mul imodal
music sepa a ion sys ems.
6.3 Fu u e Wo k
The insigh s and limi a ions o his hesis open up se e al p omising pa hs o u-
u e esea ch. Fi s , he analy ical amewo k de eloped he e could be ex ended
o o he musical adi ions and ins umen s, such as hindus ani music, o e alua e
whe he simila isual dominance pa e ns and c oss-modal modula ion p inciples
hold. Insigh s om ou in e p e abili y analyses sugges ha models could be made
mo e e icien by p io i izing he mos in o ma i e isual ea u es, such as acial
landma ks, and by designing FiLM-like modula o y mechanisms ailo ed o speci ic
sepa a ion asks. Ano he p omising di ec ion is connec ing keypoin -based analyses
wi h pixel-le el saliency me hods, which could e eal whe he models ained on aw
ideo na u ally a end o he same in o ma i e egions. Finally, he unde s anding
o audio isual ela ionships gained he e could in o m gene a i e and c oss-modal
syn hesis, enabling models o p edic o syn hesize pe o me ges u es om audio,
o ice e sa, in a musically meaning ul way.
In summa y, his hesis has p o ided an in eg a ed in es iga ion in o how isual
in o ma ion in e ac s wi h audio in music sou ce sepa a ion, ocusing on he chal-
lenging case o Ca na ic pe o mance. By combining da ase -d i en analysis, c oss-
domain e alua ion, and in e p e abili y me hods, i has o e ed a clea e pic u e o
42 Chap e 6. Discussion and Conclusion
he selec i e bu meaning ul ole o isual cues. While he indings a e necessa ily
bounded by da a and model cons ain s, hey demons a e he alue o a mul i-
modal pe spec i e and lay he g oundwo k o mo e obus and cul u ally inclusi e
app oaches o audio isual music p ocessing.
Bibliog aphy
[1] Che y, C. Some expe imen s on he ecogni ion o speech wi h one. Jou nal o
he Acous ical Socie y o Ame ica (1953).
[2] Lee, D. D. & Seung, H. S. Lea ning he pa s o objec s by non-nega i e ma ix
ac o iza ion. Na u e 401, 788–791 (1999). URL h ps://www.na u e.com/
a icles/44565.
[3] Sma agdis, P. & B own, J. Non-nega i e ma ix ac o iza ion o polyphonic
music ansc ip ion. In 2003 IEEE Wo kshop on Applica ions o Signal P o-
cessing o Audio and Acous ics (IEEE Ca . No.03TH8684), 177–180 (IEEE,
New Pal z, NY, USA, 2003). URL h p://ieeexplo e.ieee.o g/documen /
1285860/.
[4] Ronnebe ge , O., Fische , P. & B ox, T. U-Ne : Con olu ional Ne wo ks o
Biomedical Image Segmen a ion (2015). URL h p://a xi .o g/abs/1505.
04597. A Xi :1505.04597 [cs].
[5] Jansson, A. e al. SINGING VOICE SEPARATION WITH DEEP U-NET
CONVOLUTIONAL NETWORKS (2017).
[6] S olle , D., Ewe , S. & Dixon, S. Wa e-U-Ne : A Mul i-Scale Neu al Ne wo k
o End- o-End Audio Sou ce Sepa a ion (2018). URL h p://a xi .o g/
abs/1806.03185. A Xi :1806.03185 [cs].
[7] Hennequin, R., Khli , A., Voi u e , F. & Moussallam, M. Splee e : a as and
e icien music sou ce sepa a ion ool wi h p e- ained models. Jou nal o Open
43
44 BIBLIOGRAPHY
Sou ce So wa e 5, 2154 (2020). URL h ps://joss. heoj.o g/pape s/10.
21105/joss.02154.
[8] Dé ossez, A., Usunie , N., Bo ou, L. & Bach, F. Music Sou ce Sepa a ion
in he Wa e o m Domain (2021). URL h p://a xi .o g/abs/1911.13254.
A Xi :1911.13254 [cs].
[9] Luo, Y. & Mesga ani, N. Con -TasNe : Su passing Ideal Time-F equency
Magni ude Masking o Speech Sepa a ion. IEEE/ACM T ansac ions on Au-
dio, Speech, and Language P ocessing 27, 1256–1266 (2019). URL h p:
//a xi .o g/abs/1809.07454. A Xi :1809.07454 [cs].
[10] Mi su uji, Y. e al. Music Demixing Challenge 2021. F on ie s in Signal P o-
cessing 1, 808395 (2022). URL h ps://www. on ie sin.o g/a icles/
10.3389/ sip.2021.808395/ ull.
[11] See ha aman, P., Wiche n, G., Venka a amani, S. & Roux, J. L. Class-
condi ional embeddings o music sou ce sepa a ion (2018). URL h p://
a xi .o g/abs/1811.03076. A Xi :1811.03076 [cs].
[12] S ö e , F.-R., Uhlich, S., Liu kus, A. & Mi su uji, Y. Open-Unmix - A Re -
e ence Implemen a ion o Music Sou ce Sepa a ion. Jou nal o Open Sou ce
So wa e 4, 1667 (2019). URL h ps://joss. heoj.o g/pape s/10.21105/
joss.01667.
[13] Wang, Y., S olle , D., Bi ne , R. M. & Pablo Bello, J. Few-Sho Musical
Sou ce Sepa a ion. In ICASSP 2022 - 2022 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP), 121–125 (IEEE, Singapo e,
Singapo e, 2022). URL h ps://ieeexplo e.ieee.o g/documen /9747536/.
[14] Tong, W. e al. SCNe : Spa se Comp ession Ne wo k o Music Sou ce Sepa-
a ion. In ICASSP 2024 - 2024 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), 1276–1280 (IEEE, Seoul, Ko ea, Re-
public o , 2024). URL h ps://ieeexplo e.ieee.o g/documen /10446651/.
BIBLIOGRAPHY 45
[15] Sodoye , D., Schwa z, J.-L., Gi in, L., Klinkisch, J. & Ju en, C. Sepa a-
ion o Audio-Visual Speech Sou ces: A New App oach Exploi ing he Audio-
Visual Cohe ence o Speech S imuli. EURASIP Jou nal on Ad ances in Sig-
nal P ocessing 2002, 382823 (2002). URL h ps://asp-eu asipjou nals.
sp inge open.com/a icles/10.1155/S1110865702207015.
[16] Lu, R., Duan, Z. & Zhang, C. Lis en and Look: Audio–Visual Ma ching As-
sis ed Speech Sou ce Sepa a ion. IEEE Signal P ocessing Le e s 25, 1315–1319
(2018). URL h ps://ieeexplo e.ieee.o g/documen /8404105/.
[17] Zhao, H. e al. The Sound o Pixels. In Fe a i, V., Hebe , M., Sminchis-
escu, C. & Weiss, Y. (eds.) Compu e Vision – ECCV 2018, ol. 11205, 587–
604 (Sp inge In e na ional Publishing, Cham, 2018). URL h ps://link.
sp inge .com/10.1007/978-3-030-01246-5_35. Se ies Ti le: Lec u e No es
in Compu e Science.
[18] Gao, R. & G auman, K. Co-Sepa a ing Sounds o Visual Objec s. In 2019
IEEE/CVF In e na ional Con e ence on Compu e Vision (ICCV), 3878–3887
(IEEE, Seoul, Ko ea (Sou h), 2019). URL h ps://ieeexplo e.ieee.o g/
documen /9009045/.
[19] Zhu, L. & Rah u, E. Visually Guided Sound Sou ce Sepa a ion Using Cas-
caded Opponen Fil e Ne wo k. In Ishikawa, H., Liu, C.-L., Pajdla, T. & Shi,
J. (eds.) Compu e Vision – ACCV 2020, ol. 12627, 409–426 (Sp inge In e na-
ional Publishing, Cham, 2021). URL h p://link.sp inge .com/10.1007/
978-3-030-69544-6_25. Se ies Ti le: Lec u e No es in Compu e Science.
[20] Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B. & To alba, A. Music Ges u e
o Visual Sound Sepa a ion. In 2020 IEEE/CVF Con e ence on Compu e
Vision and Pa e n Recogni ion (CVPR), 10475–10484 (IEEE, Sea le, WA,
USA, 2020). URL h ps://ieeexplo e.ieee.o g/documen /9157677/.
[21] Tan, R. e al. Language-Guided Audio-Visual Sou ce Sepa a ion ia T imodal
Consis ency. In 2023 IEEE/CVF Con e ence on Compu e Vision and Pa -

46 BIBLIOGRAPHY
e n Recogni ion (CVPR), 10575–10584 (IEEE, Vancou e , BC, Canada, 2023).
URL h ps://ieeexplo e.ieee.o g/documen /10203040/.
[22] Chen, J. e al. iQue y: Ins umen s as Que ies o Audio-Visual Sound Sepa a-
ion. In 2023 IEEE/CVF Con e ence on Compu e Vision and Pa e n Recog-
ni ion (CVPR), 14675–14686 (IEEE, Vancou e , BC, Canada, 2023). URL
h ps://ieeexplo e.ieee.o g/documen /10205441/.
[23] Cha e jee, M., Le Roux, J., Ahuja, N. & Che ian, A. Visual Scene G aphs
o Audio Sou ce Sepa a ion. In 2021 IEEE/CVF In e na ional Con e ence on
Compu e Vision (ICCV), 1184–1193 (IEEE, Mon eal, QC, Canada, 2021).
URL h ps://ieeexplo e.ieee.o g/documen /9710769/.
[24] Mon esinos, J. F., Kadandale, V. S. & Ha o, G. A cappella: Audio- isual
Singing Voice Sepa a ion (2021). URL h p://a xi .o g/abs/2104.09946.
A Xi :2104.09946 [cs].
[25] Mon esinos, J. F., Kadandale, V. S. & Ha o, G. VoViT: Low La ency G aph-
Based Audio-Visual Voice Sepa a ion T ans o me . In A idan, S., B os ow, G.,
Cissé, M., Fa inella, G. M. & Hassne , T. (eds.) Compu e Vision – ECCV
2022, ol. 13697, 310–326 (Sp inge Na u e Swi ze land, Cham, 2022). URL
h ps://link.sp inge .com/10.1007/978-3-031-19836-6_18. Se ies Ti le:
Lec u e No es in Compu e Science.
[26] E han, D., Bengio, Y., Cou ille, A., Vincen , P. & Box, P. O. Visualizing
Highe -Laye Fea u es o a Deep Ne wo k .
[27] Zeile , M. D. & Fe gus, R. Visualizing and Unde s anding Con olu ional Ne -
wo ks (2013). URL h p://a xi .o g/abs/1311.2901. A Xi :1311.2901 [cs].
[28] Baeh ens, D. e al. How o Explain Indi idual Classi ica ion Decisions .
[29] Simonyan, K., Vedaldi, A. & Zisse man, A. Deep Inside Con olu ional Ne -
wo ks: Visualising Image Classi ica ion Models and Saliency Maps (2014). URL
h p://a xi .o g/abs/1312.6034. A Xi :1312.6034 [cs].
BIBLIOGRAPHY 47
[30] Sel a aju, R. R. e al. G ad-CAM: Visual Explana ions om Deep Ne wo ks ia
G adien -based Localiza ion. In e na ional Jou nal o Compu e Vision 128,
336–359 (2020). URL h p://a xi .o g/abs/1610.02391. A Xi :1610.02391
[cs].
[31] Meyes, R., Lu, M., Puiseau, C. W. d. & Meisen, T. Abla ion S udies in A -
i icial Neu al Ne wo ks (2019). URL h p://a xi .o g/abs/1901.08644.
A Xi :1901.08644 [cs].
[32] Se a, X. A MULTICULTURAL APPROACH IN MUSIC INFORMATION
RESEARCH. O al Session (2011).
[33] Gemmeke, J. F. e al. Audio Se : An on ology and human-labeled da ase o
audio e en s. In 2017 IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 776–780 (IEEE, New O leans, LA, 2017). URL
h p://ieeexplo e.ieee.o g/documen /7952261/.
[34] Li, B., Liu, X., Dinesh, K., Duan, Z. & Sha ma, G. C ea ing a Mul i ack Clas-
sical Music Pe o mance Da ase o Mul imodal Music Analysis: Challenges,
Insigh s, and Applica ions. IEEE T ansac ions on Mul imedia 21, 522–535
(2019). URL h ps://ieeexplo e.ieee.o g/documen /8411155/.
[35] Shanka , A., Plaja-Roglans, G., Nu all, T., Rocamo a, M. & Se a, X. Sa aga
audio isual: a la ge mul imodal open da a collec ion o he analysis o ca na ic
music (2024).
[36] S ini asamu hy, A., Gula i, S., Ca o Repe o, R. & Se a, X. Sa aga: Open
Da ase s o Resea ch on Indian A Music. Empi ical Musicology Re iew
16, 85–98 (2021). URL h ps://emusicology.o g/index.php/EMR/a icle/
iew/7641.
[37] K ishnan, V. V., Alben, N., Nai , A. & Condi -Schul z, N. Sanidha: A S udio
Quali y Mul i-Modal Da ase o Ca na ic Music (2025). URL h p://a xi .
o g/abs/2501.06959. A Xi :2501.06959 [cs].
48 BIBLIOGRAPHY
[38] S idha , R. & Gee ha, T. V. Music In o ma ion Re ie al o Ca na ic Songs
Based on Ca na ic Music Singe Iden i ica ion. In 2008 In e na ional Con e -
ence on Compu e and Elec ical Enginee ing, 407–411 (IEEE, Phuke , Thai-
land, 2008). URL h p://ieeexplo e.ieee.o g/documen /4741017/.
[39] Kodu i, G. K., Mi on, M., Se a, J. & Se a, X. COMPUTATIONAL AP-
PROACHES FOR THE UNDERSTANDING OF MELODY IN CARNATIC
MUSIC .
[40] Ki hika, P. & Cha am elli, R. A e iew o aga based music classi ica ion and
music in o ma ion e ie al (MIR). In 2012 IEEE In e na ional Con e ence on
Enginee ing Educa ion: Inno a i e P ac ices and Fu u e T ends (AICERA),
1–5 (IEEE, Ko ayam, India, 2012). URL h p://ieeexplo e.ieee.o g/
documen /6306752/.
[41] Plaja-Roglans, G., Mi on, M., Shanka , A. & Se a, X. CARNATIC SINGING
VOICE SEPARATION USING COLD DIFFUSION ON TRAINING DATA
WITH BLEEDING (2023).
[42] Plaja-Roglans, G., Nu all, T., Pea son, L., Se a, X. & Mi on, M. Repe oi e-
Speci ic Vocal Pi ch Da a Gene a ion o Imp o ed Melodic Analysis o Ca -
na ic Music. T ansac ions o he In e na ional Socie y o Music In o ma ion
Re ie al 6, 13–26 (2023). URL h p:// ansac ions.ismi .ne /a icles/
10.5334/ ismi .137/.
[43] Shanka , A. Ges u e-guided melodic sou ce sepa a ion in ca na ic music using
c oss-modal usion echniques (2025).
[44] Rod igues, M., Ha o, G., Rocamo a, M. & Si asanka , A. S. Pose Es ima ion
Fo Audio-Visual Singing Voice Sepa a ion In Indian Ca na ic Music .
[45] Chand aseka an, C., T ubano a, A., S illi ano, S., Caplie , A. & Ghazan a ,
A. A. The Na u al S a is ics o Audio isual Speech. PLoS Compu a ional
Biology 5, e1000436 (2009). URL h ps://dx.plos.o g/10.1371/jou nal.
pcbi.1000436.
BIBLIOGRAPHY 49
[46] Tu ne -S okes, L. & Reid, K. Th ee-dimensional mo ion analysis o uppe limb
mo emen in he bowing a m o s ing-playing musicians. Clinical Biomechanics
14, 426–433 (1999).
[47] Ancillao, A., Sa as ano, B., Galli, M. & Albe ini, G. Th ee dimensional mo ion
cap u e applied o iolin playing: A s udy on easibili y and cha ac e iza ion o
he mo o s a egy. Compu e Me hods and P og ams in Biomedicine 149, 19–
27 (2017). URL h ps://www.sciencedi ec .com/science/a icle/pii/
S0169260716312342.
[48] Eph a , A. e al. Looking o Lis en a he Cock ail Pa y: A Speake -
Independen Audio-Visual Model o Speech Sepa a ion. ACM T ansac ions
on G aphics 37, 1–11 (2018). URL h p://a xi .o g/abs/1804.03619.
A Xi :1804.03619 [cs].
[49] Yan, S., Xiong, Y. & Lin, D. Spa ial Tempo al G aph Con olu ional Ne wo ks
o Skele on-Based Ac ion Recogni ion. P oceedings o he AAAI Con e ence
on A i icial In elligence 32 (2018). URL h ps://ojs.aaai.o g/index.php/
AAAI/a icle/ iew/12328. Publishe : Associa ion o he Ad ancemen o
A i icial In elligence (AAAI).
[50] Sh ikuma , A., G eenside, P., Shche bina, A. & Kundaje, A. No Jus a Black
Box: Lea ning Impo an Fea u es Th ough P opaga ing Ac i a ion Di e ences
(2017). URL h p://a xi .o g/abs/1605.01713. A Xi :1605.01713 [cs].
[51] Sunda a ajan, M., Taly, A. & Yan, Q. Axioma ic A ibu ion o Deep Ne wo ks
(2017). URL h p://a xi .o g/abs/1703.01365. A Xi :1703.01365 [cs].