Final Deg ee Disse a ion
Deg ee in Ma hema ics
Semi-Supe ised Classi ica ion:
Mix u e models and co- aining
Au ho :
B uno ´
Al a ez O ega
Supe iso :
Ja ie C´a camo U iaga
June 2025
Con en s
In oduc ion
0.1 The case o semi-supe ised classi ica ion . . . . . . . . . . . . . . . . . .
Re lexion ii
1 P elimina ies 1
1.1 Supe isedlea ning............................... 1
1.2 Unsupe isedlea ning ............................. 2
1.3 Classi ica ione o ............................... 3
2 Semi-Supe ised Lea ning 5
2.1 Gene a i e mix u e models . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Expec a ion maximiza ion algo i hm . . . . . . . . . . . . . . . . . . . . . 9
2.3 Classi ica ion ule: log-p obabili y a io . . . . . . . . . . . . . . . . . . . 12
2.4 Someca ea s .................................. 13
2.5 Co- aining ................................... 14
2.6 Hypo heses o co- aining . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Compu a ional simula ions 21
3.1 In oduc ion................................... 21
3.2 Fi s se o simula ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Second se o simula ions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Example: e ec s o an inco ec model . . . . . . . . . . . . . . . . . . . . 25
3.5 A co- aining simula ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Real li e applica ions 31
4.1 Op ical galaxy classi ica ion . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Case s udy: ain occupancy da a . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Sel - aining: a inal semi-supe ised app oach . . . . . . . . . . . . . . . . 38
4.4 Conclusion ................................... 40
A R and Py hon code 41
Bibliog aphy 43
iii
In oduc ion
0.1 The case o semi-supe ised classi ica ion
T adi ionally, classi ica ion has been pe o med by supe ised lea ning algo i hms ained
on manually labelled da a o p oduce a good disc imina o o u u e, unseen da a. How-
e e , his da a is o en expensi e o ob ain in la ge quan i ies. Wi h he ad en o e e
inc easing access o la ge amoun s o da a, mos o i unlabelled, i has become necessa y
o le e age his ype o in o ma ion e ec i ely.
In his sense, semi-supe ised classi ica ions aims o b idge he gap be ween supe ised
and unsupe ised lea ning by de eloping classi ie s ha can make use o bo h he la-
belled and unlabelled da a o imp o e classi ica ion pe o mance. Ano he mo i a ion
o he unde s anding o semi-supe ised lea ning is i s ela ionship wi h human in elli-
gence. Fo example, when in he ea ly yea s o a child pa en s poin o, o ins ance, a
small animal a say “dog”, i is he combina ion o his labelled da a and u u e passi e
(unlabelled) obse a ions o a dog ha in o m a human being o wha a dog is. This
is s udied by cogni i e science, we e classi ica ion algo i hms in o m how humans hink
and lea n.
In his disse a ion, we will explo e semi-supe ised classi ica ion, s a ing in chap e
1 by e iewing some concep s ele an o his wo k. In chap e 2, we lay down he
heo e ical g oundwo k o he wo semi-supe ised me hods s udied: mix u e models
and co- aining.
In he second hal , chap e s 3, and 4 will pu in o p ac ice hese wo app oaches, wi h
some simula ions being pe o med in he o me o es he limi s o heo y. Finally, in
chap e 4, we apply his knowledge in h ee di e se case s udies: galaxy classi ica ion,
ain occupancy analysis and sen imen ecogni ion.
The goal o his disse a ion is o esea ch he e ol ing ield o s a is ical lea ning h ough
he pe cep i e o he semi-supe ised lea ning pa adigm. All code de eloped o his wo k
is p o ided in he olde CODE, wi h a b ie guide in Appendix A.
Re lexion
One o he bigges sins ha a young aspi ing ma hema ician can commi is ask abou he
‘use ulness’ o he subjec o s udy. The some imes ha sh esponse om he p o esso s
is no wi hou some sense o u h. Some imes knowledge by i sel is aluable enough.
Bu soone a he han la e e e y ma hema ician mus eckon wi h one ha sh eali y:
ha he beau i ul wo ld o ma hema ics whe e e e y hing makes sense, ull o beau i ul
and ewa ding heo ies has a di ec impac on ou socie ies.
Indeed, as will became clea h oughou his disse a ion, he opic a hand, semi-
supe ised classi ica ion, can build owa ds au oma ion, which will ine i ably lead o
mo e decen jobs and economic de elopmen (SDG 8). I also p o ides imp o emen s
in e iciency and unde s anding o he complex sys ems ha we ope a e in, aiding in
he de elopmen o g een and sus ainable indus ies and in as uc u es (SDGs 9, 11).
Classi ica ion is also used daily o medical diagnosis, con ibu ing o he be e men o
heal h and wel a e o socie y (SDG 3). Inc easingly, au oma ic classi ica ion has aken
a mo e di ec ole in ou li es. In Spain, o ins ance, classi ica ion algo i hms a e used
o de e mine he le el o isk ha women ic im o gende iolence ace o p o ide hem
wi h he esou ces hey need, wi h he in en o e adica ing his ype o symp om o
gende inequali y (SDGs 5, 10).
His o y is o en w i en o us because we sh ink om ou bu den as ci izens. We
mus ake esponsibili y o ou knowledge and use i o pu sue he goals ou lined by he
UN simply because i is he igh hing o do. Fo some imes, he isk o doing no hing
becomes he g ea es isk o all.
ii
Chap e 1
P elimina ies
We begin by discussing some concep s ela ed o s a is ical lea ning, in pa icula hose
conce ning classi ica ion.
De ini ion 1.0.1. Ada a poin o ins ance x= (x1, . . . , xd)∈Rdis he mul i a ia e
ep esen a ion o each indi idual in a sample o size n∈N. Ins ances may also be
accompanied by a label,y∈ Y, ha ep esen s he class o which xbelongs. In gene al,
we may also use he no a ion X o ep esen he se o ins ances.
The sample o inpu da a o ou models, whe he i is labelled, unlabelled o pa ially
labelled, is o en e e ed o as he aining sample o S.
F om o en as amoun s o da a con ained in hese aining samples, one would like
o ex ac aluable knowledge. The means and ends o s a is ical lea ning a y g ea ly.
Fo now, we discuss he wo main pa adigms in his ield: supe ised and unsupe ised
lea ning.
1.1 Supe ised lea ning
The main ask o supe ised lea ning is classi ica ion: Gi en a se Yo classes and an
ins ance x, ou goal is o classi y his da a poin in one o hose classes (also called popu-
la ions), since we belie e he ea u es in xha e an in luence on i s class. In his con ex ,
all ins ances a e labelled. The pai s (x,y) ep esen he undamen al cha ac e is ic o
his pa adigm: i is supe ised in he sense ha labels a e gi en by a “supe iso ”. A
ypical example may be he ollowing.
Example 1.1.1. All banks ha e a hei disposal he eco ds o pas loans, ha is, he
da a co esponding o he clien : sex, age, ma i al s a us, job, income, deposi s, e c.
And, on he o he hand, he esul o he ansac ion: whe he he bank made a p o i
(i.e. he loan plus in e es we e paid back) o i he clien wen bank up . Gi en a new
clien , om whom we know all he abo e-men ioned ea u es, we a e in e es ed in classi-
ying i in one o he wo ca ego ies: low- isk, i he ansac ion will likely be success ul,
1
82.1. Gene a i e mix u e models
Due o he likelihood unc ion being posi i e and he p ope ies o he loga i hm (con-
inuous and mono onic) we always conside he log-likelihood.
Recall ha in he con ex o semi-supe ised lea ning, he aining sample is S=
{(x1, y1), ..., (xl, yl),xl+1, ..., xl+u}. The e o e, he log-likelihood unc ion is de ined as
log L(S|θ) = log l
Y
i=1
p(xi, yi|θ)
l+u
Y
i=l+1
p(xi|θ)!
=
l
X
i=1
log p(xi, yi|θ) +
l+u
X
i=l+1
log p(xi|θ)
=
l
X
i=1
log p(yi|θ)p(xi|yi, θ)+
l+u
X
i=l+1
log p(xi|θ),
(2.5)
whe e he las equali y is jus i ied by he de ini ion o he condi ional p obabili y. Fo he
e ms on he igh , co esponding o he unlabelled da a poin s, he ma ginal p obabili ies
p(x|θ) a e conside ed, ha a e, by he o al p obabili y heo em:
p(xi|θ) = X
y∈Y
p(y|θ)p(xi|y, θ) (2.6)
Hence, combining (2.5) and (2.6), he log-likelihood unc ion is as ollows.
log L(S|θ) =
l
X
i=1
log p(yi|θ)p(xi|yi, θ)
+
l+u
X
i=l+1
log X
y∈Y
p(y|θ)p(xi|y, θ).
(2.7)
I is easy o see ha he log-likelihood ha we a e ying o maximize is only di -
e en om ha o he supe ised lea ning o he e ms in he second sum. This is a
c i ical di e ence, since unlike in such a lea ning pa adigm, he op imiza ion p oblem
in semi-supe ised lea ning is no necessa ily a con ex p oblem, which makes i mo e
challenging o sol e. In addi ion, he solu ion o he equa ion ∇log L(S|θ) = 0canno ,
in gene al, be explici ly calcula ed, and he e o e a nume ical me hod is equi ed.
In his con ex , he s anda d me hod o sol ing o he MLE p oblem is he Expec a ion
Maximiza ion algo i hm (EM), which inds a local maximum o he objec i e unc ion
p esen ed in (2.7).
Chap e 2. Semi-Supe ised Lea ning 9
2.2 Expec a ion maximiza ion algo i hm
De ini ion 2.2.1. Gi en a aining sample S={(x1, y1), ..., (xl, yl),xl+1, ..., xl+u}, we
call hidden a iables o he unknown labels o he unlabelled ins ances, H={yl+1, ..., yl+u}.
The expec a ion maximiza ion algo i hm is an i e a i e me hod which inds a local op i-
mum o θ, he model pa ame e s. Gi en some ini ial pa ame e alues, ˆ
θ(0), he me hod
epea s he ollowing wo s eps un il a ia ion in he log L(S|θ) is below a ce ain ε.
(i) Expec a ion s ep: The expec ed alue o he hidden a iables in His calcu-
la ed. This can be hough o as “so assignmen s” o he labels o ins ances
xl+1, ..., xl+u. Fo mally, we compu e p( )(H), which by de ini ion is p(H|S,ˆ
θ( )).
In Algo i hm 1 , s ep 2a, a o mula o he compu a ion o hese p obabili ies is
p esen ed o he case o a wo componen GMM model.
(ii) Maximiza ion s ep: These “so labels”, ˆ
H( ), a e used o upda e he pa ame e s,
by calcula ing he MLE o {(x1, y1), ..., (xl, yl),(xl+1,ˆy( )
l+1), ..., (xl+u,ˆy( )
l+u)}.
In o he wo ds, ˆ
θ( +1) is calcula ed, such ha i maximizes Q(θ|ˆ
θ( )) =
Elog pS,ˆ
H( )|θ.
Depending on he ini ial condi ions, ˆ
θ(0), he local maximum ha EM inds may a y.
Usually, ˆ
θ(0) is chosen as he MLE o he labelled da a called, Slabelled. Equi alen ly,
ˆ
θ(0) = a g max
θlog L(Slabelled|θ) = a g max
θ
l
X
i=1
log p(yi|θ)p(xi|yi, θ).(2.8)
We now gi e a lemma necessa y o p o e an impo an esul o he EM algo i hm.
Lemma 2.2.1. (Gibb’s inequali y) Le {p(x)}x∈X and {q(x)}x∈X be wo disc e e p ob-
abili y dis ibu ions. Then,
−X
x∈X
p(x) log q(x)≥ − X
x∈X
p(x) log p(x).(2.9)
Gibb’s inequali y is o en s a ed his way because o i s ela ion o he concep o en opy
o quan i y o in o ma ion o a andom a iable. See [3, pp. 287 - 289] o mo e de ails
and he co esponding p oo .
Theo em 2.2.2. The expec a ion maximiza ion me hod imp o es he log-likelihood, log L(S|θ),
in e e y successi e i e a ion.
P oo . By de ini ion o condi ional p obabili y,
p(H | S, θ) = p(S,H | θ)
p(S | θ)=⇒log p(S | θ) = log p(S,H | θ)−log p(H | S, θ).
10 2.2. Expec a ion maximiza ion algo i hm
We conside he expec ed alue o all possible Husing he es ima ion o he pa ame e s
o he cu en i e a ion. This is done h ough mul iplica ion by p(H | S,ˆ
θ( )) on bo h
sides, and aking he sum,
log p(S | θ) = X
H
p(H | S,ˆ
θ( )) log p(S,H | θ)
−X
H
p(H | S,ˆ
θ( )) log p(H | S, θ).
No e ha he le -hand side s ays he same since i is a cons an . In he igh -hand side
o he equa ion, he i s sum is wha we ha e de ined as Q(θ|ˆ
θ( )) ( he expec ed alue
o he log-likelihood as a unc ion o θ, gi en he cu en es ima e, ˆ
θ( )) and minus he
second sum, which we deno e as H(θ|ˆ
θ( )). In his way, we ha e he ollowing iden i y,
log p(S | θ) = Q(θ|ˆ
θ( )) + H(θ|ˆ
θ( )).(2.10)
Since equali y (2.10) is ue o any alue o θ, we can make θ=ˆ
θ( +1) and θ=ˆ
θ( ) o
ob ain, espec i ely,
log p(S | ˆ
θ( +1)) = Q(ˆ
θ( +1) |ˆ
θ( )) + H(ˆ
θ( +1) |ˆ
θ( )) and (2.11)
log p(S | ˆ
θ( )) = Q(ˆ
θ( )|ˆ
θ( )) + H(ˆ
θ( )|ˆ
θ( )).(2.12)
Sub ac ing equa ion (2.12) o (2.11) yields
log p(S | ˆ
θ( +1))−log p(S | ˆ
θ( )) = Q(ˆ
θ( +1) |ˆ
θ( ))−Q(ˆ
θ( )|ˆ
θ( ))
+H(ˆ
θ( +1) |ˆ
θ( ))−H(ˆ
θ( )|ˆ
θ( )).(2.13)
Using Lemma 2.2.1 (Gibb’s inequali y) o p obabili y dis ibu ions P={p(H | S,ˆ
θ( ))}H
and Q={p(H | S,ˆ
θ( +1))}Hwe ha e ha H(ˆ
θ( +1) |ˆ
θ( ))≥H(ˆ
θ( )|ˆ
θ( )) and hus,
log p(S | ˆ
θ( +1))−log p(S | ˆ
θ( ))≥Q(ˆ
θ( +1) |ˆ
θ( ))−Q(ˆ
θ( )|ˆ
θ( )).(2.14)
Finally, since in e e y i e a ion in he maximiza ion s ep we calcula e ˆ
θ( +1) o imp o e
Q(ˆ
θ( +1) |ˆ
θ( )), and knowing ha log p(S | θ) is he log L(S|θ), we ob ain he desi ed
esul .
In he ollowing page, we p esen he o mula ion o he EM algo i hm o a wo-
componen Gaussian mix u e model ( om now on GMM), as in Example 2.1.1.
We aim o es ima e he p io p obabili ies, π0,π1and he pa ame e s o he wo no mal
popula ions, (µ0,Σ0) and (µ1,Σ1). All in all, ou model pa ame e s a e
θ={πj, µj,Σj}j∈{0,1}.
Chap e 2. Semi-Supe ised Lea ning 11
Algo i hm 1. EM o GMM:
Inpu : Sample S={(x1, y1), ..., (xl, yl),xl+1, ..., xl+u}, and ole ance ε > 0.
(i) Ini ializa ion: Make = 0 and ˆ
θ(0) ={ˆπ(0)
j,ˆµ(0)
j,ˆ
Σ(0)
j}j∈{0,1}, he MLE o he
labelled da a. Fo ins ance, ˆπj=|{xi∈Slabelled :yi=j}|
|Slabelled|.
(ii) I e a e he ollowing s eps un il con e gence o he log L(S|θ) is achie ed. Equi a-
len ly, he p ocess is s opped i
log L(S | ˆ
θ( +1))−log L(S | ˆ
θ( ))≤ε .
(a) Expec a ion s ep: Fo all he unlabelled ins ances, i∈ {l+ 1, . . . , l +u},
calcula e using Bayes’ ule,
γij := p(yj|xi,ˆ
θ( )) = ˆπ( )
jN(xi; ˆµ( )
j,ˆ
Σ( )
j)
P1
k=0 ˆπ( )
kN(xi; ˆµ( )
k,ˆ
Σ( )
k), j = 0,1.
These alues can be hough o as ac ional labels es ima ed o he unla-
belled da a poin s.
On he o he hand, o he labelled ins ances, conside he ue assignmen s,
ha is, o i∈ {1, . . . , l},
γij =(1,i yi=j ,
0,o he wise .
(b) Maximiza ion s ep: Calcula e ˆ
θ( +1), o j∈ {0,1}as he MLE o he aining
sample Swi h he ac ional labels γ.
lj=
l+u
X
i=1
γij ,
ˆµ( +1)
j=1
lj
l+u
X
i=1
γijxi,
ˆ
Σ( +1)
j=1
lj
l+u
X
i=1
γij(xi−ˆµ( +1)
j) (xi−ˆµ( +1)
j),
ˆπ( +1)
j=lj
l+u.
(iii) Upda e := + 1. Re u n o s ep (ii).
Ou pu : MLE {ˆπj,ˆµj,ˆ
Σj}j∈{0,1}.
12 2.3. Classi ica ion ule: log-p obabili y a io
The EM algo i hm is an example, in he ield o compu e science, o a sel - aining algo-
i hm, o an algo i hm ha eaches i sel . This is because o he unlabelled ins ances,
an es ima e o hei labels is calcula ed, which a e hen used o augmen he es ima ion
o he model pa ame e s, using MLE as i he whole sample was labelled.
The p oblem o he EM algo i hm no inding he global op imum o he log-likelihood
bu a he a local maximum can be deal wi h in a numbe o di e en ways. One ap-
p oach is wha is o en e e ed o as a andom s a , whe e he ini ial alues o EM,
ˆ
θ(0), a e andomly chosen. This p ocess is epea ed, and only he bigges log-likelihood
achie ed is conside ed. I is clea ha his is jus a heu is ic app oach and does no
gua an ee he op imal solu ion. Ano he app oach would be u ilizing nume ical me h-
ods o sol ing uncons ained op imiza ion p oblems, such as g adien descen , which
again does no gua an ee he global op imum i he solu ion space is no con ex and
would also equi e he andom s a me hodology.
2.3 Classi ica ion ule: log-p obabili y a io
We ecall ha ou o iginal objec i e o gene a i e mix u e models was classi ica ion.
Since any classi ica ion ask whe e he e a e mo e han wo classes can be educed o
he p oblem o bina y classi ica ion, le us assume ha Y={0,1}, which we e e o as
he se o he 0 and 1 class.
In o de o classi y each x∈ S o any new ins ance x∈Rd, so as o assess he ac-
cu acy o classi ica ion, by he o al p obabili y heo em we ha e,
p(x|y= 0)p(y= 0) + p(x|y= 1)p(y= 1) = p(x).(2.15)
This exp ession allows us o es ima e he p obabili y o an ins ance coming om he 1
class:
p(y= 1|x) = 1
p(x|y=0)p(y=0)
p(x|y=1)p(y=1) + 1 .(2.16)
No e ha ha ing es ima ed he model pa ame e s using he expec a ion maximiza ion
me hod, bo h he p io p obabili ies and he class condi ional dis ibu ions a e known.
I is also in e es ing o men ion ha o mula (2.16) is simila o ha o he logis ic
eg ession o logi model.
This is why, unde a GMM model, whe e p(x|y= 0) = N(x|µ0,Σ0) and p(x|y=
1) = N(x|µ1,Σ1), we can use o mula (2.3) o expand (2.16) and achie e he ollowing
exp ession o he log-p obabili y a io:
log p(y= 1|x)
p(y= 0|x)!=1
2(x−µ0) Σ−1
0(x−µ0)−(x−µ1) Σ−1
1(x−µ1)
+1
2log |Σ0| − log |Σ1|+log p(y= 1) −log p(y= 0).
(2.17)
Chap e 2. Semi-Supe ised Lea ning 13
The log p obabili y a io allows us o gi e he ollowing classi ica ion ule:
ˆy= 1 ⇐⇒ log p(y= 1|x)
p(y= 0|x)!>0.(2.18)
Fu he mo e, exp ession (2.17) gi es a measu e o he con idence o classi ica ion: he
u he he a io is om 0, he mo e likely is o he classi ica ion o be co ec . We can
jus i y his by conside ing he h ee main componen s o he log p obabili y a io:
(i) (x−µ0) Σ−1
0(x−µ0)−(x−µ1) Σ−1
1(x−µ1)is he di e ence o he squa es
o he Mahalanobis dis ance be ween he ins ance and each o he means. The
g ea e he absolu e alue o his di e ence, he close he ins ance is o one o he
dis ibu ions compa ed o he o he .
(ii) (log |Σ0| − log |Σ1|) he di e ence o he gene alized log- a iances.
(iii) log p(y= 1) −log p(y= 0) he di e ence be ween he p io p obabili ies.
2.4 Some ca ea s
We now ha e a clea semi-supe ised me hod ha uses he unlabelled da a o imp o e
he accu acy o classi ica ion. Ne e heless, one mus be cau ious abou he co ec ness
o he model o he unde lying hypo hesis: ha he da a is ac ually gene a ed by he
mix u e model ha is being conside ed. In o he wo ds, should he numbe o com-
ponen s (which may no necessa ily be |Y|), he p io p obabili ies, o he condi ional
p obabili y dis ibu ions p(x|y) be inco ec , he accu acy o he p edic o migh be less
han i only labelled da a was used in a adi ional supe ised lea ning con ex . In chap-
e 3, some simula ions will be p esen ed o illus a e his poin mo e p ecisely.
On he o he hand, domain knowledge is use ul in o de o conside a gene a i e model.
Fo example, image analysis o medical ials a e ields whe e a simple s a is ical analysis
shows popula ions ypically ollow a Gaussian dis ibu ion. In hese con ex s, a Gaussian
Mix u e Model wi h he app op ia e numbe o componen s would be adequa e.
14 2.5. Co- aining
2.5 Co- aining
Co- aining [11] is ano he impo an semi-supe ised classi ica ion me hod. I is spe-
cially sui able o wha is b oadly e e ed o as na u al language p ocessing. In pa -
icula , we will concen a e on named en i y classi ica ion, which is a ask ha in ol es
classi ying a p ope name in o one o mul iple classes depending on i s meaning. Conside
he ollowing illus a i e example.
Example 2.5.1. Suppose we a e in e es ed in classi ying Wikipedia a icles in o one o
he wo ollowing ca ego ies: humans o places, he i s ones being biog aphical accoun s
o some pe son, and he la e being abou geog aphical spaces. Conside ha we a e
gi en as ins ances x= (x(1),x(2)), whe e x(1) is he i le o he a icle and x(2) is an
exce p o he abs ac . One aining sample Scould be,
Ins ance x(1) x(2) y
1 Joseph Fou ie ...ma hema ician and physicis ... pe son
2 Co sica ...island in he Medi e anean Sea... place
3 Leonha d Eule ...ma hema ician, ..., as onome ... ???
4 Madagasca ...is an island coun y... ???
5 Leona do da Vinci ...as onome and a chi ec ... ???
... ... ... ...
As always, anno a ed da a is di icul o ob ain because i equi es manual labou (in
his case we only ha e wo labelled ins ances) bu we ha e plen y o unlabelled ins ances
a ou disposal. A simple co- aining classi ie ha uses bo h he name o he a icle
(x(1)), as well as he con ex (gi en by x(2)) o lea n name en i y classi ica ion would
p oceed as ollows:
(i) F om ins ance 1 we lea n ha “ma hema ician” and “physicis ” appea in he
con ex o he label “pe son”. The same idea applies o Co sica: he classi ie
lea ns ha “island” co esponds o a place.
(ii) Knowing his, we a e able o classi y Madagasca as a place, as i has “island” in
i s con ex .
(iii) Simila ly, we can assign he class “pe son” o Leona d Eule , and his in u n
allows us o lea n “as onome ” o be associa ed wi h “pe son”.
(i ) Finally, his would enable he classi ica ion o Leona do da Vinci as a “pe son”
e en hough nei he he name no he con ex we e p esen in he anno a ed da a.
As in he example, we conside each ins ance x o be desc ibed by wo ea u e se s,
also called iews, (x(1),x(2))∈Rd1×Rd2. This is o en he case o eal-wo ld da a.
Take, o ins ance, con en mode a ion in a social media pla o m like YouTube, whe e
each ideo is desc ibed by i s me a-da a ( i le, desc ip ion, e c.) and he con en o he
Chap e 2. Semi-Supe ised Lea ning 15
ideo i sel . YouTube’s algo i hm may decide whe he a ideo is sui able o ecommen-
da ion based on hese wo iews.
Fo mally, a co- aining algo i hm is an ensemble me hod ha uses wo dis inc clas-
si ie s, (1) and (2), which a e only ained wi h he labelled ins ances, aking in o
accoun solely he iews x(1) and x(2), espec i ely. The mos con iden p edic ions o
each classi ie a e added o he labelled da a o he o he . In his way, bo h classi ie s
each one ano he .
Algo i hm 2. Co-T aining
Inpu da a:
S={(x(1)
1, y1),(x(2)
1, y1),...,(x(1)
l, yl),(x(2)
l, yl),x(1)
l+1,x(2)
l+1,...,x(1)
l+u,x(2)
l+u},
and k∈N, wi h k≤l+u, he lea ning speed.
(i) Conside he aining da a se s o classi ie s (1) and (2) o be, espec i ely,
L1={(x(1)
1, y1),...,(x(1)
l, yl)},L2={(x(2)
1, y1),...,(x(2)
l, yl)}.
(ii) While S (L1∪L2)=∅pe o m he ollowing s eps:
(a) T ain (1) om L1and (2) om L2.
(b) Use bo h (1) and (2) o classi y ins ances om S (L1∪L2).
(c) Fo each classi ica ion o he p e ious ins ances made by (1) and (2), a
con idence alue is assigned. Conside {x(1)
ij,ˆy(1)
ij}k
j=1 and {x(2)
i′
j,ˆy(2)
i′
j}k
j=1 o
be he k-mos con iden p edic ions o (1) and (2), espec i ely. Add hese
o he aining sample o he o he classi ie :
•L1=L1∪ {(x(1)
i′
j,ˆy(2)
i′
j)}k
j=1 .
•L2=L2∪ {(x(2)
ij,ˆy(1)
ij)}k
j=1 .
Classi ie s (1) and (2) only pay a en ion o hei co esponding iews, bu he ain-
ing ins ances may be gi en by he o he classi ie s, hence he no a ion ˆy(2)
i′
jand ˆy(1)
ij,
espec i ely, in he upda ed aining samples. No ice also ha we a e no in e es ed
in he speci ic na u e o each supe ised lea ne , bu on he way in which hey can be
combined o imp o e classi ica ion in he con ex o semi-supe ised lea ning, when only
a ew anno a ed ins ances wi h wo iews a e a ailable. The only equi emen o (1)
and (2) is o be able o assign a con idence o accu acy o hei p edic ions. (1) and
(2) a e o en e e ed o as iew-1 and iew-2 classi ie s, espec i ely.
Co- aining is one o many semi-supe ised lea ning me hods ha u ilizes he “dis-
ag eemen s” be ween he wo classi ie s ained on a smalle , ully labelled da a se , and
e- ains hem un il hey ag ee on a la ge sample, using he unlabelled da a.
16 2.6. Hypo heses o co- aining
2.6 Hypo heses o co- aining
The i s necessa y condi ion in o de o apply co- aining is o he ins ances o ha e wo
iews. Besides he con ex s in which his happens na u ally, ins ances can be a bi a ily
spli in o wo iews. Ne e heless, o co- aining o be success ul in any o hese cases,
wo main hypo heses a e usually conside ed: iews x(1) and x(2) mus be su icien and
edundan ( o he classi ica ion ask) and condi ionally independen [12].
The su icien and edundan hypo hesis asks o bo h iews o be su icien ly in o -
ma i e. In o he wo ds, ha a good classi ie can be ained solely on each x(1) and
x(2). On he o he hand, iews mus be condi ionally independen . Tha is,
p(x(1) |y, x(2)) = p(x(1) |y),
p(x(2) |y, x(1)) = p(x(2) |y).(2.19)
This means ha gi en he label y, knowledge o one o he iews does no a ec he
p obabili y o he o he . To illus a e his hypo hesis, le ’s ecall ou example, in which
iew x(1) was he i le o a Wikipedia a icle, and x(2) a agmen o i s abs ac . Fix
he label y= “pe son” and conside o example he con ex x(2) = “ma hema ician and
physicis ”. The condi ionally independen hypo hesis implies ha his con ex does no
bene i any pa icula i le o name, gi en he ue label y= “pe son”.
This hypo hesis, p oposed by Blum and Mi chell in 1998 (see [5, pp. 92 - 100]), is
gene ally o e ly s ic and usually does no hold, e en wi h la ge da a se s. This is
ob ious in ou example, since x(2) = “...F ench empe o ...” hea ily in luences he a icle
name. None heless, i is in ui i e o unde s and why his hypo hesis is conside ed, e en
i i is no always ue. I condi ional independence did no gene ally hold and some x(2)
we e lea n by (2) o be associa ed wi h a pa icula class, since his classi ie eaches
(1) by adding he co esponding iew o i s aining da a, we would isk adding less
and less in o ma i e ins ances o (1), as a esul o hese being o e ly “simila ” o each
o he and hence de ea ing he pu pose o imp o ing classi ica ion.
The e o e, a elaxa ion o his necessa y hypo hesis is p oposed by Abney in [1], in
which an uppe bound o he classi ica ion e o is gi en in e ms o he a e o dis-
ag eemen o he wo lea ne s, as we ha e men ioned be o e. To unde s and his, we need
o in oduce he ollowing concep s: Le X1be he space whe e x1belongs. Conside
H1 o be all possible classi ie s om X1 o Y. Analogously, o he second iew we ha e
X2and H2.
De ini ion 2.6.1. Gi en y∈ Y ={0,1}, (1) ∈ H1and (2) ∈ H2, we say ha classi ie s
(1) and (2) a e condi ionally independen i o all u, ∈ Y :
p( (1) =u| (2) = , y) = p( (1) =u|y),
p( (2) =u| (1) = , y) = p( (2) =u|y).(2.20)
Chap e 2. Semi-Supe ised Lea ning 17
Remembe ha (1) ∈ H1is a iew-1 only classi ie , i.e., (1) :X1−→ Y, analogously
wi h (2). Bo h concep s o condi ional independence a e ela ed o each o he by he
ollowing esul .
P oposi ion 2.6.1. I iews x(1),x(2) a e condi ionally independen , hen (1) and (2)
a e condi ionally independen .
P oo . By de ini ion o condi ional independence o iews x(1) and x(2), gi en y, u, ∈ Y,
p( (1) =u| (2) = , y) = p{x(1) : (1)(x(1)) = u}|{x(2) : (2)(x(2)) = }, y
=p{x(1) : (1)(x(1)) = u} | y
=p( (1) =u|y).
As we ha e obse ed, his hypo hesis is gene ally un easonable, hough use ul. We now
in oduce a measu e o how ou classi ie s de ia e om he condi ional independence
hypo hesis.
De ini ion 2.6.2. The condi ional dependence o (1) and (2) gi en yis
dy=1
2X
u, ∈{0,1}p( (2) = |y, (1) =u)−p( (2) = |y).(2.21)
The e o e i (1) and (2) a e condi ionally independen hen dy= 0. This no ion will
allow us o elax he condi ional independence hypo hesis by allowing dy o be bounded.
De ini ion 2.6.3. Fo y∈ Y ={0,1}, le p1= minu∈Y{p( (1) =u|y)},p2= minu∈Y {p( (2)
=u|y)}and q1= 1 −p1. Then, (1) and (2) sa is y weak dependence i :
dy≤p2
q1−p1
2p1q1
.(2.22)
No e ha i p1= 1/2, hen dy= 0 and we eco e condi ional independence. Indeed, p1
and p2canno be g ea e han 1/2 by de ini ion. In pa icula , p1se es as a measu e o
how much condi ional dependence we can ha e and is o en e e ed o as he mino i y
p obabili y o classi ie (1): he lowe p1is, he g ea e dycan be. Le he mino i y
alue o a classi ie be he label whe e he mino i y p obabili y is achie ed.
Ano he use ul concep ha will become impo an la e on is he a e o disag ee-
men o classi ie s (1) and (2), which is simply he p( (1) = (2)) = p({(x(1),x(2))|
(1)(x(1))= (2)(x(2))}).
De ini ion 2.6.4. P edic o s (1) and (2) a e said o be non- i ial i minu∈Y {p( (1) =
u)}> p( (1) = (2)).
24 3.3. Second se o simula ions
These simula ions unde sco e he impo ance o semi-supe ised lea ning echniques,
in pa icula he mix u e model app oach o enhance classi ica ion pe o mance when
plen y o unlabelled da a is a ailable and a basic knowledge o he unde lying assump-
ions can be le e aged e ec i ely.
3.3 Second se o simula ions
We now es he obus ness o he semi-supe ised me hodology when he unde lying
mix u e model assump ions a e inco ec o analyse how i s pe o mance decays and
compa es o he supe ised lea ning echniques.
Figu e 3.3
Conside he da a ep esen ed in he his og am in Figu e 3.3. I is a mix u e model o
wo dis inc dis ibu ions wi h weigh 1/2: he le -mos componen is a -s uden wi h
10 deg ees o eedom cen ed a µ0=−2, while he second componen is an skewed
no mal dis ibu ion o pa ame e s ξ= 2, ω= 1.5 and α= 2, meaning i is a sligh ly
asymme ic - con olled by pa ame e α- no mal dis ibu ion cen ed a ound 3.
E en hough he model assump ions o a wo componen Gaussian mix u e model a e
no co ec , we would be o gi en o ying o i such a model o he da a. Indeed, i we
es ima e he model pa ame e s using he EM algo i hm o a wo componen Gaussian
mix u e model and use hem o pe o m he s anda d Kolmogo o -Smi no no mali y
es on ou da a, we ob ain a p- alue o 0.9471. Thus, we a e unable o dis ega d no -
mali y.
As in he p e ious sec ion, in o de o assess he beha iou o he semi-supe ised ap-
p oach unde ailu e o he model hypo heses, we pe o m 100 simula ions whe e he
da a was andomly gene a ed om he mix u e model p esen ed abo e. One hund ed
poin s o each class we e gene a ed in each ial, o a o al sample size o n= 200,
along wi h wo andomly selec ed labelled ins ances om each componen .
Chap e 3. Compu a ional simula ions 25
The esul s we e as ollows. Using LDA ained on he ou labelled ins ances ( he es is
conside ed unlabelled), he a e age decision bounda y o he 100 ials was 0.45, wi h a
s anda d de ia ion o 0.66, indica ing, as commen ed be o e, ha classi ica ion wi h LDA
is highly dependen on he speci ic labelled ins ances chosen o aining. None heless,
he a e age classi ica ion e o was 4.68%†. SVM yielded simila esul s when ained on
he e y same labelled poin s, wi h he mean bounda y a −0.051. This me hod was also
mo e s able, o he a e age de ia ion o he decision bounda y was jus 0.12. Con a y
o he p e ious simula ions, classi ica ion e o wi h SVM was sligh ly wo se a 5.68%.
Conside ing now he a o emen ioned semi-supe ised app oach based on he EM al-
go i hm, a wo componen GMM has been assumed o ha e gene a ed he da a. Fu -
he mo e, ini ializa ion in all ials o EM was conside ed such ha he means we e
chosen a −1 and 1, wi h a s anda d de ia ion o 1 o bo h componen s, as well as
equal p io p obabili ies. The summa y o he a e age esul s ound by EM is as ollows.
bµ0bµ1bσ0bσ1bπ0bπ1
−2.309 2.852 1.332 1.332 0.452 0.548
Table 3.2: A e age es ima ed pa ame e s o he GMM o e all simula ions.
The al eady discussed clus e hen label s a egy p o ided an a e age classi ica ion a e
o 4.55%, ma ginally be e han he LDA app oach. This indica es ha e en i he
unde lying hypo heses a e no ue, he p obabili y dis ibu ions in hese simula ions
a e close enough o no mali y ha e en unde he ailu e o he model co ec ness hy-
po hesis, he me hod s ill boos ed classi ica ion pe o mance.
In conclusion, lack o he model co ec ness hypo hesis does no necessa ily hu , in
gene al, he model’s pe o mance. Ne e heless, le us see an example whe e his does
indeed happen.
3.4 Example: e ec s o an inco ec model
As men ioned, inco ec assump ions in he gene a i e model need no be ca as ophic,
bu can ac ually, unde ce ain ci cums ances, hu i s pe o mance when compa ed o
adi ional supe ised classi ica ion me hods. Le us unde s and his phenomenon wi h
he ollowing example.
Below, in Figu e 3.4a, a sample o 200 unlabelled da a poin s is ep esen ed. Clea ly,
wo clus e s in he o m o wo ellipses a e isible. To he igh , in Figu e 3.4b, he ue
class o each poin is displayed wi h blue do s o he ze o class and wi h ed iangles
o he one class.
†E o a es in his sec ion we e calcula ed labelling he da a based on he es ima ed decision bounda y
wi h each me hod and compa ing hem o he ue labels.
26 3.4. Example: e ec s o an inco ec model
(a) Sample da a. (b) Two classes in ou clus e s.
Figu e 3.4
Knowledge o he ue unde lying model allows us o iden i y ou dis inc bi-dimensional
no mal dis ibu ions ha ha e gene a ed he poin s om he wo popula ions ( ed and
blue) in he wo clus e s. Wi hou his p io insigh in o he model, hough, we could
easonably assume ha he da a comes om a wo-componen no mal mix u e model,
when in eali y he e a e ou no mal dis ibu ions ha ha e gene a ed i , as in Fi-
gu e 3.4b.
In spi e o he model no being co ec , i is he mos easonable because i is he
one wi h he highes log-likelihood, e en when compa ed o he wo componen model
co esponding o he ue popula ions. This is shown in Figu e 3.5, whe e he ellipses in
he le diag am ep esen he assumed model, which indeed has a g ea e log-likelihood
compa ed o ha o he ac ual model ep esen ed in he igh -hand side diag am.
(a) Assumed model.
log-likelihood: -849.0514.
(b) Co ec model.
log-likelihood: -1029.025.
Figu e 3.5
Chap e 3. Compu a ional simula ions 27
The assumed model, a wo componen GMM, was es ima ed using a pu ely unsupe ised
EM algo i hm. This model i s he unlabelled da a easonably well, wi h an es ima ed
log-likelihood o −849.051. The wo ellipses isible in Figu e 3.5a a e cen ed a he
es ima ed means o each o he popula ions and co espond o he calcula ed co a iance
ma ices. On he con a y, he co ec model in Figu e 3.5b ac ually adap s wo se o
he da a, and hence he log-likelihood is lowe a −1029.025.
In ac uali y, howe e , depending on he labelled da a p o ided o ou classi ie ( he a-
ssumed model), he es ima ed decision bounda y will app oxima ely be y=−x, which
would p oduce an app oxima e classi ica ion e o o 0.5, making his classi ie i ually
useless. This phenomenon, whe e he semi-supe ised app oach would app oxima ely
lea n such a linea classi ie is ep esen ed in Figu e 3.6a. To he igh , he ue decision
bounda y based on he co ec model assump ions is displayed.
(a) Decision bounda y a y=−x o an
app oxima e e o o 0.5.
(b) Co ec decision bounda y a y=x.
This is he Bayes classi ie .
Figu e 3.6
I is he e o e e iden h ough his example ha a ypical supe ised classi ie , like
an SVM, e en when u ilizing he same labelled ins ances, would ha e a simila o e en
be e classi ica ion pe o mance. This is especially ue i a ew mo e labelled ins ances
we e p o ided. In conclusion, he semi-supe ised app oach does no gua an ee g ea e
classi ica ion accu acy han i s supe ised pee s, bu equi es domain knowledge o a
basic unde s anding o he gene a i e model o boos pe o mance.
Rema k 3.4.1. When domain knowledge is no a ailable o he gene a i e model is
unclea , i is use ul o employ a semi-supe ised non-pa ame ic app oach o es ima e
he densi y unc ions o he di e en sub-popula ions.
28 3.5. A co- aining simula ion
Indeed, i we use he unc ion m npEM om he R package mix ools, we can use his
new non-pa ame ic EM algo i hm on ou o iginal unlabelled da a se o be e iden i y
he ue subpopula ions. This ke nel based EM algo i hm o iginally sea ches o 6 sub-
popula ions in he da a. The esul s a e shown below, in Figu e 3.7.
Figu e 3.7: Fou mix u e componen s a e ound by he non-pa ame ic EM.
When he p ocess is comple ed, he algo i hm is able o assign e e y poin o one o
ou sub-popula ions, which clea ly o e lap wi h he ue classes in Figu e 3.4b. A ew
cen ally loca ed labelled da a poin s in each o he mix u e componen s would hen
allow us o classi y wi h an accu acy close o ha o he Bayes’ classi ie .
3.5 A co- aining simula ion
Le us now analyse how o enhance classi ica ion pe o mance using he co- aining
amewo k s udied in Sec ion 2.5. In his con ex , ins ances ha e wo iews o se s o
ea u es. Conside supe ised classi ie s h(1) and h(2) as iew-1 and iew-2 only, which
we will e e o as ou “base” classi ie s, since hey a e only ained on he o iginal
labelled samples, say L1={(x(1)
1, y1),...,(x(1)
l, yl)},L2={(x(2)
1, y1),...,(x(2)
l, yl)},
espec i ely. Ou goal is o make use o all he ini ially unlabelled da a in ou sample,
{x(1)
l+1,x(2)
l+1,...,x(1)
l+u,x(2)
l+u}, o co- ain wo new be e classi ie s, (1) and (2).
In ou simula ion, 1000 no mally dis ibu ed da a poin s wi h wo iews we e gene a-
ed, 500 o each o he wo classes. Fo he 0 class, he poin s om he i s iew
ollow he andom a iable Xy=0, iew-1 ∼ N (−1.5,1), while hose o he second iew
a e d awn om Xy=0, iew-2 ∼ N (−1,1.2). Simila ly, da a om he 1 class is such ha
Xy=1, iew-1 ∼ N(1.5,1) and Xy=1, iew-2 ∼ N (1,1.2). All in all, o he 0 class, he bi-
dimensional poin s a e cen ed a (−1.5,−1), while a (1.5,1) o he 1 class.
Chap e 3. Compu a ional simula ions 29
Figu e 3.8: Gene a ed syn he ic da a.
Figu e 3.8 abo e displays he 1000 da a poin s wi h hei wo uni a ia e iews ac ing as
ea u es. F om his plo , i is clea ha he wo classes a e able o be sepa a ed by a
linea classi ie . Fo each o he wo iews, he densi y unc ion o he gene a ed da a is
shown in he ma gins, sepa a ed by class, which co espond o he no mal dis ibu ions
conside ed ea lie .
In o de o be e assess classi ica ion pe o mance, 20% o ha da a was se apa
as he es sample. O he emaining, 5% was conside ed labelled, wi h he o he 95%
o aining ins ances being unlabelled. Bo h he “base” classi ie s h(1),h(2) and h ough
co- aining (1), (2) we e chosen o use he usual LDA supe ised classi ica ion ech-
nique. Indeed, ini ially, h(1) = (1) and h(2) = (2).
Co- aining was pe o med wi h a lea ning speed k= 10, as desc ibed in Algo i hm 2,
whe e he k-mos con iden p edic ions o each classi ie a e added (wi h hei p ojec ed
labels) o he aining sample o he o he . Then, each classi ie is e- ained. The
algo i hm s ops when he e is no longe any unlabelled ins ances, o in addi ion, i
classi ica ion accu acy o (1) and (2) deg ades o does no imp o e o 3 i e a ions.
The inal esul s, ob ained a e only 4 i e a ions, a e summa ized in Table 3.3.
Classi ie Accu acy
Base iew-1 0.885
Base iew-2 0.775
Co- ained iew-1 0.890
Co- ained iew-2 0.800
Table 3.3: Final esul s o he ou classi ie s.
30 3.5. A co- aining simula ion
The ollowing diag am, Figu e 3.9, shows he p og ess in accu acy o he classi ie s
h oughou he ou i e a ions. The do ed lines co espond o he base classi ie s, which
emain unchanged, while he co- ained classi ie s a e ep esen ed by solid lines.
Figu e 3.9: Accu acy o he ou classi ie s om i e a ion 0, whe e h(1) = (1) and
h(2) = (2), o i e a ion ou . Base classi ie s h(1) and h(2) emain unchanged because
hey a e ained on he o iginal labelled sample.
Bo h iew-1 and iew-2 classi ie s ained wi h co- aining p og essi ely imp o e hei
accu acy, by 0.5% and 2.5%, espec i ely. The unimp essi e imp o emen s in classi i-
ca ion accu acy a e due o he ac ha he wo iews, by hemsel es, a e in o ma i e
enough o ain a good base classi ie . In addi ion, nei he o he wo iews is mo e
in o ma i e han he o he and hey a e no condi ionally independen , so co- aining is
only able o mo e pe o mance sligh ly close o he heo e ical maximum. Ne e heless,
his demons a es he powe o co- aining o boos classi ica ion pe o mance unde he
igh assump ions and amewo k.
Chap e 4
Real li e applica ions
4.1 Op ical galaxy classi ica ion
G a i y bounds galaxies oge he o o m wha as onome s call galaxy clus e s. The
numbe o galaxies in hese clus e s can ange om hund eds o housands. Ne e heless,
apa om clus e ing galaxies in he physical space, hey can also be clus e ed in he
colou space based on hei obse ed equency in he elec omagne ic spec um. This
is ela ed o he concep o “ edshi ”, a g a i a ional phenomenon p edic ed by he
heo y o gene al ela i i y ha s a es ha pho ons emi ed om he cen e o galaxies
lose ene gy and he e o e become mo e ed in he spec um.
This op ical clus e ing is o in e es o ou disse a ion, since he colou o a galaxy
can be used o unde s and i s na u e. Mos galaxies can be classi ied in one o he
wo ollowing ca ego ies: ed sequence galaxies (RS) ha e gene ally low s a o ma ion
ac i i y and a e usually ellip ical, and blue cloud (BC) galaxies ha a e o en o spi al
o m and a e p oducing new s a s. Galaxies ha do no belong o ei he o he wo
a o emen ioned ca ego ies - like ou s, he Milky Way - li e in he “g een alley”.
S a o ma ion ac i i y is measu ed ia he speci ic s a o ma ion a e o sSFR.
Unde low edshi (meaning o galaxies ha a e ela i ely close), he colou o a galaxy
is a good p oxy o he sSFR. As discussed, RS galaxies usually ha e low sSFR, while
BC galaxies gene ally ha e high sSFR. Wha is in e es ing is ha he dis ibu ion o
he speci ic s a o ma ion a e ac oss galaxies is o bi-modal na u e, which sugges s he
use o a wo componen Gaussian mix u e model o disc imina e be ween RS and BC
galaxies. In his way, we can use he colou o galaxies o classi y hem in o he RS, and
hus wi h low sSFR, and BC, associa ed wi h high sSFR.
Indeed, as onome s obse e his phenomenon in space, whe e small componen s o he
RS, low-sSFR galaxies exis be ween wide componen s o he high-sSFR, BC galaxies.
31
32 4.1. Op ical galaxy classi ica ion
Le us illus a e hese concep s wi h eal-li e obse a ions o galaxies collec ed om
SkySe e ’s Sloan Digi al Sky Su ey, o SDSS. A da a se con aining 44,829 galaxies
in he low edshi en i onmen was ob ained wi h p ecise spec oscopic (colou ) mea-
su emen s along wi h an es ima ion o hei speci ic s a o ma ion a e, sSFR. Each
indi idual galaxy was ep esen ed as a bi-dimensional da a poin in Figu e 4.1 using he
colou indexes g− and −i as ea u es. The colou ing o each poin was calcula ed using
he sSFR es ima ion in i s app op ia e as onomical o m (see [4]). Galaxies wi h a low
speci ic s a o ma ion a e we e colou ed in ed, while hose wi h a high sSFR a e in
blue.
Figu e 4.1: Low edshi sample o 44,829 galaxies. Two clus e s a e clea ly isible.
Figu e 4.1 demons a es ha he colou o galaxies is indeed co ela ed o hei s a o -
ma ion ac i i y. The wo dis inc clus e s, co esponding o he blue cloud (BC) and he
ed sequence (RS) galaxies, indica e ha classi ica ion o galaxies in o one o he abo e
classes wi h a wo componen GMM using he colou s as ea u es is use ul o de e mine
he sSFR le el.
The e e ence pape o his sec ion [4] ou lines a new algo i hm, called Red D agon.
This algo i hm slices he low edshi en i onmen s in o u he subg oups, applies a
GMM o ob ain a pa ame iza ion o he componen s using he colou in o ma ion o
each galaxy, and hen uses in e pola ion o e he disc e e mix u e model pa ame e s
o ob ain a con inuous GMM model ac oss edshi . Classi ica ion o each indi idual
galaxy is done, as discussed in Chap e 2, using he pos e io p obabili ies o each o
he componen s, usually assigning i o he componen wi h he g ea es p obabili y o
based on some h eshold.
Howe e , Red D agon is a much mo e sophis ica ed algo i hm since i uses an addi ional
wo colou indexes, o a ea u e space o ou dimensions. Each galaxy is he e o e
ep esen ed by a colou ec o deno ed as ci.
Chap e 4. Real li e applica ions 33
In addi ion, i conside s a “noise” co a iance ma ix o each galaxy, ∆i, o accoun o
he in insic e o s in he measu emen s o he di e en colou bands. All in all, he
likelihood o he K-componen Gaussian mix u e model is
L(θ|S) =
Ngal
Y
i=1
K
X
k=1
Lk(θk|ci),(4.1)
whe e Ngal is he o al numbe o galaxies and Lk(θk|ci) is he likelihood o he i- h
galaxy in he k- h componen , which is equal o
Lk(θk|ci) = πk
p(2π)Ngal |Σk+ ∆i|exp −1
2(ci−µk) (Σk+ ∆i)−1(ci−µk).(4.2)
A e Red D agon has es ima ed he model’s pa ame e s con inuously ac oss edshi ,
galaxy classi ica ion is pe o med assigning each galaxy o he class ha maximizes he
membe ship p obabili y, i.e,
α= a g max
α=1,...,K Lα(θα|ci)
PK
k=1 Lk(θk|ci).(4.3)
Red d agon was es ed by he au ho s on he da a se ep esen ed in Figu e 4.1 o
es ima e he model pa ame e s o a wo componen GMM in he al eady commen ed
4-dimensional colou space. Figu e 4.2 shows he esul s, as in he p e ious igu e, in he
g− and −i axis, whe e he colou ep esen s he es ima ed p obabili y o each galaxy
being om he RS ( ed sequence) class o galaxies, wi h hose wi h e y low p obabili y
(and hence likely o be om he BC class) in blue.
Figu e 4.2: F om “Red D agon, a edshi -e ol ing Gaussian mix u e model o galaxies” [4].
Because he clus e s in he diag am o e lap wi h hose in Figu e 4.1 co esponding o
he low and high speci ic s a o ma ion a es, we can conclude ha classi ica ion based
on he abo e GMM is a good p edic o , as sugges ed ea lie , o he le el o sSFR.
40 4.4. Conclusion
4.4 Conclusion
Th oughou his disse a ion we ha e s udied he heo e ical amewo k o semi su-
pe ised classi ica ion, wi h pa icula a en ion o he mix u e model and co- aining
app oaches. A e pe o ming a numbe o simula ions o es he beha iou o hese
echniques, we ha e seen a numbe eal li e use cases, anging om galaxy classi ica ion,
o analysing ain occupancy da a.
Semi-supe ised lea ning is s ill an unde -de eloped ield which would bene i om mo e
igo ous heo e ical ounda ions. In addi ion, his lea ning pa adigm is subjec ed o
he same challenges as all o he s in he con ex o machine lea ning, ha being he
pace o ad ancemen s in he ield, which demands cons an inno a ion and adap a ion.
Ne e heless, semi-supe ised classi ica ion has p o en o be an ex emely use ul lea n-
ing pa adigm in he cu en con ex o la ge quan i ies o unlabelled da a being a ailable
in o de o enhance classical supe ised classi ie s.
The p ocess o de eloping his wo k has been ex emely ewa ding, as i has been a
g ea excuse o di e deep in o opics ha we e ou o he scope o his deg ee, and ead
sou ces ha we e o g ea in e es bo h o his disse a ion and in a b oade sense o my
de elopmen as a scien is and as a human.
Appendix A
R and Py hon code
R and Py hon we e used ex ensi ely, especially o chap e s 3 and 4. No included in
his disse a ion o main ain b e i y, he code is p o ided sepa a ely in he CODE olde .
We p o ide a basic guide o na iga e h ough hese iles.
•3.1 In oduc ion -gMM.R
•3.2 Fi s se o simula ions -Simula ion 1.R
•3.3 Second se o simula ions -Simula ion 2.R
•3.4 Example: e ec s o an inco ec model -Simula ion 3.R
•3.5 A co- aining simula ion -Simula ion CoT aining.R
•4.1 Op ical galaxy classi ica ion -Galaxies ( olde )
–galaxies.R
–SDSS low z sample.cs (da a base)
•4.2 Case s udy: ain occupancy da a -T ain occupancy analysis ( olde )
– ain da a.R
–Loading Da a.cs (da a base)
•4.3 Sel - aining: a inal semi-supe ised app oach -Sel -T aining ( olde )
–main.py
–plo .py
–da ase .cs (da a base)
41
Bibliog aphy
[1] Abney, S. (2002) Boo s apping. P oceedings o he 40 h Annual Mee ing o he Associa-
ion o Compu a ional Linguis ics (ACL), July 2002, pp. 360-367.
[2] Benaglia, T., Chau eau, D., Hun e , D.R., Young, D.S. (2009) “mix ools: An Package
R o Analyzing Fini e Mix u e Models.” Jou nal o S a is ical So wa e, 32(6), 1–29.
h p://www.js a so .o g/ 32/i06/.
[3] B ´emaud, P. (2017) Disc e e P obabili y Models and Me hods, Sp inge .
[4] Black, W.K., E a d, A. (2022) Red D agon: a edshi -e ol ing Gaussian mix u e model
o galaxies, Mon hly No ices o he Royal As onomical Socie y, Volume 516, Issue 1,
Oc obe 2022, Pages 1170–1182, h ps://doi.o g/10.1093/mn as/s ac2052.
[5] Blum, A., Mi chell, T. (1998) Combining labeled and unlabeled da a wi h co- aining.
In: P oceedings o he 11 h Annual Con e ence on Compu a ional Lea ning Theo y, As-
socia ion o Compu ing Machine y.
[6] Bouguila, N., Fan, W. (Eds.) (2020) Mix u e Models and Applica ions, Sp inge Na u al
Swi ze land.
[7] Helmbold, D., Williamson, B. (Eds.) (2001) 14 h and 5 h Annual Con e ence on Compu-
a ional Lea ning Theo y, Ams e dam, The Ne he lands, July 16-19, 2001 p oceedings.
[8] Jo, T. (2021) Machine Lea ning Founda ions: Supe ised, Unsupe ised, and Ad anced
Lea ning, Sp inge .
[9] P esno, M. A. (2023) Polic´ıa p edic i a y p e enci´on de la iolencia de g´ene o: el sis ema
VioG´en. Re is a de los Es udios de De echo y Ciencia Pol´ı ica, UOC, no iemb e, 2023.
[10] Sasaki, T. (2022) Semi-supe ised classi ica ion on a ex da ase , Kaggle,
www.kaggle.com/code/sasaki e suya/semi-supe ised-classi ica ion-on-a- ex -da ase
[11] Schwenke , F., T en in, E. (Eds.) (2011) Pa ially Supe ised Lea ning, Fi s IAPR
TC3 Wo kshop, PSL 2011 Ulm, Ge many, Sep embe 15-16, Re ised Selec ed Pape s,
Sp inge .
[12] Zhou, Z.-H. (2013) Unlabeled Da a and Mul iple Views. In: Pa ially Supe ised Lea n-
ing: Second IAPR In e na ional Wo kshop, PSL 2013, Nanjing, China, May 13-14, 2013.
[13] Zhu, X., Goldbe g, A. B. (2009) In oduc ion o Semi-Supe ised Lea ning, Mo gan &
Claypool.
43