scieee Science in your language
[en] (orig)

Hierarchical information representation and efficient classification of gene expression microarray data

Author: Bosio, Mattia
Publisher: Universitat Politècnica de Catalunya
Year: 2014
DOI: 10.5821/dissertation-2117-95353
Source: https://upcommons.upc.edu/bitstream/2117/95353/1/TMB1de1.pdf
Uni e si a Poli écnica de Ca alunya
Hie a chical in o ma ion ep esen a ion and ecien
classica ion o gene exp ession mic oa ay da a
PhD Thesis
S uden :
Ma ia Bosio
Thesis ad iso s:
Philippe Salembie
Albe Oli e as Ve gés
2014
Ac a de cali icación de esis doc o al
Cu so académico: 2013/2014
Nomb e y apellidos
MATTIA BOSIO
Unidad es uc u al esponsable del p og ama
Resolución del T ibunal
Reunido el T ibunal designado a al e ec o, el doc o ando / la doc o anda expone el ema de la su esis doc o al
i ulada:____________________________________________________________________________________
Hie a chical in o ma ion ep esen a ion and e icien classi ica ion o gene exp ession
mic oa ay da a
Acabada la lec u a y después de da espues a a las cues iones o muladas po los miemb os i ula es del
ibunal, és e o o ga la cali icación:
NO APTO APROBADO NOTABLE SOBRESALIENTE
(Nomb e, apellidos y i ma)
P esiden e/a
(Nomb e, apellidos y i ma)
Sec e a io/a
(Nomb e, apellidos y i ma)
Vocal
(Nomb e, apellidos y i ma)
Vocal
(Nomb e, apellidos y i ma)
Vocal
/
______________________, _______ de __________________ de _______________
El esul ado del esc u inio de los o os emi idos po los miemb os i ula es del ibunal, e ec uado po la Escuela
de Doc o ado, a ins ancia de la Comisión de Doc o ado de la UPC, o o ga la MENCIÓN CUM LAUDE:
SÍ NO
(Nomb e, apellidos y i ma)
P esiden e de la Comisión Pe manen e de la Escuela de
Doc o ado
(Nomb e, apellidos y i ma)
Sec e a ia de la Comisión Pe manen e de la Escuela de
Doc o ado
Ba celona a _______ de ____________________ de __________
A u a la mia amiglia.
I ha e always hi s ed o knowledge, I ha e always been ull o ques ions"
He man Hesse,
Sidda ha
Summa y
In he eld o compu a ional biology, mic oa yas a e used o measu e he ac i i y o housands
o genes a once and c ea e a global pic u e o cellula unc ion. Mic oa ays allow scien is s
o analyze exp ession o many genes in a single expe imen quickly and ecien ly. E en i
mic oa ays a e a consolida ed esea ch echnology nowadays and he ends in high- h oughpu
da a analysis a e shi ing owa ds new echnologies like Nex Gene a ion Sequencing (NGS), an
op imum me hod o sample classica ion has no been ound ye .
Mic oa ay classica ion is a complica ed ask, no only due o he high dimensionali y o
he ea u e se , bu also o an appa en lack o da a s uc u e. This cha ac e is ic limi s he
applicabili y o p ocessing echniques, such as wa ele l e ing o o he l e ing echniques ha
ake ad an age o known s uc u al ela ion. On he o he hand, i is well known ha genes a e
no exp essed independen ly om o he each o he : genes ha e a high in e dependence ela ed
o he in ol ed egula ing biological p ocess.
This hesis aims o imp o e he cu en s a e o he a in mic oa ay classica ion and
o con ibu e o unde s and how signal p ocessing echniques can be de eloped and applied o
analyze mic oa ay da a. The goal o building a classica ion amewo k needs an explo a o y
wo k in which algo i hms a e cons an ly ied and adap ed o he analyzed da a. The de eloped
algo i hms and classica ion amewo ks in his hesis ackle he p oblem wi h wo essen ial
building blocks. The  s one deals wi h he lack o a p io i s uc u e by in e ing a da a-d i en
s uc u e wi h unsupe ised hie a chical clus e ing ools. The second key elemen is a p ope
ea u e selec ion ool o p oduce a p ecise classie as an ou pu and o educe he o e  ing
isk.
The main ocus in his hesis is he bina y da a classica ion, eld in which we ob ained
ele an imp o emen s o he s a e o he a . The  s key elemen is he da a-d i en s uc u e,
ob ained by modi ying hie a chical clus e ing algo i hms de i ed om he T eele s algo i hm om
he li e a u e. Se e al al e na i es o he o iginal e e ence algo i hm ha e been es ed, changing
ei he he simila i y me ic o me ge he ea u e o he way wo ea u e a e me ged. Mo eo e , he
possibili y o include ex e nal sou ces o in o ma ion om publicly a ailable biological knowledge
and on ologies o imp o e he s uc u e gene a ion has been s udied oo. Abou he ea u e
selec ion, wo al e na i e app oaches ha e been s udied: he  s one is a modica ion o he IFFS
algo i hm as a w appe ea u e selec ion, while he second app oach in ol ed an ensemble lea ning
ocus. To ob ain good esul s, he IFFS algo i hm has been adap ed o he da a cha ac e is ics by

in oducing new elemen s o he selec ion p ocess like a eliabili y measu e and a sco ing sys em
o be e selec he bes ea u e a each i e a ion. The second ea u e selec ion app oach is based
on Ensemble lea ning, aking ad an age o he mic oa yas ea u e abundance o implemen a
die en selec ion scheme. New algo i hms ha e been s udied in his eld, imp o ing s a e o he
a algo i hms o he mic oa ay da a cha ac e is ic o small sample and high ea u e numbe s.
In addi ion o he bina y classica ion p oblem, he mul iclass case has been add essed
oo. A new algo i hm combining mul iple bina y classie s has been e alua ed, exploi ing he
edundancy oe ed by mul iple classie s o ob ain be e p edic ions.
All he s udied algo i hm h oughou his hesis ha e been e alua ed using high quali y
publicly a ailable da a, ollowing es ablished es ing p o ocols om he li e a u e o oe a p ope
benchma king wi h he s a e o he a . Whene e possible, mul iple Mon e Ca lo simula ions
ha e been pe o med o inc ease he obus ness o he ob ained esul s.
Resumen
En el campo de la biología compu acional, los mic oa ays son u ilizados pa a medi la ac i idad
de miles de genes a la ez y p oduci una ep esen ación global de la unción celula . Los
mic oa ays pe mi en analiza la exp esión de muchos genes en un solo expe imen o, ápidamen e
y ecazmen e. Aunque los mic oa ays sean una ecnología de in es igación consolidada hoy en
día y la endencia es en u iliza nue as ecnologías como Nex Gene a ion Sequencing (NGS),
aun no se ha encon ado un mé odo óp imo pa a la clasicación de mues as.
La clasicación de mues as de mic oa ay es una a ea complicada, debido al al o núme o
de a iables y a la al a de es uc u a en e los da os. Es a ca ac e ís ica impide la aplicación
de écnicas de p ocesado que se basan en elaciones es uc u ales, como el l ado con wa ele
u o as écnicas de l ado. Po o o lado, los genes no se exp esen independien emen e unos de
o os: los genes es án in e - elacionados según el p oceso biológico que les egula.
El obje i o de es a esis es mejo a el es ado del a e en la clasicación de mic oa ays y
con ibui a en ende como se pueden diseña y aplica écnicas de p ocesado de señal pa a
analiza mic oa ays. El obje i o de cons ui un algo i mo de clasicación, necesi a un es udio
de comp obaciones y adap aciones de algo i mos exis en es a los da os analizados. Los algo i mos
desa ollados en es a esis enca an el p oblema con dos bloques esenciales. El p ime o a aca la
al a de es uc u a, de i ando un á bol bina io usando he amien as de clus e ing no supe isado.
El segundo elemen o undamen al pa a ob ene clasicado es p ecisos educiendo el iesgo de
o e  ing es un elemen o de selección de a iables.
La p incipal a ea en es a esis es la clasicación de da os bina ios en la cual hemos ob enido
mejo as ele an es al es ado del a e. El p ime paso es la gene ación de una es uc u a, pa a
eso se ha u ilizado el algo i mo T eele s disponible en la li e a u a. Múl iples al e na i as a es e
algo i mo o iginal han sido p opues as y e aluadas, cambiando las mé icas de simili ud o las
eglas de usión du an e el p oceso. Además, se ha es udiado la posibilidad de usa uen es de
in o mación ex e nas, como on ologías de in o mación biológica, pa a mejo a la in e encia de la
es uc u a. Se han es udiado dos en oques di e en es pa a la selección de a iables: el p ime o
es una modicación del algo i mo IFFS y el segundo u iliza un esquema de ap endizaje con 
ensembles". El algo i mo IFFS ha sido adap ado a las ca ac e ís icas de mic oa ays pa a ob ene
mejo es esul ados, añadiendo elemen os como la medida de abilidad y un sis ema de e aluación
pa a selecciona la mejo a iable en cada i e ación. El mé odo que u iliza ensembles" ap o echa
la abundancia de ea u es de los mic oa ays pa a implemen a una selección di e en e. En es e
campo se han es udiado di e en es algo i mos, mejo ando al e na i as ya exis en es al escaso
núme o de mues as y al al o núme o de a iables, ípicos de los mic oa ays.
El p oblema de clasicación con más de dos clases ha sido ambién a ado al es udia un
nue o algo i mo que combina múl iples clasicado es bina ios. El algo i mo p opues o ap o echa
la edundancia o ecida po múl iples clasicado es pa a ob ene p edicciones más ables.
Todos los algo i mos p opues os en es a esis han sido e aluados con da os públicos y de al a
calidad, siguiendo p o ocolos es ablecidos en la li e a u a pa a pode o ece una compa ación
able con el es ado del a e. Cuando ha sido posible, se han aplicado simulaciones Mon e Ca lo
pa a mejo a la obus ez de los esul ados.
Acknowledgmen s
I would like o hank my hesis ad iso s, Philippe Salembie and Albe Oli e as Ve gés,
who accompanied me h ough hese yea s wi h encou agemen s, ad ices and guidance in
my g ow h as a esea che . I wan also o hank all he guys om he GPI g oup, hanks
o whom I el among iends. A special hanks is o Pau Bello , wi h whom we sha ed
se e al mee ings and ga e ele an eedback o he esea ch p ocess.
4-12 Pseudocode o he AID algo i hm. . . . . . . . . . . . . . . . . . . . . . . 65
4-13 Mean MCC esul s compa ison wi h s a e o he a esul s om [112] and
omSec ion4.2.2. ............................... 69
4-14 Mean MCC esul s compa ison among all he es ed al e na i es o classi-
e and nonexpe condi ion. The alues a e he mean ac oss he MAQC
da ase s...................................... 72
5-1 Toy example o a small knowledge da abase ma ix whe e each ow is a
die en gene while columns a e a ibu es. Black do s ep esen s ha a
gene has a specic a ibu e. . . . . . . . . . . . . . . . . . . . . . . . . . 77
5-2 Toy example o he adop ed anking scheme using only wo biological el-
e ance analysis ools combined wi h Bo da coun . . . . . . . . . . . . . . . 87
5-3 Sco e compa ison wi h esul s om [84] on da ase s D and E om MAQC
da ase s. All he algo i hms a e so ed by inc easing nal sco e, he black
line. The bes esul is he one wi h he smalles o e all sco e, which is G-
pd , consis en ly wi h he ob ained esul s o e a wide selec ion o da ase s. 93
6-1 Example o OAO and OAA in a h ee classes p oblem wi h hei associa ed
classica ion bounda ies.
1
........................... 96
ii

iii
Lis o Tables
4.1 Mic oa ay da ase s used o classica ion. . . . . . . . . . . . . . . . . . . 49
4.2 MAQC mean MCC and mean Accu acy esul s . . . . . . . . . . . . . . . 51
4.3 Mean esul s adop ing he lexicog aphic sco ing scheme . . . . . . . . . . . 65
4.4 Mean esul s adop ing he exponen ial penaliza ion sco ing scheme . . . . . 65
4.5 Mean esul s adop ing he linea combina ion sco ing scheme. . . . . . . . 65
4.6 S a is ical p ope ies o he Mon e Ca lo simula ion. . . . . . . . . . . . . . 66
4.7 Resul s o he s udy based on syn he ic da a. The h ee sub ables co e-
spond o he h ee die en da a dis ibu ions. Each sub able is o ganized
showing he alues depending on he skewness alue and he die en size
o he aining se . The
T ain
column con ains he size o he aining se ,
he
MCC
columns shows he mean MCC alue ac oss he die en expe i-
men al condi ions and Mon e Ca lo i e a ions while
S d
and
#F
columns
con ain he MCC s anda d de ia ion and he mean numbe o selec ed
ea u es espec i ely. .............................. 67
4.8 Mean MCC esul s om Mon e Ca lo simula ion on MAQC da ase s. The
wo algo i hms die om he me agene gene a ion ule, PCA e sus Haa
basisdecomposi ion. .............................. 67
4.9 MCC esul s compa ing he s udied AID and
Kun
algo i hms. . . . . . . . 68
4.10 Mean MCC esul s compa ing he al e na i es in e ms o nonexpe no a-
ion and adop ed classie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Biological simila i y measu es o mulas. Fo each measu e he o iginal o -
mula and i s adap ed e sion o con inuous a iables a e p esen ed. . . . . 81
5.2 Compa ison o he ob ained MCC s a is ics on MAQC da ase s. . . . . . . 89
ix
5.3 Resul s om he biological e alua ion o he gene signa u es and he global
anking esul s. ................................. 91
6.1 Example o he ECOC ep esen a ion o One Agains All (OAA) classica-
ion in a 4 class case. Each bi is he ou pu as a classie sepa a ing one
class om he es . ............................... 97
6.2 Code able o he OAA+PAA app oach in a ou classes scena io. The e
a e ou codewo ds o 10 bi s, co esponding o he OAA case plus one bi
o eachclasspai . ............................... 99
6.3 B ie mic oa ay da ase s desc ip ion. . . . . . . . . . . . . . . . . . . . . . 101
6.4 Expe imen al p edic ion e o a es o e he se en da ase s. . . . . . . . . 102
x
Chap e 1
In oduc ion
The de eloped wo k in his hesis lies in he eld o au oma ic mic oa ay da a analysis-
analysis and  s well he Na ional Ins i u e o Heal h, NIH, deni ion o bioin o ma ics
Bioin o ma ics: Resea ch, de elopmen , o applica ion o compu a ional ools
and app oaches o expanding he use o biological, medical, beha io al o
heal h da a, including hose o acqui e, s o e, o ganize, a chi e, analyze, o
isualize such da a.
Mo e in de ail, his esea ch wo k consis s in de eloping a no el, global app oach, wi h
which high- h oughpu da a like mic oa ays can be classied. To his end, signal p o-
cessing echniques ha e been de eloped, applied and e alua ed o imp o e he cu en
esul s wi hin he mic oa ay analysis eld. In[110], he use ulness o signal p ocessing
echniques in he bioin o ma ics eld is well desc ibed:
The ecen de elopmen o high- h oughpu molecula gene ics echnologies
has b ough a majo impac o bioin o ma ics and sys ems biology. These
echnologies ha e made possible he measu emen o he exp ession p oles o
genes and p o eins in a highly pa allel and in eg a ed ashion. The examina-
ion o he huge amoun s o genomic and p o eomic da a holds he p omise
o unde s anding he complex in e ac ions be ween genes and p o eins, he
unc ional p ocesses o a cell, and he impac o a ious ac o s on a cell, and
ul ima ely, o enabling he design o new echnologies o in elligen manage-
men o diseases.
. . .
The impo ance o signal p ocessing echniques is due
1
o hei impo an ole in ex ac ing, p ocessing, and in e p e ing he in o -
ma ion con ained in genomic and p o eomic da a. I is ou hope ha signal
p ocessing me hods will lead o new ad ances and insigh s in unco e ing he
s uc u e, unc ioning and e olu ion o biological sys ems.
Signal p ocessing echniques a e key in he analysis p ocess, since he p oblems o sol e in
he mic oa ay da a analysis a e simila p oblems al eady aced in he elecommunica ion-
ela ed signal p ocessing eld (e.g. analysis and comp ession o la ge da a, noise cancel-
la ion, pa e n de ec ion, ea u e selec ion and classica ion). Mo eo e , a as li e a u e
al eady exis s, in which a whole ple ho a o algo i hms om he signal p ocessing wo ld
a e aken, modied and adap ed o he analysis o high- h oughpu da a such as mi-
c oa ays. This hesis wo k aims o u he imp o e he applica ion o signal p ocessing
echniques o he analysis o a widely adop ed ool like mic oa ays.
The main asks ea ed in his hesis a e he classica ion o incoming samples (e.g.
o de e mine whe he a mic oa ay sample ep esen s a pe son wi h a ce ain disease ype
o no ), he ele an ea u e ex ac ion o a mic oa ay se (e.g. o iden i y he mos
disc imina ing genes be ween wo classes) and he imp o emen o esul s in e p e abili y
om a biological poin o iew.
The de eloped echniques and ools ocus on building a hie a chical da a ep esen a-
ion o he gene exp ession da a able o p oduce use ul ea u es o classica ion, ei he
using only he nume ical in o ma ion om mic oa ay, o by including p e ious biological
knowledge o ease he esul s in e p e a ion and o inc ease he biological cohe ence o
he gene a ed s uc u e. Algo i hms ha e been de eloped o he bina y classica ion
p oblem, which is by a he mos s udied ask in classica ion. In his a ea, p ope ly
uned ea u e selec ion algo i hms ha e been de eloped and es ed o ake in o accoun
he mic oa ay da a cha ac e is ics. The mul iclass classica ion has also been consid-
e ed by de eloping a no el ensemble classica ion echnique combining mul iple bina y
classie o ob ain a mo e obus sample classica ion.
Mic oa ays a e an impo an and well es ablished echnology in he biomedical e-
sea ch eld, de eloped o allow esea che s o ga he a e y la ge numbe o gene exp es-
sions simul aneously. By measu ing he mRNA le el, he s a e o a cell can be de e mined
2

and in e ences abou phenomena inside he cell can be made [138]. In each mic oa ay
expe imen , a la ge numbe o gene exp essions a e measu ed, ypically ens o housands,
wi h a ela i ely small sample numbe . Mic oa ays a e an ex eme example o sample
sca ci y, o high-dimensionali y o he ea u e se and his is a c i ical issue du ing he
da a analysis s ep.
The  s publica ion using mic oa ays o cance classica ion is om Golub in 1999
[52], whe e a gene subse wi h la ge mean alue die ence be ween classes and small a i-
ance wi hin each class has been selec ed om he ini ial da ase and used as a p edic o
classie . Since hen, a wide a ie y o lea ning app oaches ha e been p oposed o mi-
c oa ay da a analysis, like o example da a no maliza ion and co ec ion, classica ion
o egula o y ne wo k iden ica ion.
1.1 Mic oa ay Da a
In he eld o compu a ional biology, mic oa ays a e used o gene exp ession p oling,
which is he measu emen o he ac i i y ( he exp ession) o housands o genes a once,
o c ea e a global pic u e o cellula unc ion.
Mic oa ays allow scien is s o analyze exp ession o many genes in a single expe imen
quickly and ecien ly. They ep esen a majo me hodological ad ance and a e a powe ul
esea ch ool, used by scien is s o y o unde s and undamen al aspec s o g ow h and
de elopmen as well as o explo e he unde lying gene ic causes o many human diseases.
Mic oa ays da a a e usually isualized wi h he help o a hea map, like he example
shown in Figu e 1-1, in which genes a e a anged as columns, while each ow ep esen s
a sample. In gu e 1-1, he samples a e so ed by hei classes: he  s 32 ows a e
om a class while he las 30 a e om ano he . In he adop ed colo scheme, ed alues
indica e high gene exp ession le el, while blue alues indica e low gene exp ession le el.
The hea map gi es a isual summa y o he collec ed gene ic in o ma ion and, a he
same ime, well isualizes he p oblem o be aced: he e is oo much in o ma ion wi hou
an associa ed knowledge o easily disc imina e be ween classes.
Mic oa ay classica ion is a complica ed ask, no only due o he high dimensionali y
o he ea u e se , bu also o an appa en lack o da a s uc u e. E en i da a a e p esen ed
3
Figu e 1-1:
Mic oa ay da a isualiza ion wi h hea map. Each columns ep esen s a single
gene, while each ow ep esen s a sample and i isualizes he lack o appa en egula i y in a
mic oa ay da ase .
as a ma ix, no a p io i ela ion exis s om he geome ical p oximi y, see o example
in Figu e 1-1 whe e he e is no local uni o mi y ac oss he columns. This cha ac e is ic
limi s he applicabili y o p ocessing echniques, such as wa ele l e ing o o he l e ing
echniques ha ake ad an age o known s uc u al ela ion. On he o he hand, i is well
known ha genes a e no exp essed independen ly om each o he [50]: genes ha e a high
in e dependence depending on he in ol ed egula ing biological p ocess. The e o e, e en
i gene exp essions ha e no geome ical s uc u e in he mic oa ay da a, he measu ed
alues hemsel es do ha e an unknown s uc u e, which could be used o p ocess he da a.
An addi ional issue when analyzing mic oa ay da a is he measu emen noise. In mi-
c oa ay expe imen s, uo escen in ensi ies ela ed o gene exp ession le els a e measu ed
wi h sophis ica ed algo i hms o image p ocessing. E en so, an issue many esea ches nd
compelling o sol e is how o eec i ely disce n he ac ual alues om expe imen al noise
[68]. This is an issue e en i in ecen s udies like [112] i is s a ed how he ac ual echnical
noise is low. I s ill is no ze o and da a sue om andom gene exp ession uc ua ion
which can al e he eal exp ession alue. To add ess he main noise eec due o some
sys ema ic e o , a ious no maliza ion and ba ch eec co ec ion echniques ha e been
de eloped h oughou he li e a u e [50, 107]. Wi h some die ences all o hem manage
o ob ain compa able da a ac oss a ious mic oa ay samples (e en i no noise ee). To
4
add ess he esidual noise uc ua ion, bene s would be ob ained i he unde lying da a
s uc u e o he gene exp ession was ound.
1.2 P oblem s a emen
As an icipa ed in Sec ion 1.1, mic oa ay da a cha ac e is ics can add complexi y o he
classica ion ask:
•
High ea u e se dimension wi h espec o he sample numbe also known as cu se
o dimensionali y [11];
•
Lack o a p io i known da a s uc u al ela ions;
•
Residual measu emen noise e en a e applying no maliza ion echniques.
The main p oblem o be sol ed is how o de elop an algo i hm able o ou pu a p ecise
and eliable classie wi h epea able esul , conside ing he mic oa ay da a cha ac e is-
ics. E en i mic oa ays a e a consolida ed esea ch echnology nowadays and he ends
in high- h oughpu da a analysis a e shi ing owa ds new echnologies like Nex Gene -
a ion Sequencing (NGS) [102], an op imum me hod o sample classica ion has no ye
been ound.
In ecen s udies om he mic oa ay quali y con ol s udy conso ium, MAQC, [112],
an ex ensi e e alua ion o classica ion algo i hms has been pe o med. F om MAQC
esul s in [112], no indi idual me hod esul ed o be always he bes in all da ase s.
Fu he mo e, om he published esul s in [112], i can be obse ed how he e is s ill a
lo o oom o imp o emen o he classica ion p edic i e p ope ies. Mo eo e , he
esea ch o a be e mic oa ay classica ion algo i hm is in e es ing o he cu en and
u u e sequencing echniques like NGS. Wi h NGS, he ou pu da a a e basically aec ed
by he same p oblems as mic oa ays, wi h he added incon enience o no ha ing nei he
consolida ed da a no maliza ion and co ec ion echniques, no a wide a ailabili y o da a
o p e ious wo ks o compa e wi h. On he o he hand, a new algo i hm analyzing
mic oa ay da a can be compa ed wi h a la ge amoun o p eexis ing li e a u e. Mo eo e ,
he e is he possibili y o analyze many public da ase s om o example Gene Exp ession
5
Omnibus, GEO, [41], hus i is possible o ocus on he algo i hmic aspec wi hou being
oo condi ioned by he da a quali y con ol like wi h he cu en s a e o NGS da a
analysis. In his way, algo i hms can be de eloped o mic oa ays, compa ed wi h he bes
al e na i es and la e s aigh o wa dly adap ed o he nex high- h oughpu sequencing
echnology wi h good chance o main aining he pe o mances.
In he li e a u e a ple ho a o mic oa ay classica ion me hods ha e been de eloped
and a e iew o he mos popula al e na i es is p esen ed in Chap e 2. In almos e e y
case, ea u e selec ion algo i hms ha e been applied o educe he impac o he ea u e
numbe . The aim o he ea u e selec ion ask is o choose a subse o ele an ea u es o
building obus lea ning models. By emo ing he mos i ele an and edundan ea u es
om he da a, he ea u e selec ion helps o imp o e he p edic i e pe o mance. In his
way, he gene aliza ion capabili y and he model in e p e abili y a e enhanced.
The lack o s uc u e aec s he possibili y o apply a whole se o lea ning echniques
based on some p oximi y measu e, being i spa ial, spec al o unc ional. The lack o
s uc u e is also an issue o noise educ ion echniques based on low-pass l e ing: he
lack o knowledge abou ea u es ha a e supposed o ha e a simila beha io limi s he
applicabili y o low pass ope a o s. In o de o ex ac a s uc u e om he nume ical
da a, unsupe ised lea ning echniques ha e also been p oposed in he li e a u e, among
which an impo an subse a e he clus e ing echniques. The clus e ing ope a ion denes
se s o ela ed genes by some simila i y measu e. A whole uni e se o al e na i es exis s,
and a e iew o hem is included in Chap e 2.
Finally, a de e minan ole in he classica ion p ocess is played by he classica ion
ule i sel . A e iew o exis ing classica ion echniques applied o mic oa ay da a analy-
sis is included in Sec ion 2.3, bu mo e comple e in o ma ion can be ound in [58, 38, 50].
The pano ama o classica ion echniques is ex emely di e se, om e y simple clas-
sica ion ules o highly complex ne wo k sys ems. This hesis ocus is o p oduce a
gene al pu pose classica ion me hodology o mic oa ay da a. Die en classica ion
ules ha e been s udied and compa ed, like he Linea Disc iminan Analysis classie
(LDA) [58], Suppo Vec o Machines (SVM) [109], o k-Nea es Neighbo s (kNN) [58],
bu he p oposed scheme can wo k wi h almos any exis ing classie .
6
Chap e 2
S a e o he a
Th ough he las yea s, many me hods ackled he high- h oughpu biological da a clas-
sica ion p oblem wi h die en angles, add essing he mos ele an issues o p oduce an
ecien classie which conjuga es high p edic ion pe o mance wi h obus ness o o e -
 ing and wi h an in e p e able biological meaning. In his hesis, he classica ion ask
is add essed by implemen ing a sys em composed o h ee main pa s: he hie a chical
da a ep esen a ion, he ea u e selec ion and he classica ion ule. The s a e o he a
abou he h ee main pa s o his hesis wo k is summa ized he e o oe a pano amic
iew o he a ailable echniques, wi h hei s eng hs and limi a ions.
2.1 Hie a chical da a ep esen a ion
Mic oa ays do no ha e a known da a s uc u e ha can be used o implemen ecien
l e ing echniques o noise educ ion. They p o ide uno de ed da a which a e consid-
e ably ha d o ead and in e p e , due o he eno mous amoun o a ailable a iables.
A la ge numbe o algo i hms ha e been de eloped o make o de om he uns uc u ed
gene exp ession da a wi hou using any p e ious in o ma ion abou he samples ca ego ies
and a e called unsupe ised lea ning algo i hms.
13

2.1.1 Unsupe ised lea ning
Unsupe ised lea ning e e s o he p oblem o ying o nd a hidden s uc u e in unla-
beled da a. In he p oposed amewo k, unsupe ised lea ning echniques a e implemen ed
o nd a hie a chical s uc u e o he gene exp ession da a and o gene a e a new se
o ea u es called me agenes. Unsupe ised lea ning encompasses many echniques ha
seek o summa ize and o explain key ea u es om he da a. App oaches o unsupe ised
lea ning include clus e ing algo i hms (e.g. k-means, mix u e models, hie a chical clus-
e ing) o blind signal sepa a ion using ea u e ex ac ion echniques o dimensionali y
educ ion (e.g. P incipal componen analysis, Independen componen analysis, Non-
nega i e ma ix ac o iza ion, Singula alue decomposi ion), o a de ailed su ey abou
hese and mo e echniques e e o [38, 58, 93].
The goal o clus e ing is, oughly said, o assign a se o objec s in o g oups called clus-
e s so ha objec s in he same clus e a e mo e simila o each o he han o hose in o he
clus e s. Clus e ing algo i hms die en ia e hemsel es in he adop ed simila i y me ic,
which denes when wo objec a e close o each o he , and in he p ocedu e o dene
he clus e numbe and hei composi ion (i.e. he ac ual clus e ing algo i hm). Popula
clus e ing algo i hms applied in mic oa ay analysis a e hie a chical clus e ing[42], k-
means[85], pa i ioning a ound medioids (PAM)[120], sel -o ganizing maps (SOM)[70], o
bi-clus e ing me hods [91]. A de ailed explana ion o hese algo i hms can be ound in [38]
and in o ma ion abou hei u iliza ion in mic oa ay analysis is p esen ed in [99, 63, 93].
Among he mos popula clus e ing algo i hms, he closes o his hesis objec i e is
hie a chical clus e ing. I has been he  s algo i hm o be used in mic oa ay esea ch
o g oup genes [42]. I is an i e a i e p ocess in which, a  s , each objec is assigned
o i s own clus e , hen, he wo mos simila clus e s a e joined, ep esen ing a new
node o he clus e ing ee. This p ocess is epea ed un il only a single clus e emains,
including all he da a. Va ian s o his algo i hm exis , among which he simple p ocess
in e sion is called op-down hie a chical clus e ing: he p ocess s a s om one clus e
only, which is i e a i ely spli in o wo clus e s un il one clus e o each ea u e is ob ained.
Hie a chical clus e ing ou pu s a ee o nes ed clus e s. Each node in he ee ep esen s
a g oup o simila genes (i.e. he g oup composi ion depends on he chosen simila i y
14
me ic). Taking ad an age om he ee esul ing om hie a chical clus e ing, Lee's
wo k in [78] p esen s a mul i- esolu ion ep esen a ion and eigen-analysis o he o iginal
da a h ough an i e a i e pai wise hie a chical clus e ing algo i hm called
T eele s
. This
me hod p oduces a ee in which, a each le el, he wo mos simila ea u es a e chosen
and eplaced by a coa se-g ained app oxima ion ea u e and a esidual de ail ea u e.
This cha ac e is ic om
T eele s
will be used in he me agene c ea ion p ocess because i
allows a local ep esen a ion o common beha io o a gene clus e and mo e de ails a e
p o ided in Sec ion 3.1.
2.1.2 Knowledge in eg a ion o clus e ing
A ele an heme add essed in his hesis wi hin he hie a chical da a ep esen a ion and
me agene gene a ion, is he oppo uni y o include p io biological knowledge o d i e he
hie a chical clus e ing p ocess. A ele an issue wi h high- h oughpu biological da a is
how o ex ac eliable knowledge om he as amoun o a ailable da a [3]. A whole se o
analysis ools ha e been de eloped o help he in e p e a ion ask and o in e ela ionships
be ween he gene signa u es and biological knowledge da abases [115, 82, 27, 67, 30].
Including and in eg a ing p io biological knowledge has gained impo ance in he
omics da a analysis eld h oughou he yea s [3, 30]. Knowledge da abases ha e been
used in many di ec ions, o example, o iden i y biologically ele an ac i a ed pa hways
by in eg a ing Gene On ology (GO) in he analysis p ocess [105], o o in eg a e a gene
anking ool in he analysis [127]. Mo eo e , biological knowledge is also used in ools like
Hanalyze [77] o iden i y gene- o-gene ela ionships and acili a e he da a in e p e a ion.
Knowledge in eg a ion o mic oa ay classica ion has been ecen ly applied in mod-
ica ions o classica ion me hods like Nea es sh unken cen oids [122] and Penalized
pa ial leas squa es (PPLS) [133] called mPAM and mPLS, espec i ely [117]. Bo h
me hods implici ly con ain a mechanism o selec ing genes based on a penal y applied
acco ding o he disc imina o y powe o he gene. In [134, 98, 49] oo, he biological
in o ma ion has been used o imp o e he gene- anking and he l e ing ea u e selec ion,
inc easing he classica ion esul s in e p e abili y and obus ness.
P io knowledge has al eady been used o analyze mic oa ay da a. In [28] he p io
15
in o ma ion has been used o analyze he pa ien su i al p edic ion a he han o clas-
sica ion. The p io in o ma ion in o m o gene se s ep esen ing me abolic pa hways has
been used o summa ize unc ionally ela ed genes in a single a iable called supe gene by
means o Supe ised P incipal Componen Analysis (SPCA) [29]. In [25] he biological
in o ma ion is used o ex ac he common beha io o unc ionally ela ed g oups, gene -
a ing
supe genes
like in [28] o be used o ea u e selec ion as subs i u es o he o iginal
gene exp essions and applied o he mic oa ay classica ion a he han eg ession.
A common ai o all hese wo ks is ha including some p io biological knowledge
led o mo e in e p e able esul s om a biological iewpoin , easing he scien is 's ask o
o mula e new hypo heses.
In his hesis, he biological in o ma ion in eg a ion has been s udied in a mo e ex-
ensi e model han [25] o [77]. The in o ma ion has been used o gene a e a whole
hie a chical s uc u e o gene a e a new se o ea u es ha do no subs i u e he o iginal
gene exp essions. Mo eo e , in his wo k, he es ed algo i hms ha e been compa ed o
a wide a ie y o s a e o he a classica ion algo i hm on mul iple publicly a ailable
da ase s wi h a epea able e alua ion p ocedu e ecommended in [112].
Two key elemen s mus be conside ed in including p io biological knowledge in a
clus e ing p ocess. The  s on is he knowledge da abase and he second is how o
de e mine he concep o biological simila i y, so o include i in he ac ual clus e ing
algo i hm.
Conce ning he knowledge da abase, in he las yea s many online and publicly acces-
sible eposi o ies ha e been implemen ed and main ained. Some ele an examples a e he
Gene On ology da abase, GO [6], which anno a es genes by h ee ca ego ies: Biological
P ocess, Cellula Componen , and Molecula Func ion, he KEGG da abase [65] which is
a da abase esou ce o unde s anding high-le el unc ions and u ili ies o he biological
sys em, he Molecula signa u e Da abase [115] o he DAVID knowledge base [60]. The
las wo da ase s a e collec ions o ex e nal knowledge da abases, p ocessed and o de ed in
a compu e iendly o m, easie o use o da a mining applica ion. Fo a mo e comple e
and ho ough lis o knowledge da abases and analysis ools, e e o [3, 13].
The biological simila i y deni ion o he inclusion in he clus e ing p ocess educes
16
o nding an app op ia e simila i y measu e o he biological da a, which usually a e in
a bina y o ca ego ical o m. The undamen al issue is hen o nd an app op ia e ca e-
go ical da a simila i y measu e ha conside s he cha ac e is ics o a knowledge da abase
like spa si y and incomple eness o he a ailable da a. Examples o ca ego ical measu es
used o e alua e he simila i y in mic oa ays can be ound in [14, 77].
2.2 Fea u e selec ion
Fea u e selec ion is he p ocess o choosing ele an ea u es om he da a se wi h espec
o he ask o be pe o med. In addi ion o he main goal o ob aining p edic i e and
gene alizable classie s, wo addi ional goals a e pu sued by ea u e selec ion: o e coming
he cu se o dimensionali y and inc easing he in e p e abili y. The o me is a concep
in oduced in [10] which is ela ed o he ela i e amoun o a ailable aining poin s
and da a dimensions. When he e a e oo many dimensions compa ed o he a ailable
sample poin s, i is easy o nd da a disc imina i e pa e ns which a e acciden al and no
gene alizable, alling in o da a o e  ing. The la e concep is ela ed o making sense
ou o he da a. A classica ion ule in ol ing ewe ea u es is easie o in e p e and
unde s and han a classie using housands o genes.
The selec ion o he bes ea u e subse could be a sol ed p oblem i he p oblem
would no be un easible compu a ionally. Op imum subse selec ion algo i hms al eady
exis [48, 95, 58], which consis in es ing e e y possible ea u e subse and nally choosing
he bes one in e ms o some cos unc ion.
Being his un easible, less compu a ionally expensi e me hods mus be conside ed.
Some o he exis ing me hods a e in oduced in he ollowing Sec ion using a commonly
adop ed axonomy om - [54, 108], which di ides he algo i hms in h ee classes:
l e s,
w appe s
and
embedded
. In Sec ion 4.3, me hods adop ing a die en ea u e selec ion
s a egy a e desc ibed. They a e called
Ensemble me hods
and a e in oduced since some
o hem a e used wi hin his hesis.
Fil e s
a e dened by a p ep ocessing s ep comple ely disconnec ed om he lea ning
phase. A ep esen a i e example a e he anking c i e ia such as [46, 129, 54], in which
co ela ion, mu ual in o ma ion o o he uni a ia e c i e ia a e used o assign a sco e o
17
each ea u e. S a is ical es s like he S uden - es [38] o he Wilcoxon ank-sum es
[131, 86] a e commonly used as
l e s
o ea u e selec ion.
Fil e s
me hods ypically
ha e a sho execu ion ime because hey a e easy o calcula e. The calcula ion speed
is high because no classie needs o be ained in he l e ing phase. The l e ing
ope a ion usually ollows a uni a ia e pa adigm: he ea u e sco e is de e mined by he
ea u e alues wi hou analyzing possible mul i a ia e in e ac ions. This independen
ea u e e alua ion leads o a ea u e anking lis , om which he op sco ing ea u es a e
chosen o ain he classie . Such uni a ia e pa adigm limi s he in e ac ion analysis
in he classica ion phase, p ecluding a pos e io in e ac ion disco e y by a mul i a ia e
classie . The ea u e p eselec ion limi s he classie o use ea u es ha usually a e
co ela ed, due o he uni a ia e na u e o he l e ing phase selec ion. Nume ous
l e
me hods exis in he li e a u e and o mo e de ails [75] can be e e ed as an exhaus i e
e iew.
W appe
me hods include he classie esul s in he selec ion p ocess. They sea ch
h ough he possible ea u e subse s and use he lea ning algo i hm (i.e. he classie ) o
e alua e he sui abili y o each candida e [69].
W appe s
ha e an ad an age o e
l e s
,
because hey can iden i y mul i a ia e in e ac ions. Howe e , when dealing wi h high
dimensional da a, his p ocessing can be compu a ionally expensi e. Die en amilies o
w appe
algo i hms exis , mainly di ided in o op imal and subop imal. Op imal me hods
like ex ensi e sea ch o b anch and bound algo i hm a e in easible o mic oa ay da a
[47]. The subop imal amily is hen di ided in o de e minis ic and s ochas ic me hods.
The s ochas ic g oup includes e olu iona y sea ch algo i hms like gene ic algo i hm [66],
gene ic p og amming [43] o NSGAA II [33]. These algo i hms ha e shown good p edic-
i e abili y [33] hanks o he mu a ion possibili y o he selec ed ea u e subse du ing he
sea ch p ocess. A ypical amewo k o he sea ch s a egy implies e olu iona y s eps. A
he beginning, many indi idual solu ions a e andomly gene a ed o o m an ini ial pop-
ula ion and each solu ion is a ea u e subse . Each solu ion is e alua ed and, a e wa ds,
he bes pa o he popula ion is mo e likely o be used o b eed a new gene a ion. In
he gene a ion p ocess, he solu ions can mu a e and mix wi h some dened p obabili ies
[126]. The p ocess mimics he na u al selec ion p ocess, aiming a ha ing a nal popu-
18

la ion well  ed o he classica ion ask. This p ocess, o i s own na u e, is andom
and s ongly depends on he ini ial popula ion, which can limi he solu ion space. Tha
is why, usually, many pa allel uns a e needed o ob ain a nal solu ion. Fu he mo e,
as no iced in [104], he pe o mance o e olu iona y ends o deg ade when he ea u e
numbe inc eases.
The de e minis ic algo i hms g oup includes many commonly used algo i hms like he
Sequen ial Fo wa d Selec ion (SFS) [130] o Sequen ial Backwa d Selec ion (SBS) [87].
The SFS algo i hm s a s om an emp y se o selec ed ea u es
Y0=∅
, and sequen ially
adds he ea u e
x
ha esul s in he highes objec i e unc ion
J( x, Yk)
when combined
wi h he ea u es
Yk={ i|iselec ed be o e}
ha ha e al eady been selec ed. In his
way
Yk
is a se composed o
k
sequen ially selec ed ea u es. The SBS algo i hm is he
opposi e o SFS and s a s by selec ing all he
p
a ailable ea u es,
Y0={ 1. . . p}
and
sequen ially emo ing he wo s ea u e om he subse
Yk
. The wo s ea u e is he one
whose emo al om
Yk
allows o ob ain he highes objec i e unc ion
J(Yk x)
.
De e minis ic sea ch s a egies like SFS o SBS always choose he same ea u e se i
he s a ing condi ions do no change, hus ensu ing he esul eplica ion in successi e
es s. Wi hin his g oup, algo i hms in oducing exibili y in he sea ch ha e led o e y
compe i i e esul s [112, 39]. Common examples a e he Sequen ial Floa ing Fo wa d
Selec ion algo i hm (SFFS) [104], which is an e olu ion o SFS, allowing a backwa d co -
ec ion s age in he sea ch p ocess, o he Imp o ed Sequen ial Floa ing Fo wa d Selec ion
[94] which addi ionally includes a eplacing s ep. De ails abou SFFS and IFFS a e in-
cluded in Chap e 4, since hey a e he e e ence w appe algo i hms adop ed o ea u e
selec ion.
Finally,
embedded
me hods inco po a e ea u e selec ion as pa o he aining phase.
Examples a e decision ees [24] o LASSO (Leas Absolu e Sh inkage and Selec ion Ope -
a o ) [121, 135] o andom o es s [22, 23, 32]. These ea u e selec ion me hods a e s ic ly
dependen on he chosen classie and a e no sui ed o he aim o his hesis, which is
o p opose a mo e gene al amewo k, applicable o mo e han one classie . Mo e de ails
abou embedded me hods can be ound in [38, 121, 58].
19
2.2.1 Ensemble lea ning o ea u e selec ion
In s a is ics and machine lea ning, ensemble me hods use mul iple expe s o ob ain be -
e p edic i e pe o mance han could be ob ained om any o he cons i uen expe s
[103]. Ensemble echniques ha e been used in he li e a u e o imp o e he s abili y and
pe o mance o ea u e selec ion and classica ion esul s [8, 136, 72]. In his hesis, a
b anch o ensemble echniques o classica ion has been s udied o selec a p ope subse
o classie s o me ge and p oduce a global classica ion ou come o mic oa ay samples.
The idea is o use ensemble lea ning echniques by me ging he p edic ion o a se
o expe s o p oduce a nal ou come wi h imp o ed gene aliza ion and p ecision [72].
The idea behind ensemble lea ning echniques is ha he ensemble p edic ion abili y can
imp o e he one o he single classie s. Many ensemble me hods exis and hey a e
applied in many esea ch elds, o a e iew o ensemble me hods and hei applica ions
in bioin o ma ic e e o [72, 97, 36]. To p oduce expe ensembles he adop ed app oaches
in he li e a u e can be ca ego ized as ollows, om [97]:
•
Using die en ea u e subse s o die en expe s
•
Using die en sample subse s o he die en expe s
•
Using die en ypes o classie s o p oduce he die en expe s
•
Using die en pa ame e s o he same classie ype
•
Any combina ion o he abo e me hods
The ensemble selec ion me hods s udied in his hesis pe ain o he  s ca ego y in
he lis . A se o expe is p oduced by applying he same classie ained on die en
subse s. In Sec ion 4.3, he de ails abou he implemen ed algo i hms a e p esen ed. As
a gene al ule, he key elemen s in de e mining he expe selec ion a e a  ness unc ion,
(e.g. aining e o ), and he no ion o di e si y [73, 72]. Many di e si y measu es ha e
been de eloped o cap u e how much an expe p oduces die en decisions compa ed
o ano he . Examples a e he k-measu e, yuleQ, PCDM [72]. Depending on he chosen
di e si y measu e and he i s in eg a ion wi h he  ness unc ion, a ple ho a o ensemble
selec ion algo i hms ha e been e alua ed and e iewed in [72]. Rele an examples a e he
20
Pa e o-op imal sea ch [72], he Con ex-Hull sea ch in a p ope ly dened sea ch space [72]
o he accu acy in di e si y algo i hm (AID) [8]. Among hese, he AID algo i hm will
be de ailed in Sec ion 4.3, because i is he base o all he de eloped ensemble selec ion
algo i hms in his hesis hanks o bo h i s good esul s in [72], and o i s compu a ional
cos which eases he implemen a ion [8].
Fo a deepe discussion on he o he ensemble gene a ion ca ego ies, [72, 97] can be
e e ed, as well as o he desc ip ion o popula ensemble me hods o imp o e ea u e
selec ion s abili y like boo s apping [58], boos ing [58] and many o he a ian s ha ha e
been de eloped in he li e a u e.
2.3 Classie s
Sample classica ion assigns a class label o incoming samples ollowing a p ecise ule.
Such ule is ob ained om a lea ning phase in which he classie is ained on known da a
wi h p e iously assigned labels. The high dimensionali y o he ea u e se o mic oa ays
is an issue since he as majo i y o classie s a e hough o cases in which he sample
numbe is g ea e han he ea u e dimension. This p oblem is usually add essed h ough
a ea u e selec ion ope a ion and, some imes, in de eloping new classie s as adap ed
e sions o he new scena io. Some s anda d algo i hms ha e been mo e commonly
adop ed among all he possible echniques [138] and o mo e de ailed su eys e e o
[38, 58, 74]. These echniques include om simple classica ion ules like K nea es
neighbo (KNN) o disc iminan analysis, o mo e complex sys ems like suppo ec o
machines (SVM) o a icial neu al ne wo ks.
Simple algo i hms like KNN assign a class label depending on he classes o he K
closes known samples o he cu en sample. Usually K is odd and, he classica ion
bounda ies a e no obus o small aining se a ia ions [19]. KNN has been used
in many wo ks o mic oa ay analysis [34, 112, 101] wi h some success. Ne e heless,
KNN is a nonlinea classie , whose bounda ies can change impo an ly depending on he
aining se , making o KNN mo e sensi i e o aining se die ences han o he , mo e
egula ized, classica ion ules. The educed obus ness o KNN in a small sample scena io
like mic oa ays classica ion, esul s in classie s ha de o eplica e, hus making i s
21
pe o mances less s able [19, 112].
Ano he class o classie s a e disc iminan analysis me hods, which assume ha di -
e en classes gene a e da a based on die en Gaussian dis ibu ions. The mos popula ly
adop ed algo i hm among hem is he Linea Disc iminan Analysis (LDA). Linea dis-
c iminan analysis is also known as he Fishe disc iminan , named o i s in en o , Si R.
A. Fishe [58]. I is a s a is ical lea ning me hod which nds he bes linea combina ion
o ea u es o sepa a e wo o mo e classes, unde he Gaussian dis ibu ion assump ion
o he sample classes, mo eo e i conside s ha all classes ha e he same co a iance
ma ix. [58]. This classie usually ob ains good p edic i e esul s wi h s able classi-
ca ion bounda y and eliable pe o mance es ima ion [112, 19, 15] and o hese easons
has been chosen as a e e ence classie h oughou his hesis. O he ele an examples
o disc iminan analysis classie s a e he Quad a ic Disc iminan Analysis [38], QDA,
which emo es he iden ical co a iance ma ix assump ion and includes quad a ic compo-
nen s o he classie aining. I has also been used in mic oa ay classica ion, [19]. I
p oduces mo e exible classie s han LDA a a p ice o a highe compu a ional cos . An
impo an men ion is also o a whole algo i hm amily bo n o o e come LDA limi a ion
when he sample numbe is smalle han he classie dimension. To do so, egula iza ion,
sh inkage o diagonaliza ion echniques ha e been applied o e ol e he o iginal LDA, and
QDA. Some ele an examples a e he egula ized LDA in oduced in F iedman's wo k
in [96], o he diagonalized LDA, DLDA, [39, 137], o he Sh inkage-based DLDA [123],
and some applica ion o hese me hods in he mic oa ay analysis [53, 111]. Fu he LDA
e olu ions a e known as gene alized disc iminan analysis [9] and ke nel disc iminan anal-
ysis,[81] is a ke nelized e sion o linea disc iminan analysis. Using he ke nel ick, LDA
is implici ly pe o med in a new ea u e space, which allows non-linea mappings o be
lea ned o p oduce mo e complex classica ion bounda ies. Such nonlinea classie s can
be e y powe ul bu he e is an inc eased isk o o e  ing in a small sample scena io
and i may be pa icula ly icky o ob ain gene alizable classie s.
Suppo ec o machines classie was  s p oposed by Vapnik and Che onenkis in
[125]. The goal o he algo i hm, in case o linea ly sepa able da a, is o nd he hype plane
which maximizes he sho es dis ance om a sample poin . When da a a e no linea ly
22
Componen Analysis decomposi ion. Such a modica ion simplies he gene a ion p ocess
by using cons an combina ion weigh s o gene a e he me agene exp ession.
All he s udied modica ions o he o iginal T eele s algo i hm ha e been es ed and
compa ed, o analyze possible bene s o he p edic i e abili y. The expe imen s se up
and he esul s a e included in Chap e 4.
3.1 T eele s clus e ing
The  s s udied echnique o in e a hie a chical s uc u e om gene exp ession da a is
he o iginal T eele s algo i hm, hus i has been chosen o call i
T eele s
clus e ing.
The clus e ing ee is p oduced in a bo om-up pai wise app oach. A each le el:
he wo mos simila ea u es a e chosen and eplaced by wo ea u es, a coa se-g ained
app oxima ion ea u e and a esidual de ail ea u e. Taking ad an age o his mul i-
scale da a ep esen a ion, wi h
T eele s
clus e ing, a each i e a ion, he wo ea u es a e
eplaced by one ea u e only, he app oxima ion one, while he esidual de ail ea u e is
disca ded because i ep esen s wha is die en be ween he wo me ged ea u es. This
new app oxima ion ea u e is called me agene and i is ob ained as a linea combina ion
o he wo joined ea u es. A e wa ds, he newly c ea ed me agene is used as a ea u e o
be compa ed in he nex i e a ions. I he ini ial condi ion is a ea u e se o
p
indi idual
genes, he nal ou come om he ea u e se enhancemen p ocess is a me agene se o
p−1
me agenes
, one o each node in he hie a chical ee. This me agene se is hen
added o he ini ial ea u e se .
In Figu e 3-2, a pseudo code o he hie a chical clus e ing and he me agene gene a ion
p ocess is de ailed. I is a gene al algo i hm, which can be used o desc ibe any o he
implemen ed algo i hm a ian s. Wha die en ia es a clus e ing algo i hm om ano he
in his amewo k a e ei he he simila i y dis ance
d( a, b)
o he me agene gene a ion
p ocess
g( a, b)
.
The Pea son co ela ion is he simila i y me ic used o e alua e pai wise ela ions
be ween ea u es in he o iginal
T eele s
clus e ing. I is a no malized co ela ion measu e
be ween wo ea u es and i is dened as in Eq. 3.1 o gene ic ea u e ec o s
a
and
b
. Each ea u e ec o ep esen s he samples o a specic ea u e, ha is a gene o
29

O iginal ea u e se
G0={g1, . . . , gp}
Ac i e ea u e se
F=G0
Me agene se
M=∅
Fo
i
= 1 :
p-1
1.
Calcula e pai wise simila i y me ic
d( a, b)
o all ea u es in
F
2.
Find a,b
:
d( a, b) = max(d(·,·))
3.
New me agene
mi=g( a, b)
gene a ion:
mi=αa a+αb b=Pp
i=1 βigi;
α∈ <2β∈ <p
Each me agene can be seen ei he as a combina ion o i s wo child ea u es
{ a, b}
o as a linea combina ion o all he o iginal ea u es
gi
4.
Add he new me agene o he ac i e ea u e se
F:= F∪{mi}
5.
Remo e he wo ea u es
a, b
om he ac i e ea u e se
F:= F { a, b}
6.
Join he me agene
mi
o he me agene se
M:= M∪{mi}
end
Dene he new expanded ea u e se :
F=G0∪M
as he union o me agenes and
o iginal gene exp ession p oles.
Figu e 3-2:
Gene al hie a chical clus e ing algo i hm adop ed in his hesis.
me agene.
d( a, b) = < a, b>
k akk bk
(3.1)
The Pea son co ela ion
d( a, b)∈[−1,1]
, measu es he scala p oduc be ween wo
ea u es (i.e. nume a o in Eq. 3.1), di ided by he p oduc o
l2
no m o he wo in ol ed
ea u es. This c i e ion measu es he p ole-shape simila i y o wo ea u es so ha i is
30
in a ian o a scaling ac o :
d( a, b) = dk a, b
. The Pea son co ela ion assumes
alue equal o 1 when wo ea u es ha e he exac same pa e n, while a co ela ion alue
o
−1
implies a pe ec p ole an ico ela ion, dening he a hes possible poin in he
simila i y space spanned by he Pea son co ela ion.
Abou he
me agene
gene a ion p ocess wi h
T eele s
, he clus e ing p ocess p oduces
me agenes
aking ad an age o he mul i-scale ep esen a ion in oduced in [78]. PCA
can be desc ibed as a da a ep esen a ion and i is ma hema ically desc ibed as a change
o basis in a ec o ial space. I has been demons a ed ha PCA can achie e a compac
ep esen a ion o he analyzed da a. In i s o iginal o mula ion, PCA is a global ea u e
ans o ma ion whe e he new ep esen a ion is ob ained as linea combina ion o i s
child ea u es, bu also as a linea combina ion o all he o iginal componen s (in he
mic oa ay case i would be a combina ion o all he housands o gene exp essions). In
T eele s
clus e ing, PCA is ins ead used locally, inside each clus e ing s ep o p oduce
a local da a ans o ma ion, hus combining only wo ea u es a a ime. In de ail, o
each node in he ee, a local P incipal Componen Analysis (PCA) [64] is applied on he
child ea u es. By his p ocess, a hie a chical ee wi h mul i-scale da a ep esen a ion is
ob ained. In each i e a ion, he local PCA calcula es a Jacobi o a ion on he wo ea u es
a, b
[51] as in Eq. 3.2.
m= acos θL+ bsin θL
(3.2)
d= acos θL− bsin θL
In Eq. 3.2,
θL
is he o a ion angle which deco ela es he wo ea u es
a
and
b
so ha
he wo ou pu ea u es
m
and
d
will ha e 0 co ela ion. The
m
ea u e is he coa se-
g ained app oxima ion ea u e in [78] (i.e he  s p incipal componen ) and i is chosen
as me agene in he
T eele s
clus e ing. On he o he hand he
d
ea u e is he esidual
de ail ea u e, which is no aken in o accoun o u he p ocessing. The ac ha he
local PCA can be seen as a Jacobi o a ion is isualized in Figu e 3-3 in a case o wo
e y simila ea u es. On he le , he ini ial wo-dimensional space o med by he o iginal
ea u es
a
and
b
is isualized. On he igh hand side, ins ead, da a a e isualized in
he coo dina e sys em o he wo p incipal componen s. As can be seen, in his case, he
31
Figu e 3-3:
Example o how local PCA can be ep esen ed as a coo dina e sys em o a ion and
how he  s componen well ep esen s wo simila ea u es.
 s p incipal componen ( he
m
ea u e chosen as me agene) ep esen s well he common
beha io o he wo analyzed ea u es.
A no e abou he linea coecien s calcula ed wi h PCA in he me agene c ea ion
algo i hm in Figu e 3-2: each me agene can be seen as a linea combina ion o all he
indi idual genes, and PCA is an uni a y ans o m so ha
kβk2= 1
. This
l2
no m equal
o
1
s a es ha PCA is an ene gy conse a i e ans o ma ion and his eec ansla es
in o p oducing me agenes o g owing dynamic ange as he numbe o ep esen ed genes
g ows.
The nal ou pu o he
T eele s
clus e ing is a hie a chical ee wi h a
me agene
o
each node. The o iginal ea u e se is enhanced by he addi ion o new ea u es able o
summa ize he common beha io o gene clus e s. This cha ac e is ic can educe he noise
hanks o he low-pass l e ing eec om he linea combina ion o simila ea u es.
3.2 Euclidean clus e ing
The second
me agene
c ea ion echnique is called
Euclidean
clus e ing. I adop s an
i e a i e p ocess like he one explained in Figu e 3-2, bu i in oduces changes in he
simila i y me ic
d(,)
and in he me agene gene a ion ule
g( a, b)
wi h espec o he
T eele s
clus e ing echnique.
The simila i y me ic adop ed in he
Euclidean
clus e ing is he nega i e Euclidean
dis ance be ween ea u es, dened in Eq. 3.3. The nega i e Euclidean dis ance has a
maximum in ze o, when wo ea u es a e equal. I has been chosen as al e na i e o he
32
Ini ial ea u e se o h ee equal genes
F0= 1, 2, 3
wi h
1= 2= 3
Two me agenes will be c ea ed
1.
me agene
m1
joining
1
and
2
m1=p1/2 1+p1/2 2
m1scaled = 1/2 1+ 1/2 2
2.
me agene
m2
joining
m1
and
3
m2=p2/3m1+p1/3 3
m2=p1/3 1+p1/3 2+p1/3 3
m2scaled = 1/3 1+ 1/3 2+ 1/3 3
Scaled e sions
m1scaled
and
m2scaled
he scaled e sions a e used o dene he
simila i y wi h he Euclidean dis ance because hey p ese e he componen s
dynamics. These e sions a e hen used as me agenes, enhancing he o iginal
ea u e se .
The non scaled e sions,
m1
and
m2
, a e used o compu e he me agene om
he wo child ea u es wi h PCA as hey p ese e he ene gy dis ibu ion among
he elemen a y componen s.
Figu e 3-4:
Example o me agene c ea ion wi h
Euclidean
clus e ing.
Pea son co ela ion because he Euclidean dis ance can measu e he poin -wise closeness
a he han he p ole-shape simila i y.
d a, b=−
 a− b
2
(3.3)
The Euclidean dis ance has a die en poin o iew wi h espec o he co ela ion mea-
su e adop ed in
T eele s
clus e ing and migh be able o ex ac simila i y ela ed o he
ac ual gene exp essions a he han o hei pa e n.
The change in he simila i y measu e implies a modica ion in he me agene gene a ion
ule
g( a, b)
. Due o he PCA ans o ma ion, which is ene gy conse a i e, a scaling
ac o is in oduced on he p oduced me agenes. The ob ained me agenes wi h
T eele s
clus e ing a e scaled weigh ed a e ages o he genes, wi h a scale ac o g ea e han 1.
33
To p ope ly compa e genes exp ession alues (and no hei shape as wi h he Pea son
co ela ion) wi h
me agenes
, he la e mus be a pu e weigh ed a e age o he genes. An
illus a i e example o how he
me agene
c ea ion p ocess is pe o med wi h
Euclidean
clus e ing is p esen ed in Figu e 3-4. In his gu e, a oy example wi h an ini ial ea u e
se o h ee equal genes is shown. I can be seen how he me agenes ob ained wi h he
sole PCA ans o ma ion a e scaled weigh ed a e age o he genes, mo eo e wi h a scale
ac o p opo ional o he numbe o genes. This scaling ac o is no an issue when he
Pea son co ela ion is conce ned, bu i aec s he Euclidean dis ance measu emen .
To ob ain a p ope compa ison be ween genes and me agenes, when a me agene
mx
is c ea ed, wo e sions o i a e used. The  s one is he same as in he
T eele s
case om he PCA ans o ma ion, while he second is a scaled e sion o he o me
mxscaled =mx/kβk1
The scaled e sion
mxscaled
esul s o be a pu e weigh ed a e age o
he genes and i is used in he pai wise simila i y measu emen as me agene. The non
scaled e sion, ins ead, is main ained and i is used when a new me agene is buil om
mx
o p ese e he ene gy dis ibu ion among he indi idual componen , as can be obse ed
in Figu e 3-4.
The die ences in he simila i y measu e and in he gene a ion ule lead o a die en
me agene se wi h espec o he
T eele s
clus e ing. To be e isualize he die ences
be ween he
T eele s
clus e ing and
Euclidean
clus e ing, Figu e 3-5 is in oduced. The e,
i can be obse ed how he dend og ams a e qui e die en e en i only 4 ini ial ea u es
a e conside ed. As expec ed, in
T eele s
clus e ing, he p ole-shape p e ails in dening
he me ging ea u es, while in
Euclidean
clus e ing, he poin -wise dis ance ules he
p ocess. I can be obse ed how, ou o he h ee me agenes
m1m2
and
m3
, only
m3
has he same p ole in bo h he clus e ing echniques. This is an expec ed esul because
he nal combina ion includes only h ee genes and he ene gy dis ibu ion among he
indi idual componen s is de e mined in he same way by he wo algo i hms.
3.3 Haa wa ele o clus e ing
The possibili y o change, simpli ying, he me agene gene a ion p ocess inside he hie a -
chical clus e ing p ocess has been e alua ed by in oducing Haa wa ele decomposi ion
34

Figu e 3-5:
Example o me agene cons uc ion p ocess die ences be ween T eele s and Eu-
clidean clus e ing. The e ical axis ep esen he gene exp ession alue, while he bulle s in he
ho izon al axis a e he die en samples. In he  s ow he o iginal da a and he wo ob ained
clus e ing ees a e shown. In he second and hi d ows, he c ea ed me agenes wi h T eele s o
Euclidean clus e ing a e ep esen ed.
35
[55] o dene he me agene gene a ion c i e ion
g(·,·)
. In he T eele s o iginal e sion,
each me agene is p oduced wi h a local PCA on he wo me ged ea u es [78]. The s ud-
ied al e na i e p oposes o subs i u e he PCA wi h a Haa ans o ma ion on he wo
me ged ea u es.
The main die ence be ween he wo ules is in he linea combina ion weigh assign-
men . Whe he wi h PCA, he linea weigh s can be any hing cons ained o
kαk2= 1
,
being
α
he wo dimensional coecien ec o , wi h he Haa wa ele ans o ma ion, he
weigh s a e xed and equal o
√2/2
. Such weigh ing die ence eases he s uc u e in o -
ma ion s o age and e ie al, because he only needed in o ma ion is he me ging o de ,
wi hou ca ing abou he coecien alues. A side eec o he Haa basis ans o m is
he gene a ion o a comple ely die en me agene se .
3.4 Discussion
In his Chap e , echniques o in e a hie a chical s uc u e om mic oa ay da a ha e
been desc ibed. The p oduced ou pu a e bina y ees associa ing genes in die en o de s
and p oducing die en se s o me agenes.
This p ocessing s ep is done o ob ain new ea u es mo e able o summa ize he be-
ha io o ela ed genes. To e alua e i his me agene gene a ion p ocess is use ul and
o decide which o he p oposed al e na i e algo i hms is he bes , he inclusion o he
me agenes in a classica ion amewo k mus be done.
In he ollowing Chap e , he p oposed mic oa ay classica ion amewo k is in o-
duced wi h all he needed de ails o adap he p ocess o he mic oa ay da a cha ac e -
is ics.Mo eo e , all he me agene gene a ion algo i hms ha e been uni o mly compa ed
among hem and wi h ele an s a e o he a al e na i es. The esul s a e measu ed in
e ms o p edic i e abili y and obus ness.
36
Chap e 4
Fea u e selec ion o bina y
classica ion
In Chap e 3, he me agene gene a ion p ocess has been used o en ich he gene exp es-
sions wi h a whole new se o ea u es called me agenes. Me agenes can imp o e he
classica ion abili y since hey expand he a ailable ea u e space and because hey can
ex ac common ai s o gene clus e s, l e ing ou he esidual noise. A e he ea u e
se en ichmen , he main p oblem is o deal wi h he high dimensionali y o he ea u e
se , choosing an app op ia e subse o classica ion. This ask is e en mo e compelling
due o he inc eased sample sca ci y condi ion since he o al ea u e numbe has almos
doubled. The ea u e selec ion ask is needed o o e come he cu se o dimensionali y [10]
by selec ing a small amoun o ele an ea u es o , a leas , by excluding a as majo i y
o i ele an ea u es, he eby imp o ing gene aliza ion p ope ies and he in e p e abili y
o he ou pu p edic ion model.
The objec i e o his chap e is o p esen he s udied classica ion amewo ks o
mic oa ay classica ion and o compa e hem wi h he s a e o he a . As a gene al
o e iew, o p oduce a nal p edic ion model o new samples i has been chosen o use
wo undamen al building blocks: he me agene gene a ion s ep and a subsequen ea u e
selec ion s age, whose ou pu is a p edic ion model o classica ion. The me agene gen-
e a ion p ocess has been co e ed in Chap e 3, while in his chap e he wo de eloped
ea u e selec ion app oaches a e de ailed in Sec ions 4.1 and 4.3. The  s one aims a
37
de eloping and uning a w appe ea u e selec ion algo i hm allowing mu a ion o p e i-
ous choices, good s abili y and good scalabili y de i ing om a de e minis ic app oach.
Se e al al e na i es ha e been s udied by in oducing specic elemen s in he sea ch p o-
cess o deal wi h he small-sample scena io in mic oa ay da ase s. The second s udied
ea u e selec ion s a egy is desc ibed in Sec ion 4.3 and i consis s in applying ensemble
ea u e selec ion echniques o he specic case o mic oa ay da a.
In bo h cases, w appe and ensemble ea u e selec ion, he oppo uni y o include
me agenes in he selec ion p ocess is e alua ed, as well as a compa ison among he die en
s udied al e na i es and wi h s a e o he a echniques is pe o med.
4.1 W appe ea u e selec ion
W appe ea u e selec ion has been chosen because o i s exibili y in choosing ea u es
conside ing also mul i a ia e ela ionships among hem [46, 58]. Among he ple ho a o
exis ing me hods, we ocus on e olu ions o he sequen ial o wa d selec ion me hod, SFS,
[130] because hey add exibili y in he sea ch p ocess and in pa icula one algo i hm
has been implemen ed
•
Imp o ed Sequen ial Floa ing Fo wa d Selec ion (IFFS) [94]:
i is a se-
quen ial algo i hm ha allows back acking a e each sequen ial s ep o iden i y
a be e subse : a e adding a ea u e o he subse , he algo i hm looks o he
possible bene s o elimina ing one o mo e ea u es. Fu he mo e, i in oduces a
eplacing s age in case ha back acking does no imp o e he classica ion pe o -
mance. The p ice o pay is a sensible inc ease o execu ion ime in he eplacing
phase ha does no g ow linea ly wi h he ea u e subse dimension. In Figu e 4-1,
he owcha o he IFFS algo i hm is p esen ed.
4.1.1 The IFFS algo i hm
The IFFS algo i hm s a s wi h an emp y se and ends he sea ch when a h eshold alue
θ
is eached. This h eshold alue is eached ei he because he selec ed numbe o ea u es
is equal o he desi ed maximum o because he algo i hm is in a loop and has o e come
38
ial beha io depending on he e o a e alue. The
−sign( )
ac o in he exponen has
been included o highly penalize ea u es wi h nega i e eliabili y, while he
δ
pa ame e
denes he s eepness o he penaliza ion. The
δ
alue denes he
e−1
penaliza ion in e al:
be ween wo ea u es wi h equal eliabili y alue, a
δ%
die ence in he e o a e induces
a
e−1
penaliza ion in he nal sco e. So, when
δ
is small, he dominan pa ame e is he
e o a e (an ex eme case is when
δ→0
he eliabili y has no inuence a all), while
when
δ
is la ge he dominan pa ame e becomes he eliabili y (when
δ→ ∞
he e o
a e is no aken in o accoun ).
The second sco ing ule is a linea combina ion o e o a e and no malized eliabili y.
The linea combina ion sco e is dened by Eq. 4.3. I is a weigh ed sum o e o a e
e
and no malized eliabili y alue
n= ( −min( ))/max( )
. The
δ
pa ame e is bounded
be ween 0 and 1 and i denes he ela i e weigh o eliabili y wi h espec o he e o
a e.
J=δ· n+ (1 −δ)·(1 −e)δ∈[0,1]
(4.3)
This simple sco ing ule allows a mo e exible compa ison o eliabili y alues among
ea u es wi h die en e o a es. I shows a linea end bo h in he e o a e and in he
eliabili y di ec ion. The main change wi h espec o he o me exponen ial penaliza ion
sco ing is ha , he e, a cons an penaliza ion is added (no mul iplied) o a cons an e o
a e inc ease. Figu e 4-3 illus a es he h ee sco e unc ions. I shows he sco e alue
assigned o poin s in he E o -Reliabili y space o he exponen ial combina ion, he
linea combina ion and he lexicog aphic so ing case o [16].
F om Figu e 4-3 i can be obse ed how in he exponen ial combina ion case, he
sco e has an exponen ial dec ease along he E o dimension, while i has a linea end
in he Reliabili y dimension. Fo he linea combina ion case, he sco es lie on o a o a ed
plane in he space wi h he o a ion axis passing h ough he
(0,1)
and
(1,0)
poin s. I
shows linea ends in bo h dimensions (E o and Reliabili y) wi h slopes equal o
1−δ
and
δ
espec i ely. The lexicog aphic sco ing is he e isualized in a e y coa se scena io
in which only 10 die en e o alues a e allowed (imagine a es se composed o 10
samples only) in o de o isualize i s beha io . I is a s ai way-like su ace showing how
he main dimension is he E o alue. Only i wo ea u es sha e he same e o alue he
45

Figu e 4-3:
Sco e su aces in he e o - eliabili y space depending on he h ee sco ing ules.
eliabili y is aken in o accoun (linea end in he eliabili y di ec ion), o he wise he
sco e o a ea u e wi h smalle e o a e is highe , ega dless o he eliabili y alue. F om
Figu e 4-3 i can be obse ed how bo h he sco ing ules combining eliabili y and e o
a e adically change he sco e su ace. F om a s ai way-like su ace (wi h discon inui ies
among e o a e alues), he sco e su ace is ans o med o a con inuous su ace in which
he eliabili y alues ha e mo e decisional powe . This change is mo e impo an in a case
wi h many es samples, because in such a scena io, he lexicog aphic sco ing would be like
a s ai way wi h many small s eps, hus making he eliabili y pa ame e almos useless.
Fu he mo e i would be ex emely sensible o small e o a e changes while he new
sco ing me hods a e able o mix e o a e and eliabili y in a mo e exible o m.
As can be obse ed, he deni ions o exponen ial combina ion and o linea combina-
ion in Eq. 4.2 and Eq. 4.3 depend bo h on a pa ame e (i.e.
δ
) ha mus be p e iously
chosen. This pa ame e dependence implies an op imiza ion s udy o choose he bes
δ
alue o classica ion. Thus, bo h he linea combina ion and he exponen ial penal-
iza ion ules dene a whole se o al e na i es. I will be shown in Sec ion 4.2 how he
p edic i e abili y also depends on he chosen pa ame e alue oo.
46
4.2 Expe imen al esul s o w appe ea u e selec ion
In his sec ion, he classica ion amewo k adop ing he w appe ea u e selec ion p ocess
is e alua ed o de e mine he bes se up conside ing all he in oduced elemen s. The
e alua ion pu pose is mul iple: on one side, he use ulness o in oducing he hie a chical
s uc u e and he me agenes is assessed and, on he o he side, an e alua ion p o ocol
is dened o nd he bes se up in e ms o clus e ing dis ance
(·,·)
(i.e. o compa e
be ween T eele s and Euclidean clus e ing om Chap e 3) and  ness measu e (i.e.
anking sco e ules om 4.1.2).
The e alua ion is pe o med by applying all he algo i hms o analyze a da a coho o
publicly a ailable da a om MAQC s udy [112] and a e compa ed by means o p edic i e
abili y. Once he bes al e na i e is chosen by dening he clus e ing ype, T eele s o
Euclidean, and he anking sco e ule, among lexicog aphic so ing, linea combina ion
o exponen ial penaliza ion, addi ional s udies a e pe o med o assess he s a is ical
obus ness o he ob ained esul s wi h Mon e Ca lo simula ions and analyses on syn he ic
da ase s.
In 4.2.3 and 4.2.4, die en sou ces o a ia ion a e s udied and he op pe o ming
algo i hm is compa ed o al e na i es changing he me agene gene a ion ule
g(·,·)
o
he w appe classie . In 4.2.3, Haa wa ele is used ins ead o PCA o gene a e he
me agenes, while in 4.2.4, linea SVM is chosen as classie a he han LDA. In bo h
cases, esul s on publicly a ailable da a a e compa ed o hei co esponden applying PCA
wi h LDA classie , chosen om he analysis in 4.2.2.
4.2.1 Da ase coho
The analyzed da a a e a se o high quali y da ase s, p o ided by he Mic o A ay Quali y
Con ol s udy phase II as a common g ound o es classica ion algo i hms [112]. The
analyzed da a a e a subse o he p o ided da ase s by he MAQC II conso ium: six
da ase s con aining 13 p eclinical and clinical endpoin s coded A h ough M; o mo e
in o ma ion e e o [112]. Each endpoin co esponds o a die en sample classica ion
so ha he same da ase can be classied ollowing die en c i e ia (e.g. ea men ,
ou come, sex, andom, e c.). Fou ou o six da ase s ha e been used, co esponding o
47
endpoin s A,C o I endpoin s o [112], a ailable a
h p://www.ncbi.nlm.nih.go /geo/
que y/acc.cgi?acc=GSE16716
. A de ailed explana ion o he endpoin composi ion is
included in Table 4.1. These da a ha e been chosen because hey a e highly eliable,
selec ed a e a quali y con ol p ocess in o de o p o ide a common es g ound and
because o each endpoin bo h a aining se and an independen alida ion se a e
p o ided [112]. Fu he mo e, many die en labo a o ies ha e es ed hei algo i hm
on he same da ase s wi h he same e alua ion p o ocol (i.e. ain he classie s on he
aining se wi h pe o mance assessmen on he alida ion da ase ) and published hei
nal ou come [112, 100, 83] hus an accu a e benchma k can be pe o med o unde s and
how well does a p oposed algo i hm pe o m wi h espec o a la ge numbe o s a e o
he a al e na i es. Resul s a e compa ed in e ms o Ma hews Co ela ion Coecien
(MCC) [89] since, as s a ed in [112] i is in o ma i e when he dis ibu ion o he wo
classes is highly skewed, i is simple o calcula e and a ailable o all models wi h which
he p oposed me hod has been compa ed o. I is dened by:
MCC =T P ·TN −F P ·FN
p(TP +F P)(T P +FN)(TN +FP)(TN +FN)
(4.4)
whe e TP is he numbe o he ue posi i es iden ied by he classie , TN a e he ue
nega i es, FP a e he alse posi i es and FN a e he alse nega i es. Wi h ue posi i e
i is mean a sample ca ego ized as posi i e, P, in Table 4.1 and co ec ly classied
as posi i e by he classie . The emaining alues o TN, FP and FN a e consequen ly
dened. The MCC can assume alues om
1
(pe ec classica ion) o
−1
(pe ec in e se
classica ion).
4.2.2 Clus e ing dis ance & sco ing measu e compa ison
The aim o his sec ion is o assess he use ulness o he me agenes, compa ing he p edic-
i e esul s o classie s buil wi h ea u es ob ained wi h T eele s clus e ing, Euclidean
clus e ing and wi hou adding any me agenes. In all cases, he IFFS algo i hm in oduced
in Sec ion 4.1.1 has been consis en ly adop ed. Meanwhile, he h ee sco ing sys ems o
IFFS ea u e selec ion p esen ed in Sec ion 4.1.2, (lexicog aphic so ing, exponen ial
penaliza ion and linea combina ion), a e e alua ed oo in pa allel analyses.
The expe imen al se up is a sequence o ou main s eps: da a p ep ocessing, me agene
48
Table 4.1:
Mic oa ay da ase s used o classica ion.
T aining se Valida ion se
Da ase Endpoin desc ip ion Mic oa ay
pla o m
Samples P N Samples P N
Hamne
Lung umo igen
s. non umo i-
gen AAyme ix
Mouse 430.2.0 70 26 44 88 28 60
NIEHS Li e oxican
s. non oxican CAyme ix
Ra 230.2.0 214 79 135 204 78 126
B eas
cance
P e ope a-
i e ea men
esponse DAyme ix
Human
U133A 130 33 97 100 15 85
Es ogen ecep-
o s a u E130 80 50 100 61 39
Mul iple
Myeloma
O e all su -
i al miles one
ou come FAyme ix
Human
U133Plus 2.0 340 51 289 214 27 187
E en - ee su -
i al miles one
ou come G340 84 256 214 34 180
Sex o he pa-
ien H340 194 146 214 140 74
Nega i e con-
ol, andom
assigna ion I340 200 140 214 122 92
c ea ion, ull-da a analysis wi h some chosen
δ
alues when he sco ing depends on a
pa ame e and a nal pe o mance assessmen in e ms o MCC and p edic i e accu acy,
compa ing he ob ained esul s wi h s a e o he a al e na i es.
The da a p ep ocessing s ep o all he da ase s, (excep he Hamne ), consis s in
se ing he minimum alue o
log210
in o de o a oid conside ing small alued p obe
se s ollowed by a
log2(·)
ans o ma ion and a mean emo al ope a ion along he samples
di ec ion (i.e. each ea u e is se o ha e ze o mean) as i is a common p ac ice in mi-
c oa ay analysis. The Hamne da ase , ins ead, needs o be no malized a  s because
an impo an ba ch eec has shown o wo sen he pe o mance o he alida ion analysis
( [112] supplemen a y ma e ial). Fo his eason, da a a e  s ly no malized using o-
bus mul i-a ay no maliza ion (RMA) p ocedu e on he whole da a space, aining and
alida ion se s. Subsequen ly hey a e p ocessed exac ly like he o he da ase s.
The me agene c ea ion phase is pe o med as explained in Chap e 3 applying
T eele s
clus e ing, he
Euclidean
clus e ing o wi hou applying any clus e ing o assess he me a-
genes use ulness o classica ion.
49
In he ollowing s ep, he p edic i e pe o mance o he al e na i es is measu ed. As
shown in subsec ion 4.1.2, bo h he exponen ial penaliza ion and he linea combina ion
depend on a
δ
pa ame e , so he algo i hm has been es ed on mul iple
δ
alues, chosen
a e a small s udy on a educed e sion o he a ailable da a. Fo he linea combina ion
ule, a ange o
δ
alues be ween 0.05 and 1 wi h 0.05 in e al has been es ed. The bes
selec ed alues a e
[0.05,0.10,0.15]
. Abou he exponen ial combina ion a ange o
δ
alues om 5 o 100 wi h 5 in e al has been es ed, choosing
δ= [5,10,15]
o u he
e alua ion.
Once he
δ
alues ha e been chosen, he analysis is pe o med on he comple e da ase s
(genes and me agenes) applying he ea u e selec ion algo i hm o ain classie s up
o  e dimensions. In o de o ha e a igo ous alida ion assessmen , alida ion da a
a e p ope ly p ocessed by se ing he minimum o
log2(10)
, sub ac ing he gene means
calcula ed on he aining se , and hen p oducing he necessa y me agenes using he
coecien s om he hie a chical ee buil on he aining se . Resul s a e collec ed o
each
δ
and he classie ob aining he bes MCC alue is conside ed as he measu e o
he p edic ion po en ial o he me hod.
Resul s analysis and assessmen
The expe imen al esul s ollowing he expe imen al p o ocol a e p esen ed and discussed
he e. In Table 4.2, he mean MCC and accu acy esul s ac oss he analyzed endpoin s
om [112], A C D E F G H I, a e showed. Each
da XX
exp ession iden ies a die en
classie de eloped by a die en esea ch g oup in ol ed in he MAQC s udy. The
da XX
alues a e hose whose esul s a e epo ed in [112]. As i can be obse ed, he MCC esul s
in Figu e 4.2 span a ange om 0.284 co esponding o
da 3
, o he 0.490 ob ained by
da 24
g oup, while he accu acy alues span om 65.43% o
Da 3
o 83.86% o
Da 20
.
The bes al e na i e is die en depending on he chosen measu e. This a ia ion is linked
o he class dis ibu ion skewness which can lead an algo i hm o ha e a high accu acy
bu a e y low o null MCC alue. This is exac ly wha happens o
Da 20
analyzing
endpoin F: i has 87.38% accu acy while MCC
= 0
because i conside s all he samples
pe aining o a single class which co esponds o 87.38% o he alida ion se samples.
50

Table 4.2:
MAQC mean MCC and mean Accu acy esul s
G oup MCC Accu acy G oup MCC Accu acy
da 3
0.284 65.43%
da 11
0.453 75.59%
da 33
0.300 66.04%
da 36
0.457 79.18%
da 7
0.307 71.04%
da 10
0.458 78.39%
da 19
0.384 79.52%
da 4
0.468 81.49%
da 29
0.397 81.78%
da 12
0.476 82.54%
da 35
0.419 77.69%
da 25
0.477 80.81%
da 18
0.428 77.29%
da 13
0.488 80.67%
da 32
0.431 78.89%
da 24
0.490 81.13%
da 20
0.443 83.86%
The MCC alue be e e alua es he pe o mances o he scheme, pa icula ly in cases o
unin o ma i e classica ion. The I endpoin is no conside ed in he mean calcula ions
because i is a nega i e con ol da ase on which algo i hms should p oduce bad esul s
because class membe ships ha e been andomly dened (see Table 4.1). Resul s in Table
4.2 a e o ganized by inc easing MCC alue along each column.
In ables 4.3, 4.4 and 4.5 he esul s applying he p oposed amewo k on he
da ase s om Table 4.1 a e p esen ed. Each able includes he esul s pe aining o a
die en sco ing ule: he lexicog aphic so ing, he exponen ial penaliza ion o he linea
combina ion.
In Table 4.3, he mean MCC and accu acy alues wi h he lexicog aphic sco ing a e
showed. In each column he esul s co esponding o a die en me agene gene a ion
me hod a e epo ed:
T eele s
clus e ing,
Euclidean
clus e ing, o
None
. The
None
col-
umn co esponds o he esul s when no me agene has been conside ed. As o he me hod
epo ed in [112], he I endpoin is no conside ed in he mean calcula ion due o i s an-
dom na u e. As can be seen, he in oduc ion o me agenes allows ob aining highe mean
MCC and accu acy alues, hus p oducing be e classie s. Wi h he lexicog aphic so -
ing he bes MCC esul is 0.423, wi h 77.46% accu acy, i
T eele s
clus e ing as me agene
gene a ion me hod is chosen.
Table 4.4 con ains he collec ed alues applying he exponen ial penaliza ion sco ing
ule. Resul s a e o ganized in ou columns. The le column species he
δ
pa ame e ,
while he emaining h ee columns a e o ganized as in Table 4.3. Changing he sco ing
ule leads o ema kably be e esul s han hose in Table 4.3. The simul aneous use
51
o bo h he e o a e and he eliabili y allows us o each be e pe o mances. He e
also, esul s wi h me agenes a e be e han wi hou and he bes esul is ob ained when
T eele s
clus e ing is adop ed and
δ
is equal o 10. Finally, he bes mean MCC alue is
e en highe han he bes one o Table 4.2 om
Da 24
. The e, he bes MCC is 0.490,
while he e 0.495 is eached, suppo ing he p oposed amewo k as an excellen al e na i e
o s a e o he a me hods. Conce ning he accu acy alues, wi h
T eele s
clus e ing and
δ= [5,10]
, be e esul s han hose in Table 4.2 a e ob ained. The highes accu acy
alue is 84.02%, ob ained wi h
δ= 5
.
In Table 4.5, he esul s ela i e o he linea combina ion sco e a e showed. The
o ganiza ion is he same as in Table 4.4. In his case oo, he me agenes ha e con med
o be use ul o classica ion because he esul s ob ained wi h
T eele s
o
Euclidean
clus e ing a e be e han wi hou . A compa ison wi h he lexicog aphic so ing shows
how, gene ally, he mean esul s a e highe . In his case, he bes mean MCC is 0.486
when
Euclidean
clus e ing is adop ed and he
δ
pa ame e is be ween 0.1 and 0.15, while
he highes accu acy alue is 83.60% when
δ
is se o 10. Obse ing he esul s using bo h
linea combina ion and exponen ial penaliza ion ule, he MCC alues a e qui e s able o
small a ia ion o he
δ
pa ame e . This is a good p ope y because he e is no need o
p ecisely op imize he alpha alue.
To isualize he p oposed algo i hm pe o mance in compa ison wi h he s a e o he
a al e na i es om [112], Figu e 4-4 and 4-5 a e in oduced. In Figu e 4-4, he esul s
a e so ed by inc easing mean MCC alue and a e ep esen ed as columns. The MCC
alue o each al e na i e is p in ed abo e each column, and below he co esponding
me hod is indica ed. In Figu e 4-5, he accu acy alues a e p esen ed, so ed by in-
c easing alues. All he esul s om Table 4.2 a e included and pain ed as uni o m ligh
g ay ba s. Fo space and cla i y easons, no all he esul s ob ained wi h he p oposed
amewo k a e included. A selec ion o hem is p oposed ep esen ing only he bes h ee
esul s o he exponen ial penaliza ion and o he linea combina ion ule, and he o e -
all bes esul wi h he lexicog aphic so ing. The esul om he lexicog aphic so ing
scheme is pain ed as a black ba and is iden ied by he
Lexicog aphic
label. Resul s
applying he linea combina ion scheme a e highligh ed by a black and whi e ho izon al
52
0.284
0.300
0.307
0.384
0.397
0.419
0.423
0.428
0.431
0.435
0.444
0.453
0.457
0.458
0.468
0.474
0.476
0.476
0.483
0.486
0.486
0.488
0.490
0.495
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
MeanMCCValue
MAQC esul s Lexicog aphic Linea combina ion sco e Exponen ial penaliza ion sco e
Figu e 4-4:
Mean MCC alues compa ison be ween MAQC esul s and he bes al e na i es
o he die en sco ing echniques adop ed.
50%
55%
60%
65%
70%
75%
80%
85%
MeanAccu acy
MAQC esul s Lexicog aphic Linea combina ion sco e Exponen ial penaliza ion sco e
Figu e 4-5:
Mean accu acy alues compa ison be ween MAQC esul s and he bes al e na i es
o he die en sco ing echniques adop ed.
lines pa e n. The labels s a wi h
lin_xx_E
, whe e
_xx
is he
δ
alue mul iplied by 100
and
_E
indica es ha he
Euclidean
clus e ing has been used. The alues co esponding
o he exponen ial penaliza ion sco ing ules a e coded as da k g ay columns. The labels
a e coded by
exp_xx_T
, whe e
_xx
is he
δ
alue and
_T
indica es ha he
T eele s
clus e ing has been adop ed. As can be obse ed in Figu e 4-4, he p oposed amewo k
ob ains esul s compa able o he bes s a e o he a al e na i es when he linea combi-
na ion sco ing o he exponen ial penaliza ion ule a e used. Fu he mo e, he
exp_10_T
ob ains he bes o e all mean MCC alue. F om Figu e 4-5 i can be obse ed how
bo h
exp_10_T
and
exp_5_T
ob ain be e alues han he compa ed s a e o he a
al e na i es. Fu he mo e, i is shown how he accu acy alue oo is obus o small
δ
a ia ions.
53
The mean numbe o chosen ea u es by all he p esen ed al e na i es spans be ween
2.14 o
exp_10_T
o 3.43 o
lin_10_E
. As can be seen, he me agene c ea ion p ocess
has almos doubled he numbe o ea u es compa ed o he o iginal numbe o genes, bu
he nal classie ac ually uses a e y low numbe o ea u es o pe o m he classica ion.
S a is ical analysis o he op pe o ming algo i hm
The p oposed amewo k p o ides compe i i e pe o mances wi h espec o he s a e o
he a al e na i es. To alida e his esul , a u he s udy has been pe o med o assess
he obus ness o he ob ained pe o mance. The s udy consis s in a 50 uns Mon e Ca lo
analysis o he classica ion endpoin s. This 50 un se up has been p oposed o ha e a
b oade ange o expe imen s o assess he pe o mance s abili y linked o he use o c oss
alida ion as pe o mance es ima ion me hod, which is known o ha e a la ge a iance [21].
In each un, he amewo k se up is he same as he bes al e na i e:
T eele s
clus e ing
as me agene gene a ion me hod and exponen ial penaliza ion wi h
δ= 10
as sco ing ule
o ea u e selec ion.
The esul s a e shown in Figu e 4-6 as a boxplo whe e he g ay box ep esen he
in e al be ween he
25 h
and he
75 h
pe cen iles and he black c osses a e alues consid-
e ed ou lie s. In Table 4.6 some s a is ical esul s a e p esen ed. Each column in Figu e
4-6 co esponds o a die en endpoin , labeled along he
x
axis. Fo each column, an
as e isk iden ies he MCC alue ob ained in he p e ious s udy ( he alues used o ob-
ain he mean MCC alue in Figu e 4-4), whose alues a e included in he las column
o Table 4.6 unde he label o
un 0
.
The alues a e collec ed in he same way as he
un 0
i e a ion, o each endpoin ,
classie s ha e been buil up o  e ea u es and he bes one is hen conside ed in he
mean calcula ion. Resul s o each endpoin a e p esen ed sepa a ely o be e iden i y
how he algo i hm pe o mance can change depending on he analyzed da a.
Wha can be obse ed om bo h Figu e 4-6 and Table 4.6 is ha he esul s show a
high obus ness in he analysis o mos endpoin s. The ob ained alues a e igh a ound
hei mean alue o he endpoin s A,C,D,E,H,G and I. The mean alues a e e y close
o he
un 0
esul s. The mean alues in all hese endpoin s a e sligh ly highe han he
54
Figu e 4-10:
Mean MCC esul s compa ison be ween PCA and Haa me agene gene a ion
ules.
applying he o iginal PCA ans o m a e ep esen ed by o ange columns, while he cu en
esul s applying he Haa basis ans o m a e coded by blue ba s. In Table 4.8, he same
mean MCC alue om he Mon e Ca lo simula ion a e epo ed. I can be obse ed how
he Haa al e na i e ob ains an o e all MCC mean highe han he PCA o iginal e sion.
Analyzing he esul s, i can be obse ed how he e is li le o no die ence be ween
he mean pe o mances applying ei he Haa o PCA ans o m in gene a ing a me agene.
The die ence is ele an in he F endpoin and, s ic ly speaking, using Haa basis o
p oduce me agenes, leads o be e esul s in 5 ou o 7 da ase s. As a gene al conclusion,
Haa basis decomposi ion as me agene gene a ion me hod can be a alid al e na i e o
he PCA as me agene gene a ion ule since he mean esul s a e sligh ly be e and he
me agene gene a ion p ocess is easie han he o iginal PCA implemen a ion.
4.2.4 Classie compa ison: LDA and linea SVM
SVMs a e e y powe ul ools o he lea ning and classica ion ask. They a e used in
a e y b oad spec um o applica ions, including he mic oa ay classica ion wi h e y
simple ke nels [125, 58]. Since SVMs a e commonly used in machine lea ning and o
mic oa ay classica ion [112, 114], hey ha e been conside ed as a possible al e na i e o
he e e ence classie , LDA.
The p edic i e esul s applying LDA ha e been compa ed wi h new esul s by changing
LDA o a linea SVM. The eason o choosing linea SVM and no o he nonlinea
61

Figu e 4-11:
Mean MCC alues on MAQC da ase s compa ing he LDA classie and he
linea SVM implemen a ion.
ke nels SVMs like adial basis unc ions o polynomial ke nels [125] is due o he eliabili y
measu e o mula ion. The eliabili y has been conside ed o linea bounda y classie s
which wo k in he space co e ed by he gene exp essions. I s beha io when he decision
space is augmen ed wi h a ke nel is no known o well unde s ood bu he e is a ele an
p obabili y ha i may be biased by he nonlinea componen s which would educe he
disc imina i e eec o he cu en eliabili y o mula ion.
The linea SVM classie has hen been used wi h he de aul pa ame e o he
libs m implemen a ion [26] because he sample size is oo small o eec i ely pe o m
a pa ame e es ima ion h ough in e nal c oss alida ion, and because such a pa ame e
es ima ion p ocess would imply an eno mous inc ease o he compu a ion ime.
The expe imen al p ocess is he same as in Sec ion 4.2, in which he MAQC da ase s
a e analyzed wi h a 50 un Mon e Ca lo simula ion, bu o he SVM case, he i e a ion has
been limi ed o 10 due o he much longe compu a ion ime han LDA implemen a ion
The esul s in Figu e 4-11 co espond o he mean Ma hews Co ela ion Coecien
alues (MCC) [89]. Wha can be obse ed on Figu e 4-11 is ha he esul s ob ained
wi h LDA a e signican ly be e han hose ob ained using linea SVM. In 6 ou o 7
da ase s he mean MCC alue is highe using LDA han using SVM.
O e all, i appea s how he choice o LDA ins ead o SVM wi h linea ke nel is he
good one o he p oposed ea u e selec ion algo i hm. P obably, SVM classica ion can
be imp o ed wi h a p ope pa ame e uning bu ha would equi e mo e samples o be
62
eec i e and will su ely imply an inc ease o he compu a ion ime (e.g. 10 old o a 10
old c oss alida ion uning).
4.3 Ensemble ea u e selec ion
Ensemble lea ning combines mul iple lea ning algo i hms, called expe s, o imp o e he
o e all p edic ion accu acy and ha e been ex ensi ely adop ed in he li e a u e [136]. A
ple ho a o ensemble me hods has been de eloped o analyze biological da a and he e
exis many al e na i es e iewed o example in [136, 72]. They became popula because
hey allow o imp o e he classica ion by agg ega ing mul iple expe s o make decision
o e unseen da a in a consensus way. In o de o eec i ely imp o e he ensemble pe o -
mances he expe s should be accu a e, (i.e. be e han andom), and di e se om each
o he [136].
An app oach o ensemble lea ning called o e p oduce and selec is desc ibed in [72]
as a me hod o ob ain good ensemble lea ne s. I consis s in p oducing a big se o
expe s and hen selec a subse which will be used o classica ion ia majo i y o ing.
Se e al c i e ia o expe selec ion algo i hms a e s udied in [72] and compa ed. Among
he conside ed algo i hms, he one called
Accu acy in di e si y
, AID, [8] was able o each
he bes p edic ion accu acy when compa ed o se e al al e na i es [72].
In his hesis wo e sions o he AID algo i hm om [8] ha e been implemen ed and
s udied as e e ence ensemble algo i hm. One is he o iginal AID implemen a ion and
he o he is a simplied e sion om Kunche a's book [72] and ha will be named
Kun
.
To p oduce a huge and di e se se o expe s, we decided o use he o e abundance o
ea u es mic oa ay da a. Fo each one o he a ailable ea u es, a Linea Disc iminan
Analysis classie , LDA, is buil and used as an expe . The a ailable ea u e se is no
only composed by he genes, bu also by me agenes buil as explained in Sec ion 3.1 wi h
he T eele s algo i hm.
The mic oa ay cha ac e is ics o small sample size and la ge ea u e numbe ha e been
conside ed as possible issues o he ensemble sea ch p ocess, he e o e no el y elemen s
ha e been in oduced o adap he o iginal hinning algo i hm o he mic oa ay scena io.
In addi ion o including me agenes as expe s, he no ion o
nonexpe s
ha ep esen a
63
se o expe s excluded om he hinning p ocess due o hey poo p ope ies has been
in oduced as well as a ule o b eak ies in he hinning p ocess.
4.3.1 The e e ence ensemble algo i hms
The p inciples on which he AID algo i hm is based a e o include he mos di e se and
accu a e classie s by elimina ing classie s ha a e mos o en inco ec on examples
ha a e misclassied by many expe s. A pseudo code o he AID algo i hm is shown in
Figu e 4-12. I is an i e a i e p ocess in which, a each i e a ion, one expe is emo ed
om he ensemble. A each i e a ion we conside o ha e a se o
n
samples and
p
expe s
[8]. To de e mine which expe
Ei
mus be emo ed, some elemen s a e calcula ed. The
 s one is an ensemble di e si y measu e called Pe cen age Co ec Di e si y Measu e
d
[8], which is he pe cen age o samples which a e co ec ly classied by a pe cen age
o indi idual expe s be ween 10 and 90 %. The
d
measu e is hen combined wi h o he
pa ame e s,
µ
and
β
, dened in Figu e 4-12 which a e used o iden i y a se o ele an
poin s
Sp
o he cu en i e a ion. The
Sp
se is composed o all samples which a e
co ec ly classied by a pe cen age o expe s be ween he wo calcula ed bounda ies.
Finally, he expe
Ei
o be emo ed om he ensemble is he one wi h lowes accu acy
on he
Sp
se .
The a ionale behind his is ha he samples in
Sp
a e hose on which he ensemble is
mos unce ain, hus a e hose o which he elimina ion o an expe can be mo e ele an
because i can change he ensemble majo i y o ing. The e o e, excluding he expe ha
mo e poo ly pe o ms on hese samples aec s mo e posi i ely he ensemble accu acy han
simply excluding he expe wi h o e all lowes accu acy. Since he ensemble changes
h oughou he i e a ions, he
d
alue changes, as well as he bounda ies, hus meaning
ha he se o ele an samples adap s o he ensemble changing cha ac e is ics.
In [8] is s a ed how he adap i e bounda ies o dene he
Sp
se a e dened by con-
side ing he known ela ionship be ween he expe s mean accu acy and he ensemble
di e si y [72]. On he o he side, in [72] i is ema ked how he AID algo i hm could ha e
equi alen pe o mances wi h xed bounda y alues, sugges ing o use he ones in he cal-
cula ion o he
d
measu e: 10% and 90%. Since we could no nd any wo ks compa ing
64
Table 4.3:
Mean esul s adop ing he lexicog aphic sco ing scheme
Lexicog aphic so ing
T eele s Euclidean None
MCC Accu acy MCC Accu acy MCC Accu acy
0.423 77.46% 0.418 76.18% 0.381 75.48%
Table 4.4:
Mean esul s adop ing he exponen ial penaliza ion sco ing scheme
Exponen ial penaliza ion
δ
T eele s Euclidean None
MCC Accu acy MCC Accu acy MCC Accu acy
5 0.475 84.02% 0.457 81.57& 0.442 82.99%
10 0.495 83.95% 0.460 83.61% 0.421 82.66%
15 0.483 83.67% 0.451 83.187% 0.457 83.30%
Table 4.5:
Mean esul s adop ing he linea combina ion sco ing scheme.
Linea combina ion
δ
T eele s Euclidean None
MCC Accu acy MCC Accu acy MCC Accu acy
0.05 0.483 83.45% 0.437 81.58% 0.444 81.46%
0.10 0.468 83.31% 0.486 83.60% 0.444 81.46%
0.15 0.469 83.25% 0.486 83.19% 0.444 81.46%
Samples
S=s1. . . sn
Expe s
E=E1. . . Ep
while
#E > 1
Calcula e
Sd={si}: 0.1≤ (si)≤0.9
whe e
(si)
ac ion o expe s in he ensemble co ec ly classi ying
i h
sample.
Calcula e
d=#Sd
n
Lowe Bound
lb=µ·d+1−d
n
Uppe Bound
Ub=β·d+µ(1 −d)
Dene he se o ele an samples.
Sp={si}:lb≤ (si)≤Ub
Ei=
expe wi h lowes accu acy o e he
Sp
se .
E:= E−Ei
Remo e
Ei
om
E
end
µ
= Mean expe s accu acy
β= 0.9
Figu e 4-12:
Pseudocode o he AID algo i hm.
65
Table 4.6:
S a is ical p ope ies o he Mon e Ca lo simula ion.
Endpoin MCC Accu acy Run 0 Run 0
MCC Accu acy
A
0.2176 67.37% 0.2750 65.91%
C
0.7949 90.25% 0.7700 89.22%
D
0.3869 80.49% 0.3690 80.00%
E
0.7732 89.17% 0.7680 89.00%
F
0.1147 86.3% 0.1800 87.85%
G
0.1723 79.57% 0.2430 82.71%
H
0.8609 93.21% 0.8550 92.99%
I
0.0564 55.14% 0.0510 52.68%
he wo al e na i es, we chose o apply bo h and keep he one wi h be e pe o mances.
4.3.2 Mic oa ay adap a ions o hinning
Conside ing he mic oa ay da a cha ac e is ics we p opose some key poin s o ob ain a
good ensemble sys em:
Expe s coho
We chose o build housands o expe s dening each expe as an
LDA classie ained on a die en ea u e. Bo h genes and me agenes, ob ained wi h
he algo i hm om Chap e 3 a e conside ed as indi idual ea u es since me agenes helped
in nding be e classie han wi h genes only.
Nonexpe s
We in oduce he no ion o nonexpe o emo e a whole se o expe s"
wi h poo aining cha ac e is ics. We decided o exclude om he hinning p ocess all
hose expe s ha classi y all he aining sample wi h he same label. Conside ing ha
he expe is unable o dis inguish wo classes, i is no conside ed as a use ul ensemble
componen . The nonexpe numbe can a y depending on he da a ype and i inc eases
when he class dis ibu ion is highly skewed. Fu he mo e, he idea o nonexpe esponds
o he mic oa ay da a cha ac e is ic o ea u e o e abundance: he majo pa o he
a ailable ea u es is useless o p edic ion pu poses since hey a e no ela ed o he
classied phenomenon. Thus, we included his simple c i e ion in he hinning p ocess.
Tie b eaking
Conside ing he ypical case o small sample numbe o mic oa ays and
conside ing ha he
Sp
sample is smalle o equal o he whole aining sample numbe ,
66

Table 4.7:
Resul s o he s udy based on syn he ic da a. The h ee sub ables co espond o
he h ee die en da a dis ibu ions. Each sub able is o ganized showing he alues depending
on he skewness alue and he die en size o he aining se . The
T ain
column con ains
he size o he aining se , he
MCC
columns shows he mean MCC alue ac oss he die en
expe imen al condi ions and Mon e Ca lo i e a ions while
S d
and
#F
columns con ain he MCC
s anda d de ia ion and he mean numbe o selec ed ea u es espec i ely.
Skewness - Class 1 pe cen age -
50% 70% 90%
Redundan model
T ain MCC S d # F MCC S d # F MCC S d # F
60 0.509 0.120 4.50 0.431 0.140 4.83 0.319 0.193 3.58
120 0.532 0.086 2.58 0.468 0.117 4.67 0.323 0.143 3.50
180 0.545 0.071 2.75 0.492 0.086 3.33 0.346 0.120 7.33
Syne ge ic model
T ain MCC S d # F MCC S d # F MCC S d # F
60 0.343 0.184 4.58 0.315 0.187 2.92 0.325 0.239 1.83
120 0.431 0.133 5.42 0.351 0.143 5.83 0.266 0.221 4.75
180 0.475 0.108 5.50 0.407 0.109 5.92 0.257 0.189 6.50
Ma ginal model
T ain MCC S d # F MCC S d # F MCC S d # F
60 0.509 0.159 6.58 0.555 0.150 3.25 0.490 0.193 2.17
120 0.549 0.148 7.50 0.610 0.135 4.92 0.542 0.211 2.92
180 0.570 0.139 7.75 0.631 0.137 8.17 0.572 0.193 4.00
A C D E F G H AVG
PCA 0.253 0.788 0.291 0.769 0.088 0.155 0.863 0.458
Haa 0.251 0.797 0.337 0.769 0.095 0.156 0.871 0.468
Table 4.8:
Mean MCC esul s om Mon e Ca lo simula ion on MAQC da ase s. The wo
algo i hms die om he me agene gene a ion ule, PCA e sus Haa basis decomposi ion.
he e is a ele an p obabili y o ha e ies when compa ing expe s accu acies. To deal
wi h his p oblem and in oduce a ule, he me agene gene a ion p ocess is conside ed.
When ies occu , he excluded expe is he one which has been gene a ed a a highe le el
in he hie a chical ee, so ha me agenes composed o many child en wi h low simila i y
will be elimina ed ins ead o ano he me agene wi h mo e co ela ed componen s. Indeed
i is mo e likely ha a me agene wi h mo e co ela ed child en will eplica e i s beha io
han ano he one me ging many die en indi idual genes. Finally, he ies be ween indi-
idual genes a e andomly esol ed since hey all a e on he same le el o he hie a chical
ee.
The use ulness o hese h ee elemen s is assessed by expe imen s compa ing he com-
ple e algo i hm wi h h ee modied algo i hms, each o which does no use one o he
67
Table 4.9:
MCC esul s compa ing he s udied AID and
Kun
algo i hms.
A C D E F G H MEAN
AID
0.293 0.793 0.459 0.789 0.221 0.231 0.813 0.514
Kun
0.407 0.812 0.459 0.789 0.221 0.236 0.828 0.533
Kun ie
0.303 0.804 0.451 0.789 0.221 0.236 0.828 0.519
Kungenes
0.346 0.781 0.366 0.773  0.313 0.817 0.485
Kunall
 0.792  0.789   0.031 0.230
p oposed key elemen s.
4.3.3 Ensemble algo i hms compa ison
Two expe imen s a e pe o med o e alua e he ensemble e e ence algo i hms and o
e alua e he use ulness o he in oduced mic oa ay adap a ion elemen s. The  s ex-
pe imen e alua es whe he he o iginal AID algo i hm [8] o he simplied e sion in [72]
has be e pe o mances. They will be iden ied by
AID
and
Kun
espec i ely. Bo h
algo i hms a e ained on he se en da ase s. Fo each da ase hey p oduce housands o
nes ed ensembles, one o each i e a ion. These ensembles a e hen applied on indepen-
den alida ion da ase s and he bes pe o ming ensemble is aken as ep esen a i e o
he p edic i e po en ial o he algo i hm as in [84, 15]. In o de o a oid o ing a i ac s,
only ensembles wi h an odd numbe o expe s a e conside ed.
The chosen pe o mance me ic is he Ma hews Co ela ion Coecien (MCC) [89],
since, as s a ed in [112] i is in o ma i e when he dis ibu ion o he wo classes is highly
skewed, i is simple o calcula e and a ailable o all models wi h which he p oposed
me hod has been compa ed o. MCC alues ange om -1 (i.e. pe ec in e se p edic ion)
o 1 (pe ec p edic ion).
The second expe imen has he same se up as he  s one, bu i e alua es he use-
ulness o he in oduced elemen s in Sec ion 4.3.2: he nonexpe no a ion, he me agene
inclusion and he ie b eaking ule. Th ee algo i hms a e compa ed o he o iginal one.
Each one applies wo o he h ee elemen s and is iden ied, o he
Kun
algo i hm by:
•Kunall
: This algo i hm does no exclude he nonexpe s om he hinning p ocess.
•Kungenes
: This algo i hm excludes he nonexpe s bu i does no use any me agene.
68
Figu e 4-13:
Mean MCC esul s compa ison wi h s a e o he a esul s om [112] and om
Sec ion 4.2.2.
•Kun ie
: This algo i hm esol es each ie wi hou conside ing he ee s uc u e,
hus elimina ing he  s expe i encoun e s wi h lowes accu acy on
Sp
se .
Finally, he bes pe o ming algo i hm is compa ed o s a e o he a al e na i es
om MAQC s udy [112] and om he bes esul s in Sec ion 4.2.2. In his way i is
also possible o compa e he die ences in oduced by he ensemble hinning algo i hm
wi h espec o he algo i hm om Sec ion 4.2.2, ha uses he same ea u es bu wi h a
die en ea u e selec ion algo i hm.
4.3.4 Compa ison wi h s a e o he a
Table 4.9 shows he MCC esul s o all he s udied algo i hms in his wo k. Each da ase
co esponds o a column and he las column is he mean MCC alue ac oss he da ase s,
he highe he alue he be e he algo i hm is conside ed o p edic ion. The compa ison
be ween he AID and he simplied
Kun
algo i hm can be done obse ing he  s wo
lines in Table 4.9. The
Kun
algo i hm ob ains be e o e all MCC mean alue and in
e e y single da ase i ob ains be e o equal MCC alues. I can be s a ed ha he
simple
Kun
algo i hm achie es be e p edic ion esul s and i should be p e e ed o he
AID algo i hm.
In he las ou ows o Table 4.9, he main p oposed inno a ions a e analyzed by
compa ing he ull
Kun
algo i hm, wi h h ee algo i hms, each one excluding a die en
aspec . They a e o ganized by dec easing mean MCC, so ha i can be s aigh o wa dly
seen which algo i hm ob ains he bes pe o mances and how much each o he key ele-
men s aec s he nal esul s. Globally, he
Kun
algo i hm ob ains be e esul s wi h
69
an o e all MCC o 0.533 and he in oduced elemen s ha e die en impac s. The ie
b eaking ule is he leas aec ing ac o since
Kun ie
ob ains a mean 0.519 MCC. The
me agene inclusion as indi idual ea u e has a signican inuence on he p edic i e abil-
i y, as an MCC o 0.485 is ob ained. He e oo, he me agenes a e use ul o classica ion as
in Sec ion 4.2.2 and no using hem can lead o undesi able MCC alues since he missing
alues ep esen an unde e mined MCC due o he null denomina o . This is ob ained
when all he alida ion samples a e assigned o one class [15]. Finally, he mos impo an
o he in oduced elemen s is he nonexpe deni ion. No including his concep leads
o e y poo esul s and, mo e impo an ly, o unde e mined MCC alues in many o he
analyzed da ase s. This is due o he ac ha all nonexpe s ag ee on e e y sample, hus
s ongly biasing he ensemble o e.
F om he esul s in Table 4.9, he bes pe o ming algo i hm is he ull
Kun
and all he
in oduced adap a ions helped in ob aining such esul s. In Figu e 4-13, he mean MCC
alue o
Kun
algo i hm is compa ed wi h s a e o he a al e na i es. The as majo i y,
all he
da xx
columns, co espond o he mean MCC alue om he MAQC s udy [112].
In addi ion o hem, he column labeled as
IFFS
is he mean MCC alue om he bes
esul s om Sec ion 4.2.2, which makes use o he same ea u es, genes and me agenes,
bu adop s he IFFS ea u e selec ion algo i hm. The s a e o he a algo i hms a e
ep esen ed as solid g ay columns, while he
Kun
mean MCC alue is ep esen ed by a
black and whi e s aigh lines pa e n.
I can be obse ed how he
Kun
algo i hm ob ains a ema kable imp o emen when
compa ed o s a e o he a al e na i es and, compa ing he shown esul s wi h he mean
alues in Table 4.9, i can be obse ed how a ious o he es ed algo i hms would ha e
ob ained be e han s a e o he a esul s. This con ms he goodness o ensemble
hinning as app oach o combine mul iple expe s o classica ion [72].
4.3.5 Tuning he ensemble
F om he esul s in Sec ion 4.3.4, i appea s how he
Kun
algo i hm is a alid al e na i e
o he ea u e selec ion ask, and how he mic oa ay adap a ion elemen s ha e helped
in ob aining he nal esul . To u he explo e he po en ial o he
Kun
ea u e selec ion
70
Figu e 5-1:
Toy example o a small knowledge da abase ma ix whe e each ow is a die en
gene while columns a e a ibu es. Black do s ep esen s ha a gene has a specic a ibu e.
domain expe s. C3 is composed o mo i gene se s based on conse ed cis- egula o y
mo i s om a compa a i e analysis o he human, mouse, a , and dog genomes. Finally,
C5 ecollec s gene se s sha ing he same Gene On ology e m [115].
These da a a e publicly a ailable and can be ep esen ed as a bina y ma ix
M
whose
ows a e he die en genes, while he columns ep esen he MSigDB gene se s. A
oy example o a possible knowledge ma ix is shown in Figu e 5-1, whe e each black do
ep esen s he p esence o a gene-gene-se co espondence. As can be obse ed, he ma ix
is spa se and his is a cha ac e is ic o he eal knowledge ma ix om he MSigDB da a.
The ac ual in o ma ion om MSigDB C2, C3 and C5 gene se s is coded in a knowledge
ma ix
M
composed o 22680 unique gene iden ie s and 5607 MSigDB gene se s. The
M
ma ix is hen used as knowledge da abase o he clus e ing p ocess.
5.1.1 The hie a chical clus e ing p ocess
The hie a chical clus e ing p ocess is a pai -wise i e a i e p ocess me ging ea u e pai s
o p oduce new hie a chical le els and new ea u es called me agenes like desc ibed in
Sec ion 3.1 and illus a ed in Figu e 3-2.
In Sec ion 3.1, ea u es a e me ged measu ing he Pea son co ela ion be ween wo
gene exp essions and each me agene is buil as he  s p incipal componen o he local
P incipal Componen Analysis (PCA) o e he wo ea u es o be me ged [78]. S a ing
om he second me ging s ep, me agenes and genes a e conside ed as ea u es, so he
simila i y mus be calcula ed o all genes and me agenes.
In o de o inco po a e he in o ma ion om he knowledge ma ix
M
, changes o he
simila i y me ic ha e been s udied and a e discussed in Sec ions 5.2 and 5.2. To his
77

end, o each ea u e pai
( i, j)
, wo quan i ies a e calcula ed:
dn( i, j)
which is he
nume ical simila i y as in Sec ion 3.1 and
dk( i, j)
he knowledge simila i y. The global
pai wise simila i y is hen dened as a combina ion o hese wo measu es:
d( i, j) = dn( i, j), dk( i, j)
(5.1)
In Sec ion 5.2, he s udied simila i y measu es o dene
dk( i, j)
a e p esen ed and
discussed, while in Sec ion 5.2, he combina ion o
dn
and
dk
is analyzed, p oposing wo
al e na i es o dene he nal pai wise simila i y.
5.2 Biological simila i y measu es
The in oduc ion o he biological simila i y in he clus e ing p ocess b ings o ligh some
ques ions. The  s one is, which measu e should be adop ed in quan i ying how much
wo genes a e alike and a second one is ela ed o he clus e ing p ocess and ega ds how
he simila i y measu e can be in eg a ed wi hin he clus e ing p ocess when gene a ing
me agenes as linea combina ions o genes.
In he li e a u e, he e a e plen y o simila i y measu es ha ha e been p oposed
o wo k wi h bina y da a, ca ego ical da a o con inuous da a [14, 4, 132]. Since he
knowledge ma ix o ma is bina y, we chose o sea ch he li e a u e o sui able measu es
in he bina y and ca ego ical eld. F om ou esea ch, we chose ou die en measu es
conside ing also he spa si y o he knowledge ma ix and he compu a ional easibili y
o he measu es. The chosen measu es a e he ollowing:
•
Ande be g
: This measu e has been p oposed in [4] and assigns mo e impo ance
o a e ma ches and a e misma ches. I anges om [0; 1], he minimum alue is
a ained when he e a e no ma ches, while he maximum alue is eached when all
a ibu es coincide be ween he compa ed ea u es.
•
Godall
: This is an adap a ion o he o iginal measu e p oposed by Godall in [40],
as p esen ed in [14] unde he name o
Godall3
o educe he compu a ional bu den.
This measu e assigns highe simila i y o a ma ch i he alue is in equen han i
78
he alue is equen . Ma ches can be ei he o ones o ze os. I s ange is be ween
[0; 1] and i eaches one when all he a ibu es a e he same.
•
NoisyOR
: This measu e has been adop ed in die en wo ks on mic oa ay da a
analysis [77, 76]. I assumes he a ibu es independence and i compu es he in-
eg a ed likelihood o each ea u e pai h ough a noisy-OR unc ion o e each
common a ibu e eliabili y. I is calcula ed wi h he consensus eliabili y es ima e
om [76]. This measu e anges om 0 o inni y, so i is no malized be ween [0 ;
1] by di iding by he maximum a ibu e eliabili y alue as in [77].
•
Smi no
: Smi no [113] p oposed a measu e oo ed in p obabili y heo y ha no
only conside s a gi en alue's equency, bu also akes in o accoun he dis ibu ion
o he o he alues aken by he same a ibu e. Fo a ma ch, he simila i y is
high when he equency o he ma ching alue is low and he o he alues occu
equen ly. The ange o his measu e goes om [0; 2N] whe e
N
is he a ibu e
numbe , so i is di ided by
2N
o be bounded be ween [0; 1].
All ou measu es adop die en c i e ia o dene each a ibu e impo ance in he
deni ion o a global simila i y measu e be ween wo ea u es. An addi ional conce n
a ises when he me agene gene a ion p ocess is conside ed and, p ecisely, when he new
me agene is gene a ed ia PCA. This s ep in he clus e ing p ocess has no been ouched
and he new me agene is ob ained as a linea combina ion o he wo me ged ea u es
conside ing only he exp ession alue. This s ep has been p ese ed o main ain he
me agenes bene o noise educ ion. As a as biological simila i y is conce ned, in
o de o he clus e ing p ocess o p og ess, a knowledge p ole mus be assigned o he
newly c ea ed me agene. The knowledge p ole is ep esen ed by adding a new ow
o he knowledge ma ix
M
wi h he me agene co esponding a ibu es which a e no
necessa ily bina y alues. The me agene gene a ion o mula o he nume ical da a is
p esen ed in Eq. (5.2) when i me ges wo ea u es
( i, j)
o build he
mk
me agene.
mk=α1 i+α2 j
wi h
α2
1+α2
2= 1
(5.2)
Fo he knowledge p ole o he me agene
mk
we chose o sa e as much as possible he
79
linea combina ion om PCA o bidding nega i e alues which may occu in a PCA. The
esul is shown in Eq. (5.3), whe e
Mi
and
Mj
a e he knowledge ma ix ows o he
i h
and
k h
ea u es.
mk= (|α1|Mi +|α2|Mj)/(|α1|+|α2|)
(5.3)
In his way he gene a ed knowledge p ole has non-nega i e alues bounded be ween
[0 ,1]
which allow o use he chosen simila i y me ics a e a sligh adap a ion o accep
con inuous alues ins ead o bina y ones.
Table 5.1 shows he ma hema ical exp ession o he ou s udied simila i y me ics.
Fo each simila i y me ic, wo o mulas a e shown, he  s one is he o iginal deni ion,
while he second one is he con inuous alue adap a ion. Some no a ions ha e o be
in oduced o p ope ly ead Table 5.1. Fi s o all, he knowledge ma ix
M
is o med by
N
ea u es, genes, and
d
a ibu es. Each column is a die en a ibu e, while each ow
is a die en ea u e. The no a ion
Mi
denes he
i h
ow o ma ix
M
, which includes
he a ibu es o an indi idual ea u es.
Mi,k
iden ies he elemen om he
i h
ow
and
k h
column o he knowledge ma ix. Table 5.1 p esen s he equa ions o measu e
he simila i y be ween he
i h
and
j h
ea u es, hus meaning he
i h
and
j h
ows o he
ma ix
M
. Wi h
K∩ij
he subse o sha ed a ibu es be ween he ea u e
i
and
j
is
dened:
K∩ij ={k}∈{1≤k≤d:Mi,k =Mj,k}
. While wi h
Kc
∩ij
, he complemen a y
subse o
K∩ij
is dened :
Kc
∩ij ={k} ∈ {1≤k≤d:Mi,k 6=Mj,k}
. We also dene he
no ions o
k(x)
,
ˆpk(x)
,
p2
k
and
k
as in [14, 77]:
• k(x)
is he numbe o imes ha he
k h
a ibu e assumes he alue
x∈[0 1]
.
•ˆpk(x)
is he sample p obabili y o he alue
x
o he
k h
a ibu e.
ˆpk(x) = x(x)
N
•p2
k
is ano he p obabili y es ima e o he alue
x
wi hin he
k h
a ibu e.
p2
k= k(x)( k(x)−1)
N(N−1)
• k
is he no malized consensus eliabili y es ima e,
ˆ k
o he
k h
a ibu e, calcu-
80
Table 5.1:
Biological simila i y measu es o mulas. Fo each measu e he o iginal o mula and
i s adap ed e sion o con inuous a iables a e p esen ed.
Me hod
Sk(Mi, Mj)
Ande be g
1
NPk∈K∩1
ˆpk(Mi,k)2
Pk∈K∩1
ˆpk(Mi,k)2
+Pk∈Kc
∩1
2ˆpk(Mi,k )ˆpk(Mj,k)
1
NPd
k=1 1
ˆpk(0) 2(1 −Mi,k)(1 −Mj,k)+1
ˆpk(1) 2Mi,kMj,k
Pd
k=1 1
ˆpk(0) 2(1 −Mi,k)(1 −Mj,k) + 1
ˆpk(1) 2Mi,kMj,k +1
2ˆpk(0)ˆpk(1) Mi,k +Mj,k −2Mi,kMj,k
Godall
1
NX
k∈K∩
1−p2
k(Mi,k)
1
N
d
X
k=1 h(1 −p2
k(0)) (1 −Mi,k)(1 −Mj,k+ (1 −p2
k(1)) Mi,kMj,ki
NoisyOR
1−Y
k∈K∩
((1 − k))
1−
d
Y
k=1 (1 − k)(Mi,kMj,k)
Smi no
1
2NX
k∈K∩


2 +
N− k(Mi,k)
k(Mi,k)+ k(q)
N− k(q)q6=Mi,k



1
2N
d
X
k=1 N− k(0)
k(0) + k(1)
N− k(1) (1 −Mi,k)(1 −Mj,k)+N− k(1)
k(1) + k(0)
N− k(0) Mi,k1−Mj,k
la ed as in [77, 76] in o de o bound i s alue be ween 0 and 1.
k=ˆ k
maxk(ˆ k)
The calcula ion o he pa ame e s
k(x),ˆpk(x), p2
k(x)
and
k
a e done o e he ini ial
knowledge ma ix
M
con aining only he indi idual gene in o ma ion. The in o ma ion
om he me agenes is no conside ed because hese pa ame e s mus ha e a xed alue
be o e s a ing o measu e he ea u e simila i y.
81
5.3 Combina ion o nume ical and biological simila i-
ies
Once he die en simila i y me ics o he biological in o ma ion a e dened, he ocus
is on how o combine he wo sou ces o in o ma ion: nume ical and biological. We
ha e s udied wo die en ways o combine he nume ical co ela ion
dn( i, j)
and he
biological simila i y
dk( i, j)
.
The  s and easies combina ion ule is a simple a e age o he wo alues, so ha
he o e all simila i y is dened as in Eq.(5.4).
d( i, j) = 1
2dn( i, j) + dk( i, j)
(5.4)
In addi ion o he a e age combina ion, a mo e complex way o combine he wo mea-
su es has been s udied, based on he wo k om [77, 56], whe e he o iginal simila i y alue
is mapped o he ange
[0 ,1]
using a p obabili y densi y unc ion es ima ion assuming a
logis ic dis ibu ion wi h mean
µ
and a iance
ν= 6/µ
. F om [77, 56], i is highligh ed
how such a mapping can be benecial o he disco e y o impo an ela ionships be ween
ea u es. The unde lying idea is o equalize he dis ibu ion o he calcula ed simila i y
alues be ween 0 and 1 and making a mo e uni o m combina ion o he alues om he
wo sou ces o in o ma ion. The equaliza ion s ep would wo k i he assump ion o he
unde lying dis ibu ion is co ec , o i i sui s he ac ual da a. F om he esul s we ga h-
e ed he logis ic assump ion does no hold, in pa icula o he biological simila i y da a
and especially wi h he imposed xed a io be ween
µ
and
ν
.
In o de o ope a e an equaliza ion s ep, we chose no o limi ou sel es o a specic
dis ibu ion ype, bu o es ima e he densi y unc ion on he eal da a. Fo his, 17
die en pa ame ic dis ibu ions a e compa ed o e a se o
105
pai wise simila i y da a
as de ailed in [90]. The bes  ing dis ibu ion is hen chosen in e ms o Bayesian In o -
ma ion C i e ion, BIC, and he equaliza ion unc ion is hen ob ained. A e equaliza ion,
a new dis ibu ion
˜
dx( i, j)
is ob ained wi h a mo e uni o m dis ibu ion o he alues
be ween 0 and 1. This wo k is done independen ly o he nume ical co ela ion and he
biological simila i y so o equalize bo h dis ibu ions p ope ly. The global simila i y alue
82

is hen dened as in Eq. (5.5) as an a e age o he wo equalized simila i ies.
d( i, j) = 1
2˜
dn( i, j) + ˜
dk( i, j)
(5.5)
5.4 Knowledge in eg a ion e alua ion o classica ion
The e alua ion o he use ulness o he knowledge in eg a ion amewo k o mic oa ay
classica ion is assessed in his sec ion using he IFFS ea u e selec ion algo i hm de-
sc ibed in 4.1.1. The knowledge in eg a ion algo i hms ha e been desc ibed in Sec ions
5.2 and 5.3, and combine he nume ical co ela ion wi h ou biological simila i y mea-
su es (Ande be g, Godall, Noisy-OR and Smi no ) and wi h wo combina ion schemes
(a e age and dis ibu ion equaliza ion).
The expe imen al p o ocol o e alua e bo h he p edic i e abili y and he biological
ele ance o he die en knowledge in eg a ing schemes when classi ying mic oa ays is
de ailed he e. All algo i hms ha e been analyzed in e ms o p edic i e powe and in e ms
o biological ele ance o he ound signa u es. The objec i e o he s udy is o compa e
he die en schemes o e alua e i in oducing biological knowledge helps in ob aining
be e esul s, mo e obus and in e p e able han in he o iginal T eele s implemen a ion
using nume ical da a only.
5.4.1 P edic i e powe e alua ion
To e alua e he p edic i e powe o he die en algo i hms, a 50 un Mon e Ca lo simu-
la ion has been pe o med. Fo each un, he same p o ocol as in Sec ion 4.2.2 has been
ollowed, classie s up o 5 dimensions ha e been buil o each da ase s. The bes clas-
sie in each case has been chosen by e alua ing he Ma hews Co ela ion Coecien ,
MCC, [89] esul s when classi ying he independen alida ion se .
S a is ical p ope ies o he algo i hms p edic i e powe ha e been ex ac ed om he
popula ion o 50 un simula ions in o de o d aw conclusions abou he gene al beha io o
each es ed algo i hm. The mean MCC alue ac oss he 50 i e a ions has been conside ed
as well as he s anda d de ia ion o he esul s. The compa ison among all he a ian s
83
conside ed in his Chap e and he algo i hm p esen ed Sec ion 4.2.2 combines bo h he
mean alue and he s anda d de ia ion o conside also he s abili y and epea abili y o
he p edic ion esul s h oughou he i e a ions.
A sco e,
S
is ex ac ed as dened by Eq.(5.6) and all he algo i hms a e hen so ed
acco ding o he
S
alue.
S=µMCC
σMCC +:= 0.02 = 1
50
(5.6)
I is p opo ional o he mean MCC alue so ha highe means ob ain highe sco es, bu
i also is in e sely p opo ional o he MCC s anda d de ia ion, so ha mo e obus and
s able esul s can ob ain highe sco es. The

alue a he denomina o has been chosen
o educe he isk o gi ing oo much ele ance o he esul s obus ness ha could make
he mean alue i ele an . The alue has been chosen as he in e se o he Mon e Ca lo
uns (i.e.
= 0.02 = 1
50
) and i is compa able o he ob ained s anda d de ia ion alues
collec ed in Sec ion 5.5.
5.4.2 Biological ele ance e alua ion
Besides he p edic ion abili y, an addi ional e alua ion is pe o med s udying he die -
ences among he ound gene signa u es abou hei biological use ulness. This kind o
analysis aims a assessing he in e p e abili y o he die en solu ions.
The aim is o see i he biological knowledge in eg a ion helps in selec ing genes which
a e good o classica ion and also use ul o biological in e p e a ion. The biological use-
ulness assessmen is an ex emely complica ed ask. I is ela ed o he specic p oblem
unde s udy and depends on he scien is s' expe ience. Ne e heless, an es ablished p ac-
ice in he li e a u e is o e alua e he die en gene signa u es wi h au oma ic analysis
ools, o example o nd en iched unc ions o o nd genes ela ed o an in es iga ion
opic om he li e a u e. Fo each o he conside ed al e na i es, he union o he used
genes o build he classie s h oughou he Mon e Ca lo i e a ions is used as gene signa-
u e. When a me agene is chosen o be pa o a classie , all he genes composing i a e
included in he signa u e.
Fou publicly a ailable ools ha e been used o quan i y he biological ele ance o he
84
gene lis s. They assess die en cha ac e is ics o a gene lis using die en da abases and
e e ences. The adop ed ools a e he ollowing:
GSEA
The  s ool uses Gene Se En ichmen Analysis esou ces [115]
h p://www.
b oadins i u e.o g/gsea/msigdb/anno a e.jsp
. Fo each gene lis , i calcula es an
ou pu p- alue o each one o he selec ed MSigDB gene se s [115]. The p- alues a e
calcula ed as hype geome ic dis ibu ions o o e lapping genes be ween he analyzed
gene signa u e and he MSigDB gene se . A low p- alue indica es a high p obabili y ha
he MSigDB gene se is ep esen ed in he gene signa u e and he e o e ha genes used o
classica ion ha e some hing in common om a biological iewpoin ( unc ion, posi ion,
disease, e c.). Fo his analysis, he subse s C2, C4 and C5 om MSigDB ha e been
used. Gene se s can be collec ed om a ious sou ces such as on-line pa hway da abases,
publica ions in PubMed, knowledge o domain expe s o om Gene On ology da abases.
Biog aph
The second ool is called Biog aph [82]
h p://www.biog aph.be/
, i
quan ies ela ionships be ween indi idual genes and a key e m (e.g. he s udied dis-
ease). Biog aph analyzes each gene indi idually and quan ies hei ela ionship wi h he
key e m based on a knowledge da abase. The ou pu sco e is p opo ional o he gene
key- e m ela ionship. The me hod is based on he in eg a ion o he e ogeneous biomed-
ical knowledge bases and yields in elligible and li e a u e-suppo ed indi ec unc ional
ela ions. By assessing he plausibili y and specici y o hese hypo he ical unc ional
pa hs wi hin a use -p o ided esea ch con ex , he unsupe ised me hodology is capable
o app aising and anking o esea ch a ge s, wi hou equi ing p io domain knowledge
om he use . Since his me hod analyzes he ela ions be ween each gene and a ele an
key- e m, when analyzing he 7 MAQC da ase s, die en key- e ms ha e been chosen,
ela ing wi h he s udied phenomenon: A da ase : lung neoplasms; C da ase : li e neo-
plasms; D and E da ase s: malignan b eas neoplasms; F da ase : Mul iple Myeloma, G
da ase : Su i al Analysis and H da ase : sex die en ia ion.
En ich
The hi d used ool is En ich [27]
h p://amp.pha m.mssm.edu/En ich /
index.h ml
, which is an in eg a i e web-based and mobile so wa e applica ion ha in-
85
cludes gene-se lib a ies, an al e na i e app oach o ank en iched e ms, and a ious
in e ac i e isualiza ion app oaches o display en ichmen esul s. En ich con ains 35
gene-se lib a ies whe e some lib a ies a e bo owed om o he ools while many o he
lib a ies a e newly c ea ed and only a ailable in En ich . I has been used o analyze he
en ichmen o he gene lis s in e ms o KEGG pa hways. The chosen ou pu o each
die en pa hway is a combined sco e p esen ed in [27].
Génie
The ou h ool is Génie [45]
h p://cbdm.mdc-be lin.de/~medline anke /
cms/genie
. Wi h Génie, genes a e anked using a ex -mining app oach based on gene-
ela ed scien ic abs ac s. I p io i izes all o he genes om a species acco ding o
hei ela ion o a biomedical opic using all a ailable scien ic abs ac s and on ology
in o ma ion. Génie akes ad an age o li e a u e, gene and homology in o ma ion om he
MEDLINE, NCBI Gene and HomoloGene da abases. This ool, like Biog aph, analyzes
each gene independen ly and i s ou pu is a p- alue assessing he ele ance o each gene
wi h a sea ch e m. The used sea ch e m - da ase pai s a e: A da ase : lung cance ;
C da ase : li e cance ; D and E da ase s: b eas cance ; F and G da ase s: mul iple
myeloma and H: sex.
The ou analysis ools e alua e die en cha ac e is ics o he gene lis s. Tools like
GSEA o En ich pe o m a gene-se analysis as a whole, while Biog aph and Génie
e alua e he indi idual gene ele ance. To quan i y in a simple way he collec ed esul s,
an e alua ion p o ocol is p oposed. Fo each analysis ool, he  s  e ou pu s a e
a e aged: o GSEA and Génie i is an a e age o he nega i e loga i hm o he p- alues
while wi h Biog aph and En ich i is an a e age o he  s  e ou pu sco es. In his
way, he me hod wi h he highes a e age is he one ob aining he bes esul . The nex
s ep is o a e age all he alues ac oss he 7 da ase s o ob ain an a e age sco e anking
all he algo i hms o e mul iple esul s.
A e wa ds, o combine all he ob ained esul s in e ms o biological ele ance and
o p edic i e abili y a o ing scheme is adop ed. We chose o combine he biological
analysis esul s in o a single sco e om he a e age o he Bo da coun o he ou analysis
ools. In Figu e 5-2, a oy example is shown wi h only wo analysis ools o biological
in o ma ion. Fo each ool, each me hod is assigned poin s depending on i s anking.
86
Figu e 5-3:
Sco e compa ison wi h esul s om [84] on da ase s D and E om MAQC da ase s.
All he algo i hms a e so ed by inc easing nal sco e, he black line. The bes esul is he one
wi h he smalles o e all sco e, which is G-pd , consis en ly wi h he ob ained esul s o e a
wide selec ion o da ase s.
wo s global sco e among he s udied al e na i es due o i s p edic ion pe o mances. An
obse a ion mus be done abou he Génie da a because hey a e almos all he same. This
is due o he ac ha almos all algo i hms a e able o iden i y genes wi h ze o p- alue o
bo h he da ase s, ( o example ESR1, IRS1, PHB o HRAS o da ase D and ESR1 o
da ase E, which is a known gene ela ed o b eas cance as i is an es ogen ecep o ),
hus ob aining an ideally inni e alue. This has been conside ed when e alua ing all he
da ase s and a maximum h eshold o 1000 has been se o a oid ha ing inni e alues in
he algo i hms compa ison.
Looking a he global esul s, we obse e how he G-pd s ill is he bes sco ing algo-
i hm e en i he global sco e o de has changed wi h espec o Table 5.3.
5.6 Summa y
In his Chap e , he s udied echniques o in e a hie a chical s uc u e om mic oa ay
da a combining bo h nume ical in o ma ion and p io biological in o ma ion ha e been
desc ibed and e alua ed. They ha e been compa ed o s a e o he a al e na i es and
o he nume ical in o ma ion only solu ion om Sec ion 4.2 which showed o ha e good
and obus p edic i e p ope ies.
93

The knowledge in eg a ion amewo k has been s udied wi h die en implemen a ions,
compa ing ou simila i y me ics and wo combina ion ules o me ge he nume ical
co ela ion and he biological simila i y.
The algo i hms ha e been compa ed wi h Mon e Ca lo expe imen s on public da ase s
in e ms o hei p edic i e abili y and biological in e p e abili y o he chosen gene sig-
na u es. The knowledge in eg a ion has shown o be benecial inc easing he p edic i e
powe obus ness wi hou losing he mean pe o mance alue when compa ed o he nu-
me ical co ela ion only al e na i e, as well as p oducing mo e biologically in e p e able
gene signa u es.
Among he s udied al e na i es, he G-pd algo i hm combining Godall simila i y
measu e wi h he p obabili y densi y unc ion equaliza ion is he bes one. I consis-
en ly ob ained he bes pe o mances when compa ed o he o he knowledge in eg a ion
al e na i es as well as when i has been compa ed o s a e o he a algo i hms.
As a gene al obse a ion, a p ope knowledge in eg a ion amewo k like G-pd should
be p e e ed o he ba e nume ical eele s, when possible, since i ob ains mo e obus
and in e p e able esul s o classica ion.
94
Chap e 6
Mul iclass classica ion
Machine lea ning echniques ha e been ex ensi ely applied on mic oa ay da a o cance
classica ion, ob aining in e es ing p edic ion pe o mances [112, 16, 138]. Mos o he
wo k in he eld is ocused on he bina y classica ion, conside ing he mul iclass case as
a s aigh o wa d gene aliza ion. Die en s udies sugges howe e ha in he mul iclass
case, i is mo e complica ed o ob ain good p edic ion a es, especially when he class
numbe is high and he class dis ibu ion is skewed [80, 114, 119, 128].
A no el mul iclass app oach has been s udied in his hesis as a combina ion o mul-
iple bina y classie s. I is an example o E o Co ec ion Ou pu Coding (ECOC)
algo i hms [37] applied o he mic oa ay analysis. The ECOC algo i hms ob ained in-
e es ing esul s by applying buil -in coding algo i hms om he da a ansmission eld
[119]. Thei di ec applica ion on biological da a like mic oa ays has some d awbacks
like he e o independence assump ion, he code ma ix gene a ion o he allowed bina y
pa i ions which will be de ailed in he ollowing sec ions. A new app oach is in oduced
in his chap e o ake ad an age o he da a ansmission amewo k o ECOC algo i hms
wi hou o ge ing ha wha is decoded a e biological da a. To do so, he edundancy
is used o educe he e o a e, bu he bina y classie s a e bounded o class pa i ions
mo e likely o be signican han wi h o he ECOC app oaches.
The p oposed ECOC scheme adds o he classical One Agains All (OAA) app oach
a g oup o bina y classie s called Pai Agains All (PAA), each o which ocuses in
sepa a ing a class-pai om he es o he samples. The PAA choice is done because
95
Figu e 6-1:
Example o OAO and OAA in a h ee classes p oblem wi h hei associa ed classi-
ca ion bounda ies.
1
class pai s a e mo e likely o ha e common biological ea u es han la ge class g oups
and i is common o nd couples o a ian s o he same disease inside a mic oa ay
expe imen .
The OAA+PAA algo i hm has been es ed on se en publicly a ailable da ase s h ough
50 un Mon e Ca lo simula ions. I s pe o mances ha e been compa ed wi h s a e o he
a al e na i es, showed in [119], whe e bo h he OAA app oach and a s a e o he a
ECOC algo i hm applying Low Densi y Pa i y Check (LDPC) codes a e s udied wi h he
applica ion o linea SVM as classica ion algo i hm.
6.1 ECOC algo i hms and he OAA + PAA algo i hm
In his sec ion, he E o Co ec ing Ou pu Coding applica ion o mic oa ay mul i-
class classica ion is discussed and he p oposed OAA+PAA algo i hm is de ailed. The
mul iclass p oblem is add essed as a gene aliza ion o he wo-class scena io in which
mul iple bina y classie s a e used o ob ain a nal es ima ion. The wo mos common
app oaches a e One Agains All (OAA) and One Agains One (OAO) [80, 38, 106]. In
he OAA app oach,
M
bina y classie s a e ained, each one sepa a ing samples o one
class om he es o he samples. The nal decision on he assignmen o each sample
is de e mined by a combina ion o he
M
ou pu s. In he OAO app oach,
M(M−1)/2
classie s a e ained, one o each possible class pai wi hou conside ing samples om
1
Images om:
h p://cou ses.media.mi .edu/2006 all/mas622j/P ojec s/aisen-p ojec /
96
Table 6.1:
Example o he ECOC ep esen a ion o One Agains All (OAA) classica ion in a
4 class case. Each bi is he ou pu as a classie sepa a ing one class om he es .
Codewo d:
OAA1OAA2OAA3OAA4
Class 1 1 0 0 0
Class 2 0 1 0 0
Class 3 0 0 1 0
Class 4 0 0 0 1
he o he classes. The class assignmen is done on he basis o he pa i ion o he decision
space esul ing om he combina ion o
M(M−1)/2
p oduced bounda ies. These wo
app oaches a e commonly used o mul iclass classica ion wi h ai ly good pe o mances
[128, 119] and a g aphical ep esen a ion o he die ence be ween OAO and OAA is
shown in Figu e 6-1. I can be obse ed how he classica ion bounda ies die be ween
he wo cases and how OAO conside s only a class-pai o dene a bounda y, a he han
he whole samples popula ion.
An in e es ing b anch o mul iclass classica ion app oaches applies da a ansmis-
sion algo i hms o he sample classica ion [119, 37]. These algo i hms a e called E o
Co ec ing Ou pu Codes (ECOC) algo i hms.
The gene al app oach compa es he sample classica ion using
N
bina y classie s as a
ansmission o
N
bi codewo d o e a noisy channel. Each bina y classie is he ecei e
o a one o he
N
bi s o he codewo d. The sample class is hen assigned depending on
he ecei ed bi s. Wi h his pa allelism, da a ansmission solu ions can be adop ed o
imp o e he bi e o a e" such as e o co ec ing codes.
In [119], ecu si e Low Densi y Pa i y Check (LDPC) codes ha e been implemen ed o
code he
M
classes in
N
-bi s codewo ds. The applica ion o LDPC codes o he mul iclass
classica ion is due o hei ou s anding pe o mances in he da a ansmission eld [119],
whe e hey can app oxima e he Shannon limi . These codes showed e y low bi e o
a e when used in he ac ual da a ansmission and a e a g ea choice o ha ask, bu
hei applica ion o he sample classica ion needs o ake in o accoun some issues. Fi s
o all he e is he e o independence assump ion, which assumes ha e o s on die en
bi s a e independen . This assump ion is no ue because he e bi s a e connec ed o
he sample classica ion [118]. Fu he mo e, LDPC codes a e block codes which showed
97
good esul s o long codewo ds [118], hus a di ec LDPC applica ion o he mic oa ay
classica ion ask would imply he aining o housands o classie s, making hei use
unp ac ical. A LDPC ela ed issue is he code- able gene a ion because he e is no unique
and as way o ob ain hem. These aspec s a e add essed in [118], whe e a ecu si e way
o p oduce LDPC codes is s udied and applied o he mul iclass case.
He e, an al e na i e ECOC app oach is p esen ed, dealing wi h an addi ional issue o
e o co ec ing block codes: he equali y o he bina y classie pa i ions. The common
ECOC app oach consis s in building a code able ela ing each o he
M
sample classes o
a
N
bi codewo ds o p oduce a sui able bina y ma ix (e.g. Hamming code es ic ions
o LDPC es ic ions). This app oach wo ks well o da a ansmission bu i does no
ake in o accoun he aim o he classica ion ask which is o dis inguish among elemen s
pe aining o die en classes. In he code ma ix gene a ion, all he class pa i ions a e
equally sui able. A bina y classie sepa a ing one class om he es can be chosen in he
code able gene a ion wi h he same p obabili y as a classie sepa a ing h ee classes wi h
sca ce biological ela ion om he es . This ea u e can lead o e y in e es ing nume ical
code ables bu i does no ansla e in o he expec ed e o co ec ing imp o emen s when
classi ying mic oa ay samples [118, 106].
In he p oposed app oach, a simple e o co ec ing scheme is p oposed by adding
edundancy o he OAA app oach, which is he simples ECOC app oach and whose code
able is ep esen ed in Table 6.1. The edundancy is ob ained h ough mul iple bina y
classie s wi h class pa i ions mo e likely o be signican om a biological poin o iew
han hose ob ained wi h LDPC codes o o he mo e elabo a e algo i hms. The p esen ed
algo i hm adds o he OAA app oach a g oup o bina y classie s called Pai Agains All
(PAA), each o which ocuses on sepa a ing a class-pai om he es . Ad an ages o such
a choice a e in he simplici y o he code able gene a ion, as a die ence wi h espec o
LDPC codes whe e he able gene a ion is a complex p ocess, and in he choice o possibly
mo e signican class pa i ions. Limi ing he possible bina y pa i ions o single classes
o pai s o classes educes he isk o choosing meaningless pa i ions, which should esul
in he de elopmen o mo e eliable classie s.
The PAA choice is done because class pai s a e mo e likely o ha e common biological
98

Table 6.2:
Code able o he OAA+PAA app oach in a ou classes scena io. The e a e ou
codewo ds o 10 bi s, co esponding o he OAA case plus one bi o each class pai .
Bi s
OAA1OAA2OAA3OAA4P AA1,2P AA1,3P AA1,4P AA2,3P AA2,4P AA3,4
Cl. 1 1 0 0 0 1 1 1 0 0 0
Cl. 2 0 1 0 0 1 0 0 1 1 0
Cl. 3 0 0 1 0 0 1 0 1 0 1
Cl. 4 0 0 0 1 0 0 1 0 1 1
ea u es han la ge class g oups, and i is common o nd couples o a ian s o he same
disease inside a mic oa ay expe imen . Bina y pa i ions g ouping un ela ed classes can
lead o he p oduc ion o poo classie s because he wo pa i ions a e no well sepa able,
hus educing he eec i eness o he code able edundancy. I some o he codewo d bi s
a e no us wo hy he class assigna ion is less likely o p oduce co ec ou comes. The
OAA+PAA o
M
die en classes p oduces a code able wi h
M
lines, one o each class,
o med o
M+M(M−1)/2
bi s. The codewo d leng h is de e mined by he
M
bi s
de i ing om he OAA app oach, plus one bi o each possible class pai
(M(M−1)/2)
.
An example o how he code able is o med in a ou classes case is shown in Table
6.2. As i can be obse ed, he code able om Table 6.2 includes he OAA code able,
ep esen ed in Table 6.1.
In he p oposed app oach, each bi is ecei ed by a die en classie , buil wi h he
algo i hm in oduced in Sec ion 4.1.1 wi h he IFFS ea u e selec ion algo i hm. Each
classie can ou pu a ha d decision (i.e. a bina y ou pu o 1 o 0) o a so es ima ion,
each bi is a eal alue
∈[0,1]
ep esen ing es ima ed a pos e io i p obabili ies om
he LDA classie . The code able ep esen s he coding s ep o each sample be o e he
ansmission, while he decoding phase consis s in ecei ing each one o he ansmi ed
bi s and in assigning an es ima ed codewo d o each ecei ed block o bi s. Fo each
classied sample, a
N
dimensional wo d is ecei ed and he nal class assigna ion depends
on he dis ance o he ecei ed wo d om each one o he codewo ds in he code able.
In p ac ice, assume ha
xi
is he p oduced wo d co esponding o he classica ion o
he
i h
sample, whose ac ual class is
Y(i)∈[1, . . . , M]
. The decoding p ocess can be seen
as a unc ion
(xi)→ˆ
Y(i)
assigning an es ima ed class o he sample. The classica ion
is co ec i
Y(i) = ˆ
Y(i)
, o he wise an e o is p oduced. The class es ima ion is ob ained
assigning he class whose codewo d has he smalles dis ance om he ecei ed wo d:
99
ˆ
Y(i) = minj∈[1,...,M]kxi−cjk1
, whe e
cj
is he codewo d co esponding o he
j h
class. I
he classie ou pu is a ha d decision, he dis ance is a Hamming dis ance. O he wise, i
he ou pu is a so dis ance like a pos e io i p obabili y, he dis ances can be measu ed
wi h L1 o L2 no m. Mo e p ecisely, any
N
dimensional dis ance can be adop ed o see
whe he i in oduces some changes in he nal ou pu . In his wo k he ha d decision
has been pai ed wi h he Hamming dis ance and he so decision case has been s udied
applying L1 and L2 dis ances.
6.2 Expe imen al P o ocol
To assess he classica ion pe o mance o he p oposed algo i hm in i s wo a ian s, Ha d
and So decision, he mul iclass classie s ha e been e alua ed by means o a Mon e Ca lo
simula ion o e 7 publicly a ailable da ase s desc ibed in Sec ion 6.2.1. The esul s a e
hen compa ed o hose p esen ed in [119]. Based on [39, 7] and simila ly o wha has
been done in [119], 50 Mon e Ca lo 4:1 (
4/5
o aining and
1/5
o es ing) pa i ions o
he a ailable da a we e conside ed. Fo each i e a ion, he single bi classie s ha e been
buil up o 15 ea u es as in Sec ion 4.2. A e wa ds he mean alues o each ea u e
numbe a e measu ed and he bes esul is kep as pe o mance le el po en ial.
The pe o mance is measu ed as he mean e o a e p edic ing he independen es
se along he 50 i e a ions o he Mon e Ca lo simula ion. Inside each bina y classie
aining, a 10 old c oss alida ion p ocess has been adop ed.
The OAA+PAA algo i hm has been s udied in h ee a ian s, he Ha d decision wi h
Hamming dis ance (
OAA +PAA
_
Ha d
) he So decision e sion adop ing he L1 dis-
ance in he class assigna ion ask (
OAA+PAA
_
L1
) and he So decision e sion adop -
ing he L2 dis ance in he class assigna ion ask (
OAA +PAA
_
L2
). The simple OAA
and OAO app oaches ha e been es ed oo since he ocus o he expe imen al e alua ion
is o compa e he pe o mance o OAA+PAA wi h baseline me hods.
100
Table 6.3:
B ie mic oa ay da ase s desc ip ion.
Name Samples Genes Classes
SRBCT 63 2308 4
h p:// esea ch.nhg i.nih.go /mic oa ay/Supplemen /index.h ml
B ain 42 5597 5
h p://www.b oadins i u e.o g/mp /CNS/
NCI60 61 5244 5
h p://genome-www.s an o d.edu/nci60
S aun on 60 5726 9
h p://www.gems-sys em.o g/
Su 174 12533 11
h p://www.gems-sys em.o g/
GCM 190 16063 14
h p://www.b oadins i u e.o g/cgi-bin/cance /da ase s.cgi
GCM RM 123 7129 11
h p://exp ession.washing on.edu/publica ions/kayee/sh unken_cen oid/
6.2.1 The analyzed mic oa ay da ase s
Se en cance mic oa ay da a se s we e used in he e alua ion o he analyzed mul iclass
algo i hms. They a e called Small Round Blue Cell Tumo da ase (SRBCT), he B ain
da ase , he NCI60 da ase , he S aun on da ase , he Su da ase , he GCM da ase and
he GCM RM da ase , de i ed om he GCM da ase wi h he pu pose o imp o ing mul-
iclass classica ion wi h a iabili y es ima es o epea ed gene exp ession measu emen s.
Fo a mo e de ailed desc ip ion o he da ase s, he class dis ibu ion and he da a
p ep ocessing s eps, e e o [119]. A basic desc ip ion o he da ase composi ion including
he sample numbe , he numbe o adop ed genes, he class numbe and a public link o
access o he da ase a e gi en in Table 6.3.
6.3 Resul s
In his sec ion, he expe imen al esul s a e shown and discussed. Table 6.4 p esen s he
mean Mon e Ca lo esul s o e he se en da ase s o he al e na i es s udied in his wo k,
OAA
,
OAA +PAA
_
Ha d
,
OAA +PAA
_
L1
and
OAA +PAA
_
L2
, compa ed wi h he
esul s ob ained in [119]. The esul s om [119] a e di ided by hose ob ained wi h an
OAA app oach and hose ob ained adop ing a ecu si e LDPC scheme o mic oa ay
classica ion. The die ences be ween he OAA based me hod om [119] and he
OAA
baseline me hod es ed in his wo k lie in he ea u e se ha does no include me agenes in
[119] OAA algo i hm, in he classie (LDA s SVM) and in he ea u e selec ion algo i hm
o build he classie , making o he wo OAA based algo i hm signican ly die en . The
101
Table 6.4:
Expe imen al p edic ion e o a es o e he se en da ase s.
Me hod B ain NCI60 SRBCT Su
OAA
25.16% 41.37% 1.73% 9.22%
OAO
21.33% 38.50% 2.27% 12.29%
OAA+PAA_ Ha d
19.83% 29.87% 2.00% 6.48%
OAA+PAA_ L1
18.67% 28.37% 0.55% 4.19%
OAA+PAA_ L2
18.67% 30.38% 0.67% 4.14%
[119] OAA
12.5% 23.08% 0.00% 8.57%
[119] LDPC
12.5% 30.77% 0.00% 8.57%
Me hod S aun on GCM RM GCM Mean
OAA
56.75% 5.91% 38.78% 25.56%
OAO
45.88% 7.16% 34.24% 23.10%
OAA+PAA_ Ha d
41.25% 0.75% 24.26% 17.78%
OAA+PAA_ L1
37.75% 0.54% 20.17% 15.75%
OAA+PAA_ L2
37.75% 0.40% 20.17% 16.03%
[119] OAA
46.15% 0.00% 28.63% 16.82%
[119] LDPC
46.15% 0.00% 36.24% 19.07%
p oposed algo i hms in [119] a e ecen s a e o he a al e na i es wi h good p edic ion
esul s o e a wide a ie y o da ase s. Fu he mo e he alida ion p ocedu e is clea and
de ailed, allowing a ealis ic e o es ima ion hanks o he Mon e Ca lo simula ion on
independen es se s.
Table 6.4 indica es he mean p edic ion e o a e o each da ase . In he las column,
he mean e o a e ac oss all da ase s is gi en o ha e a global indica o o he p edic ion
abili y o he die en algo i hms.
F om Table 6.4 i can be obse ed how he p oposed algo i hm
OAA+PAA_L1
man-
ages o ob ain he smalles mean p edic ion e o . The bes mean e o a e esul is
15.75%
, while he second bes esul comes om he
OAA+PAA_L2
, while he s a e
o he a al e na i e om [119] implemen ing he
OAA
algo i hm is hi d in e ms o
a e age e o a e. As p e iously men ioned, he die ences in e o a es be ween he
cu en
OAA
esul s and he [119]
OAA
esul s a e due o he die en ea u e selec ion
algo i hms, ea u e se s and used classica ion algo i hms.
The p oposed ECOC algo i hm,
OAA+PAA
, is use ul as a gene al me hod o mul-
iclass classica ion since i consis en ly pe o ms be e han he
OAA
al e na i e, e-
ducing he mean e o a e o almos en pe cen . This esul is ob ained wi h bo h he
OAA+PAA
implemen a ions adop ing Ha d and So decision. Fu he mo e, i can be ob-
se ed how using so decision helps o he class assignmen , since he
OAA+PAA_Ha d
102
classica ion algo i hms like Linea Disc iminan Analysis and linea -ke nel SVM. F om
his las compa ison, i showed how LDA should be p e e ed o SVM since i ob ained
be e p edic ion pe o mances analyzing MAQC da ase s. As a global esul , when IFFS
is used join ly wi h T eele s clus e ing and LDA classie , i ob ained be e esul s ha
he al e na i es classi ying MAQC da a [112], imp o ing he cu en s a e o he a .
The o he s udied ea u e selec ion app oach is based on Ensemble lea ning echniques.
In his di ec ion, die en al e na i es ha e been es ed as a p oo o concep e y in e es -
ing esul s. The pe o med single un expe imen wi h die en congu a ions highligh ed
how he ensemble ea u e selec ion app oach allows o signican ly imp o e he s a e o he
a p edic i e abili y when compa ed o he IFFS esul s in e ms o p edic i e accu acy.
The s udied algo i hm has been en iched wi h key elemen s like he nonexpe no ion ha
allows boos ing he pe o mance. O e all, he bes esul s o ensemble ea u e selec ion
ha e been ob ained wi h SVM classie combined wi h he nonexpe no ion in oduced
in Sec ion 4.3.2.
Publica ions:
Publica ions: The ollowing publica ions in in e na ional con e ences
and jou nals a e ela ed o he a o emen ioned opics [17, 15, 1, 18].
7.2.3 Knowledge in eg a ion model o me agene gene a ion
Techniques o in e a hie a chical s uc u e om mic oa ay da a combining bo h nume -
ical in o ma ion and p io biological in o ma ion ha e been desc ibed and e alua ed wi h
he aim o p oduce be e me agenes and imp o e esul s s abili y and in e p e abili y.
They ha e been compa ed o s a e o he a al e na i es and o he nume ical in o ma ion
only solu ion om Sec ion 4.2.2 which al eady showed o ha e good and obus p edic i e
p ope ies implemen ing IFFS as ea u e selec ion.
The knowledge in eg a ion amewo k has been s udied wi h die en implemen a ions,
compa ing ou simila i y me ics and wo combina ion ules o me ge he nume ical
co ela ion and he biological simila i y. The a ionale behind i is o ga he mo e high-
quali y ex e nal in o ma ion abou he genes so ha he hie a chical clus e ing p ocess
can be mo e meaning ul om a biological s andpoin . In his way, he me agenes can
109

summa ize he beha io o genes ha a e simila in bo h nume ical exp ession and in
biological unc ions.
Mon e Ca lo expe imen s on MAQC [112] da ase s ha e been pe o med o e alua e he
esul ing algo i hms in e ms o hei p edic i e abili y and biological in e p e abili y o
he chosen gene signa u es. When compa ed o he IFFS al e na i e using nume ical da a
only, including p io knowledge in he me agene gene a ion allows o ob ain mo e s able
p edic ion esul s and mo e biologically ele an signa u es, all his wi hou educing he
o e all mean p edic i e pe o mance.
Among he s udied al e na i es, he G-pd algo i hm combining Godall simila i y
measu e wi h he p obabili y densi y unc ion equaliza ion consis en ly ob ained he bes
pe o mances also when compa ed o s a e o he a al e na i e om [84]
As a gene al obse a ion, a p ope knowledge in eg a ion amewo k should be p e-
e ed o he ba e nume ical eele s when possible, since i ob ains mo e obus and
in e p e able esul s o classica ion.
Publica ions:
The esul o his wo k has been published in [62].
7.2.4 Mul iclass classica ion
Due o he good esul s o bina y classica ion, he IFFS based amewo k has been
gene alized o wo k wi h mul iclass p oblems. Fo ha , a new algo i hm o mul iclass
classica ion wi hin he E o Co ec ing Ou pu Coding amewo k [37] has been in o-
duced,. I add esses he issue o ECOC algo i hms o conside ing equally p obable all he
possible class pa i ions by limi ing he pa i ions o single classes and class pai s and i
has been named One Agains All + Pai Agains All:
OAA+PAA
.
The
OAA+PAA
algo i hm has been es ed on se en publicly a ailable da ase s and i
has been compa ed wi h esul s ob ained wi h he baseline OAA and OAO app oaches,
as well as wi h s a e o he a algo i hms om [119] applying LDPC codes o mul iclass
classica ion.
The p oposed algo i hm ou pe o med he baseline al e na i es, showing how i can
imp o e simple algo i hms. Such an imp o emen is due o he p o ided edundancy
110
om he algo i hm by adding he Pai Agains All pa . This is a key die ence wi h
espec o o he ECOC app oaches ha did no manage o subs an ially imp o e he
pe o mances when compa ed o he OAA app oach [118, 106]. When compa ed o [119],
he
OAA+PAA
algo i hm ob ained be e esul s, showing how i is a alid al e na i e
o he mul iclass classica ion.
Publica ions:
The esul o his wo k has been published in [2].
7.3 O e iew and Nex s eps
In his sec ion, he o e all conclusions om his hesis a e de ailed, oge he wi h in u-
i ions and ideas abou u u e esea ch di ec ions om his wo k.
The mos impo an elemen o he whole amewo k p edic ion pe o mance is he
me agenes gene a ion om gene exp ession da a. In all he pe o med expe imen s, in-
oducing me agenes consis en ly led o imp o ed pe o mances o classica ion. These
newly in oduced ea u es ha e mo e ep oducible beha io s han single genes be ween
aining and alida ion se s, suppo ing he s a emen ha me agenes can educe he
esidual noise on gene-exp ession. As a desi able de elopmen om he me agenes in o-
duc ion, i is p obably wo h ying o u he exploi he ob ained hie a chical s uc u e
because in his hesis all he me agenes a e conside ed equal, ega dless o how many genes
hey me ge. Making a be e use o he in e ed ee, o example o ea ly elimina e some
me agenes because o hei un eliabili y (e.g. he ee- oo o he highes le el me agenes
ha combine housands o ea u es), o o example o d i e he explo a ion o candida e
egula o y genes o ce ain p oblems, could bene he esul s. The ee s uc u e is an
asse ha has no been used and ha may help in making mo e sense ou o he da a.
Abou me agenes and how i is possible o imp o e hem as well as he in e ed s uc-
u e, i has been shown how including p io biological in o ma ion led o an o e all im-
p o emen o he esul s. The p edic i e accu acy emained unal e ed, bu bo h he
p edic i e s abili y and he in e p e abili y a e be e han wi hou i . These esul s
can be in e p e ed how including da a sou ces ex e nal om he gene-exp ession, helps in
gaining mo e insigh abou he hidden da a s uc u e. A u u e wo k di ec ion is he e o e
111
o build sys ems in eg a ing mo e and mo e in o ma ion, in an au oma ic way o be e
dene he ee cons uc ion. F om au oma ic p ocessing o die en in o ma ion sou ces,
like he used gene on ology da abases bu also om na u al language p ocessing ools, i
could be possible o ex ac meaning ha o he wise would no be possible and i is an
oppo uni y o in eg a e die en signal p ocessing a eas.
Going om he s uc u e o ea u e selec ion, ea u e selec ion algo i hms a e a key
ac o o he pe o mances. Two al e na i e me hods p o ed o each good esul s om
e y die en pe spec i es. On one side, he esul s using he w appe algo i hm led o
p edic i e pe o mance ha a e compa able wi h he bes s a e o he a al e na i es,
wi h good pe o mance s abili y and esul s in e p e abili y. On he o he side, applying
ensemble ea u e selec ion led o a ema kable pe o mance imp o emen , a a p ice o
less in e p e able esul s. No hing can be said ye abou s abili y because o he dicul y
o design a p ope expe imen . E en i die en , bo h me hods sha e one common ea u e
ha should be ema ked. In he de elopmen phase, bo h algo i hms ha e been ailo ed
o he da a a he han simply being applied as is. Fo he ea u e selec ion ask, u u e
esea ch wo ks could be dedica ed o deepen he knowledge o he ensemble lea ning
po en ial, o he c ea ion o expe imen s o assess he s abili y and how o inco po a e
he no ion o s abili y in he selec ion p ocess. Be ween ensemble lea ning and IFFS, he
o me has mo e po en ial o g ow and explo e.
In his hesis, he mul iclass scena io has been conside ed oo wi h he s udy o a
new algo i hm wi h in e es ing pe o mances. Al hough we ha e p oposed an imp o ed
algo i hm compa ed o he he s a e o he a , he bina y class classica ion should
be p e e ed o u he s udies. The main eason is ha he e s ill is lo o oom o
imp o emen in ha eld and because he mul iclass case could be educed o mul iple
bina y compa isons.
Finally, as a global conclusion, he applica ion o signal p ocessing echniques o he
analysis o biological da a like mic oa ays p o ed o be use ul and in e es ing. I has been
possible o de elop ools compa able, and e en be e , han s a e o he a al e na i es
in bo h he bina y and mul iclass cases. The p oposed amewo ks led o good esul s
in e ms o p edic i e abili y, p edic i e s abili y and esul s in e p e abili y, mee ing he
112
o iginal hesis goal. E en hough he heo e ical op imum is s ill a om being eached,
i has been possible o es some key elemen s like me agenes and he ea u e selec ion
ha a e wo h o be u he s udied.
113
114

Bibliog aphy
[1]
Mic oa ay classica ion wi h hie a chical da a ep esen a ion and no el ea u e se-
lec ion c i e ia
, La naca, Cyp us, 11/2012 2012.
[2]
Mul iclass cance mic oa ay classica ion algo i hm wi h Pai -Agains -All edun-
dancy
, Washing on, DC, USA, 12/2012 2012.
[3] G. Al e o i z and M. F. Ramoni.
Knowledge based bioin o ma ics : om analysis
o in e p e a ion
. John Wiley & Sons, Chiches e , Wes Sussex, U.K., 2010.
[4] M. Ande be g.
Clus e analysis o applica ions
. P obabili y and ma hema ical
s a is ics. Academic P ess, 1973.
[5] J. A. Ande son.
An In oduc ion o Neu al Ne wo ks
. The MIT P ess, Ma . 1995.
[6] M. Ashbu ne . Gene on ology: Tool o he unica ion o biology.
Na u e Gene ics
,
25:2529, 2000.
[7] F. Azuaje. Genomic da a sampling and i s eec on classica ion pe o mance
assessmen .
BMC Bioin o ma ics
, 4:5, 2003.
[8] R. E. Baneld, L. O. Hall, K. W. Bowye , and W. P. Kegelmeye . A new ensemble
di e si y measu e applied o hinning ensembles. In T. Windea and F. Roli,
edi o s,
Mul iple Classie Sys ems
, olume 2709 o
Lec u e No es in Compu e
Science
, pages 306316. Sp inge , 2003.
[9] G. Bauda and F. Anoua . Gene alized disc iminan analysis using a ke nel ap-
p oach.
Neu al Compu a ion
, 12(10):23852404, 2000.
115
[10] R. Bellman and R. Kalaba. On adap i e con ol p ocesses.
Au oma ic Con ol, IRE
T ansac ions on
, 4(2):19, 1959.
[11] R. E. Bellman.
Dynamic P og amming
. Do e Publica ions, Inco po a ed, 2003.
[12] P. Bello Pujal e. S udy o gene exp ession ep esen a ion wi h eele s and hie a -
chical clus e ing algo i hms. 2011.
[13] D. M. Bolse , P.-Y. Chibon, N. Palopoli, S. Gong, D. Jacob, V. D. D. Angel,
D. Swan, S. Bassi, V. Gonzýlez, P. Su a ajhala, S. Hwang, P. Romano, R. Edwa ds,
B. Bishop, J. Ea gle, T. Sh a land, N. J. P o a , D. Clemen s, D. P. Ren o,
D. Bhak, and J. Bhak. Me abase - he wiki-da abase o biological da abases.
Nucleic
Acids Resea ch
, 40(Da abase-Issue):12501254, 2012.
[14] S. Bo iah, V. Chandola, and V. Kuma . Simila i y measu es o ca ego ical da a:
A compa a i e e alua ion. In
SDM
, pages 243254. SIAM, 2008.
[15] M. Bosio, P. Bello , P. Salembie , and A. Oli e as-Ve gés. Gene exp ession da a
classica ion combining hie a chical ep esen a ion and ecien ea u e selec ion.
Jou nal o Biological Sys ems
, 20(04):349375, 2012.
[16] M. Bosio, P. Bello Pujal e, P. Salembie , and A. Oli e as. Fea u e se enhance-
men ia hie a chical clus e ing o mic oa ay classica ion. In
IEEE In e na ional
Wo kshop on Genomic Signal P ocessing and S a is ics (GENSIPS)
, San An onio
TX, USA, Decembe 2011. IEEE.
[17] M. Bosio, P. Bello Pujal e, P. Salembie , and A. Oli e as-Ve ges. Fea u e se
enhancemen ia hie a chical clus e ing o mic oa ay classica ion. In
Genomic
Signal P ocessing and S a is ics (GENSIPS), 2011 IEEE In e na ional Wo kshop
on
, pages 226229. IEEE, 2011.
[18] M. Bosio, P. Salembie , A. Oli e as, and P. Bello Pujal e. Ensemble ea u e se-
lec ion and hie a chical da a ep esen a ion o mic oa ay classica ion. In
13 h
IEEE In e na ional Con e ence on BioIn o ma ics and BioEnginee ing BIBE
, Cha-
nia, C e e, 11/2013 2013. 13 h IEEE In e na ional Con e ence on BioIn o ma ics
116
and BioEnginee ing 13 h IEEE In e na ional Con e ence on BioIn o ma ics and
BioEnginee ing,.
[19] U. B aga-Ne o. Fads and allacies in he name o small-sample mic oa ay classi-
ca ion - a highligh o misunde s anding and e oneous usage in he applica ions
o genomic signal p ocessing.
Signal P ocessing Magazine, IEEE
, 24(1):91 99, jan.
2007.
[20] U. B aga-Ne o and E. Doughe y. Bols e ed e o es ima ion.
Pa e n Recogni ion
,
37](6):1267  1281, 2004.
[21] U. M. B aga-Ne o and E. R. Doughe y. Is c oss- alida ion alid o small-sample
mic oa ay classica ion?
Bioin o ma ics
, 20(3):374380, 2004.
[22] L. B eiman. Random o es s.
Machine lea ning
, 45(1):532, 2001.
[23] L. B eiman. S a is ical modeling: The wo cul u es.
S a is ical Science
, 16(3):199
215, 2001.
[24] L. B eiman, J. H. F iedman, R. A. Olshen, and C. J. S one.
Classica ion and
Reg ession T ees
. S a is ics/P obabili y Se ies. Wadswo h Publishing Company,
Belmon , Cali o nia, U.S.A., 1984.
[25] D. Cal o-Dmgz, J. F. Gál ez, D. Glez-Peña, S. G. Mei e, and F. Fdez-Ri e ola.
Using a iable p ecision ough se o selec ion and classica ion o biological knowl-
edge in eg a ed in dna gene exp ession.
J. In eg a i e Bioin o ma ics
, 9(3), 2012.
[26] C.-C. Chang and C.-J. Lin. LIBSVM: A lib a y o suppo ec o machines.
ACM
T ansac ions on In elligen Sys ems and Technology
, 2:27:127:27, 2011. So wa e
a ailable a
h p://www.csie.n u.edu. w/~cjlin/libs m
.
[27] E. Chen, C. Tan, Y. Kou, Q. Duan, Z. Wang, G. Mei elles, N. Cla k, and
A. Ma'ayan. En ich : in e ac i e and collabo a i e h ml5 gene lis en ichmen
analysis ool.
BMC Bioin o ma ics
, 14(1):128, 2013.
117
[28] X. Chen and L. Wang. In eg a ing biological knowledge wi h gene exp ession p oles
o su i al p edic ion o cance .
Jou nal o Compu a ional Biology
, 16(2):265278,
2009.
[29] X. Chen, L. Wang, J. D. Smi h, and B. Zhang. Supe ised p incipal componen
analysis o gene se en ichmen o mic oa ay da a wi h con inuous o su i al
ou comes.
Bioin o ma ics
, 24(21):24742481, 2008.
[30] L. Chin, W. C. Hahn, G. Ge z, and M. Meye son. Making sense o cance genomic
da a.
Genes & De elopmen
, 25(6):534555, 2011.
[31] D. Coppola, M. Nebozhyn, F. Khalil, H. Dai, T. Yea man, A. Loboda, and J. J.
Mulà c

. Unique ec opic lymph node-like s uc u es p esen in human p ima y
colo ec al ca cinoma a e iden ied by immune gene a ay p oling.
The Ame ican
Jou nal o Pa hology
, 179(1):37  45, 2011.
[32] R. DÃaz-U ia e and S. A. de And à c

s. Gene selec ion and classica ion o mi-
c oa ay da a using andom o es .
BMC Bioin o ma ics
, 7:3, 2006.
[33] K. Deb and A. Reddy. Reliable classica ion o wo-class cance da a using e olu-
iona y algo i hms.
BioSys ems
, 2003.
[34] S. Deegalla and H. Bos öm. Classica ion o mic oa ays wi h knn: compa ison o
dimensionali y educ ion me hods. In
P oceedings o he 8 h in e na ional con e ence
on In elligen da a enginee ing and au oma ed lea ning
, IDEAL'07, pages 800809,
Be lin, Heidelbe g, 2007. Sp inge -Ve lag.
[35] M. De ling and P. Bühlmann. Finding p edic i e gene g oups om mic oa ay
da a.
J. Mul i a . Anal.
, 90:106131, July 2004.
[36] T. G. Die e ich. Ensemble me hods in machine lea ning. In
P oceedings o he
Fi s In e na ional Wo kshop on Mul iple Classie Sys ems
, MCS '00, pages 115,
London, UK, UK, 2000. Sp inge -Ve lag.
[37] T. G. Die e ich and G. Baki i. Sol ing mul iclass lea ning p oblems ia e o -
co ec ing ou pu codes.
Jou nal o A icial In elligence Resea ch
, 2:263286, 1995.
118
[102] E. Pe e sson, J. Lundebe g, and A. Ahmadian. Gene a ions o sequencing ech-
nologies.
Genomics
, 93(2):105  111, 2009.
[103] R. Polika . Ensemble based sys ems in decision making.
IEEE Ci cui s and Sys ems
Magazine
, 6(3):2145, 2006.
[104] P. Pudil, J. No o i£o á, and J. Ki le . Floa ing sea ch me hods in ea u e selec ion.
Pa e n Recogn. Le .
, 15(11):11191125, No . 1994.
[105] A. Rao and A. He o. Biological pa hway in e ence using mani old embedding. In
Acous ics, Speech and Signal P ocessing (ICASSP), 2011 IEEE In e na ional Con-
e ence on
, pages 59925995, 2011.
[106] R. Ri kin and A. Klau au. In de ense o one s all classica ion.
Jou nal o Machine
Lea ning Resea ch
, 5:101141, 2004.
[107] D. M. Rocke, T. Ideke , O. G. T oyanskaya, J. Quackenbush, and J. Dopazo. Pape s
on no maliza ion, a iable selec ion, classica ion o clus e ing o mic oa ay da a.
Bioin o ma ics
, 25(6):701702, 2009.
[108] Y. Saeys, I. n. Inza, and P. La añaga. A e iew o ea u e selec ion echniques in
bioin o ma ics.
Bioin o ma ics
, 23(19):25072517, Sep . 2007.
[109] B. Schölkop and A. J. Smola.
Lea ning wi h ke nels : suppo ec o machines, eg-
ula iza ion, op imiza ion, and beyond
. Adap i e compu a ion and machine lea ning.
MIT P ess, 2002.
[110] E. Se pedin, J. Ga cia-F ias, Y. Huang, and U. B aga-Ne o. Applica ions o sig-
nal p ocessing echniques o bioin o ma ics, genomics, and p o eomics.
EURASIP
Jou nal on Bioin o ma ics and Sys ems Biology
, 2009(1):250306, 2009.
[111] L. Sheng, R. Pique-Regi, S. Asgha zadeh, and A. O ega. Mic oa ay classica ion
using block diagonal linea disc iminan analysis wi h embedded ea u e selec ion.
In
ICASSP
, pages 17571760. IEEE, 2009.
125

[112] L. Shi, G. Campbell, W. D. Jones, F. Campagne, and Z. Wen. The mic oa ay
quali y con ol (MAQC)-II s udy o common p ac ices o he de elopmen and
alida ion o mic oa ay-based p edic i e models.
Na u e bio echnology
, 28:82738,
2010 Aug 2010.
[113] E. Smi no .
On exac me hods in sys ema ics
, olume 17. Sys ema ic Zoology, 1968.
[114] A. R. S a niko , L. Wang, and C. F. Ali e is. A comp ehensi e compa ison o an-
dom o es s and suppo ec o machines o mic oa ay-based cance classica ion.
BMC Bioin o ma ics
, 9, 2008.
[115] A. Sub amanian, P. Tamayo, V. K. Moo ha, S. Mukhe jee, B. L. Ebe , M. A.
Gille e, A. Paulo ich, S. L. Pome oy, T. R. Golub, E. S. Lande , and J. P. Mesi o .
Gene se en ichmen analysis: A knowledge-based app oach o in e p e ing genome-
wide exp ession p oles.
P oceedings o he Na ional Academy o Sciences o he
Uni ed S a es o Ame ica
, 102(43):1554515550, 2005.
[116] S. Sus e and C. A. Mo an. Applica ions and limi a ions o immunohis ochemis y
in he diagnosis o malignan meso helioma.
Ad Ana Pa hol
, 13(6):31629, 2006.
[117] F. Tai and W. Pan. Inco po a ing p io knowledge o p edic o s in o penalized
classie s wi h mul iple penal y e ms.
Bioin o ma ics
, 23(14):17751782, 2007.
[118] E. Tapia, P. Bulacio, and L. Angelone. Recu si e ECOC classica ion.
Pa e n
Recogn. Le .
, 31(3):210215, Feb. 2010.
[119] E. Tapia, L. O nella, P. Bulacio, and L. Angelone. Mul iclass classica ion o mi-
c oa ay da a samples wi h a educed numbe o genes.
BMC Bioin o ma ics
, 12:59,
2011.
[120] S. Theodo idis and K. Kou oumbas.
Pa e n Recogni ion
. Else ie Science, 2008.
[121] R. Tibshi ani. Reg ession sh inkage and selec ion ia he lasso.
Jou nal o he Royal
S a is ical Socie y (Se ies B)
, 58:267288, 1996.
126
[122] R. Tibshi ani, T. Has ie, B. Na asimhan, and G. Chu. Class P edic ion by Nea es
Sh unken Cen oids, wi h Applica ions o DNA Mic oa ays.
S a is ical Science
,
18(1):104117, 2003.
[123] T. Tong and Y. Wang. Op imal Sh inkage Es ima ion o Va iances Wi h Applica-
ions o Mic oa ay Da a Analysis.
Jou nal o he Ame ican S a is ical Associa ion
,
102(477):113122, Ma . 2007.
[124] S. Van ini, V. Vi elli, É. de F ance, and P. Zanini. T eele analysis and indepen-
den componen analysis o milan mobile-ne wo k da a: In es iga ing popula ion
mobili y and beha io .
Analysis and Modeling o Complex Da a in Beha iou al and
Social Sciences
, page 87, 2012.
[125] V. Vapnik and A. Che onenkis. A no e on one class o pe cep ons.
Au oma ion
and Remo e Con ol
, 25, 1964.
[126] M. D. Vose.
The Simple Gene ic Algo i hm: Founda ions and Theo y
. MIT P ess,
Camb idge, MA, USA, 1998.
[127] C. Wang, J. Xuan, H. Li, Y. J. Wang, M. Zhan, E. P. Homan, and R. Cla ke.
Knowledge-guided gene anking by coo dina i e componen analysis.
BMC Bioin-
o ma ics
, 11:162, 2010.
[128] S. Wang and X. Yao. Mul iclass imbalance p oblems: Analysis and po en ial so-
lu ions.
Sys ems, Man, and Cybe ne ics, Pa B: Cybe ne ics, IEEE T ansac ions
on
, 42(4):1119 1130, aug. 2012.
[129] J. Wes on, A. Elissee, B. Schölkop , and M. Tipping. Use o he ze o no m wi h
linea models and ke nel me hods.
J. Mach. Lea n. Res.
, 3:14391461, Ma . 2003.
[130] A. Whi ney. A di ec me hod o nonpa ame ic measu emen selec ion.
Compu e s,
IEEE T ansac ions on
, 1971.
[131] F. Wilcoxon. Indi idual Compa isons by Ranking Me hods.
Biome ics Bulle in
,
1(6):8083, 1945.
127
[132] D. R. Wilson and T. R. Ma inez. Imp o ed he e ogeneous dis ance unc ions.
CoRR
, cs.AI/9701101, 1997.
[133] S. Wold, M. Sjos om, and L. E iksson. Pls- eg ession: a basic ool o chemome ics.
Chemome ics and In elligen Labo a o y Sys ems
, 58:109130, 2001.
[134] X. Xu and A. Zhang. Selec ing in o ma i e genes om mic oa ay da ase by inco -
po a ing gene on ology. In
Bioin o ma ics and Bioenginee ing, 2005. BIBE 2005.
Fi h IEEE Symposium on
, pages 241245, 2005.
[135] A. Y. Yang, S. S. Sas y, A. Ganesh, and Y. Ma. Fas l1-minimiza ion algo i hms
and an applica ion in obus ace ecogni ion: A e iew. In
ICIP
, pages 18491852.
IEEE, 2010.
[136] P. Yang, Y. H. Yang, B. B. Zhou, and A. Y. Zomaya. A e iew o ensemble me hods
in bioin o ma ics.
Cu en Bioin o ma ics
, (5):296308, 2010.
[137] J. Ye, T. Li, T. Xiong, and R. Jana dan. Using unco ela ed disc iminan analysis
o issue classica ion wi h gene exp ession da a.
IEEE/ACM T ans. Compu .
Biology Bioin o m.
, 1(4):181190, 2004.
[138] W.-K. Yip, S. B. Amin, and C. Li. A su ey o classica ion echniques o mic oa -
ay da a analysis. In H. H.-S. Lu, B. Schölkop , and H. Zhao, edi o s,
Handbook o
S a is ical Bioin o ma ics
, Sp inge Handbooks o Compu a ional S a is ics, pages
193223. Sp inge Be lin Heidelbe g, 2011. 10.1007/978-3-642-16345-6_10.
128