König e al. BMC Bioin o ma ics (2015) 16:314
DOI 10.1186/s12859-015-0731-9
RESEARCH ARTICLE Open Access
Label noise in sub ype disc imina ion o
class C G p o ein-coupled ecep o s: A
sys ema ic app oach o he analysis o
classi ica ion e o s
Ca oline König1*, Ma ha I Cá denas1,2, Jesús Gi aldo2, René Alquéza 1,3 and Al edo Vellido1,4
Abs ac
Backg ound: The cha ac e iza ion o p o eins in amilies and sub amilies, a di e en le els, en ails he de ini ion and
use o class labels. When he adsc ip ion o a p o ein o a amily is unce ain, o e en w ong, his becomes an ins ance
o wha has come o be known as a label noise p oblem. Label noise has a po en ially nega i e e ec on any
quan i a i e analysis o p o eins ha depends on label in o ma ion. This s udy in es iga es class C o G
p o ein-coupled ecep o s, which a e cell memb ane p o eins o ele ance bo h o biology in gene al and
pha macology in pa icula . Thei supe ised classi ica ion in o di e en known sub ypes, based on p ima y sequence
da a, is hampe ed by label noise. The la e may s em om a combina ion o expe knowledge limi a ions and he
lack o a clea co espondence be ween labels ha mos ly e lec GPCR unc ionali y and he di e en ep esen a ions
o he p o ein p ima y sequences.
Resul s: In his s udy, we desc ibe a sys ema ic app oach, using Suppo Vec o Machine classi ie s, o he analysis o
G p o ein-coupled ecep o misclassi ica ions. As a p oo o concep , his app oach is used o assis he disco e y o
labeling quali y p oblems in a cu a ed, publicly accessible da abase o his ype o p o eins. We also in es iga e he
ex en o which physico-chemical ans o ma ions o he p o ein sequences e lec G p o ein-coupled ecep o
sub ype labeling. The candida e mislabeled cases de ec ed wi h his app oach a e ex e nally alida ed wi h
phylogene ic ees and agains u he us ed sou ces such as he Na ional Cen e o Bio echnology In o ma ion,
Uni e sal P o ein Resou ce,Eu opean Bioin o ma ics Ins i u e and Ensembl Genome B owse in o ma ion eposi o ies.
Conclusions: In quan i a i e classi ica ion p oblems, class labels a e o en by de aul assumed o be co ec . Label
noise, hough, is bound o be a pe asi e p oblem in bioin o ma ics, whe e labels may be ob ained indi ec ly h ough
complex, many-s ep simila i y modelling p ocesses. In he case o G p o ein-coupled ecep o s, me hods capable o
singling ou and cha ac e izing hose sequences wi h consis en misclassi ica ion beha iou a e equi ed o minimize
his p oblem. A sys ema ic, Suppo Vec o Machine-based me hod has been p oposed in his s udy o such pu pose.
The p oposed me hod enables a il e ing app oach o he label noise p oblem and migh become a suppo ool o
da abase cu a o s in p o eomics.
Keywo ds: G P o ein-coupled ecep o s, Label noise, Suppo ec o machines, Phylogene ic ees
*Co espondence: [email p o ec ed]
1Dep . o Compu e Science, Uni . Poli ècnica de Ca alunya, C. Jo di Gi ona,
1-3, 08034 Ba celona, Spain
Full lis o au ho in o ma ion is a ailable a he end o he a icle
© 2015 König e al. Open Access This a icle is dis ibu ed unde he e ms o he C ea i e Commons A ibu ion 4.0
In e na ional License (h p://c ea i ecommons.o g/licenses/by/4.0/), which pe mi s un es ic ed use, dis ibu ion, and
ep oduc ion in any medium, p o ided you gi e app op ia e c edi o he o iginal au ho (s) and he sou ce, p o ide a link o he
C ea i e Commons license, and indica e i changes we e made. The C ea i e Commons Public Domain Dedica ion wai e
(h p://c ea i ecommons.o g/publicdomain/ze o/1.0/) applies o he da a made a ailable in his a icle, unless o he wise s a ed.
König e al. BMC Bioin o ma ics (2015) 16:314 Page 2 o 14
Backg ound
P o eins ha e a ich axonomy o amilies and sub amilies,
o which he de ini ion and use o class labels is necessa y.
The adsc ip ion o a p o ein o a amily may be unce ain,
o e en w ong, hus becoming an ins ance o wha has
come o be known as a label noise (LN) p oblem. Label
noise, which is commonplace in many scien i ic domains
[1], has a po en ially nega i e e ec on any quan i a i e
analysis o p o eins ha equi es he use o label in o ma-
ion. In ac , he e a e ew domains in which he e ec s o
LN a e so pe asi e as in biomedicine and bioin o ma ics
[2]. The p oblem o LN may ake many o ms: om he
human expe subjec i i y in he labelling p ocess, which
is di icul o a oid, o bounds on he a ailable in o ma ion
and communica ion noise [3].
In medicine, o ins ance, he eliabili y o diagnos ic
labels is o en bounded by he na u al limi a ions o he
specialis s’ expe ise [4], o e en by he o mal equi e-
men s o majo i y-based decision-making p ocedu es, o
consensus guidelines ( o he la e see, o ins ance,
[5]). In bioin o ma ics, p o ein sub ype cha ac e iza ion
is a ask ha is iddled wi h his p oblem, despi e
good p ac ices in cu a ion o genomic and p o eomic
da abases [6].
In he speci ic ield o G p o ein-coupled ecep o s
(GPCRs), which a e he a ge o he cu en s udy, his
p oblem is magni ied by he ac ha sub yping can be
pe o med a up o se en le els o de ail [7]. GPCRs
a e cell memb ane p o eins o ele ance bo h o biology,
due o hei ole in ansducing ex acellula signals and
egula ing signaling pa hways, and o he pha maceu i-
cal indus y o being he a ge o many new he a-
pies, including pa hologies ela ed o he ca dio ascula ,
neu al, endoc ine, and immune sys ems, as well as in
cance [8].
The cu en s udy conce ns class C o hese ecep-
o s, which has become an inc easingly impo an a ge
o new he apies, pa icula ly in a ious cen al ne -
ous sys em diso de s such as Alzheime disease, anx-
ie y, d ug addic ion, epilepsy, pain, Pa kinson’s disease
and schizoph enia [9]. Whe eas all GPCRs a e cha ac e -
ized by sha ing a common se en ansmemb ane helices
(7TM) domain, esponsible o G p o ein ac i a ion, mos
class C GPCRs include, in addi ion, an ex acellula la ge
domain, he Venus Fly ap (VFT) and a cys ein ich
domain (CRD) connec ing bo h [10]. The VFT comp ises
wo opposing lobes wi h a cle whe e endogenous ligands
bind. Signi ican syn he ic e o s a e cu en ly de o ed by
academia and pha maceu ical companies o he design o
compounds ha , by binding o he 7TM domain, mod-
ula e he unc ion o endogenous ligands allos e ically.
This mul i-domain s uc u al and unc ional complex-
i y makes class C GPCRs an a ac i e a ge o bo h
basic and applied (d ug disco e y) esea ch. I is wo h
no ing ha , al hough no GPCR allos e ic modula o s ha e
ye been app o ed o psychia ic o neu ological diso -
de s, a numbe o GPCR allos e ic modula o s including,
pa icula ly, some om class C, a e unde clinical de el-
opmen [11]. Allos e ic modula o s a e o especial in e es
in compa ison o o hos e ic ligands due o hei educed
desensi iza ion, ole ance and side e ec s as well as highe
selec i i y among ecep o sub ypes and ac i i y depend-
ing on he spa ial and empo al p esence o endogenous
agonis [11].
Class C has been u he subdi ided in o se en sub ypes
[12]: Me abo opic glu ama e (mG), Calcium sensing
(CS), GABAB(GB), Vome onasal (VN), Phe omone (Ph),
Odo an (Od) and Tas e (Ta) ecep o s. mG ecep o s
a e ac i a ed by he glu ama e amino acid (AA), which is
he majo exci a o y neu o ansmi e in he b ain; hey
comp ise eigh sub ypes (mGlu1 o mGlu8) in u n sep-
a a ed in o h ee g oups: G oup I (mGlu1 and mGlu5),
G oup II (mGlu2 and mGlu3) and G oup III (mGlu4,
mGlu6, mGlu7 and mGlu8). G oup I mGs signal h ough
Gq whe eas G oups II and III signal h ough Gi/Go
signaling pa hways. The mG ecep o s a e in ol ed in
majo neu ological diso de s such as Alzheime ’s and
Pa kinson’s diseases, F agile X synd ome, dep ession,
schizoph enia, anxie y, and pain [13]. I is no ewo hy
ha , al hough de elopmen p og ams ela ed o he mG
d ugs Pomaglume ad (Lilly), Ma oglu an (No a is) and
Basimglu an (Roche) o he ea men o schizoph enia,
Pa kinson’s disease and F agile X synd ome ha e ecen ly
been discon inued, some o hese d ugs a e s ill expec ed
o be bene icial o a ge ed pa ien sub-popula ions wi h
neu ological and psychia ic diso de s [14].
TheCS ecep o isac i a edby hecalciumionand
plays a key ole in he egula ion o ex acellula calcium
homeos asis. Abno mali ies o he ex acellula calcium
sensing sys em lead o a disease exhibi ing abno mal
sec e ion o pa a hy oid ho mone and hypo- o hype cal-
cemia. Cinacalce is a ma ke ed posi i e allos e ic modu-
la o o he CS ecep o ha has p o ed use ul o p ima y
o seconda y hype pa a hy oidism [9].
The me abo opic GB ecep o is ac i a ed by GABA, a
neu o ansmi e which media es mos inhibi o y ac ions
in he ne ous sys em. F om a s uc u al poin o iew,
he GB ecep o dis inguishes i sel om o he class C
GPCRs o i s lack o CRD. The GB ecep o is in ol ed in
ch onic pain, anxie y, dep ession and addic ion. Baclo en
is an o hos e ic agonis o he GB ecep o ha is com-
monly used as a muscle elaxan in mul iple scle osis and
as analgesic. Because o hei ecognized pha macological
ad an ages, a numbe o posi i e allos e ic modula o s o
he GB ecep o a e cu en ly he goal o p og ams unde
de elopmen [9].
The in es iga ion o p o ein unc ionali y and signalling
mechanisms is o en based on he knowledge o c ys al
König e al. BMC Bioin o ma ics (2015) 16:314 Page 3 o 14
3-D s uc u es. In euka yo ic cell memb ane p o eins such
as GPCRs, his knowledge is pa ial and ai ly ecen : The
i s GPCR c ys al 3-D s uc u e was ully-de e mined in
2000 [15] and only o e he las decade, he s uc u es o
some o he GPCRs, mos belonging o class A, ha e been
sol ed [16].
No class C ull 3-D s uc u e has ye been sol ed.
Up un il he ime o w i ing, only wo ansmemb ane
(TM) domains and se e al ex acellula domains o class
C GPCRs ha e been independen ly de e mined [17, 18].
This means ha , in he absence o e ia y s uc u e in o -
ma ion, in es iga ions on hei unc ionali y om he p i-
ma y s uc u e ( ha is, om he AA sequences), in his
case publicly a ailable om se e al cu a ed da abases, can
be pa icula ly use ul.
As men ioned in p e ious pa ag aphs, Class C GPCRs
belong o di e en sub ypes, wi h hei co esponding
labels. The occu ence o LN is una oidable in his con ex
because he assignmen o indi idual sequences o one o
hese sub ypes is i sel , in mos cases, a model-based p o-
cess, which ollows a complex many-s ep p ocedu e ha
can only gua an ee limi ed success [19].
GPCR sub ype disc imina ion, as a compu e -based
au oma ed classi ica ion p ocedu e, may use aligned
( h ough Mul iple Sequence Alignmen , o MSA [20]) o
unaligned [21] e sions o he sequences. One app oach
o sequence alignmen - ee analysis is he ans o ma ion
o he p ima y sequences acco ding o he physicochem-
ical p ope ies o hei cons i uen AAs. T ans o ma ion
me hods based on he sequence composi ion in o ma ion
esul on da a ea u e ec o s whose p ocessing en ails
compa a i ely low compu a ional cos s. A e iew o se -
e al such me hods can be ound in [22].
Building om p elimina y esul s p esen ed in [23], we
ocus ou in es iga ion on he classi ica ion o da a esul -
ing om se e al alignmen - ee ans o ma ions o class C
GPCR sequences, using Suppo Vec o Machines (SVM).
The sequences wi h he mos consis en misclassi ica-
ion pa e ns a e u he analyzed o disco e non- andom
LN e ec s, as a way o explo e hei possible biological
explana ion.
The candida e class C GPCR mislabelings de ec ed
using such app oach a e u he alida ed h ough
sequence isualiza ion wi h phylogene ic ees (PT),
dend og am-like g aphical ep esen a ions o he e olu-
iona y ela ionship be ween axonomic g oups which
sha e a se o homologous sequence segmen s [24]. The
isualiza ion o he e olu iona y ela ionship h ough PTs
helped in his s udy o con i m he co ec ness o he
de ec ed pe sis en mislabelings.
The epo ed expe imen s using da a om a cu a ed
GPCR da abase a e mean o be he p oo o concep o
a sys ema ic app oach o assis he disco e y o GPCR
da abase labelling quali y p oblems, which would in u n
become he co e o a label il e ing decision suppo sys-
em [3], a use ul ool o da abase cu a o s in p o eomics.
The emainde o he pape is s uc u ed as ollows: The
nex sec ion desc ibes he analyzed class C GPCR da a
and he da a ans o ma ion and classi ica ion me hods,
including he alida ion p ocedu e. This is ollowed by a
epo o he expe imen al esul s and hei discussion.
The s udy w aps up wi h some conclusions.
Me hods
This sec ion s a s wi h a b ie desc ip ion o he da a
analyzed in ou expe imen s, which is ollowed by an
explana ion o he machine lea ning-based classi ica ion
p ocedu e used in hei analysis.
Ma e ials
As desc ibed in he p e ious sec ion, GPCRs a e cell
memb ane p o eins wi h he main ole o signal ans-
mission be ween he in acellula and ex acellula spaces.
The GPCRDB is a cu a ed, publicly accessible “molecula -
class in o ma ion sys em ha collec s, combines, alida es
ands o es[...]da aonGp o ein-coupled ecep o s”[12].
This da abase di ides he GPCR supe amily in o se e al
majo classes, namely A ( hodopsin like), B (sec e in like),
C (me abo opic glu ama e/phe omone), cAMP ecep-
o s, ome onasal ecep o s (V1R and V2R) and Tas e
ecep o s T2R, based on he ligand ypes, unc ions and
sequence simila i ies. Also as p e iously men ioned, he
cu en s udy ocuses on class C GPCRs.
The p ima y sequence da a analyzed in his s udy we e
ex ac ed om GPCRDB e sion 11.3.4, as o Ma ch 2011,
and comp ise a o al o 1,510 class C GPCR sequences,
belonging o he se en a o emen ioned sub ypes (mG, CS,
GB, VN, Ph, Od and Ta), including: 351 mG , 48 CS, 208
GB, 344 VN, 392 Ph, 102 Od and 65 Ta ecep o s. The
leng hs o hese sequences a ied om 250 o 1,995 AAs.
Me hods - alignmen - ee da a ans o ma ions
As he AA p ima y sequences ha e a a iable leng h, i
is necessa y o ans o m he sequence da a o ixed-size
ec o s in o de o use hem wi h supe ised classi ie s.
He e, we desc ibe he di e en ans o ma ion me hods
applied o he analyzed class C GPCR da ase .
In his s udy, ou di e en ans o ma ions we e used,
whe e we dis inguish be ween hose based on he N-g am
ep esen a ion buil on he AA alphabe and hose based
on he physicochemical p ope ies o he AAs. The use
o he N-g am ep esen a ion is common in p o ein cha -
ac e iza ion and has been in es iga ed in, o ins ance,
[25–27]. He e, we use he AAC and Dig am me hods,
which ans o m he da a acco ding o he equency o
appea ance o N-g ams o , in u n, leng h one and leng h
wo in he sequence. On he o he hand, we decided
o use mo e complex ans o ma ions based on he
König e al. BMC Bioin o ma ics (2015) 16:314 Page 4 o 14
physicochemical p ope ies o he AAs and he sequenc-
ing in o ma ion such as Au o-C oss Co a iance (ACC)
[28] and Physicochemical Dis ance-Based T ans o ma-
ion (PDBT [22]). Beyond compu a ional con enience, he
use o ans o ma ions based on he physico-chemical
p ope ies o he AAs is jus i ied by he ac ha , as
s a ed in [22], “because p o ein s uc u e and unc ion a e
mo e conse ed du ing e olu iona y p ocess, he simila -
i y be ween wo dis an ly ela ed p o eins may lie in he
physicochemical p ope ies o he AAs a he han he
sequence iden i ies”. In he ollowing, we desc ibe each o
he ans o ma ions in some de ail:
•N-g am ep esen a ions: These ans o ma ions
pa ially dis ega d sequen ial in o ma ion o e lec
only he ela i e equency o appea ance o AA
subsequences. In he case o AAC, he equencies o
appea ance o he 20 AAs (1-g am) a e calcula ed o
each sequence (i.e., a N×20 ma ix is ob ained,
whe e
N
is he numbe o i ems in he da ase ). In he
case o he Dig am (2-g am) me hod, we calcula e he
equency o each o he 400 possible AA pai
combina ions om he AA alphabe (i.e., a N×400
ma ix is ob ained).
•Au o c oss co a iance ans o ma ion:TheACC
[28, 29] is a mo e sophis ica ed ans o ma ion,
cap u ing he co ela ion o he physico-chemical
desc ip o s along he sequence. Fi s , he
physico-chemical p ope ies a e ep esen ed by
means o he i e
z
-sco es o each AA, as desc ibed in
[30]. Then, he Au o Co a iance (AC) and C oss
Co a iance (CC) a iables a e compu ed on his i s
ans o ma ion. These a iables measu e, in u n, he
co ela ion o he same desc ip o (AC) and he
co ela ion o wo di e en desc ip o s (CC) be ween
wo esidues sepa a ed by a lag along he sequence.
F om hese, he ACC ixed leng h ec o s can be
ob ained by conca ena ing he AC and CC e ms o
each lag alue up o a maximum lag,
l
.This
ans o ma ion gene a es a N×(z2·l)ma ix, whe e
z=5is he numbe o desc ip o s. In his wo k we
use he ACC ans o ma ion o a maximal lag alue
o l=13, which was ound in [31] o p o ide he bes
accu acy o he analyzed da a se .
•Physico-chemical dis ance-based ans o ma ion:
The PDBT ans o ma ion [22] is a complex
ans o ma ion ha uses a la ge se o
physicochemical p ope ies: 531 alues ep esen ing
physicochemical and biochemical p ope ies o AAs
a e aken in o accoun . Fu he mo e, sequence-o de
in o ma ion is inco po a ed in he ep esen a ion in
he o m o he co ela ion o each p ope y be ween
wo AAs sepa a ed by a maximal lag
l
. In he cu en
s udy, we use he PDBT ans o ma ion o a
maximal lag o 8, which yields a N×4248 ma ix ha
was p e iously analyzed in [32].
Me hods - SVM-based classi ica ion
SVMs ha e become commonplace in di e en p oblems
ela ed o he classi ica ion o p o eins om hei p ima y
sequences. A non-exhaus i e lis o examples includes
SVM-HUSTLE [33], SVM-I-si es [34], SVM-n-pep ide
[35], and SVM-BALSA [36]. In [22, 37], SVMs we e
epo ed o be op-pe o ming echniques o he classi i-
ca ion o sequences om simila ans o ma ions o hose
used in he cu en s udy.
These me hods ha e hei ounda ions on s a is ical
lea ning heo y and we e i s in oduced in [38]. They
map he D-dimensional ec o s xi,i=1, ...,N,whe e
xiRDand Nis henumbe o ins ances,in opossibly
highe -dimensional ea u e spaces by means o a unc ion
φ. The goal is inding a linea ly-sepa a ing hype plane ha
disc imina es he ea u e ec o s acco ding o class label
wi h a maximal ma gin, while minimizing he classi ica-
ion e o ξ.
The mos simple e sion is he linea SVM, whe e a
linea hype plane ha sepa a es he examples om wo
classes is assumed o exis . Such hype plane is de ined
by a se o poin s x ha sa is y w·x−b=0, whe e
wisano mal ec o o hehype planeand b
||w|| is he
pe pendicula dis ance om he hype plane o he o i-
gin. In consequence, he SVM algo i hm, when sea ch-
ing o he hype plane wi h la ges ma gin, assumes ha
yi(xi·w+b)−1≥0, ∀i,whe eyia e he class labels. The
objec i e o he SVM algo i hm is inding he sepa a ing
hype plane ha sa is ies his exp ession while minimiz-
ing ||w||2. This p oblem can be ansla ed o a Lag ange
o mula ion in which he ollowing objec i e unc ion
Lp(p imal Lag angian) mus be minimized wi h espec
o w,b:
LP≡1
2||w||2−
l
i=1
αiyi(xi·w+b)+
l
i=1
αi(1)
This is equi alen o he maximiza ion o he dual
Lag angian o m Ld:
LD=
i
αi−1
2
i,j
αiαjyiyjxi·xj(2)
subjec o he es ic ion ha wand b anish (and all
αi≥0), which leads o a closed solu ion.
A modi ica ion o he algo i hm was in oduced in [39],
allowing a so-called “so -ma gin” o accoun o misla-
beled da a when a linea sepa a ing hype plane could no
be ound. A classi ica ion e o ξis admi ed and a pa am-
e e Ccon olling he ade-o be ween hose e o s and
ma gin maximiza ion is de ined (No e ha , o C−→ ∞,
he model becomes equi alen o a ha d-ma gin SVM).
König e al. BMC Bioin o ma ics (2015) 16:314 Page 5 o 14
The SVM can be ex ended o nonlinea classi ica ion
[40] by applying he so-called ke nel ick [41]. The use
o nonlinea ke nel unc ions allows SVMs o sepa a e
inpu da a in highe -dimensional ea u e spaces in a way
hey would no be sepa able wi h linea classi ie s in he
o iginal inpu space. The use o ke nel unc ions allows
o sol e he p oblem wi hou explici ly calcula ing he
mapping φ( ha is, wi hou calcula ing da a coo dina es
in he implici ea u e space). This is possible due o
he ollowing p ope y: k(xi,xj)=φ(xi)·φ(xj), which
means ha any do p oduc in he op imiza ion p oce-
du e can be eplaced by a nonlinea ke nel unc ion k.In
his s udy we use he adial basis unc ion (RBF) ke nel,
speci ied as k(xi,xj)=e(−γ||xi−xj||2), which is a popu-
la nonlinea choice o SVM and has been used in he
expe imen s epo ed in he ollowing sec ions. Wi h i ,
he model equi es adjus ing wo pa ame e s h ough g id
sea ch: he e o penal y pa ame e Cand he γpa am-
e e o he RBF unc ion, which egula es he “space o
in luence” o he model suppo ec o s and, he e o e,
con ols o e i ing.
The disc imina ion o he se en sub ypes o class C
GPCRs equi es ex ending he o iginal bina y ( wo-class)
classi ica ion app oach o SVMs o a mul i-class one. To
ha end, we chose he “one-agains -one” app oach o
build he global classi ica ion model, implemen ed as pa
o he LIBSVM lib a y [42].
This app oach pe o ms class p edic ion acco ding o
he esul s o a o ing scheme applied o he bina y clas-
si ie s, i.e., acco ding o he numbe o imes a class is
p edic ed in each bina y classi ie . The e o e, his mul i-
class classi ie in e nally uses K(K−1)/2 bina y classi ie s
o dis inguishing Kclasses. A o al o 21 bina y classi ie s
we e hus buil o he 7 class C GPCR sub ypes in ou
s udy.
Classi ica ion pe o mance measu es
Two di e en igu es o me i we e used o e alua e he
es pe o mance o he mul i-class ained classi ie s,
namely he Accu acy (Accu), which is he p opo ion o
co ec ly classi ied ins ances, and he Ma hews Co ela-
ion Coe icien (MCC), which in ol es all he elemen s
o he con usion ma ix [43] and i is he e o e consid-
e ed a mo e comple e igu e o me i ; being de ined as a
co ela ion coe icien be ween he obse ed and he p e-
dic ed classi ica ion i s alue anges om –1 o 1, whe e
1 co esponds o a pe ec classi ica ion, 0 o a andom
classi ica ion and –1 o comple e misclassi ica ion.
In ou expe imen s, we measu e he P ecision, Recall
and MCC a class o sub ype le el (i.e. a he le el
o he bina y classi ie ) and measu e he Accu acy and
MCC a he global le el (i.e., a he le el o he mul i-
class classi ie ). All hese igu es o me i , desc ibed in
Tables 1 and 2, a e based on he concep o ue and alse
Table 1 Pe o mance measu es o bina y classi ie s
Measu e Fo mula Meaning
Accu acy p+ n
p+ n+ p+ n Measu e o co ec ness
P ecision p
p+ p Measu e o quali y
Recall p
p+ n Measu e o comple eness
MCC p∗ n− p∗ n
√( p+ p)( p+ n)( n+ p)( n+ n)Co ela ion coe icien
p edic ions in bina y classi ica ion wi h “posi i e” and
“nega i e” classes. T ue posi i es ( p)an uenega i es
( n) a e co ec ly classi ied cases o , in u n, he posi i e
and nega i e classes. Acco dingly, alse posi i es ( p)an
alse nega i es ( n) a e misclassi ied cases o , in u n, he
nega i e and posi i e classes.
By using a 5- old c oss- alida ion (CV) p ocedu e o
e alua e he mul i-class ained classi ie , he epo ed
measu es a e he mean alues o he espec i e me ics
o e he i e i e a ions o he 5-CV.
Me hods - A sys ema ic app oach o GPCR misclassi ica ion
analysis
Gi en a ans o med da ase , ou p oposed sys ema ic
app oach o he analysis o he classi ica ion e o s con-
sis s o h ee s eps o phases:
1. Es ima ion o he equency o misclassi ica ion o
each pa e n (sequence) using di e en SVM models
o selec a subse o equen ly misclassi ied pa e ns.
2. Fo each pa e n in he subse selec ed in s ep 1,
e alua ion o he ela ion o o es o all he SVM
classi ie s be ween i s ue (label) class and i s
mos -p edic ed class.
3. Fo each pa e n in he subse selec ed in s ep 1,
assessmen o he decision alues o he SVM bina y
Table 2 Pe o mance measu es o mul i-class classi ie s. pi, ni,
piand nia e, in u n, p, n, p and n o class i[59]. The
mul i-class MCC is calcula ed aking in o accoun all he en ies o
he con usion ma ix CK×Kin ol ing all Kclasses [60]. The ij- h
en y (cij) is he numbe o examples o he ue class i ha ha e
been assigned o he class jby he classi ie
Measu e Fo mula
Accu acy K
i=1
pi+ ni
pi+ ni+ pi+ ni
K
MCC K
k,l,m=1CkkCml−Clk Ckm
K
k=1K
l=1ClkK
,g=1 =kCg K
k=1K
l=1CklK
,g=1 =kC g
König e al. BMC Bioin o ma ics (2015) 16:314 Page 6 o 14
classi ie s be ween i s ue (label) class and i s
mos -p edic ed class.
The aim o he i s s ep is he de ec ion o hose pa -
e ns ha , mos o he imes, a e no classi ied as belong-
ing o he class de ined by hei o mal da abase label, bu
wi hou conside ing he dis ibu ion o p edic ed classes
in he misclassi ica ions. Ins ead, he aim o he second
and hi d s eps is o con i m he consis ency o he mis-
classi ica ions o he mos -p edic ed class. The di e ence
be ween he wo las s eps esides on whe he only he
o es (i.e. he bina y decisions o he SVM classi ie s) a e
aken in o accoun , o also he con idence (i.e. he decision
alues) o he bina y SVM classi ie s, when con on ing
jus he class label agains he mos -p edic ed class, a e
aken in o accoun . The union o pa e ns ob ained as
a esul in s eps 2 and 3 o ms he inal subse o e-
quen ly and consis en ly misclassi ied sequences ha a e
sho lis ed as label noise candida es.
In he ollowing subsec ions, u he de ails o each one
o he h ee s eps a e p o ided.
Repea ed classi ica ion wi h di e en SVM models
The i s phase en ails epea ing he ollowing p ocedu e
100 imes. Al hough his cons an alue could be changed,
100 is adequa e bo h o ob ain a s a is ically eliable esul
and o exp ess he equencies o misclassi ica ion di ec ly
as pe cen ages (o e o a es, ERs, o each sequence s).
This ype o epea ed c oss- alida ion app oach has been
p oposed as well in [44] and applied in [45].
•Fi s , he da ase is andomly eo de ed and a 5- old
c oss alida ion (5-CV) is used, so ha , o each o
he i e aining- es pa i ions, he cu en aining
se is employed o cons uc an RBF-SVM model [42]
wi h an op imal alue o he γpa ame e o he
ke nel unc ion and wi h he e o penal y pa ame e
C
a ying wi hin a small ange nea i s p e iously
es ablished op imum alue.
•Second, a es se classi ica ion is ca ied ou using
he ained model, egis e ing which GPCR
sequences a e misclassi ied and gene a ing he
co esponding con usion ma ix.
The use o CV in each o he 100 epe i ions o his p o-
cedu e ensu es ha each ins ance is classi ied exac ly one
ime as a es pa e n in each i e a ion o he ou e loop.
No e ha Cis sligh ly modi ied in each i e a ion o he
inne loop.
Wi h his, we ob ain de ailed esul s o how many imes
a sequence was misclassi ied when included in he es
se and how many o hese imes i was assigned o
speci ic classes. No e ha all he classi ica ion esul s
when he sequence belongs o he aining se a e no
aken in o accoun . In o de o ocus only on he mos
ecu en classi ica ion e o s, a conse a i e misclassi i-
ca ion bounda y o e=75 % on he indi idual e o a e
ERswas se (i.e., only sequences smisclassi ied in a leas
a75%o he es occasionswe edeemed obes ong
misclassi ica ions and selec ed o u he analysis). This
h eshold eis me ely illus a i e; in a eal applica ion o he
me hod, i should be se acco ding o he expe analys ’s
decision. A high h eshold would ensu e ha only he
mos ex eme misclassi ica ions a e singled ou o u he
de ailed analysis, whe eas low h esholds would be mo e
adequa e in case a mo e global explo a ion is equi ed.
Analysis o misclassi ica ions acco ding o he o ing scheme
Since we a e acing a mul i-class (Kclasses) classi ica ion
p oblem in which he unde lying classi ica ion scheme
o he SVM implemen a ion [42] was “one- s-one”, i is
in e es ing o analyze he esul s o he o ing scheme as
applied o he K(K−1)/2 esul ing classi ie s, including
he o es o each one, o each pa e n in each es i e a-
ion. Acco ding o LIBSVM, he sub ype wi h he highes
numbe o o es in each case becomes he p edic ed class
o he es pa e n.
Fo each equen ly misclassi ied sequence s,selec ed
in he i s phase, we ocus he analysis on he ela ion
be ween he o al numbe o o es VTsob ained by he
ue (label) class in he 100 i e a ions and hose ob ained
by he mos equen ly p edic ed class o ha sequence,
VPs. This is, we de ine he o ing a io
Rs=VTs
VPs
(3)
and, gi en some h eshold θR, we conside ha Rs≤θR
indica es a consis en (also deemed as la ge) classi ica ion
e o , while Rs>θ
Rdeno es a mo e doub ul (o small)
misclassi ica ion. We ixed a h eshold θR=0.5 oob ain
ou esul s discussed la e .
Analysis o misclassi ica ions acco ding o he decision alues
In he hi d and las phase o ou p oposed app oach, we
go deepe in o he analysis o misclassi ica ions by ak-
ing in o accoun he con idence (decision alues) o he
100 bina y SVM classi ie s in ol ing only he label class
and he mos equen ly p edic ed class, when classi ying
a sequence sas es pa e n. Fo each equen ly misclas-
si ied sequence sselec ed in he i s phase, we de ine a
cumula i e decision alue,CDVs, as ollows:
CDVs=
100
k=1
DVs(i,j,k)(4)
whe e DVs(i,j,k)is he decision alue gi en by he bina y
SVM classi ie con on ing he class wi h label i o which s
o mally belongs and he mos - equen ly p edic ed class
o sequence s,wi hlabelj,in hek h es i e a ion. GPCR
sub ype labels we e numbe ed 1 o 7 in he o de hey a e
König e al. BMC Bioin o ma ics (2015) 16:314 Page 7 o 14
p esen ed in he da a desc ip ion sec ion. Fo sub ypes i,j,
ala geposi i eCDVs alue i i>jand a la ge nega i e one
i i<jbo h indica e clea misclassi ica ions. Hence, he
magni udeo hee o isdeemedla ge o small depend-
ing on whe he he CDVsexceeds a ce ain h eshold θCDV
in absolu e alue o no . A h eshold θCDV =60 was
chosen o he expe imen s.
No e ha he in o ma ion con eyed by CDVscomple-
men s ha o Rs. Fo ins ance, a misclassi ied sequence
wi h high Rswould sugges ha he o ing p ocess dis-
ca ds all sub ypes bu he ue and he p edic ed ones,
ha is, a e y na ow ans e o sub ype assignmen . I
his is accompanied by a la ge CDVsin absolu e alue, he
p edic ed sub ype, e en i w ong assuming ha he iden i-
yinglabelo hesequenceis us ed,iss onglyp e e ed
by he SVM classi ie s.
Me hods - Ex e nal alida ion o SVM-based classi ica ion
Mislabeling alida ion wi h phylogene ic ees
Fo p o eins, a PT is a dend og am-like g aphical ep e-
sen a ion o he e olu iona y ela ionship be ween axo-
nomic g oups which sha e a se o homologous sequence
segmen s. This e olu iona y ela ionship is a o m
o hie a chically s uc u ed simila i y-based g ouping
p ocess.
In his s udy, PTs we e used o isualize he analyzed
class C GPCR sequences and hus p o ide an al e na-
i e way o ex e nally alida e he misclassi ica ion esul s
epo ed in he p e ious sec ions. The e a e wo sound
easons why we use PTs o his ask: i s , because hey
ha e de ac o become s anda d ools in bioin o ma ics
[24] and, pa icula ly, in p o ein homology de ec ion, so
ha p o ein da abase cu a o s a e mo e likely o us
hem. Second, because he p o ein sequence alignmen
ha unde lies he ee cons uc ion has no di ec link wi h
he sequence ans o ma ions om which he SVM clas-
si ie s a e buil , he e o e gua an eeing he independence
o he esul s.
Ou so wa e ool o choice, T ee olu ion1[46], was
de eloped in Ja a and in eg a es he P ocessing2package.
This ool suppo s isual and explo a o y analysis o PTs
in ei he Newick o PhyloXML o ma s as adial dend o-
g ams, wi h high-le el use -con olled da a in e ac ion a
he use eques and o e s se e al me hods e y use ul o
la ge PT: sec o dis o ion, ee o a ion, p uning, label-
ing, acking o ances o s and descendan s and ex sea ch,
among o he s.
The colo -guided highligh ing o p o ein amilies helps
he use o ocus on sequence g oupings o in e es and
gi e an o e all idea o g oups wi h he same ances o
wi hin he ee. The PT is c ea ed om a MSA ob ained
wi h Clus al Omega [47]. This applica ion, in which
sequences da a a e in oduced in FASTA o ma , pe o ms
dis ance-based MSA [48].
Resul s
This sec ion s a s wi h a b ie summa y o some p e-
limina y esea ch ha inspi ed he cu en s udy. This
is ollowed by a de ailed analysis o misclassi ica ions
acco ding o a o ing scheme and classi ie decision al-
ues. This analysis is alida ed using a s anda d LN de ec-
ion il e and sequence isualiza ion h ough PTs.
Resul s om p e ious esea ch
The expe imen s epo ed in his sec ion ex end some
basic p elimina y esul s epo ed in [23]. In p e ious
esea ch [49], we in es iga ed he supe ised classi ica-
ion o he da a se desc ibed in he p e ious sec ion using
di e en classi ie s, namely decision ees (DT), naï e
Bayes (NB) and SVM, o di e en alignmen - ee ans-
o ma ions o he sequences, including AA composi ion
(AAC), he Mean T ans o ma ion [50] and Au o-C oss
Co a iance (ACC) [28]. In his p e ious s udy, ocus was
placed on he accu acy o he classi ie s’ pe o mance
and he expe imen al esul s showed ha SVM clea ly
ou pe o med he es o classi ie s independen ly o he
ans o ma ion applied o he da a se . This led o he con-
clusion ha a nonlinea classi ie wi h he abili y o ind
a linea sepa a ion o ins ances in a highe -dimensional
ea u e space, such as SVM, was he adequa e choice o
he da a se unde analysis in he ask o sub ype dis-
c imina ion. The second conclusion om his p e ious
s udy was ha , a sub ype le el, classi ica ion accu acies
showed only small a ia ions depending on da a ans-
o ma ions. E en a supe icial analysis o he con usion
ma ices showed ecu en pa e ns o sub ype misclassi-
ica ion, which hin ed a LN as hei cause. Such obse -
a ions p o ided suppo o a mo e de ailed analysis o
sequence misclassi ica ion.
Repea ed classi ica ion wi h di e en SVM models using
di e en ans o ma ions o he da ase
These p e ious esul s led us o decide on he con e-
nience o using a mo e di e se se o da a ans o ma ion
echniques. Tables 3 and 4 summa izes he bes sub-
ype classi ica ion esul s ob ained wi h SVM o he
ou di e en ans o med da a se s, measu ed by a e -
age accu acy (o e all co ec classi ica ion a e) and MCC.
These esul s a e complemen ed by he box-plo ep e-
sen a ion o he dis ibu ions o he accu acy and MCC
alues, o each o he ans o med da a se s, o e he 100
ou e i e a ions o he classi ica ion p ocedu e, shown in
Figs. 1 and 2. Fo all ans o ma ions, a low a iabili y
o he esul s is obse ed, sugges ing consis en es ima es
ha make he a e age igu es o Table 3 qui e eliable. Ou
o hese, he bes classi ica ion esul s we e ound o he
Dig am and ACC ans o med da a se s, al hough he el-
a i e di e ences o accu acy and MCC make PDBT also a
easonable choice.
König e al. BMC Bioin o ma ics (2015) 16:314 Page 8 o 14
Table 3 SVM classi ie esul s: Global esul s o he ou da a
ans o ma ions; accu acy (Accu), Ma hews Co ela ion
Coe icien (MCC)
Da a Accu MCC
AAC 0.88 0.84
Dig am 0.93 0.91
ACC 0.93 0.91
PDBT 0.92 0.90
Bes esul s highligh ed in bold
A de ailed analysis o he esul s pe -sub ype e ealed
ela i ely mino di e ences be ween hose ob ained wi h
each o he ou ans o med da a se s. This obse a ion
sugges s ha he main causes o misclassi ica ion migh lie
beyond he di e ences be ween da a ans o ma ions and
ha a mo e sys ema ic analysis o he classi ica ion e o s
is equi ed.
Table 5 shows a ew illus a i e misclassi ica ion s a is-
ics o he ACC ans o med da a se . Fo ins ance,
sequence 6, which belongs o sub ype VN acco ding o
i s da abase label, was misclassi ied 100 ou o 100 imes:
96 o hem was assigned o Ph and 4 o Od (See Table 6
o he mapping be ween he numbe and he p o ein
da abase Id).
This misclassi ica ion analysis was epea ed o each o
he ans o med da a se s. The AAC, Dig am, ACC and
PDBT se s yielded, in u n, 143, 88, 85 and 100 s ong
misclassi ica ions. A de ailed analysis o hese equen ly
misclassi ied sequences e ealed ha hey a e nea ly iden-
ical o ACC and Dig am. The e a e some di e ences
wi h he PBDT misclassi ica ions ha migh be he esul
o he e y di e en ype o ans o ma ion. Impo an ly,
52 equen ly misclassi ied sequences we e common o
all ou da a se s and he e was s ong ag eemen on he
mos -o en p edic ed sub ypes. These sequences a e lis ed
in Addi ional ile 1.
Table 4 SVM classi ie esul s: Class C GPCR esul s pe sub ype
o he ACC da a se only, including MCC, P ecision (P ec) and
Recall (Rec)
Class MCC P ec Rec
mG 0.95 0.95 0.99
CS 0.93 1.00 0.88
GB 0.98 0.99 0.99
VN 0.89 0.91 0.92
Ph 0.86 0.89 0.90
Od 0.79 0.89 0.74
Ta 0.99 1.00 0.98
Fig. 1 Boxplo ep esen a ion o he Accu o he AAC, Dig am, ACC
and PDBT da ase
Analysis o misclassi ica ions acco ding o he o ing
scheme
In e es ingly, hese esul s sugges he exis ence o sub-
ypes wi h ecu en ly w ong class assignmen s. So, we
applied he second s ep o ou sys ema ic app oach based
on he o ing scheme, as desc ibed ea lie , o con i m con-
sis en misclassi ica ions. To illus a e he esul s ob ained
in his s ep, we show he o ing scheme esul s o he
selec ed ins ances o Table 5. Sequence 6, o ins ance,
is a VN consis en ly misclassi ied as Ph. The magni ude
o he e o is small, hough, as he o ing a io (Rs)o
ue class o p edic ed class is ela i ely high (0.67 >0.5).
Sequence 2isaCS, consis en ly misclassi ied as mG.The
magni ude o he e o is la ge, as he Rsis qui e low
(0.15 ≤0.5).
Only 7 o he 85 equen ly misclassi ied ACC-
ans o med sequences yielded la ge e o s (See Table 6).
Simila ly, o AAC, Dig am and PDBT se s, he majo i y
o sequences ha e small e o s.
Fig. 2 Boxplo ep esen a ion o he MCC o he AAC, Dig am, ACC
and PDBT da ase
König e al. BMC Bioin o ma ics (2015) 16:314 Page 9 o 14
Table 5 Illus a i e example o misclassi ica ion s a is ics o he
ACC da a se . Fo some sequences siden i ied by numbe s, he
e o a e (ERs), he ue class (TCs), and how many imes his
sequence was misclassi ied as belonging o each o he o he
sub ypes ( om mG o Ta), a e displayed. The h ee las columns
lis he sum o he o es o he ue class (VTs), o he mos
equen ly p edic ed class (VPs), and he a io (Rs) o one o he
o he
sERsTCsmG CS GB VN Ph Od Ta VTsVPsRs
2 100 CS 100 0 0 0 0 0 0 91 600 0.15
6 100 VN 0 0 0 0 96 4 0 404 596 0.67
7 100 VN 100 0 0 0 0 0 0 300 600 0.5
Analysis o misclassi ica ions acco ding o he decision
alues
Clea di e ences in he magni ude o he ecu en clas-
si ica ion e o s we e ound. Pu suing u he insigh , we
applied he hi d s ep o ou app oach based on he cumu-
la i e decision alue (CDVs) speci ically o he bina y
classi ie ha in ol es he ue class and he p edic ed
class.
As p e iously men ioned, he magni ude o he e o
was deemed la ge o small depending on whe he he
CDV exceeded he h eshold o 60 in absolu e alue o
no . A o al o 21 ou o he 85 equen ly misclassi ied
ins ances o he ACC- ans o med da a se ha e a la ge
e o acco ding o his c i e ion, whe eo 4 yield a e y
la ge one (|CDVs|≥95: see Table 6).
Summa y o he analysis o misclassi ica ions
The p oposed sub ype classi ica ion app oach e ealed
he exis ence o a numbe o ins ances ha , independen ly
Table 6 Sequences wi h la ge classi ica ion e o s: Fo each
sequence snumbe ed s, he GPCRDB Iden i ie (Ids), he ue
class (TCs), he p edic ed class (PCs), he o ing a io (Rs) and he
cumula i e decision alue (CDVs) a e displayed
sIdsTCsPCsRsCDVs
1 q5i5c3_9 ele mG Od 0.75 –95
2 XP_002123664 CS mG 0.15 50
3 q8c0m6_mouse CS Ph 0.15 –46
4 XP_002740613 CS mG 0–66
5 XP_002936197 VN Ph 0.83 –96
6 XP_002940476 VN Ph 0.67 –95
7 XP_002941777 VN mG 0.5 45
8 B0UYJ3_DANRE Ph mG 0.79 109
9 XP_001518611 Od mG 0.31 46
10 XP_002940324 Od VN 0.49 70
11 GPC6A_DANRE Od Ph 0.5 74
Ex eme Rsand CDVs alues highligh ed in bold
o he sequence ans o ma ion me hod, induce classi ica-
ion e o s ha could be deemed ei he la ge o small. The
in o ma ion p o ided by Rsand CDVsshould be unde -
s ood as complemen a y, gi en ha no ully coinciden
ins ances a e singled ou in each app oach.
Impo an ly, his analysis showed ha he misclassi ica-
ions o a sizeable p opo ion o sequences ha e a small
magni ude, so ha hey could be igno ed unless a ho -
ough e ision o he da abase labels is equi ed. A small
numbe o ins ances, hough, showed consis en and la ge
classi ica ion e o s and hey should be he ocus o in e -
es om he da abase cu a ion iewpoin . In Table 6, we
lis GPCRs wi h ei he e y la ge absolu e alue o CDVs
(4 i ems) o small Rs(7 i ems) using he ACC ans o med
da ase .
Mislabeling alida ion
Valida ion h ough PT-based isualiza ion o class C GPCRs
Figu e 3 displays he T ee olu ion adial PT plo o he
comple e se o 1,510 GPCRs o class C, addi ionally
showing he app oxima e dis ibu ion o i s main se en
sub ypes. In his ep esen a ion, each ou e b anch co -
esponds o one GPCR sequence. T ee colo s a e used
o ep esen amilies o descendan nodes. No e hough
ha hese colo s do no co espond o sub ype labels.
We obse e ha some amilies co espond o no one bu
se e al e olu iona y b anches. Fo example, he wo di -
e en colo s assigned o Phe omone p o ide quan i a i e
Fig. 3 Radial PT plo showing he main a eas o dis ibu ion o he
se en class C GPCR sub ypes. T ee olu ion adial PT in which he
main sec ions occupied by each o he se en class C GPCR sub ypes
a e explici ly ep esen ed by a chs o g oups o a chs in he pe iphe y
o he ee. No e ha b anch colo s a e au oma ically gene a ed
du ing PT cons uc ion and do no co espond o class C sub ypes