Implemen a ion o IFPTML Compu a ional Models in D ug Disco e y
Agains Fla i i idae Family
Yend ek Velásquez-López,*And ea Ruiz-Escude o, Sonia A asa e, and Humbe o González-Díaz*
Ci e This: J. Chem. In . Model. 2024, 64, 1841−1852
Read Online
ACCESS Me ics & Mo e A icle Recommenda ions *
sı Suppo ing In o ma ion
ABSTRACT: The Fla i i idae amily consis s o single-s anded
posi i e-sense RNA i uses, which con ains he gene a Fla i i us,
Hepaci i us,Pegi i us, and Pes i i us. Cu en ly, he e is an ou b eak
o i al diseases caused by his amily a ec ing millions o people
wo ldwide, leading o signi ican mo bidi y and mo ali y a es.
Ad ances in compu a ional chemis y ha e g ea ly acili a ed he
disco e y o no el d ugs and ea men s o diseases associa ed
wi h his amily. Chemoin o ma ic echniques, such as he
pe u ba ion heo y machine lea ning me hod, ha e played a
c ucial ole in de eloping new app oaches based on ML models
ha can e ec i ely aid d ug disco e y. The IFPTML models ha e
shown i s capabili y o handle, classi y, and p ocess la ge da a se s
wi h high speci ici y. The esul s ob ained om di e en models indica es ha his me hodology is p o icien in p ocessing he da a,
esul ing in a educ ion o he alse posi i e a e by 4.25%, along wi h an accu acy o 83% and eliabili y o 92%. These alues sugges
ha he model can se e as a compu a ional ool in assis ing d ug disco e y e o s and he de elopmen o new ea men s agains
Fla i i idae amily diseases.
1. INTRODUCTION
A bo i uses (AR h opod-BO ne VIRUSes) a e o ganisms
ansmi ed by hema ophagous a h opods.
1
Wi hin his
g oup, we ind he Chikungunya i us (CHIKV), Wes Nile
Vi us (WNV), dengue i us (DENV), yellow e e i us
(YFV), and Zika i us (ZIKV), ansmi ed by mosqui oes o
he Aedes spp. genus “Aedes Aegyp i”.
2
Gene ally, membe s o
his genus, especially hose who a e ansmi ed by he same
ec o , sha e a simila symp oma ology bu app oxima ely 80%
o in ec ions a e asymp oma ic.
3
Al hough he e a e a ia ions,
such as in Zika ha p esen a unique illness (Guillain−Ba e
synd ome and Mic ocephaly in neona es).
4
Taxonomic analyses ha e classi ied DENV, CHIKV, YFV,
and ZIKV as membe s o he Fla i i idae amily, Fla i i us
genus.
5−7
This genus is cha ac e ized by a posi i e-sense RNA
o app oxima ely 11 kb.
8,9
I has been epo ed ha he
p o eins encoded by RNA a e conse ed in membe s o he
Fla i i us genus, o example, DENV se o ype 4 shows high
simila i y o i s coun e pa s in he genus ansmi ed by
mosqui oes.
10,11
Cu en ly, he de elopmen o accines and d ugs agains
ce ain membe s o he Fla i i us genus (YFV, DENV, WNV,
Japanese Encephali is Vi us�JEV, e c.) is mo e widesp ead
han agains ZIKV.
12
Due o he highly conse ed genome
among hem and ZIKV, he same ea men s and medica ions
ha e been used o ea his disease hopping ha hey will ha e
he same success among he o he membe s o hei
amily.
12−14
Howe e , he de elopmen o hese medica ions
has no managed o su pass he second phase o d ug
de elopmen .
15,16
In o de o manage he dilemma, he de elopmen o
compu a ional chemis y has a o ed he disco e y and design
o no el d ugs.
17−20
Cu en ly, he iden i ica ion and
op imiza ion o chemical compounds as po en ial candida es
o pha maceu ical a ge s o in e es ha e been in es iga ed.
21
The de elopmen o compu a ional models based on machine
lea ning (ML) is a widely used echnique in compu e -aided
d ug design.
22
These models use he s uc u al in o ma ion on
compounds and es ablished assay condi ions o elucida e new
compounds ha can in e ac wi h he desi ed a ge s.
23,24
Wi h he ise o Big Da a and he ad en o he digi al e a, a
la ge amoun o s uc u al in o ma ion abou p o eins and
small molecules has been ob ained.
25
This in o ma ion could
p o ide esea che s and expe imen al de elope s wi h new
pha macological a ge s o be es ed, opening a ange o
possibili ies and oppo uni ies o hese me hods o be used o
d ug disco e y based on lea ning echniques.
26
The amoun o
da a ha can be ob ained om hese new me hods and
Recei ed: No embe 7, 2023
Re ised: Feb ua y 26, 2024
Accep ed: Feb ua y 27, 2024
Published: Ma ch 11, 2024
A iclepubs.acs.o g/jcim
© 2024 The Au ho s. Published by
Ame ican Chemical Socie y 1841
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
This a icle is licensed unde CC-BY 4.0
Downloaded ia UNIV DEL PAIS VASCO on Ap il 29, 2024 a 15:04:24 (UTC).
See h ps://pubs.acs.o g/sha ingguidelines o op ions on how o legi ima ely sha e published a icles.
echniques can p o ide insigh s in o he in insic ela ionships
ha could enable he gene a ion o p edic ion models o
encode chemical s uc u es. These new echniques a e based
on expe imen al and compu a ional in o ma ion; he e o e, can
be used o p edic he ac i i y o new compounds o
pha maceu ical in e es .
By de eloping ML models, new sys ems can be c ea ed wi h
he abili y o lea n and imp o e wi hou unde going any
p e ious p og amming.
27−29
The accu acy and e ec i eness o
hese models mus mee c i e ia o esponsibili y, such as
sys ema ic iden i ica ion and sea ch o egula i ies o
gua an ee he p edic ion and ensu e ha he analysis was
ca ied ou co ec ly.
30
These kinds o me hods ha e been used in o he scien i ic
ields such as obo ics, da a mining, chemis y, and
bio echnology, e c.
31−33
Howe e , one o he limi a ions o
hese con en ional me hods a ises om he ac ha hey don’
co e la ge da a se s ha appea in ials. Using he clinical da a
a ailable as an example, he e is use ul da a such as cells,
p o eins, and o ganisms ha we e used in he ials, and hese
ex a da a can be used o en ich hese models.
34
Pe u ba ion Theo y Machine Lea ning (PTML) models
we e de eloped o sol e his p oblem, aking pe u ba ion
heo ies (PT) combined wi h ML models o ob ain PTML
models.
24
They a e based on a e e ence p ope y ( ij) e .
35
Pe u ba ion ope a o s PT a e added o his p ope y o
measu e he de ia ions ha can occu wi h espec o his
e e ence p ope y; in his way, i can p edic he p ope ies o
an unknown sys em bu simila o he o iginal sys em ha was
used as a e s 36−38.
I has been shown ha PTML models a e applicable o
a ious co ela ion p oblems. Howe e , mos applica ions o
hese me hods ha e ocused on classi ica ion p oblems. PTML
me hods p esen he calcula ed unc ion and show i ela ing i
o he membe ship o a sys em o di e en classes.
35
These
sys ems can ha e di e en alues o ij = biological ac i i y in
an n h sys em unde mul iple cj= es condi ions. This can be
ex ended o mul iple pa ame e s wi hin ij ha ha e been
measu ed in a ious cjassays. These pa ame e s can be
op imized.
21
As men ioned ea lie , he model akes ( ij) e =
p( ij/cj) e as inpu da a, which ep esen s he p obabili y ha
o he simila sys ems ake a o able alues ( ij)obs 1 in clinical
ials ha usually ha e he same condi ions cj.
35
Cu en ly, he e is an ou b eak o new i al diseases, and also
new a ian s eme ging om p e iously epo ed disease.
Fla i i uses, in pa icula , ha e been a majo p oblem since
he las cen u y and con inue o be one o he leading causes o
dea h in Sou h Ame ica.
39
Al hough compounds ha can
p e en he eplica ion o hese i uses ha e been syn hesized,
he a iabili y o his amily makes hem a challenge o
esea che s and public heal h. Nowadays, he DENV has ou
di e en se o ypes. In 2016, he ZIKV was classi ied as a se e e
disease due o o i s e ec s on newbo ns.
10,40
Mos i uses in
his amily ha e simila s uc u es; howe e , he e is no speci ic
d ug o ea any o hem.
9,41
The e o e, i is necessa y o
accele a e he me hods o d ug disco e y and de elopmen .
The adi ional app oach in ol es syn hesizing compounds and
es ing hem, which is a ial-and-e o p ocess.
42,43
The use o
chemoin o ma ic echniques, such as he PTML me hod, can
accele a e he disco e y o po en ial d ug candida es ha can
inhibi he de elopmen o his ype o i us.
Figu e 1. Gene al wo k low.
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1842
The gene al lowcha showing he in e connec ions
be ween he di e en pa s o his wo k: (1) chemoin o ma ic
s udy, (2) s a is ical analysis and (3) biological assays
p obabili y, is depic ed in Figu e 1.
2. MATERIALS AND METHODS
2.1. Da a P epa a ion and P ocessing. 2.1.1. Da a Se
Gene a ion. The gene a ion o he da a se o sample o be
analyzed consis on he sea ch and compila ion o compounds
es ed in ce i ied da abases. Fi s , i is necessa y o de e mine
he a ge o unde lying disease o c ea e he model. The
membe s o he Fla i i idae amily ansmi ed by he opical
ec o “Aedes aegyp i” we e conside ed o he de elopmen o
he da a se . The ZIKV was selec ed as he base due o
p e ious s udies conduc ed wi h his i us, which p o ided a
cu a ed da abase. Howe e , because i is a disease in
de eloping coun ies, he exis ing da a is no e y signi ican
in e ms o quan i y. The da a equi ed o ain a model is
much la ge han he a ailable; he e o e, his da abase was
expanded o include o he membe s o he amily wi h he
gene ic simila i y (>90%) ha a e ansmi ed by mosqui oes.
Using he ChEMBL da abase,
44
a sea ch was made based on
compounds ha p e iously epo ed ac i i y agains selec ed
a ge s (ZIKV, DENV, DENV2, DENV3, DENV4, HepC,
WNV, YFV, e c.). The la ge da a se ob ained should epo
he biological ac i i y o he assays as da a. Subsequen ly, his
da a se was so ed using O ice360. The esul ing da a se
included 47,405 compounds wi h biological ac i i y. The
epo ed ac i i y is de ailed in di e en measu emen o ms
such as IC50 (nM), Ki(nM), inhibi ion (%), po ency (nM),
ac i i y (%), e c. (see Suppo ing In o ma ion ile DATA-
SETS.xlsx o de ails o assay condi ions).
A p epa a ion o he collec ed da a mus be ca ied ou
wi hin he PTML-based model. Molecula desc ip o s a e used
as pa o i s equa ion, wi h he mos used ones in his ype o
classi ica ion models being D1: MW, D2: ALogP, and D3:
TPSA. These alues a e easy o calcula e and a e p esen in he
li e a u e, p o iding eliable in o ma ion. As o he assay
condi ions, hey a e ob ained du ing he gene a ion o he
da abase because hey ep esen he condi ions unde which
he assays we e pe o med. The assay condi ions include C1:
a ge name, C2: a ge o ganism, C3: assay o ganism, C4: assay
issue name, C5: assay cell ype, and C6: subcellula assay.
P ima ily, hese condi ions can be modi ied, o addi ional
condi ions can be added o imp o e he model o conside
speci ic condi ions desi ed in an assay. In his s udy, h ee
molecula desc ip o s and six assay condi ions we e used in he
model de elopmen . Subsequen ly, his selec ion was expanded
o include hi y- wo desc ip o s unde he six epo ed assay
condi ions. The desc ip o s we e ob ained using he D agon
so wa e, which is widely used and use - iendly acco ding o
he li e a u e. I is necessa y o calcula e he molecula
desc ip o s o hese compounds using he DRAGON
so wa e.
45
Thus, he desc ip o s we e ob ained o comple e
he necessa y in o ma ion o he model o analyze.
2.1.2. Pos P ocessing. The p ocessing o da a is a c ucial
s ep in PTML models. I u ilizes molecula desc ip o s Dk, as
well as hei de ia ion agains he expec ed alue o he
e e ence sys em. The e e ence alue o a sys em is measu ed
as he a e age desc ip o alue o each assay ound in he
da abase unde condi ions ⟨Dk(cj)⟩. The model i sel akes
in o accoun he desc ip o alues and he assay condi ions in
which hey we e pe o med (whe e and how?). The p oblem
ha a ises om using all his in o ma ion lies no in he
nume ical alues bu in he nominal a iables ha may be
encoun e ed. The e o e, i is necessa y o include a new
a iable ha can accep hese nominal a iables and in e p e
hem nume ically. The homogeniza ion o mo ing a e age
(MA) as a s a is ical ool is used o analyze o de ed se s,
he eby elimina ing he andomness p esen in he da a. Wi hin
his ool, he chosen condi ions di ec ly in luence he MAs. An
equa ion is p oposed ha conside s a a ie y o condi ions
simul aneously, esul ing in a mul iple MA. Unde his p emise,
he eq 1 was u ilized.
D c D D c(sys em , ) ( (sys em) ( ) )
kij k ik jnew
=
(1)
Being ⟨Dk(cj)⟩ he a e age o he alues p esen s in he
a iables D1,D2...Dn, whe e D ep esen s he molecula
desc ip o s. The assay condi ions a e ep esen ed by ⟨cj⟩
whe e j= 1,...,n. Finally, Dk(sys emi)new ep esen s he Dk
alues o he compounds ound in he da abase. Wi h he da a
men ioned in he equa ion ⟨MA⟩, i seeks o measu e how
much he desc ip o s o an assay de ia e om he a e age
unde speci ic condi ions cj.
The expec ed ou come o his model is o i o be capable o
p edic ing he expe imen al alue o
ijk
o he compounds in
he da abase. Simila ly, he model will be able o de e mine he
ac i i y alue o unknown compounds
ijk
. In his way, he
a iable
ijk
is de ined as a alue measu ed om he epo ed
biological ac i i y, which akes in o accoun he assay
condi ions di ec ly associa ed wi h he a ge diseases.
Due o he a iabili y o he uni s p esen in he da abases, i
is no possible o conside hem immedia ely. This is why a
ans o ma ion o hese alues mus be pe o med in o de o
classi y hem. A new pa ame e d(c0) was es ablished, which
decides whe he a alue is desi able o undesi able. I d(c0) = 1,
indica es ha an inc ease in he analyzed pa ame e will be
desi able, while d(c0) = −1 indica es ha a dec ease in he
pa ame e will be undesi able. The obse ed unc ion mus
es ablish a limi o cu o poin o de ine whe he a uni is
desi able o no . Concen a ions a e se as −1, pe cen ages and
ac i i ies = 1, and e ec i eness and speed = 1. In cases whe e
he cu o poin s a e no de ined, ij > 1000 is es ablished. I
hese condi ions a e no me , he a e age ac i i y ⟨ ij⟩is used.
In o de o calcula e he a e age alue ⟨ ij⟩o compounds
ha ha e he same ac i i y measu e, alues mus be es ablished
based on whe he hey a e a o able o no . This de e mina ion
depends on whe he hei alues we e abo e o below he
mean alue o he da a, classi ying hem in o a bina y sys em
(ac i e = 1 and inac i e = 0). The alue o he unc ion
( ijk)obs is de ined in his s udy as an expe imen al alue
because i s esul will be he ou pu a iable, based on whe he
a compound in he da abase is ac i e o no , p o iding an idea
o hei ac i i y. Wi h he p e ious calcula ions, he condi ions
a e es ablished as ollows.
•I d(c0) = 1 and ( ijk)obs >cu o and/o abo e he
es ablished mean alue, ( ijk)obs = 1. O he wise, ( ijk)obs
= 0
Likewise:
•I d(c0) = 1 and ( ijk)obs <cu o and/o below he
es ablished mean alue, ( ijk)obs = 1. O he wise, ( ijk)obs
= 0.
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1843
The unc ion ( ijk)obs will apply he cu -o s as con ol
poin s, enabling i o p edic he ac i i y o a compound based
on he condi ion o (c0). The accu acy o he ( ijk)obs will be
de ined by he speci ici y o he classi ica ion cu o , making i
c ucial o de ine his limi o he model’s in eg i y. Finally, i is
necessa y o de ine a e e ence a iable wi h known alues ha
ha e been p e iously epo ed o be ac i e in expe imen al
assays. This unc ion uses p obabili ies and ep esen s he
likelihood o compounds being epo ed as ac i e unde he
es ablished condi ion (c0) and o each suble el j, see eqs 2 and
3.
p c( ) ( ( 1), )
ij ij e 0
= =
(2)
n n( ) ( ( ) 1)/
ij ij j e obs
= =
(3)
2.2. Compu a ional Me hods. 2.2.1. PTML Model
De elopmen . The PTML model uses he p e iously
calcula ed a iables in he me hodology as inpu a iables.
The unc ion ( ij) e , he Δ(Dk), and Dxa e employed. The
ou pu a iable ( ij)exp, ob ained om his model, enables he
bina y classi ica ion (1 and 0). Linea disc iminan analysis
(LDA) is employed o ind a linea combina ion o hese
a iables, allowing he model o e ec i ely sepa a e he wo
ypes o alues wi hin a single s a is ical p ocess.
Fo da a p ocessing, he S a is ica 10.0
46
so wa e was used.
Ou o he o al da a, 75% was alloca ed o aining, and he
emaining 25% was used o me hod alida ion. The esul ing
s a is ical pa ame e s (speci ici y and sensi i i y) o he
equa ion ob ained should all be ween 75 and 95%. A
p edic ion capabili y below 70% would be insu icien ,
ende ing he model unaccep able. Following he LDA
s a is ical es , he model yields an ou pu a iable ( ij)calc,
whe e he alues om his unc ion co espond o he ac ual
alues o p edic ed ac i i y based on p obabili y. The
coe icien s o he PTML equa ion a e also ob ained om
his analysis. Finally, Mahalanobis dis ances a e employed o
ans o m he dimensionless esul s o he equa ion in o
p ep obabili y unc ions. This enables bina y classi ica ion and
acili a es u u e p edic ions o he de elopmen o disco e y
o new compounds.
a a a D c( ) ( ) ( )
ij ij
k
k
k k jcalc 0 1 e
1
max
= + · + ·
=
(4)
2.2.2. ROC Valida ion Me hod. A ecei e ope a ing
cha ac e is ic (ROC)
47
cu e was used as a g aphical
ep esen a ion o e alua e he sc eening me hod. The g aph
used explains he success and e o o he model, he ue
posi i e alues a e placed on he Y-axis and he appa en
posi i e alues on he X-axis. This a angemen allows he
analysis o he accu acy o he model. The ROC cu e
ep esen s he p opo ion o alues ha we e co ec ly
p edic ed e sus hose ha we e inco ec . This way, by
calcula ing he a ea unde he cu e i is possible o ge his
p opo ion alue, ha should be he highes possible alue ha
can be ob ained.
2.2.3. Classi ica ion ML Models h ough Py hon. Fo he
de elopmen o ML classi ica ion models Py hon p og amming
language was used oge he wi h NumPy, Sciki -lea n and
PyCa e lib a ies.
The da a se o he aining and alida ion o he model was
IFPTML-Fla i i idae Dk30. To compa e he pe o mances o
he p e iously c ea ed model wi h LDA and he py hon model,
he aining and alida ion subse s emained unchanged.
Di e en models we e compa ed by using he PyCa e
classi ica ion unc ion “compa e_models”. This unc ion ains
he algo i hms ha a e a ailable in he lib a y and o de s he
bes models based on hei accu acy me ic by de aul . All he
pe o mance me ics ha a e lis ed in his unc ion a e he
accu acy, AUCROC, p ecision, ecall, 1-sco e, Cohen kappa
sco e and Ma hews co ela ion coe icien . The op imiza ion
o he model which showed he bes o e all pe o mance was
done using he unc ion “ une_model” o ind he op imal
hype pa ame e s. The e alua ion o he inal model was done
wi h he “e alua e_model” unc ion. This shows a a ie y o
esul s including he hype pa ame e s, AUCROC cu e,
con usion ma ix and ea u e impo ance, among o he s.
Accu acy is he a e o he co ec ly classi ied cases.
P ecision measu es he ac ion o ue posi i es among all
he p edic ed posi i es. Recall (sensi i i y) is he a e o ue
posi i es. F1 sco e is he ha monic mean o ecall and
p ecision.
48
AUCROC is a me ic ha assesses he abili y o
he model o disc imina e be ween classes. A pe ec model
would ha e an AUCROC alue o 1, indica ing a pe ec
classi ica ion, while a alue o 0.5 sugges s andom pe o m-
ance, equi alen o chance.
49
MCC co ela es he eal and
p edic ed sco es in bina y classi ica ions conside ing all he
ue and alse ins ances.
50
Cohen’s Kappa is ypically used in
bina y classi ica ion p oblems o assess he ag eemen be ween
wo classi ie s using he adi ional 2 ×2 con usion ma ix.
51
(see Table 1).
3. RESULTS AND DISCUSSION
3.1. IFPTML-Fla i i idae Model. The cons uc ion o a
model capable o p edic ing he p obabili y o a compound
being biologically ac i e agains a disease, would be a ool ha
helps educe cos s and ime in d ug disco e y. This me hod
mus be eliable and ep oducible. The goal o his
in es iga ion was o build a classi ica ion model based on he
Fla i i idae amily ha has he bes s a is ical pa ame e s and
includes a iables o in e es .
The i s esul o achie e was p ope da a cleansing, which
is he i s checkpoin o ge a unc ional model. This s ep is
essen ial o he de elopmen and co ec unc ioning o he
Table 1. Fo mulas o Accu acy, P ecision, Recall, F1, MCC
and Cohen’s Kappa Pe o mance Me ics
a
pe o mance
me ic o mula
accu acy
TP TN
TP FN FP TN
+
+ + +
p ecision
TP
TP FP+
ecall
TP
TP FN+
F1
2 TP
2 TP FP FN
•
• + +
MCC
TP TN FP FN
(TP FP)(TP FN)(TN FP)(TN FN)
× ×
+ + + +
Cohen’s kappa
2 (TP TN FP FN)
(TP FP) (FP TN) (TP FN) (FN TN)
• • •
+ • + + + • +
a
TP = ue posi i e, TN = ue nega i e, FP = alse posi i e, FN =
alse nega i e.
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1844
model. The il e ing o he 47,382 assays had o be done in a
way ha allows i s use in he cons uc ion o he model
wi hou causing e oneous esul s o gene a ing undesi ed alse
posi i es.
The 46,518 esul ing assays a e cleansing should be eady
o analysis. O hese, 10,910 a e membe s o he Fla i i us
genus, and he es a e membe s o he Fla i i idae Family. I is
expec ed ha he clean da a will no a ec he calcula ions ha
will be pe o med, as he model lea ns, ains, and imp o es
wi h each ea men i ecei es.
Due o he wide a ie y o uni s and measu emen s ound in
he da abases, hese mus be s anda dized o a common
measu e o uni . As men ioned in he Me hods Sec ion,
calcula ions we e pe o med o homogenize he da a, esul ing
in MA alues o e e y c0. F om hese esul s, expe imen al
alues and e e ence alues we e calcula ed, as shown in he
annexes.
The ob ained expe imen al alues ( ij)obs and he e e ence
alues ( ij) e , oge he wi h he LDA using S a is ica 10.0
so wa e, allowed he cons uc ion o se e al models,
conside ing s a is ical pa ame e s such as speci ici y (Sp (%)
= 0), sensi i i y (Sn (%) = 1), and accu acy (Ac (%) =
pe cen age o co ec p edic ions wi hin he analyzed da a). To
de e mine whe he a pa ame e is good o no wi hin he
cons uc s, i is es ablished ha he minimum alues o
conside a ion should be hose wi h speci ici y, sensi i i y, and
accu acy alues abo e 75% in bo h, he aining and alida ion
se ies.
The PTML models s a wi h he inpu a iables ( ij)obs,
( ij) e , and ΔDk(cj), o which he e ec s o he pe u ba o s
will be added acco ding o he es ablished pa ame e s and
selec ed a iables. The esul ing equa ion akes in o accoun
he co esponding ope a o s o all possible cases o Dk=
MW,ALogP, and TPSA, and hei espec i e (cj). The
expec a ion is o ob ain an equa ion ha co e s he g ea es
numbe o possible scena ios. Equa ion 5 p esen s a PTML-
LDA model conside ing he simple and simpli ied a iables. A
Chi-squa e es was also pe o med as a classi ie be ween he
classes ( ( ij)obs = 0 s ( ij) e = 1).
D c
D c
D c D c
D c
D c D c
D c
n p
( ) 4.148277915 6.54191072
( ) 0.00139111 ( )
0.055988681 ( ) 0.000138105
( ) 0.000300427 ( )
0.000377903 ( ) 0.000423529
( ) 0.001371082 ( )
0.002736297 ( )
46518 2 17510.55 0.05
ij
ij
calc
e 1 1
2 1
3 1 3 2
3 3
3 4 3 5
3 6
= +
· + ·
+
= = <
(5)
The esul ing equa ion was selec ed a e compa ing i o
se e al models cons uc ed, using he same inpu a iables bu
wi h di e en e ec s du ing he LDA. Fo a model o be
conside ed op imal, i mus con empla e he g ea es possible
numbe o condi ions in i s equa ion. In s a is ical analysis, he
model can be p og ammed o igu e in all possible e ec s o o
choose he bes e ec s wi h highes in luence. Among he
a ious cons uc s ob ained, he e we e se e al ha , despi e
conside ing all Dk, did no e lec he e ec o all condi ions cj.
These we e disca ded because hey could p esen undesi ed
esul s, such as alse posi i es, his occu s when he model is
es ed wi h new da a and none o he desi ed condi ions a e
ound wi hin he cjo he equa ion. In consequence he model
will no conside hem, esul ing in he loss o in o ma ion o
incomple e esul s.
The cons uc s ha we e ob ained mus conside he
in luence o Dk, while lea ing ou he ce ain condi ions cj.
Likewise, cons uc s ha e lec he in luence o he condi ions
cjwhile lea ing ou he Dk, we e also ob ained. The changes in
hese cons uc s a e based on co ec ly es ablishing he inpu
a iables and ensu ing ha he da a is p ope ly homogenized
and il e ed. The Inco ec se ings o he cu o s can cause an
al e a ion o he inal esul , leading o mo e equen alse
posi i es. One o hese changes was e y signi ican in an
ob ained cons uc , allowing o co obo a e he in luence o
hese limi s and how hei a ia ions can a o o hinde he
selec ion o he model. Many imes, i is no possible o ob ain
an equa ion ha conside s all he inpu a iables because he e
may be cjcondi ions o Dk ha canno be ela ed, so hey will
be excluded. In hese cases, is possible o ob ain a iable
equa ion ha p o ides a o able esul s h ough a p ope da a
handling. The mo e speci ic in o ma ion ob ained, he mo e
accu a e he cu o s a e, he mo e ela ed he da a is, and he
model can be imp o ed. The selec ed model as he inal esul
akes in o accoun all hese c i ical poin s o selec ion,
encompassing he g ea es numbe o cjcondi ions and he
Dkin i s equa ion (see eq 5). Table 2 summa izes he esul s
ob ained.
Du ing he de elopmen o his me hod, se e al cons uc s
we e c ea ed in o de o each he inal model. Twen y analyses
we e pe o med o each model using his me hod. The goal
was o ind he bes model-based on h ee s a is ics pa ame e s
o sa is y. Since he models mus ha e s a is ical alues highe
han 75%,
36
a balance be ween speci ici y, sensi i i y, and
accu acy was sough . In consequence, he model will be able o
co ec ly p edic compounds ha ha e high p obabili ies o
being ac i e and di e en ia e be ween ac i e and inac i e
compounds. I is also men ioned ha he aining and
alida ion alues should be simila o each o he , see Table 3
(see Suppo ing In o ma ion ile DATASETS.xlsx o de ails o
Table 2. Resul s o he IFPTML-Fla i i idae-LDA Model
Whe e he Values o Sp, Sn and Ac o he Bes Model
Ob ained a e P esen ed
se s pa am. scope o es p(1) = 0.85
expec ed s a . p edic ed
alues nj ( ij)p ed =0 ( ij)p ed =1
T aining Se ies
( ij)obs =0Sp
(%) 75.95 19,388 14,726 4662
( ij)obs =1Sn
(%) 78.88 15,500 3273 12,227
o al Ac
(%) 77.25 34,888
Valida ion Se ies
( ij)obs =0Sp
(%) 75.5 6457 4875 1582
( ij)obs =1Sn
(%) 79.29 5172 1071 5683
o al Ac
(%) 77.19 11,629
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1845
he da a se used and de ailed esul s o he model o each
case).
The ables esul p esen an equi able dis ibu ion acco ding
o he es ablished pa ame e s. The chi-squa e es allows us o
con i m ha he classi ica ion g oups a e di ided, and he p-
alue is unde o equal o 0.05 (eqs 5).
Among he a ious cons uc s c ea ed o de e mine he bes
model, se e al o hem included bo h desi able cha ac e is ics
(condi ions cjand Dk). Howe e , hei s a is ical pa ame e s
we e poo , leading hei classi ica ion as incomple e due o he
cha ac e is ics hey exhibi ed. The model ha p esen ed he
bes s a is ical pa ame e s, as well as mee ing he c i e ia cjand
Dk, ob ained a p ecision alue o 77%. When compa ing he
esul s o he models, i was co obo a ed ha an IFPTML-
LDA model was ound, and bo h (i s equa ion and i s s a is ical
pa ame e s) me he desi ed cha ac e is ics. In he inal pa o
his esul s segmen , he ROC alida ion me hod was
employed o e i y he model.
3.2. A Compa ison wi h Genus Fla i i us and
Fla i i idae IFPTML Me hod. Du ing he de elopmen o
he inal IFPTML model, se e al es models we e c ea ed.
One o hem used only he da a om membe s o he
Fla i i us genus ha a e ansmi ed by hema ophagous
a h opods. These da a we e p ocessed acco ding o he
p ocedu e desc ibed in he expe imen al de elopmen , wi h Dk
= 3 and cj= 6. The da a used o he IFPTML-Fla i i us model
ep esen ed 24% o he o al da a used in he main model. The
s a is ical pa ame e s showed a o able alues in gene al, bu
he classi ica ion ma ix ailed as i classi ied alse posi i es wi h
high signi icance (see Table 4).
The IFPTML-Fla i i us-da abase consis s only o membe s
o he Fla i i us genus; as men ioned in he in oduc ion, he
membe s a e closely ela ed o each o he and p esen
conse ed egions in hei genomes.
10
One o he mos
s udied Fla i i uses is he DENV, and i s d ugs a e used as
models o ea o he membe s o his species.
10,12,41
The e o e, se e al s udies ha e epo ed simila i ies in hei
compound condi ions o cha ac e is ics. Howe e , hese
s udies a y in epo ed ac i i y because o he 10% di e ence,
which makes each o hese membe s unique in hei own
way.
9,16,40
Taking his in o conside a ion, he unique cha ac e -
is ics o hese a iables should be expanded o make hem
mo e speci ic. The alse iden i ica ion o an elemen can be due
o i s simila i y o o he s ha i he model ( ue posi i es).
52,53
The e o e, a new model was p oposed o encompass hese new
cha ac e is ics wi h he aim o classi ying he da a mo e
e ec i ely and educing he p esence o alse posi i es.
To imp o e he IFPTML-Fla i i idae model, he Dkwas
expanded om 3 o 30 o e e y cj= 6. The inc ease in Dk
p o ides new unique cha ac e is ics o he compounds ha can
be used o compa e he da a in a be e way and educe alse
posi i es ha may a ise due o simila i ies. The numbe o
combina ions nnallows o a mo e comp ehensi e disc im-
ina ion o he in o ma ion and, he e o e, a mo e speci ic
classi ica ion by ha ing mo e pa ame e s o e alua e, which
helps de e mine he in luence o each desc ip o in he sys em.
The new model uses he 46,518 assays om he IFPTML-
Fla i i idae model as he da a se . A e he LDA analysis,
a o able s a is ical esul s we e ob ained o Sp (%), Sn (%),
and Ac (%) in aining and alida ion se s. When compa ing
he IFPTML-Fla i i idae model wi h he IFPTML-Fla i i idae
Dk30 model, an inc ease in he s a is ical pa ame e s o he
Dk30 model was obse ed, wi h an accu acy Ac (%) inc easing
om 77 o 79%, Sp (%) om 75.95 o 78.58%, and Sn (%)
om 78.88 o 80.06%.
By including he Fla i i us genus model, i can be obse ed
ha he Sp (%) alue in he Fla i i us model is highe han
ha in i s coun e pa s (Fla i i idae models). The eby, he
selec ion o nega i e alues is done co ec ly as shown in i s
classi ica ion ma ix. When compa ing Sn (%), i is e iden ha
he Fla i i idae Dk30 model shows be e alues, indica ing
ha i classi ies ue alues mo e accu a ely. This is e lec ed in
he classi ica ion ma ix esul s, whe e he alse posi i es
epo ed in he IFPTML-Fla i i idae Dk30 model compa ed
wi h he IFPTML-Fla i i us model a e in a smalle p opo ion,
indica ing ha he e is an imp o ing da a classi ica ion. The
accu acy o he IFPTML-Fla i i idae Dk30 model imp o es by
2 poin s compa ed wi h IFPTML-Fla i i idae model. Al hough
he IFPTML-Fla i i us model has an accu acy Ac (%) o 82%,
he IFPTML-Fla i i idae Dk30 model shows be e o e all
esul s in e ms o s a is ical pa ame e s and classi ica ion
ma ix. Finally, he IFPTML-Fla i i idae Dk30 model was
selec ed as he inal model (see Table 5).
3.3. ROC Valida ion IFPTML-Fla i i idae-LDA Me h-
od. The alida ion me hod ha was used is a AUCROC
cu e.
47
This me hod e i ies he eliabili y o he models by
using sensi i i y Sn (%) s p ecision (1 −Sp (%)), ob aining a
AUCROC cu e based on his inpu da a. Subsequen ly, he
a ea unde he cu e is calcula ed. The sensi i i y and p ecision
alues o he aining and alida ion models we e used o
Table 3. Resul s o PTML-LDA Models, he Compa ison o
Thei Values o Sp, Sn and Ac
PTML-LDA models 16 and 24 PTML-LDA model
p oposed
aining
model
no. 16
alida ion
se ies no.
16
aining
model
no. 24
alida ion
se ies no.
24
aining
model no.
p oposed
alida ion
se ies no.
p oposed
Sp =
77.63 Sp =
76.88 Sp =
77.42 Sp =
76.71 Sp = 75.95 Sp = 75.50
Sn =
75.88 Sn =
75.97 Sn =
76.06 Sn =
76.18 Sn = 78.88 Sn = 79.29
Ac =
76.85 Ac =
76.47 Ac =
76.81 Ac =
76.47 Ac = 77.25 Ac = 77.19
Table 4. Resul s o he IFPTML-Fla i i uses-LDA Model,
S a is ical Pa ame e s Sp, Sn and Ac o T aining and
Valida ion
se s pa am. scope o es p(1) =
expec ed s a . p edic ed
alues nj ( ij)p ed =0 ( ij)p ed =1
T aining Se ies
( ij)obs =0Sp
(%) 82.61 6551 5412 1139
( ij)obs =1Sn
(%) 79.66 1632 332 1300
o al Ac
(%) 82.02 8183
Valida ion Se ies
( ij)obs =0Sp
(%) 80.40 2168 1743 425
( ij)obs =1Sn
(%) 75.85 559 135 424
o al Ac
(%) 79.46 2727
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1846
ep esen he cu e wi h a layou ha changes wi h he p io
p obabili y. In he g aph, i can be obse ed how he alues o
Sn (%) and (1 −Sp (%)) deg ade uni o mly as he p io
p obabili y changes (see Figu e 2).
Wi hin he AUCROC cu e g aph, he diagonal line
ep esen s andom p obabili y, whe e p( ij = 1) = 0.5. The
a ea unde he cu e (AUROC) o a andom classi ica ion
model is 0.5. The AUROC alue ob ained by he IFPTML
model is 0.862, indica ing ha he disc imina ion is accu a ely
86.2% and no a andom pa e n. This ype o alida ion es
uses p obabili y based on Bayes’ heo em.
54,55
Th oughou he
s udy, p io p obabili y alues we e used be o e applying he
heo y, allowing he s udy o he change o Sn (%) and (1 −Sp
(%)) o e ime o ob ain an op imal alue.
The echnique i sel is based on he a ea unde he cu e, so
by pe o ming his ROC cu e, he p edic i e powe o he
model is ob ained. The R alue o 86.2% indica es ha his
model allows o deciding which elemen s a e ela ed o no
wi h a high deg ee o classi ica ion.
3.4. Compa ison o Classi ica ion ML Models h ough
Py hon. Cu en ly, se e al ML algo i hms a e used o
classi ica ion and p edic ion asks, each o hem wi h i s own
and unique cha ac e is ics. LDA, andom o es (RF), and
g adien boos ing (GB) a e among he mos popula
me hods.
55−57
The IFPTML-LDA models use he LDA-supe ised lea ning
algo i hm o classi ica ion. This me hod assumes ha he
inpu da a ollows a Gaussian dis ibu ion, and he classes ha e
an equal co a iance ma ix.
58
The algo i hm inds linea
combina ions o ea u es ha maximize he sepa a ion be ween
classes. This me hodology is use ul when he classes a e well-
sepa a ed.
59
On he o he hand, models based on Random
Fo es a e ensemble lea ning me hods ha combine mul iple
decision ees o make p edic ions,
60
They c ea e a collec ion
o decision ees, whe e each ee is ained on a andom subse
o da a and ea u es.
61
This lea ning me hod can handle bo h
classi ica ion and eg ession asks. I is known o i s abili y o
handle complex ela ionships and in e ac ions in da a.
62
Finally, G adien Boos ing combines mul iple weak lea ne s
(usually decision ees) o c ea e a s ong lea ne . I builds a
model in an i e a i e manne , whe e each new model ocuses
on co ec ing he mis akes made by p e ious models.
63
Also, i
is known o i s high p edic i e accu acy and abili y o handle
complex da a se s, and i wo ks well o bo h classi ica ion and
eg ession asks.
64
The di e en lea ning me hods used o compa e he ML
echniques ha e p os and cons, as men ioned ea lie in his
sec ion. The LDA me hodology is mo e sui able o well-
sepa a ed classes and dimensionali y educ ion. RF is e ec i e
in handling complex ela ionships and p o iding impo an
ea u e measu es. GB is known o i s high p edic i e accu acy
and abili y o handle complex da a se s. The comple e ML
me hods used a e p esen ed bellow (See Table 6).
The model was uned wi h 5- and 10- olds. Al hough he
alues a e qui e simila , he base model shows he bes
pe o mances, hus, i was selec ed as he bes model (see
Table 7).
Table 5. Resul s o he IFPTML-Fla i i idae Dk30-LDA
Model, S a is ical Pa ame e s Sp, Sn and Ac o T aining
and Valida ion
se s pa am. scope o es p(1) =
expec ed s a . p edic ed
alues nj ( ij)p ed =0 ( ij)p ed =1
T aining Se ies
( ij)obs =0Sp
(%) 78.58 19,338 15,236 4152
( ij)obs =1Sn
(%) 80.06 15,500 3091 12,409
o al Ac
(%) 79.23 34,888
Valida ion Se ies
( ij)obs =0Sp
(%) 78.35 6457 5059 1398
( ij)obs =1Sn
(%) 80.41 5172 1013 4159
o al Ac
(%) 79.27 11,629
Figu e 2. AUCROC cu e g aph, he diagonal line ep esen s andom p obabili y is 0.5, In he g aph, i can be obse ed how he alues o
p obabili y changes, o ange ep esen s he alida ion se and blue he aining se .
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1847
The IFPTML-Fla i i idae Dk30 model was e alua ed using
he alida ion se , and he Ligh GBM model showed he bes
o e all pe o mance me ics (see Table 8). The e o e, he
op imiza ion s ep was pe o med using his model.
The classi ica ion epo is used as an e alua ion me hod o
measu e he pe o mance o he classi ica ion model. This
epo p o ides in o ma ion on he pe o mance o he model
o each class in e ms o accu acy, ecall, F1, e c.
65
The LGBM
model p esen s a p ecision alue o 0.82. This alue measu es
he p opo ion o co ec ly p edic ed posi i e ins ances ou o
all ins ances p edic ed as posi i e. This indica es he eliabili y
o he model’s posi i e p edic ions; a highe alue means ewe
alse posi i es. The ecall alue o 0.79 ep esen s he
sensi i i y, o ue posi i e a e. This shows he p opo ion
o co ec ly p edic ed posi i e ins ances ou o all ac ual
posi i e ins ances, indica ing how well he model iden i ies
posi i e ins ances. A highe alue means ewe alse
nega i es.
66
The F1 alue o 0.80 is a ha monic mean be ween
p ecision and ecall, p o iding a single me ic ha balances
bo h. Figu e 3 shows he classi ica ion epo o he IFPTML-
Fla i i idae Dk30�LGBM model.
The eliabili y cu e was used o assess he calib a ion o he
classi ica ion model. This ype o calib a ion plo helps
de e mine whe he he p edic ed p obabili ies a e well-
calib a ed and p o ides eliable es ima es o ue p obabili ies.
Figu e 4 p esen s he calib a ion plo using LGBM model. The
cu e closely ollows he diagonal line, sugges ing good
calib a ion o he model. No o e con idence o unde
con idence was iden i ied in he cu e.
The calib a ion plo s p o ide in o ma ion on he ag eemen
be ween he p obabili ies p edic ed by he model and he
ac ual p obabili ies. An accu acy alue o 0.83 means ha i he
model p edic s p obabili ies o a gi en class, he ac ual
p obabili y o ha class should be close o ha alue.
24,25
As men ioned in he p e ious me hod analysis, he a ia ion
be ween he membe s o he Fla i i idae amily and he limi ed
amoun o a ailable da a o he assays can esul in he
p esence o alse posi i es. Du ing he expansion o Dkin he
IFPTML-LDA models, a educ ion in he numbe o alse
posi i es was achie ed. Using he same da a se in he LGBM
model, ano he educ ion in alse posi i es was achie ed,
esul ing in an imp o emen in he eliabili y o he model,
wi h he numbe o alse posi i es dec easing om 1398 o 904
(0:1) (see Figu e 5). A compa ison be ween he Dk30-LDA
and Dk30-LGBM models showed a educ ion o 4.25% in alse
posi i es.
The ea u e impo ance plo is a g aphical ep esen a ion o
he impo ance o each ea u e in a ML model. This helps
iden i y he ea u es ha ha e he mos signi ican impac on
he model’s p edic ions. Figu e 6 p esen s a ea u e impo ance
plo whe e he mos in luen ial ea u es a e anked. Each
ea u e is assigned a sco e ha ep esen s i s impo ance. The
highe he sco e, he mo e in luen ial he ea u e is in making
p edic ions.
36
In ee-based models such as g adien boos ing
models, he impo ance is calcula ed based on he numbe o
imes a ea u e is used o spli he da a ac oss all ees.
Howe e , i s impo ance is no a de ini i e measu e o
Table 6. IFPTML-Fla i i idae Dk30 Model Compa a ion o he Mos Used ML Me hods in Classi ica ion and Reg ession
Task
a
model compa a ion by 10- old CV
model accu acy AUC ecall p ec. F1 kappa MCC TT (s)
ligh gbm ligh g adien boos ing machine 0.784 0.891 0.749 0.761 0.749 0.561 0.568 8.812
xgboos ex eme g adien boos ing 0.783 0.886 0.749 0.759 0.749 0.559 0.565 4.896
idge idge classi ie 0.783 NA 0.693 0.785 0.729 0.552 0.561 0.736
lda linea disc iminan analysis 0.782 0.872 0.697 0.781 0.729 0.551 0.559 2.654
gbc g adien boos ing classi ie 0.781 0.882 0.713 0.772 0.733 0.550 0.559 57.433
ada ada boos classi ie 0.774 0.868 0.691 0.771 0.720 0.534 0.544 11.047
l logis ic eg ession 0.771 0.855 0.701 0.761 0.724 0.531 0.538 14.002
andom o es classi ie 0.770 0.852 0.724 0.749 0.731 0.531 0.537 16.707
d decision ee classi ie 0.766 0.812 0.716 0.748 0.726 0.523 0.529 3.626
e ex a ees classi ie 0.765 0.833 0.706 0.749 0.722 0.520 0.526 13.657
knn K neighbo s classi ie 0.731 0.788 0.706 0.691 0.696 0.455 0.458 4.982
s m SVM�linea ke nel 0.628 NA 0.561 0.634 0.553 0.244 0.266 5.079
qda quad a ic disc iminan analysis 0.612 0.673 0.779 0.547 0.638 0.247 0.272 1.656
nb nai e bayes 0.609 0.667 0.609 0.554 0.578 0.216 0.218 0.254
dummy dummy classi ie 0.556 0.500 0.000 0.000 0.000 0.000 0.000 1.209
a
The Pyca e lib a y was used o build he ML models and some o he algo i hms such as idge and SVM do no suppo “p edic _p oba”. In hose
cases, he AUC alue is shown as “NA”.
Table 7. Pe o mance Me ics o IFPTML-Fla i i idae Dk30 Model wi h 5- and 10-Folds
olds accu acy AUC ecall p ec. F1 kappa MCC
base 10 0.78 0.89 0.75 0.76 0.75 0.56 0.57
uned 10 0.78 0.89 0.71 0.78 0.73 0.55 0.56
5 0.78 0.88 0.73 0.77 0.74 0.55 0.56
Table 8. Ligh G adien Boos ing Machine P esen s he Bes
Resul s Using he Fla i i idae Dk30da a Se
model accu acy AUC ecall p ec. F1 kappa MCC
ligh g adien
boos ing
machine
0.83 0.92 0.79 0.82 0.80 0.65 0.65
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1848
causali y. I only indica es he ela i e impo ance o he
ea u es wi hin he con ex o he model.
A g aphical ep esen a ion o he AUCROC cu e illus a es
he pe o mance o he bina y classi ie sys em. As shown in
Figu e 7, he LGBM classi ie p esen s an AUCROC cu e o
he IFPTML-LGBM model, which p o ides a isual
ep esen a ion o he ade-o be ween he ue and alse
posi i e a es. A good classi ie will ha e a cu e close o he
op-le co ne o he plo .
The AUCROC alue o 0.92 can be in e p e ed as he
p obabili y ha he classi ie will ank a andomly chosen
posi i e ins ance highe han a andomly chosen nega i e
ins ance. A highe AUCROC indica es be e pe o mance.
47
Compa ing he AUCROC o he LDA model s LGBM model,
i shows an inc ease o 6%, indica ing an upg ade in he
classi ica ion model compa ed wi h i s olde e sion.
Finally, his ep esen a ion indica es ha he da a a e ela ed
and ha he assay condi ions can be es ed in se e al assays o
Figu e 3. LGBM Classi ica ion Repo o he IFPTML-Fla i i idae Dk30 model� alida ion da a se .
Figu e 4. Calib a ion plo s o he IFPTML-Fla i i idae Dk30�LGBM.
Figu e 5. LGBM con usion ma ix using he alida ion da a se o
Fla i i idae Dk30.
Jou nal o Chemical In o ma ion and Modeling pubs.acs.o g/jcim A icle
h ps://doi.o g/10.1021/acs.jcim.3c01796
J. Chem. In . Model. 2024, 64, 1841−1852
1849