scieee Science in your language
[en] (orig)

EP-Pred: A Machine Learning Tool for Bioprospecting Promiscuous Ester Hydrolases

Author: Xiang, Ruite; Fernandez-Lopez, Laura; Robles-Martín, Ana; Ferrer, Manuel; Guallar, Victor
Publisher: Zenodo
DOI: 10.5281/zenodo.17671860
Source: https://zenodo.org/records/17671860/files/biomolecules-12-01529.pdf
Ci a ion: Xiang, R.; Fe nandez-Lopez,
L.; Robles-Ma ín, A.; Fe e , M.;
Gualla , V. EP-P ed: A Machine
Lea ning Tool o Biop ospec ing
P omiscuous Es e Hyd olases.
Biomolecules 2022,12, 1529.
h ps://doi.o g/10.3390/
biom12101529
Academic Edi o : Jian Zhang
Recei ed: 2 Sep embe 2022
Accep ed: 18 Oc obe 2022
Published: 21 Oc obe 2022
Publishe ’s No e: MDPI s ays neu al
wi h ega d o ju isdic ional claims in
published maps and ins i u ional a il-
ia ions.
Copy igh : © 2022 by he au ho s.
Licensee MDPI, Basel, Swi ze land.
This a icle is an open access a icle
dis ibu ed unde he e ms and
condi ions o he C ea i e Commons
A ibu ion (CC BY) license (h ps://
c ea i ecommons.o g/licenses/by/
4.0/).
biomolecules
A icle
EP-P ed: A Machine Lea ning Tool o Biop ospec ing
P omiscuous Es e Hyd olases
Rui e Xiang 1,†, Lau a Fe nandez-Lopez 2,†, Ana Robles-Ma ín1, Manuel Fe e 2and Vic o Gualla 1,3,*
1Depa men o Li e Sciences, Ba celona Supe compu ing Cen e (BSC), 08034 Ba celona, Spain
2Depa men o Applied Bioca alysis, ICP, CSIC, 28049 Mad id, Spain
3Ca alan Ins i u ion o Resea ch and Ad anced S udies (ICREA), 08010 Ba celona, Spain
*Co espondence: ic o [email p o ec ed]
† These au ho s con ibu ed equally o his wo k.
Abs ac :
When biop ospec ing o no el indus ial enzymes, subs a e p omiscui y is a desi able
p ope y ha inc eases he eusabili y o he enzyme. Among indus ial enzymes, es e hyd olases
ha e g ea ele ance o which he demand has no ceased o inc ease. Howe e , he sea ch o
new subs a e p omiscuous es e hyd olases is no i ial since he mechanism behind his p op-
e y is g ea ly in luenced by he ac i e si e’s s uc u al and physicochemical cha ac e is ics. These
cha ac e is ics mus be compu ed om he 3D s uc u e, which is a ely a ailable and expensi e o
measu e, hence he need o a me hod ha can p edic p omiscui y om sequence alone. He e we
epo such a me hod called EP-p ed, an ensemble bina y classi ie , ha combines h ee machine
lea ning algo i hms: SVM, KNN, and a Linea model. EP-p ed has been e alua ed agains he
Lipase Enginee ing Da abase oge he wi h a hidden Ma ko app oach leading o a inal se o en
sequences p edic ed o encode p omiscuous es e ases. Expe imen al esul s con i med he alidi y o
ou me hod since all en p o eins we e ound o exhibi a b oad subs a e ambigui y.
Keywo ds:
bioca alys s; biop ospec ing; es e ases/lipases; hyd olases; machine lea ning; supe ised
lea ning
1. In oduc ion
Enzymes a e o g ea in e es o a as majo i y o indus ies, pa ially due o he
inc easing conce ns o e en i onmen al issues. Among he many classes o enzymes,
hyd olases s and ou o hei indus ial ele ance because o hei high s e eoselec i i y,
comme cial a ailabili y and s abili y in o ganic sol en s [
1
]. Indeed, he demand o newe
and be e hyd olases ha can wo k in indus ial se ings has only inc eased exponen ially
o e he yea s. Speci ically, es e hyd olases (EC 3.1), which hyd olyze es e bonds, ha e
ecei ed conside able a en ion, and a e ex ensi ely used in a ious a eas such as ood,
de e gen s, ag icul u e, pha maceu icals, and so on [2].
Sea ching o new es e ase candida es is no i ial, in ac , he e a e s ic equi emen s
ega ding s abili y, ac i i y and subs a e p omiscui y which a e di icul o ind conjoin ly
in na u al enzymes [
3
]. Ac ually, one o he mos common issues ha an indus ial enzyme
will ace is ha ing low subs a e p omiscui y [
4
]; he abili y o ca alyze a speci ic eac ion
o a a ie y o di e en subs a es. I is a desi able cha ac e is ic since one single enzyme
could be used o mul iple applica ions, hus educing he cos and ime o de elopmen
and p oduc ion o mul iple bioca alys s [5].
While some subs a e p omiscuous enzymes migh su e om limi ed s e eoselec i i y
and lowe ca aly ic a es [
6
,
7
], hey a e ypically enzymes p one o accep enzyme enginee -
ing, which could es o e hese p ope ies. In p e ious s udies, we ha e in es iga ed he
de e minan s o subs a e ambigui y o es e ases a a molecula le el, es ablishing ules
o i s p edic ion [
4
], and in oducing a signi ican inc ease in he numbe o subs a es hy-
d olyzed h ough enginee ing [
8
]. Howe e , i mus be no ed ha he necessa y molecula
Biomolecules 2022,12, 1529. h ps://doi.o g/10.3390/biom12101529 h ps://www.mdpi.com/jou nal/biomolecules
Biomolecules 2022,12, 1529 2 o 14
me ics mus be compu ed om he 3D s uc u es which a e a ely a ailable, and ha hey
in ol e signi ican compu a ional ime. E en wi h he ecen ad ancemen s in he accu acy
o deep lea ning s uc u al p edic ions, such as AlphaFold 2.0 [
9
], i is s ill un easible o
analyze subs a e speci ici y om he e e g owing numbe o anno a ed sequences. Fo
ins ance, as o 9 Ma ch 2021, he Lipase Enginee ing Da abase (LED) [
10
], which holds da a
on es e ases/lipases and a ew o he homologous sequences, con ains abou 280,638 en ies.
In addi ion, AlphaFold 2.0 models end o gene a e APO s uc u es, wi h a signi ican
modi ica ion (mos ly olume educ ion) in he ac i e si e ha p ecludes, o example, e i-
cien ligand igid docking [
11
]. These changes would la gely a ec p omiscui y p edic ions
using he molecula desc ip o s.
The e o e, me hods o di ec ly iden i y subs a e p omiscui y om sequence alone
would g ea ly inc ease he e iciency o biop ospec ing o new es e ase candida es, a ask
ideal o machine lea ning algo i hms (see o example, o he ecen ly de eloped me hods
o ac i i y p edic ions [
12
]). Se e al s udies ha e al eady p edic ed enzyme subs a e
p omiscui y using molecula desc ip o s [
13
] o machine lea ning [
14
–
16
] app oaches, al-
hough o o he enzyme amilies. In addi ion, he e a e se e al di e ences in ou app oach.
Fi s , o he s udies mainly used aining samples lis ed in da abases such as he Kyo o
Encyclopedia o Genes and Genomes (KEGG), which includes eac ion da a o mos o
he s udied enzyme amilies. In hese da abases he numbe o es ed compounds is usu-
ally no la ge, hinde ing he co ec iden i ica ion o p omiscuous enzymes. Addi ionally,
hose s udies ha e ol ed a ound single enzyme amilies a e usually cons ained by
he numbe o samples om which o de i e he molecula desc ip o s o he classi ica-
ion models which migh dec ease hei applicabili y. Second, he goal o he de eloped
me hods seems o di e . P e ious p ojec s e alua e whe he a speci ic compound will
be ca alyzed by a se o cha ac e ized enzymes o he e e se o ind no el subs a es o
enzymes. Ou app oach, in addi ion, ies o classi y how p omiscuous an enzyme migh
be in a biop ospec ing se ing.
He e we epo he de elopmen and applica ion o an ensemble bina y classi ie
ained on a da ase o 145 di e se es e ases and 96 subs a es [
4
], using a combina ion o
physicochemical and e olu iona y ea u e ec o s ex ac ed om he p ima y sequence.
Ou classi ie , named EP-p ed, combines h ee ypes o classi ica ion algo i hms: SVM
(suppo ec o machines), KNN (k-nea es neighbo s) and RidgeClassi ie , one o he
linea models implemen ed in Sciki -Lea n. EP-p ed was hen e alua ed agains LED
and om hose p edic ed o be posi i es, a inal se o en sequences we e isola ed and
es ed expe imen ally. All selec ed enzymes we e con i med o be subs a e p omiscuous,
highligh ing he po en ial o ou machine lea ning biop ospec ing me hod.
2. Ma e ials and Me hods
2.1. Es e ase Da ase
The da ase employed o ain he models is he same used in ou p e ious molecula
modeling s udies and i is o med by 145 di e se mic obial es e hyd olases wi h pai wise
sequence iden i ies anging om 0.2 o 99.7% and an a e age pai wise iden i y o 13.7% [
4
].
The he e ogenei y o he sequences can be a ibu ed o he di e si y o he sou ce om
which hey we e isola ed, including bo h e es ial bac e ia om 28 geog aphically dis inc
si es and ma ine bac e ia. The phylogene ic analysis pe o med in he p e ious s udy
u he suppo s he di e si y o he sou ce bac e ia since hey we e ound o be dis ibu ed
ac oss he phylogene ic ee.
The subs a e p o iles o he enzymes we e assessed on a se o 96 di e se es e s wi h
he mos p omiscuous one capable o hyd olyzing 72 es e s ou o 96 es ed and he leas
p omiscuous one only capable o ca alyzing 1 ou o 96 subs a es. The dis inc ion be ween
p omiscuous and no p omiscuous, o posi i e and nega i e classes, was also es ablished
acco ding o he h eshold o he p e ious s udy: 20 subs a es.
Biomolecules 2022,12, 1529 3 o 14
2.2. Lipase Enginee ing and Uni e 50 Da abases
The lipase enginee ing da abase (h ps://led.bioca ne .de/, accessed on
9 Ma ch 2021
),
used o e alua e he inal model, ga he s in o ma ion on he sequence, s uc u e and unc-
ion o es e ases/lipases and o he ela ed p o eins sha ing he same a/b hyd olase old.
The whole sequence da abase was downloaded in Fas a o ma con aining
280,638 en ies
.
The e olu iona y- ela ed ea u es we e based on PSSM (Posi ion Speci ic Sco ing
Ma ix) p o iles which we e gene a ed wi h Psi-Blas by que ying he inpu sequence
agains he Uni e 50. Uni e o The UniP o Re e ence Clus e s con ains eco ds o he
Unip o Knowledgebase and he Unipa c sequence a chi e a se e al esolu ions, 100%,
90% and 50%, each one gene a ed om he clus e ing o he p e ious one. Thus, Uni e 50 is
gene a ed om clus e ing he UniRe 90 seed sequences a he iden i y h eshold o 50% [
17
].
2.3. Fea u e Ex ac ion
Two web se e s Possum [
18
] and iFea u e [
19
] we e used o ex ac e olu iona y
in o ma ion and physicochemical p ope ies, espec i ely, om all p o ein sequences.
iFea u e can gene a e 53 di e en ypes o desc ip o s, om which 32 we e ex ac ed using
he de aul pa ame e s esul ing in a o al o 2274 ea u e ec o dimensions. The es o he
ea u e ypes we e disca ded because hey can only be applied o sequences o he same
leng h. These ea u es migh be simplis ic desc ip o s like he amino acid composi ion
(AAC), which coun s he equency o each amino acid in he sequence. Howe e , he e
a e also mo e elabo a e o ms o desc ip o s ha accoun o he dis ibu ion, ansi ion, o
co ela ion o di e en p ope ies, like hyd ophobici y, along he sequence.
Possum gene a es ea u es based on he PSSM (Posi ion Speci ic Sco ing Ma ix)
p o iles ha con ain e olu iona y in o ma ion o he sequences since i speci ies he sco es
o obse ing pa icula amino acids a speci ic posi ions o he sequence. Al hough
e y in o ma i e, he downside o hese p o iles is ha hey depend on he leng h o he
sequences which hampe s hei di ec use as ea u es o machine lea ning applica ions.
By applying di e en ma ix ans o ma ions o make hem leng h-independen , Possum
was able o gene a e 18 di e en desc ip o s ha we e ex ac ed esul ing in a ec o o
18,730 dimensions
. Some ans o ma ions a e inspi ed by sequence-based ea u es, such
as ACC which educes he PSSM p o ile om a ma ix o L
×
20 dimensions, L being he
leng h o a sequence, o a ec o o 20 dimensions by a e aging he sco es o he ows in
he PSSM. While o he ans o ma ions a e mo e complex cons uc s ha i s scale, il e ,
o g oup he alues in he PSSM and hen apply a ious ope a ions o ix he dimensions
o he ma ix.
The conca ena ion o bo h ea u e ec o s yields a ea u e se o 21,000 dimensions
o he es e ase da ase . A e gene a ing he ea u es, some cleaning was needed because
many columns had ze os o iden ical alues in mos o he ows which ca ied li le
in o ma ion. As a esul , 2274 iFea u e and 18,730 Possum ea u es we e educed o
1203 and 14,606 dimensions, espec i ely.
2.4. Fea u e Selec ion
E en wi h he cleaning, he numbe o o iginal ea u es emained exceedingly high,
he e o e ea u e selec ion was needed o elimina e noise and a oid o e i ing. As ecom-
mended o his s ep [
20
], da a was spli in o wo se s, a es se and a aining se and he
selec ion was pe o med on he aining se only. In addi ion, he numbe o dimensions
was educed o less han
1
2
o he numbe o samples as i was shown o educe o e i ing.
I mus be no ed ha he dimensionali y educ ion o iFea u e and Possum desc ip o s
was ca ied ou independen ly and conca ena ed la e o gene a e he ea u es. Howe e ,
he p opo ion o he wo desc ip o s was no e en because e olu iona y- ela ed ea u es
seem o bea mo e in o ma ion [
18
,
21
,
22
], so hey we e gi en a la ge weigh du ing he
cons uc ion o he ea u e se compa ed o he iFea u e desc ip o s. Fu he mo e, ollowing
his idea, i e o he se s o ea u es wi h a ying dimensions we e also cons uc ed and
Biomolecules 2022,12, 1529 4 o 14
es ed. The whole p ocess was epea ed en imes, one o each o he selec ion algo i hms
esul ing in a o al o 60 ea u e se s.
The selec ion me hods could be di ided in o h ee ca ego ies: il e me hods ha
assessed he deg ee o dependence be ween he ea u es and he labels, w appe and
embedded me hods ha applied machine lea ning algo i hms o ank hose ea u es based
on hei ele ance o he pe o mance [23].
A o al o 5 lib a ies we e used o implemen he me hods om he di e en ca ego ies:
(I) ITMO_FS [
24
] which p o ided il e me hods such as Chi-squa e and In o ma ion gain.
(II) Bo u a [
25
] a lib a y con aining a single embedded me hod. (III) Sciki - ea u e [
26
]
ha implemen ed he il e me hods MRMR (minimum edundancy maximum ele ancy)
and CIFE (condi ional in omax ea u e ex ac ion). (IV) Sciki -lea n ha p o ided he
il e me hods mu ual in o ma ion and ishe sco e; he w appe me hod RFE ( ecu si e
ea u e elimina ion) combined wi h a linea model o SVM and he embedded me hod
andom o es . Finally, (V) XGBoos [
27
], like bo u a, is a lib a y ha con ains only an
embedded me hod.
2.5. Model T aining
SVM, KNN and RidgeClassi ie , which a e all implemen ed in Sciki -Lea n, we e
selec ed o classi ica ion. To co ec ly e alua e he model’s pe o mance, we employed
a simila s a egy o nes ed c oss- alida ion [
20
]. The da a was spli in o a 20% es se
and 80% aining se 5 imes, each ime gene a ing di e en se s. Then, o each spli , he
aining se was used o model de elopmen o ind he op imal se s o hype pa ame e s
using 5- old c oss- alida ion while he es se was used o he e alua ion. I gene a es
i e measu emen s om models wi h di e en se s o hype pa ame e s ha can be used o
compu e he s a is ics on he model’s pe o mance.
2.6. Pe o mance Me ics
Using he TP ( ue posi i e), TN ( ue nega i e), FN ( alse nega i e) and FP ( alse
posi i e) alues, he p ecision (P ), ecall (Re), F1 [
28
] and Ma hew’s co ela ion coe icien
(MCC) we e calcula ed [29] o e alua e he pe o mance o he models.
P =TP
FP +TP (1)
Re =TP
FN +TP (2)
F1 =2∗P ∗Re
P +Re (3)
MCC =TP ∗TN −FP ∗FN
p(TP +FP)∗(TP +FN)∗(TN +FP)∗(TN +FN)(4)
2.7. Applicabili y Domain
The e a e se e al aspec s ha migh a ec he eliabili y o he model’s p edic ions
apa om he pe o mance me ics. Indeed, he e should be limi a ions in he applicabili y
o he models o be used only on hose samples ha a e simila o he aining samples,
because o he wise, i would be p edic ing sequences ha i has no seen and i ed be o e.
In o he wo ds, we should de ine he applicabili y domain (AD) o he models and il e
he p edic ions acco dingly.
The e a e se e al app oaches o de ine he simila i y o he AD, all wi hin he ea u e
o desc ip o space, bu we decided o ollow one inspi ed by KNN [
30
]. In his app oach, a
dis ance h eshold was compu ed o each aining sample and compa ed o he Euclidean
dis ance be ween a new sample and each aining sample. I any o he dis ances be ween
he new and he aining samples was less o equal o he h eshold associa ed wi h ha
aining sample, he p edic ion was deemed eliable and kep .
Biomolecules 2022,12, 1529 5 o 14
2.8. Hidden Ma ko Model (HMM) P o iles
LED con ains sequences o he han es e ases/lipases so i would be wise o il e hem
and keep only hose mos likely o be es e ases be o e he p edic ions. Fo his pu pose,
we employed he HMM p o iles, p obabilis ic models ha cap u e he e olu iona ily
conse ed pa e ns e ealed by mul iple sequence alignmen s (MSA). They allow mo e
sensi i e homology sea ches han blas while e aining he speed.
We ollowed he p o ocol desc ibed by Pé ez-Ga cía e al [
31
]. The p og am used o
build such p o iles was HMMER (h p://hmme .o g/, accessed on 10 Ma ch 2021), using a
MSA o he es e ases wi h 35 o mo e subs a es. The MSA was gene a ed by T-Co ee wi h
he de aul pa ame e s. The p og am can hen use he HMM p o ile o sea ch o homologs
in sequence da abases and il e hem based on E- alues; he e we used an E- alue cu o o
10
−10
o add p ecision. No ice ha he esul ing HMM model is aimed a il e ing es e ases
a he han p omiscui y. In ac , when applied o he aining da ase , i canno dis inguish
well be ween p omiscuous and non-p omiscuous enzymes, wi h a p ecision sco e o 0.6 a
E- alue o 0.001.
2.9. Homology Modelling (HM) and Ac i e Si e Analysis
The op selec ed p edic ions we e modeled using ModWeb [
32
], a web se e o
p o ein s uc u e modeling, which au oma ically gene a es a homology model o he
a ge sequence. No e ha he e we only aimed a a as s uc u al me hod o gene a e
app oxima e ac i e si e s uc u es so we could disca d clea ly w ong ones. No ice ha o
hose sequences es ed expe imen ally, we also cons uc ed he Alpha old2 models (no ye
a ailable a he concep ion o he p ojec ). AlphaFold2 esul s clea ly ag eed wi h he ones
p edic ed by ModWeb, wi h low alues o RMSD when compa ed (see, Table S6). We also
checked he s uc u al quali y o he models om he op 10 p omiscuous es e ases wi h
P oSA-web [
33
]. I compa es he s uc u e models wi h expe imen ally de e mined p o eins
om P o ein Da a Bank and es ima es a Z-sco e o each model, he lowe he be e .
The ac i e si e o he homology models was analyzed o il e ou es e ases wi h
he ca aly ic iads no a anged in an ac i e con o ma ion o make su e hey we e in-
deed es e ases. Nex , he p ope ies o hei ac i e si e we e calcula ed using Si eMap,
Sch odinge [
34
,
35
] which includes hyd ophobici y, enclosu e and exposu e ha ga e an
idea o how sol en -exposed he ca i y o he enzymes was. These me ics a e ele an
because we ound ou [
8
] ha he ac i e si e o he p omiscuous es e ases sha e some
common physicochemical ea u es such as high hyd ophobici y, la ge olumes and a e
mo e enclosed compa ed o hei non-p omiscuous coun e pa s. Acco dingly, we used
his in o ma ion o ank and isola e he 10 sequences es ed in his pape when we needed
o educe he numbe o candida es.
2.10. Enzyme Sou ce, P oduc ion, and Pu i ica ion
The sequences encoding AJP48854.1, ART39858.1, PHR82761.1, WP_014900537.1,
WP_026140314.1, WP_042877612.1, WP_059541090.1, WP_069226497.1, WP_089515094.1
and WP_101198885.1 we e used as empla es o gene syn hesis (GenSc ip Bio ech, EG
Rijswijk, The Ne he lands), and genes we e codon-op imized o maximize exp ession in
Esche ichia coli. Genes we e lanked by BamHI and HindIII (s op codon) es ic ion si es and
inse ed in a pET-45b(+) exp ession ec o wi h an ampicillin selec ion ma ke (GenSc ip
Bio ech, Rijswijk, The Ne he lands), which was u he in oduced in o E. coli BL21(DE3).
The soluble N- e minal his idine (His) agged p o eins we e p oduced and pu i ied
(98% pu i y, as de e mined by SDS–PAGE analysis using a Mini PROTEAN elec opho e-
sis sys em, Bio-Rad, Mad id, Spain) a 4
◦
C a e binding o a Ni-NTA His-Bind esin
(Me ck Li e Science S.L.U., Mad id, Spain), as p e iously desc ibed [
4
,
7
], and s o ed
a
−
86
◦
C un il use a a concen a ion o 10 mg mL
−1
in 40 mM 4-(2-hyd oxye hyl)-1-
pipe azinee hanesul onic acid (HEPES) bu e (pH 7.0).

Biomolecules 2022,12, 1529 6 o 14
2.11. Ac i i y Tes s
The hyd olysis o es e s was assayed using a pH indica o assay in 384-well pla es
( e . 781162, G eine Bio-One GmbH, K emsmüns e , Aus ia) a 40
◦
C and pH 8.0 in
a Syne gy HT Mul i-Mode Mic opla e Reade in con inuous mode a 550 nm o e 24 h
(ex inc ion coe icien (
ε
) o phenol ed, 8450 M
−1
cm
−1
), as epo ed [
36
,
37
]. The condi ions
o de e mining he speci ic ac i i y (uni s mg
−1
) we e as ollows: [p o eins]: 270
µ
g mL
−1
;
[es e ]: 20 mM; eac ion olume: 44
µ
L; T: 30
◦
C; and pH: 8.0 (5 mM 4-(2-hyd oxye hyl)-
1-pipe azinep opanesul onic acid (EPPS) bu e ). In all cases, all alues in iplica e we e
co ec ed o nonenzyma ic ans o ma ion, wi h he absence o ac i i y de ined as ha ing
a leas a wo old backg ound signal. In all cases, he ac i i y was calcula ed by de e mining
he abso bance pe minu e om he slopes gene a ed [38].
The ac i i y owa d he model es e s p-ni ophenyl (p-NP) ace a e ( e . N-8130; Me ck
Li e Science S.L.U., Mad id, Spain), p opiona e (San a C uz Bio echnology, Inc., Heidelbe g,
Ge many, e . sc-256813) and bu y a e ( e . N-9876; Me ck Li e Science S.L.U., Mad id,
Spain) was assessed in 5 mM EPPS bu e a pH 8.0 and 30
◦
C by moni o ing he p oduc ion
o 4-ni ophenol a 348 nm (pH-independen isosbes ic poin ,
ε
= 4147 M
−1
cm
−1
) o e
5 min
and de e mining he abso bance pe minu e om he gene a ed slopes [
36
]. The eac-
ions we e pe o med in 96-well pla es ( e . 655801, G eine Bio-One GmbH, K emsmüns e ,
Aus ia). Fo speci ic ac i i y de e mina ions, he ollowing condi ions we e used: [p o-
eins]: 7
µ
g mL
−1
; [es e ]: 1 mM; eac ion olume: 200
µ
L;
T: 30 ◦C;
and pH: 8.0 (5 mM
EPPS bu e ). Fo K
m
and k
ca
de e mina ions (using p-NP p opiona e), he ollowing condi-
ions we e used: [p o eins]: 0.06–25
µ
g mL
−1
; [es e ]: 0–0.04 mM; eac ion olume: 100
µ
L;
T: 30 ◦C;
and pH: 8.0 (5 mM EPPS bu e ). The alues co espond o he i ob ained om
he eg ession o he da a (each ob ained in iplica es) using SigmaPlo 14.0 so wa e. kca
was calcula ed by using he ollowing equa ion: k
ca
= V
max
/[E], whe e [E] = o al enzyme
( o aw da a see Supplemen a y Ma e ial).
Me a-clea age p oduc (MCP) hyd olase ac i i y was assayed using 2-hyd oxy-6-oxo-
6-phenylhexa-2,4-dienoa e (HOPHD) and 2-hyd oxy-6-oxohep a-2,4-dienoa e (HOHD),
eshly p oduced as desc ibed [
38
]. The eac ions we e pe o med a 30
◦
C in 96-well
pla es ( e . 655801, G eine Bio-One GmbH, K emsmüns e , Aus ia), and hey con ained
7.0 µg mL−1
p o eins and 0.2 mM HOPHD o HODH in a o al olume o 200
µ
L 50 mM
K/Na-phospha e (pH 7.5) bu e ( his bu e was shown o be op imal o measu ing
MCP hyd oly ic ac i i y [
38
]. Hyd olysis was moni o ed a 388 nm ( o HOPHD) o
434 nm
( o HOHD) o e 5 min and he abso bance pe minu e was de e mined om he
gene a ed slopes [38].
3. Resul s and Discussion
3.1. Model Buildup
The accu acy o machine lea ning classi ie s depends g ea ly on he ea u es and he
hype pa ame e s used. To cons uc he ea u es, we de i ed physicochemical and e olu-
iona y in o ma ion om he sequences ia wo webse e s iFea u es [
19
] and Possum [
18
]
(see Tables S1 and S2, espec i ely), educed hei dimension h ough ea u e selec ion and
buil a o al o 60 se s o ea u es o be es ed.
The classi ie s we e hen ained on one o he ea u e se s du ing which he hype pa-
ame e s we e uned using 5- old c oss- alida ion (CV). Las ly, he model wi h he op imal
hype pa ame e s was e alua ed on he es se . The p ocess was epea ed i e imes, one o
each o he da a spli s, om which he s a is ics on he model pe o mance we e compu ed
and compa ed agains o he models using a dis inc ea u e se (Figu e 1). Among he
60 ea u es, wo se s gene a ed he bes models, called he ea e ch_20 and andom_30,
since hey we e p oduced by Chi-squa ed and Random Fo es ea u e selec ion me hods,
espec i ely. Bo h SVM and RidgeClassi ie pe o med he bes when ained on ch_20
and bo h showed a mean MCC sco e o 0.54 o he aining se and a mean MCC sco e
o a ound 0.62 o he es se . In con as , KNN pe o med he bes when ained on
andom_30 and showed a mean MCC sco e o 0.67 o he aining se and a mean MCC
Biomolecules 2022,12, 1529 7 o 14
sco e o 0.65 o he es se (Figu e 2). KNN sligh ly ou pe o med he o he s in he aining
se sco e which migh imply ha he algo i hm migh be mo e sui ed a i ing o his ype
o da a.
Biomolecules 2022, 12, x FOR PEER REVIEW 7 o 14
The classi ie s we e hen ained on one o he ea u e se s du ing which he hype pa-
ame e s we e uned using 5- old c oss- alida ion (CV). Las ly, he model wi h he op i-
mal hype pa ame e s was e alua ed on he es se . The p ocess was epea ed i e imes,
one o each o he da a spli s, om which he s a is ics on he model pe o mance we e
compu ed and compa ed agains o he models using a dis inc ea u e se (Figu e 1).
Among he 60 ea u es, wo se s gene a ed he bes models, called he ea e ch_20 and
andom_30, since hey we e p oduced by Chi-squa ed and Random Fo es ea u e selec-
ion me hods, espec i ely. Bo h SVM and RidgeClassi ie pe o med he bes when
ained on ch_20 and bo h showed a mean MCC sco e o 0.54 o he aining se and a
mean MCC sco e o a ound 0.62 o he es se . In con as , KNN pe o med he bes when
ained on andom_30 and showed a mean MCC sco e o 0.67 o he aining se and a
mean MCC sco e o 0.65 o he es se (Figu e 2). KNN sligh ly ou pe o med he o he s
in he aining se sco e which migh imply ha he algo i hm migh be mo e sui ed a
i ing o his ype o da a.
Figu e 1. G aphical ep esen a ion o he model aining p ocess. Da a was spli i e imes in o di -
e en es and aining se s. The aining se was hen used o uning he hype pa ame e s using 5-
old c oss- alida ion (CV) while he es se was used o e alua e he ained models.
PSSM-based ea u es, which encode he e olu iona y in o ma ion o he sequences,
seem o be mo e ele an o he p edic ion accu acy. In bo h selec ed se s, PSSM ea u es
p opo ion was la ge han physicochemical ones. Fo ins ance, 19 ou o 25 desc ip o s in
andom_30 we e ex ac ed om PSSM. This is in line wi h ou p e ious indings [4] whe e
phylogeny was a p edic i e ma ke o he subs a e p omiscui y in es e hyd olases.
The MCC sco es o he h ee classi ie s, which indica e he co ela ion be ween he
p edic ed and he ue labels, a e good, bu di e en models can be g ouped o imp o e
he pe o mance since hey ha e di e en biases ha migh complemen each o he . The e
a e many possible combina ions because each machine lea ning algo i hm was ained on
i e di e en da a se s esul ing in i e dis inc classi ie s. Howe e , o gene a e he com-
bina ions, only wo o h ee models om each algo i hm we e chosen, hose wi h be e
MCC sco es. EP-p ed, which agg ega es all he models, wo SVM, h ee RidgeClassie and
wo KNN models, displayed a mean MCC sco e o 0.73 o he aining se and a sco e o
0.72 o he es se (Figu e 2). The obse ed inc ease in he models’ sco es can be a ibu ed
o he ac ha only samples o which he p edic ions be ween di e en classi ie s ag eed
we e kep o sco ing, hus making he p edic ions mo e obus .
Figu e 1.
G aphical ep esen a ion o he model aining p ocess. Da a was spli i e imes in o
di e en es and aining se s. The aining se was hen used o uning he hype pa ame e s using
5- old c oss- alida ion (CV) while he es se was used o e alua e he ained models.
Biomolecules 2022, 12, x FOR PEER REVIEW 8 o 14
Figu e 2. Ma hew’s co ela ion coe icien (MCC) sco es o he di e en classi ie s. Ridge is he
RidgeClassi ie which is one o he linea models implemen ed in Sciki -Lea n; SVM is he suppo
ec o machine; KNN is he K-nea es neighbo s and EP-p ed is he ensemble classi ie ha com-
bined all 3 o he p e ious classi ie s.
3.2. The Wo k low o In Silico Biop ospec ing
LED ga he s sequences om a ious amilies apa om es e ases/lipases, which is
why we applied an HMM p o ile, buil om he es e ase da ase , as a il e ing s ep and
ended up wi h app oxima ely 70,000 sequences. Then, he inal model EP-p ed was e al-
ua ed agains hem and p edic ed a ound 500 posi i e (p omiscuous) sequences which
we e s ill oo much o he expe imen al alida ion. Thus, se e al il e s we e applied o
dec ease he numbe o hi s o a inal se o en.
The op 100 sequences acco ding o E- alues e u ned by HMM we e selec ed o be
modeled and hei ac i e si e ca i y analyzed in sea ch o he ca aly ic iad and geome ic
desc ip o s. Only 73 sequences passed his second il e and we e o wa ded o he sub-
sequen analysis by Si eMap, a widely used binding si e analysis ool, which hen gene -
a ed a ious binding ca i y desc ip o s. As seen in ou p e ious enginee ing s udies, wo
me ics: hyd ophobici y, and he a io o enclosu e/exposu e, we e use ul in anking
p omiscui y, see Table S3; hus, we used hese o ank he inal se o en p o eins o ex-
pe imen al alida ion picking hose ha in e sec ed a he op in bo h me ics (Figu e 3).
Figu e 3. A desc ip ion o he biop ospec ing wo k low. A, Since he e was a mix o di e en ami-
lies in LED, i s we applied an HMM p o ile c ea ed om he es e ase da ase o clean he da abase
and keep only es e ases. B, EP-p ed e alua ed he emaining sequences and p edic ed a ound 500
posi i e hi s. C, The op 100 sequences acco ding o E- alues e u ned by HMM in s ep A we e
isola ed and analyzed acco ding o molecula desc ip o s om homology modeling (HM) and
Si emap calcula ions. D, A inal se o 10 sequences wi h he highes hyd ophobici y and enclo-
su e/exposu e sco es we e ga he ed and sen o be alida ed expe imen ally.
Figu e 2.
Ma hew’s co ela ion coe icien (MCC) sco es o he di e en classi ie s. Ridge is he
RidgeClassi ie which is one o he linea models implemen ed in Sciki -Lea n; SVM is he suppo
ec o machine; KNN is he K-nea es neighbo s and EP-p ed is he ensemble classi ie ha combined
all 3 o he p e ious classi ie s.
PSSM-based ea u es, which encode he e olu iona y in o ma ion o he sequences,
seem o be mo e ele an o he p edic ion accu acy. In bo h selec ed se s, PSSM ea u es
p opo ion was la ge han physicochemical ones. Fo ins ance, 19 ou o 25 desc ip o s in
andom_30 we e ex ac ed om PSSM. This is in line wi h ou p e ious indings [
4
] whe e
phylogeny was a p edic i e ma ke o he subs a e p omiscui y in es e hyd olases.
The MCC sco es o he h ee classi ie s, which indica e he co ela ion be ween he
p edic ed and he ue labels, a e good, bu di e en models can be g ouped o imp o e
he pe o mance since hey ha e di e en biases ha migh complemen each o he . The e
a e many possible combina ions because each machine lea ning algo i hm was ained
on i e di e en da a se s esul ing in i e dis inc classi ie s. Howe e , o gene a e he
combina ions, only wo o h ee models om each algo i hm we e chosen, hose wi h be e
Biomolecules 2022,12, 1529 8 o 14
MCC sco es. EP-p ed, which agg ega es all he models, wo SVM, h ee RidgeClassie and
wo KNN models, displayed a mean MCC sco e o 0.73 o he aining se and a sco e o
0.72 o he es se (Figu e 2). The obse ed inc ease in he models’ sco es can be a ibu ed
o he ac ha only samples o which he p edic ions be ween di e en classi ie s ag eed
we e kep o sco ing, hus making he p edic ions mo e obus .
3.2. The Wo k low o In Silico Biop ospec ing
LED ga he s sequences om a ious amilies apa om es e ases/lipases, which
is why we applied an HMM p o ile, buil om he es e ase da ase , as a il e ing s ep
and ended up wi h app oxima ely 70,000 sequences. Then, he inal model EP-p ed was
e alua ed agains hem and p edic ed a ound 500 posi i e (p omiscuous) sequences which
we e s ill oo much o he expe imen al alida ion. Thus, se e al il e s we e applied o
dec ease he numbe o hi s o a inal se o en.
The op 100 sequences acco ding o E- alues e u ned by HMM we e selec ed o be
modeled and hei ac i e si e ca i y analyzed in sea ch o he ca aly ic iad and geome -
ic desc ip o s. Only 73 sequences passed his second il e and we e o wa ded o he
subsequen analysis by Si eMap, a widely used binding si e analysis ool, which hen
gene a ed a ious binding ca i y desc ip o s. As seen in ou p e ious enginee ing s udies,
wo me ics: hyd ophobici y, and he a io o enclosu e/exposu e, we e use ul in anking
p omiscui y, see Table S3; hus, we used hese o ank he inal se o en p o eins o
expe imen al alida ion picking hose ha in e sec ed a he op in bo h me ics (Figu e 3).
Biomolecules 2022, 12, x FOR PEER REVIEW 8 o 14
Figu e 2. Ma hew’s co ela ion coe icien (MCC) sco es o he di e en classi ie s. Ridge is he
RidgeClassi ie which is one o he linea models implemen ed in Sciki -Lea n; SVM is he suppo
ec o machine; KNN is he K-nea es neighbo s and EP-p ed is he ensemble classi ie ha com-
bined all 3 o he p e ious classi ie s.
3.2. The Wo k low o In Silico Biop ospec ing
LED ga he s sequences om a ious amilies apa om es e ases/lipases, which is
why we applied an HMM p o ile, buil om he es e ase da ase , as a il e ing s ep and
ended up wi h app oxima ely 70,000 sequences. Then, he inal model EP-p ed was e al-
ua ed agains hem and p edic ed a ound 500 posi i e (p omiscuous) sequences which
we e s ill oo much o he expe imen al alida ion. Thus, se e al il e s we e applied o
dec ease he numbe o hi s o a inal se o en.
The op 100 sequences acco ding o E- alues e u ned by HMM we e selec ed o be
modeled and hei ac i e si e ca i y analyzed in sea ch o he ca aly ic iad and geome ic
desc ip o s. Only 73 sequences passed his second il e and we e o wa ded o he sub-
sequen analysis by Si eMap, a widely used binding si e analysis ool, which hen gene -
a ed a ious binding ca i y desc ip o s. As seen in ou p e ious enginee ing s udies, wo
me ics: hyd ophobici y, and he a io o enclosu e/exposu e, we e use ul in anking
p omiscui y, see Table S3; hus, we used hese o ank he inal se o en p o eins o ex-
pe imen al alida ion picking hose ha in e sec ed a he op in bo h me ics (Figu e 3).
Figu e 3. A desc ip ion o he biop ospec ing wo k low. A, Since he e was a mix o di e en ami-
lies in LED, i s we applied an HMM p o ile c ea ed om he es e ase da ase o clean he da abase
and keep only es e ases. B, EP-p ed e alua ed he emaining sequences and p edic ed a ound 500
posi i e hi s. C, The op 100 sequences acco ding o E- alues e u ned by HMM in s ep A we e
isola ed and analyzed acco ding o molecula desc ip o s om homology modeling (HM) and
Si emap calcula ions. D, A inal se o 10 sequences wi h he highes hyd ophobici y and enclo-
su e/exposu e sco es we e ga he ed and sen o be alida ed expe imen ally.
Figu e 3.
A desc ip ion o he biop ospec ing wo k low. A, Since he e was a mix o di e en
amilies in LED, i s we applied an HMM p o ile c ea ed om he es e ase da ase o clean he
da abase and keep only es e ases. B, EP-p ed e alua ed he emaining sequences and p edic ed
a ound 500 posi i e hi s. C, The op 100 sequences acco ding o E- alues e u ned by HMM in
s ep A we e isola ed and analyzed acco ding o molecula desc ip o s om homology modeling
(HM) and Si emap calcula ions. D, A inal se o 10 sequences wi h he highes hyd ophobici y and
enclosu e/exposu e sco es we e ga he ed and sen o be alida ed expe imen ally.
3.3. The Expe imen al Valida ion
All en ecombinan p esump i e subs a e p omiscuous hyd olases (AJP48854.1,
ART39858.1, PHR82761.1, WP_014900537.1, WP_026140314.1, WP_042877612.1,
WP_059541090.1, WP_069226497.1, WP_089515094.1 and WP_101198885.1) we e success-
ully exp essed in soluble o m and pu i ied by nickel a ini y ch oma og aphy. Then, h ee
model p-ni ophenyl (p-NP) es e subs a es wi h di e en chain leng hs: p-NP ace a e (C
2
),
p-NP p opiona e (C
3
), and p-NP bu y a e (C
4
) we e i s used o de e mine he subs a e
speci ici y o he enzymes. Thei hyd oly ic ac i i y was assessed and eco ded unde
s anda d assay condi ions desc ibed in Sec ion 2.11. We ound speci ic ac i i ies anging
om 5.85 U mg
−1
o 2.19 U mg
−1
o p-NP p opiona e, which was he bes subs a e in all
cases (Table 1).
Biomolecules 2022,12, 1529 9 o 14
Table 1. Speci ic ac i i y agains p-NP es e s. The esul s a e he mean ±SD o iplica es.
Speci ic Ac i i y (Uni s mg−1)
Enzyme p-NP Ace a e p-NP P opiona e p-NP Bu y a e
AJP48854.1 2.96 ±0.36 3.99 ±0.25 1.68 ±0.10
ART39858.1 2.53 ±0.39 3.63 ±0.26 1.48 ±0.13
PHR82761.1 4.39 ±0.44 2.75 ±0.19 2.41 ±0.25
WP_014900537.1 0.64 ±0.02 2.19 ±0.17 1.31 ±0.18
WP_026140314.1 1.47 ±0.05 3.33 ±0.14 1.22 ±0.09
WP_042877612.1 0.51 ±0.01 2.57 ±0.24 0.99 ±0.06
WP_059541090.1 0.97 ±0.02 2.96 ±0.23 1.06 ±0.08
WP_069226497.1 0.56 ±0.01 2.26 ±0.13 1.01 ±0.04
WP_089515094.1 3.75 ±0.17 4.46 ±0.19 2.20 ±0.15
WP_101198885.1 4.32 ±0.14 5.85 ±0.08 3.15 ±0.25
Once he es e ase ac i i y was con i med, we u he es ed he hyd oly ic ac i i y
owa ds a se o 96 s uc u ally di e en es e s based on Tanimo o-Combo simila i y [
4
,
39
].
As shown in Figu e 4, all enzymes we e able o hyd olyze an ample se o es e s, anging
om 27 ( o AJP48854.1) o 68 ( o WP_069226497.1). The speci ic ac i i y anges om
6. 50
(WP_069226497.1, being he mos ac i e) o 0.01 U mg
−1
(WP_014900537.1, being he
leas ac i e), depending on he subs a e (Table S4). Acco ding o he c i e ia p e iously
es ablished [
4
], nine o he enzymes could be conside ed as ha ing high- o-p ominen
subs a e p omiscui y as hey hyd olyze 30 o mo e es e s, whe eas one (AJP48854.1) could
be conside ed as mode a ely subs a e p omiscuous as i used less han 30 es e s bu
mo e han 10 (a numbe below which an es e ase could be conside ed subs a e speci ic).
Based only on he numbe o es e s con e ed (Figu e 5), hese enzymes could be anked
among he hyd olases wi h he highes subs a e p omiscui y wi hin a o al o 145 es e ases
p e iously es ed wi h a simila se o es e s. Kine ic cha ac e iza ion using he model es e
subs a e p-NP p opiona e con i med he high a ini y o nine ou he en es ed hyd olases
o his subs a e (K
m
om14.5 o 48.7
µ
M) and he high con e sion a es (k
ca
om 2060 o
5043 min−1) o six o hem (Table 2; Figu e S1).
Table 2.
Kine ic pa ame e s o selec ed hyd olases o p-NP p opiona e. The esul s a e he mean
±
SD o iplica es. No e: Kine ic pa ame e s could no be de e mined (n.d.) because no eliable K
m
could be de e mined (low a ini y o he subs a e unde ou expe imen al condi ions).
Enzyme kca (min−1)Km(µM)
AJP48854.1 n.d. n.d.
ART39858.1 33.1 ±0.1 33.3 ±3.1
PHR82761.1 2569.3 ±6.3 33.7 ±3.6
WP_014900537.1 46.8 ±0.0 16.9 ±1.6
WP_026140314.1 5.9 ±0.0 14.5 ±1.1
WP_042877612.1 5043.0 ±1268 65.8 ±26.1
WP_059541090.1 3246.8 ±10.6 31 ±3.3
WP_069226497.1 3775.1 ±37.2 46.5 ±11.8
WP_089515094.1 2060.8 ±5.7 23.5 ±3.5
WP_101198885.1 3452 ±15.9 48.7 ±7.5
A da abase sea ch indica ed ha all en hyd olases showed om 60 o 82.9% iden i y
wi h he me a-clea age p oduc hyd olase (MCP hyd olase) om Pseudomonas luo escens
IP01 (CumD) [
40
]. These hyd olases pa icipa e in he ae obic pa hways o he bac e ial
deg ada ion o a oma ic ca bons, in which a oma ic compounds a e clea ed in o me a- ing
ission compounds [35,38].