EP-Pred: A Machine Learning Tool for Bioprospecting Promiscuous Ester Hydrolases

Author: Xiang, Ruite; Fernandez-Lopez, Laura; Robles-Martín, Ana; Ferrer, Manuel; Guallar, Victor

Publisher: Zenodo

DOI: 10.5281/zenodo.17671860

Source: https://zenodo.org/records/17671860/files/biomolecules-12-01529.pdf

Ci a ion: Xiang, R.; Fe nandez-Lopez,
L.; Robles-Ma ín, A.; Fe e , M.;
Gualla , V. EP-P ed: A Machine
Lea ning Tool o Biop ospec ing
P omiscuous Es e Hyd olases.
Biomolecules 2022,12, 1529.
h ps://doi.o g/10.3390/
biom12101529
Academic Edi o : Jian Zhang
Recei ed: 2 Sep embe 2022
Accep ed: 18 Oc obe 2022
Published: 21 Oc obe 2022
Publishe ’s No e: MDPI s ays neu al
wi h ega d o ju isdic ional claims in
published maps and ins i u ional a il-
ia ions.
Copy igh : © 2022 by he au ho s.
Licensee MDPI, Basel, Swi ze land.
This a icle is an open access a icle
dis ibu ed unde he e ms and
condi ions o he C ea i e Commons
A ibu ion (CC BY) license (h ps://
c ea i ecommons.o g/licenses/by/
4.0/).
biomolecules
A icle
EP-P ed: A Machine Lea ning Tool o Biop ospec ing
P omiscuous Es e Hyd olases
Rui e Xiang 1,†, Lau a Fe nandez-Lopez 2,†, Ana Robles-Ma ín1, Manuel Fe e 2and Vic o Gualla 1,3,*
1Depa men o Li e Sciences, Ba celona Supe compu ing Cen e (BSC), 08034 Ba celona, Spain
2Depa men o Applied Bioca alysis, ICP, CSIC, 28049 Mad id, Spain
3Ca alan Ins i u ion o Resea ch and Ad anced S udies (ICREA), 08010 Ba celona, Spain
*Co espondence: ic o [email p o ec ed]
† These au ho s con ibu ed equally o his wo k.
Abs ac :
When biop ospec ing o no el indus ial enzymes, subs a e p omiscui y is a desi able
p ope y ha inc eases he eusabili y o he enzyme. Among indus ial enzymes, es e hyd olases
ha e g ea ele ance o which he demand has no ceased o inc ease. Howe e , he sea ch o
new subs a e p omiscuous es e hyd olases is no i ial since he mechanism behind his p op-
e y is g ea ly in luenced by he ac i e si e’s s uc u al and physicochemical cha ac e is ics. These
cha ac e is ics mus be compu ed om he 3D s uc u e, which is a ely a ailable and expensi e o
measu e, hence he need o a me hod ha can p edic p omiscui y om sequence alone. He e we
epo such a me hod called EP-p ed, an ensemble bina y classi ie , ha combines h ee machine
lea ning algo i hms: SVM, KNN, and a Linea model. EP-p ed has been e alua ed agains he
Lipase Enginee ing Da abase oge he wi h a hidden Ma ko app oach leading o a inal se o en
sequences p edic ed o encode p omiscuous es e ases. Expe imen al esul s con i med he alidi y o
ou me hod since all en p o eins we e ound o exhibi a b oad subs a e ambigui y.
Keywo ds:
bioca alys s; biop ospec ing; es e ases/lipases; hyd olases; machine lea ning; supe ised
lea ning
1. In oduc ion
Enzymes a e o g ea in e es o a as majo i y o indus ies, pa ially due o he
inc easing conce ns o e en i onmen al issues. Among he many classes o enzymes,
hyd olases s and ou o hei indus ial ele ance because o hei high s e eoselec i i y,
comme cial a ailabili y and s abili y in o ganic sol en s [
1
]. Indeed, he demand o newe
and be e hyd olases ha can wo k in indus ial se ings has only inc eased exponen ially
o e he yea s. Speci ically, es e hyd olases (EC 3.1), which hyd olyze es e bonds, ha e
ecei ed conside able a en ion, and a e ex ensi ely used in a ious a eas such as ood,
de e gen s, ag icul u e, pha maceu icals, and so on [2].
Sea ching o new es e ase candida es is no i ial, in ac , he e a e s ic equi emen s
ega ding s abili y, ac i i y and subs a e p omiscui y which a e di icul o ind conjoin ly
in na u al enzymes [
3
]. Ac ually, one o he mos common issues ha an indus ial enzyme
will ace is ha ing low subs a e p omiscui y [
4
]; he abili y o ca alyze a speci ic eac ion
o a a ie y o di e en subs a es. I is a desi able cha ac e is ic since one single enzyme
could be used o mul iple applica ions, hus educing he cos and ime o de elopmen
and p oduc ion o mul iple bioca alys s [5].
While some subs a e p omiscuous enzymes migh su e om limi ed s e eoselec i i y
and lowe ca aly ic a es [
6
,
7
], hey a e ypically enzymes p one o accep enzyme enginee -
ing, which could es o e hese p ope ies. In p e ious s udies, we ha e in es iga ed he
de e minan s o subs a e ambigui y o es e ases a a molecula le el, es ablishing ules
o i s p edic ion [
4
], and in oducing a signi ican inc ease in he numbe o subs a es hy-
d olyzed h ough enginee ing [
8
]. Howe e , i mus be no ed ha he necessa y molecula
Biomolecules 2022,12, 1529. h ps://doi.o g/10.3390/biom12101529 h ps://www.mdpi.com/jou nal/biomolecules
Biomolecules 2022,12, 1529 2 o 14
me ics mus be compu ed om he 3D s uc u es which a e a ely a ailable, and ha hey
in ol e signi ican compu a ional ime. E en wi h he ecen ad ancemen s in he accu acy
o deep lea ning s uc u al p edic ions, such as AlphaFold 2.0 [
9
], i is s ill un easible o
analyze subs a e speci ici y om he e e g owing numbe o anno a ed sequences. Fo
ins ance, as o 9 Ma ch 2021, he Lipase Enginee ing Da abase (LED) [
10
], which holds da a
on es e ases/lipases and a ew o he homologous sequences, con ains abou 280,638 en ies.
In addi ion, AlphaFold 2.0 models end o gene a e APO s uc u es, wi h a signi ican
modi ica ion (mos ly olume educ ion) in he ac i e si e ha p ecludes, o example, e i-
cien ligand igid docking [
11
]. These changes would la gely a ec p omiscui y p edic ions
using he molecula desc ip o s.
The e o e, me hods o di ec ly iden i y subs a e p omiscui y om sequence alone
would g ea ly inc ease he e iciency o biop ospec ing o new es e ase candida es, a ask
ideal o machine lea ning algo i hms (see o example, o he ecen ly de eloped me hods
o ac i i y p edic ions [
12
]). Se e al s udies ha e al eady p edic ed enzyme subs a e
p omiscui y using molecula desc ip o s [
13
] o machine lea ning [
14
–
16
] app oaches, al-
hough o o he enzyme amilies. In addi ion, he e a e se e al di e ences in ou app oach.
Fi s , o he s udies mainly used aining samples lis ed in da abases such as he Kyo o
Encyclopedia o Genes and Genomes (KEGG), which includes eac ion da a o mos o
he s udied enzyme amilies. In hese da abases he numbe o es ed compounds is usu-
ally no la ge, hinde ing he co ec iden i ica ion o p omiscuous enzymes. Addi ionally,
hose s udies ha e ol ed a ound single enzyme amilies a e usually cons ained by
he numbe o samples om which o de i e he molecula desc ip o s o he classi ica-
ion models which migh dec ease hei applicabili y. Second, he goal o he de eloped
me hods seems o di e . P e ious p ojec s e alua e whe he a speci ic compound will
be ca alyzed by a se o cha ac e ized enzymes o he e e se o ind no el subs a es o
enzymes. Ou app oach, in addi ion, ies o classi y how p omiscuous an enzyme migh
be in a biop ospec ing se ing.
He e we epo he de elopmen and applica ion o an ensemble bina y classi ie
ained on a da ase o 145 di e se es e ases and 96 subs a es [
4
], using a combina ion o
physicochemical and e olu iona y ea u e ec o s ex ac ed om he p ima y sequence.
Ou classi ie , named EP-p ed, combines h ee ypes o classi ica ion algo i hms: SVM
(suppo ec o machines), KNN (k-nea es neighbo s) and RidgeClassi ie , one o he
linea models implemen ed in Sciki -Lea n. EP-p ed was hen e alua ed agains LED
and om hose p edic ed o be posi i es, a inal se o en sequences we e isola ed and
es ed expe imen ally. All selec ed enzymes we e con i med o be subs a e p omiscuous,
highligh ing he po en ial o ou machine lea ning biop ospec ing me hod.
2. Ma e ials and Me hods
2.1. Es e ase Da ase
The da ase employed o ain he models is he same used in ou p e ious molecula
modeling s udies and i is o med by 145 di e se mic obial es e hyd olases wi h pai wise
sequence iden i ies anging om 0.2 o 99.7% and an a e age pai wise iden i y o 13.7% [
4
].
The he e ogenei y o he sequences can be a ibu ed o he di e si y o he sou ce om
which hey we e isola ed, including bo h e es ial bac e ia om 28 geog aphically dis inc
si es and ma ine bac e ia. The phylogene ic analysis pe o med in he p e ious s udy
u he suppo s he di e si y o he sou ce bac e ia since hey we e ound o be dis ibu ed
ac oss he phylogene ic ee.
The subs a e p o iles o he enzymes we e assessed on a se o 96 di e se es e s wi h
he mos p omiscuous one capable o hyd olyzing 72 es e s ou o 96 es ed and he leas
p omiscuous one only capable o ca alyzing 1 ou o 96 subs a es. The dis inc ion be ween
p omiscuous and no p omiscuous, o posi i e and nega i e classes, was also es ablished
acco ding o he h eshold o he p e ious s udy: 20 subs a es.
Biomolecules 2022,12, 1529 3 o 14
2.2. Lipase Enginee ing and Uni e 50 Da abases
The lipase enginee ing da abase (h ps://led.bioca ne .de/, accessed on
9 Ma ch 2021
),
used o e alua e he inal model, ga he s in o ma ion on he sequence, s uc u e and unc-
ion o es e ases/lipases and o he ela ed p o eins sha ing he same a/b hyd olase old.
The whole sequence da abase was downloaded in Fas a o ma con aining
280,638 en ies
.
The e olu iona y- ela ed ea u es we e based on PSSM (Posi ion Speci ic Sco ing
Ma ix) p o iles which we e gene a ed wi h Psi-Blas by que ying he inpu sequence
agains he Uni e 50. Uni e o The UniP o Re e ence Clus e s con ains eco ds o he
Unip o Knowledgebase and he Unipa c sequence a chi e a se e al esolu ions, 100%,
90% and 50%, each one gene a ed om he clus e ing o he p e ious one. Thus, Uni e 50 is
gene a ed om clus e ing he UniRe 90 seed sequences a he iden i y h eshold o 50% [
17
].
2.3. Fea u e Ex ac ion
Two web se e s Possum [
18
] and iFea u e [
19
] we e used o ex ac e olu iona y
in o ma ion and physicochemical p ope ies, espec i ely, om all p o ein sequences.
iFea u e can gene a e 53 di e en ypes o desc ip o s, om which 32 we e ex ac ed using
he de aul pa ame e s esul ing in a o al o 2274 ea u e ec o dimensions. The es o he
ea u e ypes we e disca ded because hey can only be applied o sequences o he same
leng h. These ea u es migh be simplis ic desc ip o s like he amino acid composi ion
(AAC), which coun s he equency o each amino acid in he sequence. Howe e , he e
a e also mo e elabo a e o ms o desc ip o s ha accoun o he dis ibu ion, ansi ion, o
co ela ion o di e en p ope ies, like hyd ophobici y, along he sequence.
Possum gene a es ea u es based on he PSSM (Posi ion Speci ic Sco ing Ma ix)
p o iles ha con ain e olu iona y in o ma ion o he sequences since i speci ies he sco es
o obse ing pa icula amino acids a speci ic posi ions o he sequence. Al hough
e y in o ma i e, he downside o hese p o iles is ha hey depend on he leng h o he
sequences which hampe s hei di ec use as ea u es o machine lea ning applica ions.
By applying di e en ma ix ans o ma ions o make hem leng h-independen , Possum
was able o gene a e 18 di e en desc ip o s ha we e ex ac ed esul ing in a ec o o
18,730 dimensions
. Some ans o ma ions a e inspi ed by sequence-based ea u es, such
as ACC which educes he PSSM p o ile om a ma ix o L
×
20 dimensions, L being he
leng h o a sequence, o a ec o o 20 dimensions by a e aging he sco es o he ows in
he PSSM. While o he ans o ma ions a e mo e complex cons uc s ha i s scale, il e ,
o g oup he alues in he PSSM and hen apply a ious ope a ions o ix he dimensions
o he ma ix.
The conca ena ion o bo h ea u e ec o s yields a ea u e se o 21,000 dimensions
o he es e ase da ase . A e gene a ing he ea u es, some cleaning was needed because
many columns had ze os o iden ical alues in mos o he ows which ca ied li le
in o ma ion. As a esul , 2274 iFea u e and 18,730 Possum ea u es we e educed o
1203 and 14,606 dimensions, espec i ely.
2.4. Fea u e Selec ion
E en wi h he cleaning, he numbe o o iginal ea u es emained exceedingly high,
he e o e ea u e selec ion was needed o elimina e noise and a oid o e i ing. As ecom-
mended o his s ep [
20
], da a was spli in o wo se s, a es se and a aining se and he
selec ion was pe o med on he aining se only. In addi ion, he numbe o dimensions
was educed o less han
1
2
o he numbe o samples as i was shown o educe o e i ing.
I mus be no ed ha he dimensionali y educ ion o iFea u e and Possum desc ip o s
was ca ied ou independen ly and conca ena ed la e o gene a e he ea u es. Howe e ,
he p opo ion o he wo desc ip o s was no e en because e olu iona y- ela ed ea u es
seem o bea mo e in o ma ion [
18
,
21
,
22
], so hey we e gi en a la ge weigh du ing he
cons uc ion o he ea u e se compa ed o he iFea u e desc ip o s. Fu he mo e, ollowing
his idea, i e o he se s o ea u es wi h a ying dimensions we e also cons uc ed and
Biomolecules 2022,12, 1529 4 o 14
es ed. The whole p ocess was epea ed en imes, one o each o he selec ion algo i hms
esul ing in a o al o 60 ea u e se s.
The selec ion me hods could be di ided in o h ee ca ego ies: il e me hods ha
assessed he deg ee o dependence be ween he ea u es and he labels, w appe and
embedded me hods ha applied machine lea ning algo i hms o ank hose ea u es based
on hei ele ance o he pe o mance [23].
A o al o 5 lib a ies we e used o implemen he me hods om he di e en ca ego ies:
(I) ITMO_FS [
24
] which p o ided il e me hods such as Chi-squa e and In o ma ion gain.
(II) Bo u a [
25
] a lib a y con aining a single embedded me hod. (III) Sciki - ea u e [
26
]
ha implemen ed he il e me hods MRMR (minimum edundancy maximum ele ancy)
and CIFE (condi ional in omax ea u e ex ac ion). (IV) Sciki -lea n ha p o ided he
il e me hods mu ual in o ma ion and ishe sco e; he w appe me hod RFE ( ecu si e
ea u e elimina ion) combined wi h a linea model o SVM and he embedded me hod
andom o es . Finally, (V) XGBoos [
27
], like bo u a, is a lib a y ha con ains only an
embedded me hod.
2.5. Model T aining
SVM, KNN and RidgeClassi ie , which a e all implemen ed in Sciki -Lea n, we e
selec ed o classi ica ion. To co ec ly e alua e he model’s pe o mance, we employed
a simila s a egy o nes ed c oss- alida ion [
20
]. The da a was spli in o a 20% es se
and 80% aining se 5 imes, each ime gene a ing di e en se s. Then, o each spli , he
aining se was used o model de elopmen o ind he op imal se s o hype pa ame e s
using 5- old c oss- alida ion while he es se was used o he e alua ion. I gene a es
i e measu emen s om models wi h di e en se s o hype pa ame e s ha can be used o
compu e he s a is ics on he model’s pe o mance.
2.6. Pe o mance Me ics
Using he TP ( ue posi i e), TN ( ue nega i e), FN ( alse nega i e) and FP ( alse
posi i e) alues, he p ecision (P ), ecall (Re), F1 [
28
] and Ma hew’s co ela ion coe icien
(MCC) we e calcula ed [29] o e alua e he pe o mance o he models.
P =TP
FP +TP (1)
Re =TP
FN +TP (2)
F1 =2∗P ∗Re
P +Re (3)
MCC =TP ∗TN −FP ∗FN
p(TP +FP)∗(TP +FN)∗(TN +FP)∗(TN +FN)(4)
2.7. Applicabili y Domain
The e a e se e al aspec s ha migh a ec he eliabili y o he model’s p edic ions
apa om he pe o mance me ics. Indeed, he e should be limi a ions in he applicabili y
o he models o be used only on hose samples ha a e simila o he aining samples,
because o he wise, i would be p edic ing sequences ha i has no seen and i ed be o e.
In o he wo ds, we should de ine he applicabili y domain (AD) o he models and il e
he p edic ions acco dingly.
The e a e se e al app oaches o de ine he simila i y o he AD, all wi hin he ea u e
o desc ip o space, bu we decided o ollow one inspi ed by KNN [
30
]. In his app oach, a
dis ance h eshold was compu ed o each aining sample and compa ed o he Euclidean
dis ance be ween a new sample and each aining sample. I any o he dis ances be ween
he new and he aining samples was less o equal o he h eshold associa ed wi h ha
aining sample, he p edic ion was deemed eliable and kep .
Biomolecules 2022,12, 1529 5 o 14
2.8. Hidden Ma ko Model (HMM) P o iles
LED con ains sequences o he han es e ases/lipases so i would be wise o il e hem
and keep only hose mos likely o be es e ases be o e he p edic ions. Fo his pu pose,
we employed he HMM p o iles, p obabilis ic models ha cap u e he e olu iona ily
conse ed pa e ns e ealed by mul iple sequence alignmen s (MSA). They allow mo e
sensi i e homology sea ches han blas while e aining he speed.
We ollowed he p o ocol desc ibed by Pé ez-Ga cía e al [
31
]. The p og am used o
build such p o iles was HMMER (h p://hmme .o g/, accessed on 10 Ma ch 2021), using a
MSA o he es e ases wi h 35 o mo e subs a es. The MSA was gene a ed by T-Co ee wi h
he de aul pa ame e s. The p og am can hen use he HMM p o ile o sea ch o homologs
in sequence da abases and il e hem based on E- alues; he e we used an E- alue cu o o
10
−10
o add p ecision. No ice ha he esul ing HMM model is aimed a il e ing es e ases
a he han p omiscui y. In ac , when applied o he aining da ase , i canno dis inguish
well be ween p omiscuous and non-p omiscuous enzymes, wi h a p ecision sco e o 0.6 a
E- alue o 0.001.
2.9. Homology Modelling (HM) and Ac i e Si e Analysis
The op selec ed p edic ions we e modeled using ModWeb [
32
], a web se e o
p o ein s uc u e modeling, which au oma ically gene a es a homology model o he
a ge sequence. No e ha he e we only aimed a a as s uc u al me hod o gene a e
app oxima e ac i e si e s uc u es so we could disca d clea ly w ong ones. No ice ha o
hose sequences es ed expe imen ally, we also cons uc ed he Alpha old2 models (no ye
a ailable a he concep ion o he p ojec ). AlphaFold2 esul s clea ly ag eed wi h he ones
p edic ed by ModWeb, wi h low alues o RMSD when compa ed (see, Table S6). We also
checked he s uc u al quali y o he models om he op 10 p omiscuous es e ases wi h
P oSA-web [
33
]. I compa es he s uc u e models wi h expe imen ally de e mined p o eins
om P o ein Da a Bank and es ima es a Z-sco e o each model, he lowe he be e .
The ac i e si e o he homology models was analyzed o il e ou es e ases wi h
he ca aly ic iads no a anged in an ac i e con o ma ion o make su e hey we e in-
deed es e ases. Nex , he p ope ies o hei ac i e si e we e calcula ed using Si eMap,
Sch odinge [
34
,
35
] which includes hyd ophobici y, enclosu e and exposu e ha ga e an
idea o how sol en -exposed he ca i y o he enzymes was. These me ics a e ele an
because we ound ou [
8
] ha he ac i e si e o he p omiscuous es e ases sha e some
common physicochemical ea u es such as high hyd ophobici y, la ge olumes and a e
mo e enclosed compa ed o hei non-p omiscuous coun e pa s. Acco dingly, we used
his in o ma ion o ank and isola e he 10 sequences es ed in his pape when we needed
o educe he numbe o candida es.
2.10. Enzyme Sou ce, P oduc ion, and Pu i ica ion
The sequences encoding AJP48854.1, ART39858.1, PHR82761.1, WP_014900537.1,
WP_026140314.1, WP_042877612.1, WP_059541090.1, WP_069226497.1, WP_089515094.1
and WP_101198885.1 we e used as empla es o gene syn hesis (GenSc ip Bio ech, EG
Rijswijk, The Ne he lands), and genes we e codon-op imized o maximize exp ession in
Esche ichia coli. Genes we e lanked by BamHI and HindIII (s op codon) es ic ion si es and
inse ed in a pET-45b(+) exp ession ec o wi h an ampicillin selec ion ma ke (GenSc ip
Bio ech, Rijswijk, The Ne he lands), which was u he in oduced in o E. coli BL21(DE3).
The soluble N- e minal his idine (His) agged p o eins we e p oduced and pu i ied
(98% pu i y, as de e mined by SDS–PAGE analysis using a Mini PROTEAN elec opho e-
sis sys em, Bio-Rad, Mad id, Spain) a 4
◦
C a e binding o a Ni-NTA His-Bind esin
(Me ck Li e Science S.L.U., Mad id, Spain), as p e iously desc ibed [
4
,
7
], and s o ed
a
−
86
◦
C un il use a a concen a ion o 10 mg mL
−1
in 40 mM 4-(2-hyd oxye hyl)-1-
pipe azinee hanesul onic acid (HEPES) bu e (pH 7.0).

Biomolecules 2022,12, 1529 6 o 14
2.11. Ac i i y Tes s
The hyd olysis o es e s was assayed using a pH indica o assay in 384-well pla es
( e . 781162, G eine Bio-One GmbH, K emsmüns e , Aus ia) a 40
◦
C and pH 8.0 in
a Syne gy HT Mul i-Mode Mic opla e Reade in con inuous mode a 550 nm o e 24 h
(ex inc ion coe icien (
ε
) o phenol ed, 8450 M
−1
cm
−1
), as epo ed [
36
,
37
]. The condi ions
o de e mining he speci ic ac i i y (uni s mg
−1
) we e as ollows: [p o eins]: 270
µ
g mL
−1
;
[es e ]: 20 mM; eac ion olume: 44
µ
L; T: 30
◦
C; and pH: 8.0 (5 mM 4-(2-hyd oxye hyl)-
1-pipe azinep opanesul onic acid (EPPS) bu e ). In all cases, all alues in iplica e we e
co ec ed o nonenzyma ic ans o ma ion, wi h he absence o ac i i y de ined as ha ing
a leas a wo old backg ound signal. In all cases, he ac i i y was calcula ed by de e mining
he abso bance pe minu e om he slopes gene a ed [38].
The ac i i y owa d he model es e s p-ni ophenyl (p-NP) ace a e ( e . N-8130; Me ck
Li e Science S.L.U., Mad id, Spain), p opiona e (San a C uz Bio echnology, Inc., Heidelbe g,
Ge many, e . sc-256813) and bu y a e ( e . N-9876; Me ck Li e Science S.L.U., Mad id,
Spain) was assessed in 5 mM EPPS bu e a pH 8.0 and 30
◦
C by moni o ing he p oduc ion
o 4-ni ophenol a 348 nm (pH-independen isosbes ic poin ,
ε
= 4147 M
−1
cm
−1
) o e
5 min
and de e mining he abso bance pe minu e om he gene a ed slopes [
36
]. The eac-
ions we e pe o med in 96-well pla es ( e . 655801, G eine Bio-One GmbH, K emsmüns e ,
Aus ia). Fo speci ic ac i i y de e mina ions, he ollowing condi ions we e used: [p o-
eins]: 7
µ
g mL
−1
; [es e ]: 1 mM; eac ion olume: 200
µ
L;
T: 30 ◦C;
and pH: 8.0 (5 mM
EPPS bu e ). Fo K
m
and k
ca
de e mina ions (using p-NP p opiona e), he ollowing condi-
ions we e used: [p o eins]: 0.06–25
µ
g mL
−1
; [es e ]: 0–0.04 mM; eac ion olume: 100
µ
L;
T: 30 ◦C;
and pH: 8.0 (5 mM EPPS bu e ). The alues co espond o he i ob ained om
he eg ession o he da a (each ob ained in iplica es) using SigmaPlo 14.0 so wa e. kca
was calcula ed by using he ollowing equa ion: k
ca
= V
max
/[E], whe e [E] = o al enzyme
( o aw da a see Supplemen a y Ma e ial).
Me a-clea age p oduc (MCP) hyd olase ac i i y was assayed using 2-hyd oxy-6-oxo-
6-phenylhexa-2,4-dienoa e (HOPHD) and 2-hyd oxy-6-oxohep a-2,4-dienoa e (HOHD),
eshly p oduced as desc ibed [
38
]. The eac ions we e pe o med a 30
◦
C in 96-well
pla es ( e . 655801, G eine Bio-One GmbH, K emsmüns e , Aus ia), and hey con ained
7.0 µg mL−1
p o eins and 0.2 mM HOPHD o HODH in a o al olume o 200
µ
L 50 mM
K/Na-phospha e (pH 7.5) bu e ( his bu e was shown o be op imal o measu ing
MCP hyd oly ic ac i i y [
38
]. Hyd olysis was moni o ed a 388 nm ( o HOPHD) o
434 nm
( o HOHD) o e 5 min and he abso bance pe minu e was de e mined om he
gene a ed slopes [38].
3. Resul s and Discussion
3.1. Model Buildup
The accu acy o machine lea ning classi ie s depends g ea ly on he ea u es and he
hype pa ame e s used. To cons uc he ea u es, we de i ed physicochemical and e olu-
iona y in o ma ion om he sequences ia wo webse e s iFea u es [
19
] and Possum [
18
]
(see Tables S1 and S2, espec i ely), educed hei dimension h ough ea u e selec ion and
buil a o al o 60 se s o ea u es o be es ed.
The classi ie s we e hen ained on one o he ea u e se s du ing which he hype pa-
ame e s we e uned using 5- old c oss- alida ion (CV). Las ly, he model wi h he op imal
hype pa ame e s was e alua ed on he es se . The p ocess was epea ed i e imes, one o
each o he da a spli s, om which he s a is ics on he model pe o mance we e compu ed
and compa ed agains o he models using a dis inc ea u e se (Figu e 1). Among he
60 ea u es, wo se s gene a ed he bes models, called he ea e ch_20 and andom_30,
since hey we e p oduced by Chi-squa ed and Random Fo es ea u e selec ion me hods,
espec i ely. Bo h SVM and RidgeClassi ie pe o med he bes when ained on ch_20
and bo h showed a mean MCC sco e o 0.54 o he aining se and a mean MCC sco e
o a ound 0.62 o he es se . In con as , KNN pe o med he bes when ained on
andom_30 and showed a mean MCC sco e o 0.67 o he aining se and a mean MCC
Biomolecules 2022,12, 1529 7 o 14
sco e o 0.65 o he es se (Figu e 2). KNN sligh ly ou pe o med he o he s in he aining
se sco e which migh imply ha he algo i hm migh be mo e sui ed a i ing o his ype
o da a.
Biomolecules 2022, 12, x FOR PEER REVIEW 7 o 14
The classi ie s we e hen ained on one o he ea u e se s du ing which he hype pa-
ame e s we e uned using 5- old c oss- alida ion (CV). Las ly, he model wi h he op i-
mal hype pa ame e s was e alua ed on he es se . The p ocess was epea ed i e imes,
one o each o he da a spli s, om which he s a is ics on he model pe o mance we e
compu ed and compa ed agains o he models using a dis inc ea u e se (Figu e 1).
Among he 60 ea u es, wo se s gene a ed he bes models, called he ea e ch_20 and
andom_30, since hey we e p oduced by Chi-squa ed and Random Fo es ea u e selec-
ion me hods, espec i ely. Bo h SVM and RidgeClassi ie pe o med he bes when
ained on ch_20 and bo h showed a mean MCC sco e o 0.54 o he aining se and a
mean MCC sco e o a ound 0.62 o he es se . In con as , KNN pe o med he bes when
ained on andom_30 and showed a mean MCC sco e o 0.67 o he aining se and a
mean MCC sco e o 0.65 o he es se (Figu e 2). KNN sligh ly ou pe o med he o he s
in he aining se sco e which migh imply ha he algo i hm migh be mo e sui ed a
i ing o his ype o da a.
Figu e 1. G aphical ep esen a ion o he model aining p ocess. Da a was spli i e imes in o di -
e en es and aining se s. The aining se was hen used o uning he hype pa ame e s using 5-
old c oss- alida ion (CV) while he es se was used o e alua e he ained models.
PSSM-based ea u es, which encode he e olu iona y in o ma ion o he sequences,
seem o be mo e ele an o he p edic ion accu acy. In bo h selec ed se s, PSSM ea u es
p opo ion was la ge han physicochemical ones. Fo ins ance, 19 ou o 25 desc ip o s in
andom_30 we e ex ac ed om PSSM. This is in line wi h ou p e ious indings [4] whe e
phylogeny was a p edic i e ma ke o he subs a e p omiscui y in es e hyd olases.
The MCC sco es o he h ee classi ie s, which indica e he co ela ion be ween he
p edic ed and he ue labels, a e good, bu di e en models can be g ouped o imp o e
he pe o mance since hey ha e di e en biases ha migh complemen each o he . The e
a e many possible combina ions because each machine lea ning algo i hm was ained on
i e di e en da a se s esul ing in i e dis inc classi ie s. Howe e , o gene a e he com-
bina ions, only wo o h ee models om each algo i hm we e chosen, hose wi h be e
MCC sco es. EP-p ed, which agg ega es all he models, wo SVM, h ee RidgeClassie and
wo KNN models, displayed a mean MCC sco e o 0.73 o he aining se and a sco e o
0.72 o he es se (Figu e 2). The obse ed inc ease in he models’ sco es can be a ibu ed
o he ac ha only samples o which he p edic ions be ween di e en classi ie s ag eed
we e kep o sco ing, hus making he p edic ions mo e obus .
Figu e 1.
G aphical ep esen a ion o he model aining p ocess. Da a was spli i e imes in o
di e en es and aining se s. The aining se was hen used o uning he hype pa ame e s using
5- old c oss- alida ion (CV) while he es se was used o e alua e he ained models.
Biomolecules 2022, 12, x FOR PEER REVIEW 8 o 14
Figu e 2. Ma hew’s co ela ion coe icien (MCC) sco es o he di e en classi ie s. Ridge is he
RidgeClassi ie which is one o he linea models implemen ed in Sciki -Lea n; SVM is he suppo
ec o machine; KNN is he K-nea es neighbo s and EP-p ed is he ensemble classi ie ha com-
bined all 3 o he p e ious classi ie s.
3.2. The Wo k low o In Silico Biop ospec ing
LED ga he s sequences om a ious amilies apa om es e ases/lipases, which is
why we applied an HMM p o ile, buil om he es e ase da ase , as a il e ing s ep and
ended up wi h app oxima ely 70,000 sequences. Then, he inal model EP-p ed was e al-
ua ed agains hem and p edic ed a ound 500 posi i e (p omiscuous) sequences which
we e s ill oo much o he expe imen al alida ion. Thus, se e al il e s we e applied o
dec ease he numbe o hi s o a inal se o en.
The op 100 sequences acco ding o E- alues e u ned by HMM we e selec ed o be
modeled and hei ac i e si e ca i y analyzed in sea ch o he ca aly ic iad and geome ic
desc ip o s. Only 73 sequences passed his second il e and we e o wa ded o he sub-
sequen analysis by Si eMap, a widely used binding si e analysis ool, which hen gene -
a ed a ious binding ca i y desc ip o s. As seen in ou p e ious enginee ing s udies, wo
me ics: hyd ophobici y, and he a io o enclosu e/exposu e, we e use ul in anking
p omiscui y, see Table S3; hus, we used hese o ank he inal se o en p o eins o ex-
pe imen al alida ion picking hose ha in e sec ed a he op in bo h me ics (Figu e 3).
Figu e 3. A desc ip ion o he biop ospec ing wo k low. A, Since he e was a mix o di e en ami-
lies in LED, i s we applied an HMM p o ile c ea ed om he es e ase da ase o clean he da abase
and keep only es e ases. B, EP-p ed e alua ed he emaining sequences and p edic ed a ound 500
posi i e hi s. C, The op 100 sequences acco ding o E- alues e u ned by HMM in s ep A we e
isola ed and analyzed acco ding o molecula desc ip o s om homology modeling (HM) and
Si emap calcula ions. D, A inal se o 10 sequences wi h he highes hyd ophobici y and enclo-
su e/exposu e sco es we e ga he ed and sen o be alida ed expe imen ally.
Figu e 2.
Ma hew’s co ela ion coe icien (MCC) sco es o he di e en classi ie s. Ridge is he
RidgeClassi ie which is one o he linea models implemen ed in Sciki -Lea n; SVM is he suppo
ec o machine; KNN is he K-nea es neighbo s and EP-p ed is he ensemble classi ie ha combined
all 3 o he p e ious classi ie s.
PSSM-based ea u es, which encode he e olu iona y in o ma ion o he sequences,
seem o be mo e ele an o he p edic ion accu acy. In bo h selec ed se s, PSSM ea u es
p opo ion was la ge han physicochemical ones. Fo ins ance, 19 ou o 25 desc ip o s in
andom_30 we e ex ac ed om PSSM. This is in line wi h ou p e ious indings [
4
] whe e
phylogeny was a p edic i e ma ke o he subs a e p omiscui y in es e hyd olases.
The MCC sco es o he h ee classi ie s, which indica e he co ela ion be ween he
p edic ed and he ue labels, a e good, bu di e en models can be g ouped o imp o e
he pe o mance since hey ha e di e en biases ha migh complemen each o he . The e
a e many possible combina ions because each machine lea ning algo i hm was ained
on i e di e en da a se s esul ing in i e dis inc classi ie s. Howe e , o gene a e he
combina ions, only wo o h ee models om each algo i hm we e chosen, hose wi h be e
Biomolecules 2022,12, 1529 8 o 14
MCC sco es. EP-p ed, which agg ega es all he models, wo SVM, h ee RidgeClassie and
wo KNN models, displayed a mean MCC sco e o 0.73 o he aining se and a sco e o
0.72 o he es se (Figu e 2). The obse ed inc ease in he models’ sco es can be a ibu ed
o he ac ha only samples o which he p edic ions be ween di e en classi ie s ag eed
we e kep o sco ing, hus making he p edic ions mo e obus .
3.2. The Wo k low o In Silico Biop ospec ing
LED ga he s sequences om a ious amilies apa om es e ases/lipases, which
is why we applied an HMM p o ile, buil om he es e ase da ase , as a il e ing s ep
and ended up wi h app oxima ely 70,000 sequences. Then, he inal model EP-p ed was
e alua ed agains hem and p edic ed a ound 500 posi i e (p omiscuous) sequences which
we e s ill oo much o he expe imen al alida ion. Thus, se e al il e s we e applied o
dec ease he numbe o hi s o a inal se o en.
The op 100 sequences acco ding o E- alues e u ned by HMM we e selec ed o be
modeled and hei ac i e si e ca i y analyzed in sea ch o he ca aly ic iad and geome -
ic desc ip o s. Only 73 sequences passed his second il e and we e o wa ded o he
subsequen analysis by Si eMap, a widely used binding si e analysis ool, which hen
gene a ed a ious binding ca i y desc ip o s. As seen in ou p e ious enginee ing s udies,
wo me ics: hyd ophobici y, and he a io o enclosu e/exposu e, we e use ul in anking
p omiscui y, see Table S3; hus, we used hese o ank he inal se o en p o eins o
expe imen al alida ion picking hose ha in e sec ed a he op in bo h me ics (Figu e 3).
Biomolecules 2022, 12, x FOR PEER REVIEW 8 o 14
Figu e 2. Ma hew’s co ela ion coe icien (MCC) sco es o he di e en classi ie s. Ridge is he
RidgeClassi ie which is one o he linea models implemen ed in Sciki -Lea n; SVM is he suppo
ec o machine; KNN is he K-nea es neighbo s and EP-p ed is he ensemble classi ie ha com-
bined all 3 o he p e ious classi ie s.
3.2. The Wo k low o In Silico Biop ospec ing
LED ga he s sequences om a ious amilies apa om es e ases/lipases, which is
why we applied an HMM p o ile, buil om he es e ase da ase , as a il e ing s ep and
ended up wi h app oxima ely 70,000 sequences. Then, he inal model EP-p ed was e al-
ua ed agains hem and p edic ed a ound 500 posi i e (p omiscuous) sequences which
we e s ill oo much o he expe imen al alida ion. Thus, se e al il e s we e applied o
dec ease he numbe o hi s o a inal se o en.
The op 100 sequences acco ding o E- alues e u ned by HMM we e selec ed o be
modeled and hei ac i e si e ca i y analyzed in sea ch o he ca aly ic iad and geome ic
desc ip o s. Only 73 sequences passed his second il e and we e o wa ded o he sub-
sequen analysis by Si eMap, a widely used binding si e analysis ool, which hen gene -
a ed a ious binding ca i y desc ip o s. As seen in ou p e ious enginee ing s udies, wo
me ics: hyd ophobici y, and he a io o enclosu e/exposu e, we e use ul in anking
p omiscui y, see Table S3; hus, we used hese o ank he inal se o en p o eins o ex-
pe imen al alida ion picking hose ha in e sec ed a he op in bo h me ics (Figu e 3).
Figu e 3. A desc ip ion o he biop ospec ing wo k low. A, Since he e was a mix o di e en ami-
lies in LED, i s we applied an HMM p o ile c ea ed om he es e ase da ase o clean he da abase
and keep only es e ases. B, EP-p ed e alua ed he emaining sequences and p edic ed a ound 500
posi i e hi s. C, The op 100 sequences acco ding o E- alues e u ned by HMM in s ep A we e
isola ed and analyzed acco ding o molecula desc ip o s om homology modeling (HM) and
Si emap calcula ions. D, A inal se o 10 sequences wi h he highes hyd ophobici y and enclo-
su e/exposu e sco es we e ga he ed and sen o be alida ed expe imen ally.
Figu e 3.
A desc ip ion o he biop ospec ing wo k low. A, Since he e was a mix o di e en
amilies in LED, i s we applied an HMM p o ile c ea ed om he es e ase da ase o clean he
da abase and keep only es e ases. B, EP-p ed e alua ed he emaining sequences and p edic ed
a ound 500 posi i e hi s. C, The op 100 sequences acco ding o E- alues e u ned by HMM in
s ep A we e isola ed and analyzed acco ding o molecula desc ip o s om homology modeling
(HM) and Si emap calcula ions. D, A inal se o 10 sequences wi h he highes hyd ophobici y and
enclosu e/exposu e sco es we e ga he ed and sen o be alida ed expe imen ally.
3.3. The Expe imen al Valida ion
All en ecombinan p esump i e subs a e p omiscuous hyd olases (AJP48854.1,
ART39858.1, PHR82761.1, WP_014900537.1, WP_026140314.1, WP_042877612.1,
WP_059541090.1, WP_069226497.1, WP_089515094.1 and WP_101198885.1) we e success-
ully exp essed in soluble o m and pu i ied by nickel a ini y ch oma og aphy. Then, h ee
model p-ni ophenyl (p-NP) es e subs a es wi h di e en chain leng hs: p-NP ace a e (C
2
),
p-NP p opiona e (C
3
), and p-NP bu y a e (C
4
) we e i s used o de e mine he subs a e
speci ici y o he enzymes. Thei hyd oly ic ac i i y was assessed and eco ded unde
s anda d assay condi ions desc ibed in Sec ion 2.11. We ound speci ic ac i i ies anging
om 5.85 U mg
−1
o 2.19 U mg
−1
o p-NP p opiona e, which was he bes subs a e in all
cases (Table 1).
Biomolecules 2022,12, 1529 9 o 14
Table 1. Speci ic ac i i y agains p-NP es e s. The esul s a e he mean ±SD o iplica es.
Speci ic Ac i i y (Uni s mg−1)
Enzyme p-NP Ace a e p-NP P opiona e p-NP Bu y a e
AJP48854.1 2.96 ±0.36 3.99 ±0.25 1.68 ±0.10
ART39858.1 2.53 ±0.39 3.63 ±0.26 1.48 ±0.13
PHR82761.1 4.39 ±0.44 2.75 ±0.19 2.41 ±0.25
WP_014900537.1 0.64 ±0.02 2.19 ±0.17 1.31 ±0.18
WP_026140314.1 1.47 ±0.05 3.33 ±0.14 1.22 ±0.09
WP_042877612.1 0.51 ±0.01 2.57 ±0.24 0.99 ±0.06
WP_059541090.1 0.97 ±0.02 2.96 ±0.23 1.06 ±0.08
WP_069226497.1 0.56 ±0.01 2.26 ±0.13 1.01 ±0.04
WP_089515094.1 3.75 ±0.17 4.46 ±0.19 2.20 ±0.15
WP_101198885.1 4.32 ±0.14 5.85 ±0.08 3.15 ±0.25
Once he es e ase ac i i y was con i med, we u he es ed he hyd oly ic ac i i y
owa ds a se o 96 s uc u ally di e en es e s based on Tanimo o-Combo simila i y [
4
,
39
].
As shown in Figu e 4, all enzymes we e able o hyd olyze an ample se o es e s, anging
om 27 ( o AJP48854.1) o 68 ( o WP_069226497.1). The speci ic ac i i y anges om
6. 50
(WP_069226497.1, being he mos ac i e) o 0.01 U mg
−1
(WP_014900537.1, being he
leas ac i e), depending on he subs a e (Table S4). Acco ding o he c i e ia p e iously
es ablished [
4
], nine o he enzymes could be conside ed as ha ing high- o-p ominen
subs a e p omiscui y as hey hyd olyze 30 o mo e es e s, whe eas one (AJP48854.1) could
be conside ed as mode a ely subs a e p omiscuous as i used less han 30 es e s bu
mo e han 10 (a numbe below which an es e ase could be conside ed subs a e speci ic).
Based only on he numbe o es e s con e ed (Figu e 5), hese enzymes could be anked
among he hyd olases wi h he highes subs a e p omiscui y wi hin a o al o 145 es e ases
p e iously es ed wi h a simila se o es e s. Kine ic cha ac e iza ion using he model es e
subs a e p-NP p opiona e con i med he high a ini y o nine ou he en es ed hyd olases
o his subs a e (K
m
om14.5 o 48.7
µ
M) and he high con e sion a es (k
ca
om 2060 o
5043 min−1) o six o hem (Table 2; Figu e S1).
Table 2.
Kine ic pa ame e s o selec ed hyd olases o p-NP p opiona e. The esul s a e he mean
±
SD o iplica es. No e: Kine ic pa ame e s could no be de e mined (n.d.) because no eliable K
m
could be de e mined (low a ini y o he subs a e unde ou expe imen al condi ions).
Enzyme kca (min−1)Km(µM)
AJP48854.1 n.d. n.d.
ART39858.1 33.1 ±0.1 33.3 ±3.1
PHR82761.1 2569.3 ±6.3 33.7 ±3.6
WP_014900537.1 46.8 ±0.0 16.9 ±1.6
WP_026140314.1 5.9 ±0.0 14.5 ±1.1
WP_042877612.1 5043.0 ±1268 65.8 ±26.1
WP_059541090.1 3246.8 ±10.6 31 ±3.3
WP_069226497.1 3775.1 ±37.2 46.5 ±11.8
WP_089515094.1 2060.8 ±5.7 23.5 ±3.5
WP_101198885.1 3452 ±15.9 48.7 ±7.5
A da abase sea ch indica ed ha all en hyd olases showed om 60 o 82.9% iden i y
wi h he me a-clea age p oduc hyd olase (MCP hyd olase) om Pseudomonas luo escens
IP01 (CumD) [
40
]. These hyd olases pa icipa e in he ae obic pa hways o he bac e ial
deg ada ion o a oma ic ca bons, in which a oma ic compounds a e clea ed in o me a- ing
ission compounds [35,38].

Related note

Why institutions use Plag.ai for originality review, entry 33
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai