Landscape o Machine Lea ning Me hods and Da a Rep esen a ions o
An imic obial Resis ance: Towa d a Benchma king F amewo k in HPC
En i onmen s
Camilo Ce da Sa abia1 , Es eban Gómez Te án1 , Fe nanda B a o Co nejo1 , Belén Díaz Díaz1 , Faus o Cabezas-Me a2, Raúl
Caulie -Cis e na3 , Jo ge Ve ga a-Quezada3, Ana Moya-Bel án3 .
1Escuela de In o má ica, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
2P og ama de Doc o ado en In o má ica Aplicada a Salud y Medio Ambien e, Escuela de Pos g ado, Uni e sidad Tecnológica Me opoli ana, San iago, Chile
3Depa amen o de In o má ica y Compu ación, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile
.
To analyze and compa e he pe o mance o a ious coding s a egies and ML models o p edic ing an imic obial
esis ance, iden i ying op imal, ep oducible, and scalable configu a ions in HPC en i onmen s.
Models a ailable in he li e a u e
Conclusions
Re e ences
An imic obial esis ance (AMR) is one o he mos u gen
h ea s o global heal h, demanding compu a ionally obus
and scalable solu ions. In ecen yea s, machine lea ning
(ML) has eme ged as a powe ul s a egy o analyzing
la ge-scale genomic da a o p edic esis ance p o iles and
unco e gene ic pa e ns linked o esis ance mechanisms.
Howe e , exis ing ools a y in e ms o inpu ea u es,
encoding s a egies, model a chi ec u es, and execu ion
en i onmen s.
AMR p edic ion s udies exhibi conside able he e ogenei y:
da a se s a y in species, sample size, and da a ype.
Compu a ional en i onmen s diffe ac oss labo a o ies;
ea u e ex ac ion and encoding me hods a e inconsis en ;
and model a chi ec u es ange om con olu ional neu al
ne wo ks (CNNs) o ensemble me hods. This di e si y limi s
ep oducibili y, complica es pe o mance compa isons, and
inc eases he ime and esou ces equi ed o eliably
e alua e ools.
MLP And CNN
Acknowledgmen s:
Labo a o io de In es igación Aplicada, Depa amen o de In o má ica y Compu ación, UTEM; Escuela de In o má ica,
UTEM;. This wo k was suppo ed in pa by P ojec suppo ed by he “Compe i ion o Resea ch Regula P ojec s”, yea
2023, code LPR23-09 and “Compe i ion o Resea ch Assis an Funding UTEM”, yea 2023, code AI23-06, Uni e sidad
Tecnológica Me opoli ana (AM-B)
Con ac :
[email p o ec ed]
[email p o ec ed]
Conclusions
Nguyen, M., Olson, R., Shukla, M., VanOe elen, M., & Da is, J. J. (2020). P edic ing an imic obial esis ance using conse ed genes. PLOS Compu a ional Biology, 16(8), e1008319.
h ps://jou nals.plos.o g/ploscompbiol/a icle?id=10.1371/jou nal.pcbi.1008319
Wang, S.-C. (2024). E-CLEAP: An ensemble lea ning model o e icien and accu a e iden i ica ion o an imic obial pep ides. PLOS ONE, 19(3), e0300125.
h ps://jou nals.plos.o g/plosone/a icle?id=10.1371/jou nal.pone.0300125
Ren, Y., Chak abo y, T., Doijad, S., Falgenhaue , L., Falgenhaue , J., Goesmann, A., Hauschild, A.-C., Schwenge s, O., & Heide , D. (2022). P edic ion o an imic obial esis ance based on whole-genome sequencing
and machine lea ning. Bioin o ma ics, 38(2), 325–334. h ps://academic.oup.com/bioin o ma ics/a icle/38/2/325/6382301
The CNN.model.h5 model was e alua ed wi h o e 2,400 Esche ichia coli genomes pai ed wi h
esis ance pheno ypes, and he E-CLEAP model was e alua ed wi h 3,500 pep ide samples. These
models we e es ed in Ubun u and Windows en i onmen s, espec i ely, wi h one-ho encoding and
PseAAC (Pseudo Amino Acid Composi ion), wi h a un ime o 2,589.87 seconds and a size o 87 MB o
CNN and 100 seconds and 0.84 MB o MLP.
Goal
A s a e-o - he-a
analysis was ca ied
ou using h ee
bibliog aphic
eposi o ies,
Scopus, WOS, and
PubMed.
Me hod
RF-LR-SVM- aining E-CLEAP T ainXGBoos CNN.model.h5
Models e alua ed
Models Op imized
SVM And XGBoos
The Logis ic Reg ession and SVM models we e op imized using s a egic G id Sea ch, e alua ing
16 and 12 combina ions o hype pa ame e s, espec i ely. Logis ic Reg ession implemen ed
L1/L2/Elas icNe egula iza ion wi h S anda dScale , imp o ing ROC-AUC om 85% o 95.42%.
SVM used linea and RBF ke nels wi h Robus Scale , inc easing ROC-AUC om 92.4% o 96.53%.
Bo h models achie ed >95% disc imina o y powe wi h obus c oss- alida ion.
Random Fo es was op imized using s a egic G id Sea ch e alua ed eigh combina ions o key
hype pa ame e s: n_es ima o s (150-300 ees e sus he o iginal fixed 200), max_dep h (limi ed o 15
le els o egula iza ion, e sus no limi o cap u e complex pa e ns), and class_weigh (equal ea men
e sus au oma ic balancing by class equency), achie ing significan imp o emen s wi h ROC-AUC o
97.14% and especially MCC o 86.74%, indica ing a be e balance be ween sensi i i y and specifici y
c i ical o clinical diagnosis.
The RF-LR-SVM model was e alua ed wi h i s da ase
o o e 2,400 Esche ichia coli genomes pai ed wi h
esis ance pheno ypes, and he T ainXGBoos model
was e alua ed wi h i s da ase o S aphylococcus
au eus genomic da a (1,274 genomes pai ed wi h se en
an ibio ic esis ance p ofiles), each wi h diffe en
en i onmen s and encoding me hods.
RF
LR
SVM
MLP (Mul ilaye
Pe cep on)
DOI:10.1093/bioin o
ma ics/b ab681
.
MLP
AUC = 97.33%
Resis an
pheno ypes wi h
F1 = 98% o
me hicillin.
Con olu ional
Neu al Ne wo k
(CNN)
CNN
AUC = 93%
DOI:10.1093/bioi
n o ma ics/b ab6
81
XGBoos s udy
Random Fo es
AUC = 96%
DOI:10.1371/jou n
al.pone.0300125
DOI:10.1371/jou
nal.pcbi.1008319
The second model men ioned had execu ion
ime me ics o 1.42 hou s and a weigh o
186.36 MB, diffe ing om he fi s wi h 25
MB and 11 seconds. This aspec shows he
he e ogenei y o he da a p esen in
an imic obial esis ance.
Wi h he aim o
seeing wha is
cu en ly a ailable
in e ms o
an imic obial
esis ance and
machine lea ning
models.
Modifica ions
we e made o
he Py hon
codes due o he
emo al o
incompa ibili ies
lib a ies (Py hon
3 o Py hon 2.7).
Hype pa ame e s
op imiza ion was
pe o med on
h ee me hods o
(RF,LR,SVM),
aining model.
The models
we e
downloaded
and e alua ed
wi h hei
espec i e
en i onmen ,
da ase s and
encoding .
Hype pa ame e op imiza ion using G id Sea ch achie ed imp o emen s in he me hodological models,
al hough c i ical ade-offs be ween accu acy and compu a ional efficiency pe sis , wi h some models
equi ing 60 imes mo e esou ces while main aining simila pe o mance. Incompa ibili ies be ween Py hon
e sions and obsole e lib a ies unde sco e he u gen need o s anda dized compu a ional amewo ks ha
acili a e knowledge ans e be ween labo a o ies and suppo he effec i e clinical implemen a ion o AMR
p edic i e ools.
By sys ema ically e alua ing diffe en da a ypes, encoding me hods, and ea u e se s ac oss di e se
compu a ional en i onmen s, ou s udy highligh s ha he e ogenei y emains a majo ba ie o
ep oducibili y in ML-based AMR p edic ion. Iden i ying op imal encoding-model combina ions on unified
da ase s p o ides a ounda ion o eliable, scalable, and ep oducible AMR p edic ion pipelines,
suppo ing equi able access o compu a ional ools in he figh agains an imic obial esis ance.
Models Op imized
Models e alua ed
Vi ulence ac o s and an ibio ic esis ance
o S ep ococcus pyogenes.
Me hods
Pe o mance
Re e ence
P oblem