Compa a i e Pe o mance Analysis o DNA Sequence
Encoding Me hods o Machine Lea ning-Based Bac e ial
Classi ica ion
Au ho s: Diego San ibáñez Oya ce1, Es eban Gómez Te án1, Jo ge Ve ga a-Quezada2, Ana
Moya-Bel án2.
A ilia ions: 1Escuela de In o má ica, Facul ad de Ingenie ía, Uni e sidad Tecnológica
Me opoli ana, San iago, Chile. 2Depa amen o de In o má ica y Compu ación, Facul ad de
Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago,Chile.
E icien encoding o DNA sequences emains a majo challenge o applying machine lea ning
(ML) in genomics. The choice o encoding me hod signi ican ly impac s model pe o mance,
compu a ional cos , and memo y e iciency, pa icula ly in axonomic classi ica ion, an imic obial
esis ance p edic ion, and me agenomic analysis.
We sys ema ically e alua e mul iple DNA encoding s a egies using he uni e sal 16S RNA
ma ke gene o bac e ial genus classi ica ion in ML. We compa e adi ional app oaches
(One-Ho , K-me s) agains signal ans o ma ion echniques (Fas Fou ie T ans o m, Wa ele )
and hyb id combina ions, es ed wi h SVM, Random Fo es , and XGBoos classi ie s. All
me hods we e e alua ed using bo h aligned sequences (AS) om mul iple sequence alignmen
and padded sequences (PS) wi h N-padding o uni o m leng h. T ans o m-based encodings
demons a ed no able e iciency: Fou ie me hods achie ed 20.1 seconds execu ion ime
compa ed o 48.2 seconds o One-Ho encoding, while using 1.9GB e sus 19.1GB
memo y— ep esen ing 2.4x speed imp o emen and 10× memo y educ ion. Wa ele ans o m
showed simila e iciency a 21.9 seconds and 3.8GB memo y usage. Peak classi ica ion
accu acy eached 99.6% wi h hyb id app oaches, while e icien me hods like AS-K-me s
achie ed 98.7% accu acy. Ac oss 30 bac e ial gene a and 5,256 sequences, aligned sequences
consis en ly ou pe o med padded sequences, wi h SVM showing supe io pe o mance ac oss
mos encoding s a egies.
Ou esul s demons a e ha encoding choice is pi o al o scaling ML models in genomics, wi h
implica ions beyond axonomic classi ica ion. The e alua ed me hods can be ex ended o o he
sequence-based p edic ion asks, enabling mo e e icien and scalable pipelines. This wo k
con ibu es o s anda dizing sequence ep esen a ion s a egies, suppo ing b oade ML
adop ion in compu a ional biology.
Keywo ds: 16S RNA ma ke , Wa ele ans o ma ion, Fas Fou ie T ans o m, Machine lea ning
Acknowledgemen : Depa amen o de In o má ica y Compu ación, UTEM; Escuela de In o má ica, UTEM;
Labo a o io de In es igación Aplicada, Depa amen o de In o má ica y Compu ación, UTEM. This wo k was
suppo ed in pa by P ojec suppo ed by he “Compe i ion o Resea ch Regula P ojec s”, yea 2023, code
LPR23-09 and in pa by he “Scien i ic and Technological Equipmen P ojec s Compe i ion, yea 2024, code
LE24-03”.