scieee Science in your language
[en] (orig)

Application of Language Models for the Functional Annotation of Conserved Domains in Biological Data

Author: Hugo, Osses Prado; Raul, Caulier Cisternas; Ana, Moya-Beltrán
Publisher: Zenodo
DOI: 10.5281/zenodo.17307097
Source: https://zenodo.org/records/17307097/files/poster.pdf
Labo a o io de In es igación Aplicada, Depa amen o de In o má ica y Compu ación, UTEM; Escuela de In o má ica, UTEM;. This wo k was suppo ed in pa
by P ojec suppo ed by he “Compe i ion o Resea ch Regula P ojec s”, yea 2023, code LPR23-09 and “Compe i ion o Resea ch Assis an Funding UTEM”,
yea 2023, code AI23-06, Uni e sidad Tecnológica Me opoli ana (AM-B)
Mode n LLMs e ec i ely ecognized unc ionally impo an egions in amino acid sequences when enhanced wi h
app op ia e echniques. RAG demons a ed ha his conse ed domain ecogni ion capabili y is de e minan o COG
unc ional classi ica ion, achie ing an F1 sco e o 0.258. The imp o emen ob ained using he RAG me hod e ealed
ha while p o iding mo e in o ma ion enables he model o be e ecognize unc ional ca ego ies, he model
exhibi ed unce ain y and bias owa d p omp examples despi e conduc ing co ec sequence analyses. In eg a ion o
addi ional me ada a such as "Foo p in " o "P o essional Analyses" could po en ially enable mo e con iden model
p edic ions and educe bias.
DeepSeek
R1:1.5B
DeepSeek
LLM:7B
Llama 3.2:1B
Llama3.2:3B
Qwen 2.5:7B
Qwen 3:8B
Mis al 7B
Y. Ji, Z. Zhou, H. Liu, and R. V. Da ulu i, ``DNABERT: p e- ained Bidi ec ional Encode Rep esen a ions om
T ans o me s model o DNA-language in genome,'' Bioin o ma ics, ol. 37, no. 15, pp. 2112--2120, 2021.
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei , L. Jones, A. N. Gomez, L. Kaise , and I. Polosukhin, ``A en ion
Is All You Need,'' a Xi p ep in a Xi :1706.03762, 2017.
J. Meie , R. Rao, R. Ve kuil, J. Liu, T. Se cu, and A. Ri es, ``Language models enable ze o-sho p edic ion o he
e ec s o mu a ions on p o ein unc ion,'' bioRxi p ep in bioRxi :2022.07.20.500902, 2022.
Y. Gao and o he s, ``Re ie al-Augmen ed Gene a ion o La ge Language Models: A Su ey,''a Xi p ep in
a Xi :2312.10997, 2024.
E. J. Hu and o he s, ``LoRA: Low-Rank Adap a ion o La ge Language Models,''a Xi p ep in
a Xi :2106.09685, 2021.
T. De me s and o he s, ``QLoRA: E icien Fine uning o Quan ized LLMs,''a Xi p ep in a Xi :2305.14314,
2023.
Applica ion o Language Models o he Func ional Anno a ion o
Conse ed Domains in Biological Da a
Hugo Osses P ado (hosses@u em.cl), Raul Caulie Cis e nas ( caulie @u em.cl) and Ana Moya-Bel án (amoya@u em.cl)
1 2 2
Escuela de In o má ica, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
1
Depa amen o de In o má ica y Compu ación, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
2
In oduc ion:
Con usion Ma ix:
Pe o mance Me ics:
Conclusions:
Acknowledgmen s:
Me hodology:
This s udy employs a sys ema ic h ee-phase
app oach o e alua e LLM s a egies o
unc ional genomic anno a ion, as illus a ed
in he expe imen al wo k low design (Figu e
2). Phase 1 in ol es downloading he COG
da abase, conduc ing explo a o y da a
analysis, and p epa ing s anda dized
da ase s o machine lea ning applica ions.
Phase 2 implemen s h ee dis inc
app oaches: ine- uning c ea es specialized
COG models using QLoRA/LoRA echniques,
p omp enginee ing e alua es ze o-sho
capabili ies wi h balanced da ase s, and he
RAG sys em combines embedding p ocesses
wi h a comp ehensi e da abase o dynamic
in o ma ion e ie al. Phase 3 conduc s
sys ema ic pe o mance e alua ion ac oss all
app oaches using p ecision, ecall, F1-sco e,
and e iciency me ics, p o iding
compa a i e analysis and e idence-based
ecommenda ions. This amewo k ensu es
obus and compa able esul s by
main aining consis en p o ocols h oughou
he expe imen al pipeline, om da a
p epa a ion h ough model implemen a ion
o comp ehensi e pe o mance assessmen .
T ans o me s A qui ec u e:
The ans o me a chi ec u e se es as he
ounda ion o mode n LLMs, as demons a ed
in he a chi ec u al diag am (Figu e 1), u ilizing
a en ion mechanisms o de e mine da a
impo ance h ough key componen s including
encode s o inpu p ocessing, decode s o
ou pu gene a ion, mul i-head a en ion o
cap u ing ele an in o ma ion, eed- o wa d
ne wo ks o da a ans o ma ion, no maliza ion
laye s o aining s abili y, and posi ional
encoding o oken posi ion in o ma ion.
Di e en LLM amilies implemen dis inc
a chi ec u al inno a ions and aining
app oaches: DeepSeek employs i e a i e
easoning a chi ec u es o complex p oblem-
sol ing, LLaMA main ains balanced designs o
gene al-pu pose applica ions, Qwen ea u es
expanded ans o me a chi ec u es o
enhanced capaci y, and Mis al specializes in
ad anced a en ion mechanisms.
Models Used:
The con usion ma ices e eal dis inc i e bias pa e ns be ween s a egies. Ze o-Sho (Figu e 5) and One-Sho (Figu e 4) show bias owa d speci ic ca ego ies, wi h One-Sho mo e p onounced by "s icking" o gi en examples. Fine-
Tuning (Figu e 6) exhibi s concen a ed pa e ns wi h ma ked bias due o o e i ing. RAG (Figu e 3) demons a es be e p edic ion dis ibu ion ac oss ca ego ies, explaining i s supe io F1 sco e o 0.257 and g ea e gene aliza ion
capaci y, alida ing i s e ec i eness in educing ca ego ical bias o anno a ion.
Expe imen al e alua ion e ealed dis inc
pe o mance pa e ns ac oss ou
s a egies. Ze o-Sho and One-Sho Me ics
(Figu es 7 and 9) showed Deepseek-R1
1.5b achie ing highes accu acy (6%)
while Llama3.2 models led F1 sco es, wi h
One-Sho imp o ing o e Ze o-Sho . FT
Me ics (Figu e 8) demons a ed modes
imp o emen s wi h QLoRA signi ican ly
enhanced compu a ional e iciency
(0.079). RAG Me ics (Figu e 10) achie ed
he b eak h ough wi h 24.5% accu acy
using Llama3.2 3b, subs an ially
ou pe o ming all o he s a egies and
alida ing he e ec i eness o combining
RAG wi h ESM-2 embeddings.
Re e ences:
Figu e 3: RAG Con unsion Ma ix Figu e 4: One-Sho Con unsion Ma ix Figu e 5: Ze o-Sho Con unsion Ma ix Figu e 6: Fine-Tuning Con unsion Ma ix
Figu e 8: Fine-Tuning Me ics. Figu e 9: One-Sho Me ics. Figu e 10: RAG Me ics.
Figu e 2: Expe imen al desing o he Wo k low
Figu e 1: T ans o me s A qui ec u e
La ge Language Models (LLMs) ha e e olu ionized na u al language p ocessing h ough hei ans o me -based a chi ec u es,
demons a ing unp eceden ed capabili ies in ex gene a ion, ansla ion, and in o ma ion e ie al. These ad anced compu a ional
models a e inc easingly being applied o compu a ional biology, showing signi ican p omise in p o ein s uc u e p edic ion, gene ic
a ian classi ica ion, and biological sequence anno a ion. Func ional anno a ion enables esea che s o iden i y how hese o ganisms
in e ac wi h each o he and wi h hos sys ems, p o iding c ucial insigh s o disease p e en ion and ea men s a egies. Howe e , he
analysis o pa hogenic o ganisms gene a es massi e olumes o biological da a ha equi e au oma ed p ocessing o iden i y
unc ions, oles, and cha ac e is ics e icien ly. Gene unc ions, o ganized h ough s uc u ed da abases like NCBI COG, encompass
undamen al biological p ocesses essen ial o unde s anding li e, heal h, and disease. This esea ch add esses he c i ical challenge o
e icien ly anno a ing biological sequences and p edic ing p o ein unc ions by le e aging he ad anced language unde s anding
capabili ies o LLMs. By imp o ing he accu acy and e iciency o unc ional genomic anno a ion, his s udy aims o accele a e
disco e ies in bio echnological and pha maceu ical esea ch, ul ima ely con ibu ing o be e unde s anding and ea men o p io i y
heal h condi ions iden i ied by global heal h o ganiza ions.
Figu e 7: Ze o-Sho Me ics.