Application of Language Models for the Functional Annotation of Conserved Domains in Biological Data

Author: Hugo, Osses Prado; Raul, Caulier Cisternas; Ana, Moya-Beltrán

Publisher: Zenodo

DOI: 10.5281/zenodo.17307097

Source: https://zenodo.org/records/17307097/files/poster.pdf

Labo a o io de In es igación Aplicada, Depa amen o de In o má ica y Compu ación, UTEM; Escuela de In o má ica, UTEM;. This wo k was suppo ed in pa
by P ojec suppo ed by he “Compe i ion o Resea ch Regula P ojec s”, yea 2023, code LPR23-09 and “Compe i ion o Resea ch Assis an Funding UTEM”,
yea 2023, code AI23-06, Uni e sidad Tecnológica Me opoli ana (AM-B)
Mode n LLMs e ec i ely ecognized unc ionally impo an egions in amino acid sequences when enhanced wi h
app op ia e echniques. RAG demons a ed ha his conse ed domain ecogni ion capabili y is de e minan o COG
unc ional classi ica ion, achie ing an F1 sco e o 0.258. The imp o emen ob ained using he RAG me hod e ealed
ha while p o iding mo e in o ma ion enables he model o be e ecognize unc ional ca ego ies, he model
exhibi ed unce ain y and bias owa d p omp examples despi e conduc ing co ec sequence analyses. In eg a ion o
addi ional me ada a such as "Foo p in " o "P o essional Analyses" could po en ially enable mo e con iden model
p edic ions and educe bias.
DeepSeek
R1:1.5B
DeepSeek
LLM:7B
Llama 3.2:1B
Llama3.2:3B
Qwen 2.5:7B
Qwen 3:8B
Mis al 7B
Y. Ji, Z. Zhou, H. Liu, and R. V. Da ulu i, ``DNABERT: p e- ained Bidi ec ional Encode Rep esen a ions om
T ans o me s model o DNA-language in genome,'' Bioin o ma ics, ol. 37, no. 15, pp. 2112--2120, 2021.
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei , L. Jones, A. N. Gomez, L. Kaise , and I. Polosukhin, ``A en ion
Is All You Need,'' a Xi p ep in a Xi :1706.03762, 2017.
J. Meie , R. Rao, R. Ve kuil, J. Liu, T. Se cu, and A. Ri es, ``Language models enable ze o-sho p edic ion o he
e ec s o mu a ions on p o ein unc ion,'' bioRxi p ep in bioRxi :2022.07.20.500902, 2022.
Y. Gao and o he s, ``Re ie al-Augmen ed Gene a ion o La ge Language Models: A Su ey,''a Xi p ep in
a Xi :2312.10997, 2024.
E. J. Hu and o he s, ``LoRA: Low-Rank Adap a ion o La ge Language Models,''a Xi p ep in
a Xi :2106.09685, 2021.
T. De me s and o he s, ``QLoRA: E icien Fine uning o Quan ized LLMs,''a Xi p ep in a Xi :2305.14314,
2023.
Applica ion o Language Models o he Func ional Anno a ion o
Conse ed Domains in Biological Da a
Hugo Osses P ado (hosses@u em.cl), Raul Caulie Cis e nas ( caulie @u em.cl) and Ana Moya-Bel án (amoya@u em.cl)
1 2 2
Escuela de In o má ica, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
1
Depa amen o de In o má ica y Compu ación, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
2
In oduc ion:
Con usion Ma ix:
Pe o mance Me ics:
Conclusions:
Acknowledgmen s:
Me hodology:
This s udy employs a sys ema ic h ee-phase
app oach o e alua e LLM s a egies o
unc ional genomic anno a ion, as illus a ed
in he expe imen al wo k low design (Figu e
2). Phase 1 in ol es downloading he COG
da abase, conduc ing explo a o y da a
analysis, and p epa ing s anda dized
da ase s o machine lea ning applica ions.
Phase 2 implemen s h ee dis inc
app oaches: ine- uning c ea es specialized
COG models using QLoRA/LoRA echniques,
p omp enginee ing e alua es ze o-sho
capabili ies wi h balanced da ase s, and he
RAG sys em combines embedding p ocesses
wi h a comp ehensi e da abase o dynamic
in o ma ion e ie al. Phase 3 conduc s
sys ema ic pe o mance e alua ion ac oss all
app oaches using p ecision, ecall, F1-sco e,
and e iciency me ics, p o iding
compa a i e analysis and e idence-based
ecommenda ions. This amewo k ensu es
obus and compa able esul s by
main aining consis en p o ocols h oughou
he expe imen al pipeline, om da a
p epa a ion h ough model implemen a ion
o comp ehensi e pe o mance assessmen .
T ans o me s A qui ec u e:
The ans o me a chi ec u e se es as he
ounda ion o mode n LLMs, as demons a ed
in he a chi ec u al diag am (Figu e 1), u ilizing
a en ion mechanisms o de e mine da a
impo ance h ough key componen s including
encode s o inpu p ocessing, decode s o
ou pu gene a ion, mul i-head a en ion o
cap u ing ele an in o ma ion, eed- o wa d
ne wo ks o da a ans o ma ion, no maliza ion
laye s o aining s abili y, and posi ional
encoding o oken posi ion in o ma ion.
Di e en LLM amilies implemen dis inc
a chi ec u al inno a ions and aining
app oaches: DeepSeek employs i e a i e
easoning a chi ec u es o complex p oblem-
sol ing, LLaMA main ains balanced designs o
gene al-pu pose applica ions, Qwen ea u es
expanded ans o me a chi ec u es o
enhanced capaci y, and Mis al specializes in
ad anced a en ion mechanisms.
Models Used:
The con usion ma ices e eal dis inc i e bias pa e ns be ween s a egies. Ze o-Sho (Figu e 5) and One-Sho (Figu e 4) show bias owa d speci ic ca ego ies, wi h One-Sho mo e p onounced by "s icking" o gi en examples. Fine-
Tuning (Figu e 6) exhibi s concen a ed pa e ns wi h ma ked bias due o o e i ing. RAG (Figu e 3) demons a es be e p edic ion dis ibu ion ac oss ca ego ies, explaining i s supe io F1 sco e o 0.257 and g ea e gene aliza ion
capaci y, alida ing i s e ec i eness in educing ca ego ical bias o anno a ion.
Expe imen al e alua ion e ealed dis inc
pe o mance pa e ns ac oss ou
s a egies. Ze o-Sho and One-Sho Me ics
(Figu es 7 and 9) showed Deepseek-R1
1.5b achie ing highes accu acy (6%)
while Llama3.2 models led F1 sco es, wi h
One-Sho imp o ing o e Ze o-Sho . FT
Me ics (Figu e 8) demons a ed modes
imp o emen s wi h QLoRA signi ican ly
enhanced compu a ional e iciency
(0.079). RAG Me ics (Figu e 10) achie ed
he b eak h ough wi h 24.5% accu acy
using Llama3.2 3b, subs an ially
ou pe o ming all o he s a egies and
alida ing he e ec i eness o combining
RAG wi h ESM-2 embeddings.
Re e ences:
Figu e 3: RAG Con unsion Ma ix Figu e 4: One-Sho Con unsion Ma ix Figu e 5: Ze o-Sho Con unsion Ma ix Figu e 6: Fine-Tuning Con unsion Ma ix
Figu e 8: Fine-Tuning Me ics. Figu e 9: One-Sho Me ics. Figu e 10: RAG Me ics.
Figu e 2: Expe imen al desing o he Wo k low
Figu e 1: T ans o me s A qui ec u e
La ge Language Models (LLMs) ha e e olu ionized na u al language p ocessing h ough hei ans o me -based a chi ec u es,
demons a ing unp eceden ed capabili ies in ex gene a ion, ansla ion, and in o ma ion e ie al. These ad anced compu a ional
models a e inc easingly being applied o compu a ional biology, showing signi ican p omise in p o ein s uc u e p edic ion, gene ic
a ian classi ica ion, and biological sequence anno a ion. Func ional anno a ion enables esea che s o iden i y how hese o ganisms
in e ac wi h each o he and wi h hos sys ems, p o iding c ucial insigh s o disease p e en ion and ea men s a egies. Howe e , he
analysis o pa hogenic o ganisms gene a es massi e olumes o biological da a ha equi e au oma ed p ocessing o iden i y
unc ions, oles, and cha ac e is ics e icien ly. Gene unc ions, o ganized h ough s uc u ed da abases like NCBI COG, encompass
undamen al biological p ocesses essen ial o unde s anding li e, heal h, and disease. This esea ch add esses he c i ical challenge o
e icien ly anno a ing biological sequences and p edic ing p o ein unc ions by le e aging he ad anced language unde s anding
capabili ies o LLMs. By imp o ing he accu acy and e iciency o unc ional genomic anno a ion, his s udy aims o accele a e
disco e ies in bio echnological and pha maceu ical esea ch, ul ima ely con ibu ing o be e unde s anding and ea men o p io i y
heal h condi ions iden i ied by global heal h o ganiza ions.
Figu e 7: Ze o-Sho Me ics.

Related note

Why institutions use Plag.ai for originality review, entry 97
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai