scieee Science in your language
[en] (orig)

Building an LLM Agent for Life Sciences Literature QA and Summarization

Author: PAULRAJ, NISHANTH JOSEPH
Publisher: Zenodo
DOI: 10.5281/zenodo.17299380
Source: https://zenodo.org/records/17299380/files/WJARR-2025-1665.pdf
 Co esponding au ho : NISHANTH JOSEPH PAULRAJ
Copy igh © 2025 Au ho (s) e ain he copy igh o his a icle. This a icle is published unde he e ms o he C ea i e Commons A ibu ion Liscense 4.0.
Building an LLM Agen o Li e Sciences Li e a u e QA and Summa iza ion
NISHANTH JOSEPH PAULRAJ *
The mo Fishe Scien i ic, US.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
Publica ion his o y: Recei ed on 28 Ma ch 2025; e ised on 03 May 2025; accep ed on 06 May 2025
A icle DOI: h ps://doi.o g/10.30574/wja .2025.26.2.1665
Abs ac
This a icle explo es he de elopmen o a specialized a i icial in elligence agen ha combines La ge Language Models
(LLMs) wi h Re ie al-Augmen ed Gene a ion (RAG) echniques o add ess he challenges o biomedical li e a u e
sea ch and syn hesis. The unp eceden ed g ow h o published esea ch in li e sciences has c ea ed an in o ma ion c isis
ha adi ional sea ch me hods canno e ec i ely manage. Resea che s ace signi ican challenges including
o e whelming olume, domain-speci ic e minology ba ie s, di icul y in making c oss-s udy connec ions, and se e e
ime cons ain s. The p oposed LLM+RAG a chi ec u e o e s a comp ehensi e solu ion ea u ing specialized documen
p ocessing o scien i ic pape s, biomedical-speci ic ec o embeddings, ad anced e ie al s a egies, and
sophis ica ed easoning capabili ies. The sys em in eg a es wi h PubMed and o he biomedical da abases while
p o iding na u al language in e aces ha signi ican ly educe he cogni i e bu den o esea che s. Domain-speci ic
op imiza ions such as biomedical en i y ecogni ion, ela ionship ex ac ion, and specialized embeddings u he
enhance pe o mance ac oss di e se esea ch scena ios. E alua ion h ough benchma k es ing, expe alida ion, and
ci a ion accu acy assessmen demons a es he sys em's abili y o p o ide comp ehensi e, accu a e in o ma ion while
subs an ially educing li e a u e e iew ime. This a icle ep esen s a ans o ma i e ool o biomedical esea che s,
po en ially e olu ionizing how scien i ic disco e y p og esses in he li e sciences.
Keywo ds: Biomedical Li e a u e Sea ch; La ge Language Models; Re ie al-Augmen ed Gene a ion; Knowledge
G aphs; Scien i ic In o ma ion Ex ac ion
1. In oduc ion
The exponen ial g ow h o biomedical li e a u e p esen s a signi ican challenge o esea che s ying o s ay cu en
wi h he la es indings in hei ield. Acco ding o a comp ehensi e analysis o PubMed g ow h pa e ns, he da abase
has expanded a a compound annual g ow h a e o e he pas decade, wi h cu en es ima es indica ing ha housands
o new pape s a e published daily ac oss biomedical disciplines [1]. This phenomenal g ow h has c ea ed wha
esea che s e m an "in o ma ion explosion c isis" whe e adi ional sea ch me hods inc easingly all sho in
ex ac ing p ecise in o ma ion o iden i ying sub le connec ions ac oss s udies. This a icle explo es he de elopmen
o a specialized AI agen ha le e ages La ge Language Models (LLMs) and Re ie al-Augmen ed Gene a ion (RAG) o
e olu ionize how li e science esea che s in e ac wi h scien i ic li e a u e.
1.1. The Challenge o Biomedical Li e a u e Sea ch
Resea che s in li e sciences ace se e al escala ing challenges when sea ching o ele an in o ma ion. The olume
o e load p oblem has eached unp eceden ed le els as PubMed now indexes o e hi y million ci a ions, wi h he
da abase g owing by app oxima ely a million new pape s annually. S udies examining esea che p oduc i i y ha e
de e mined ha e en specialized scien is s can ealis ically ead only a small ac ion o newly published con en in
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
658
b oade ields [1]. This exponen ial g ow h c ea es an unb idgeable gap be ween a ailable in o ma ion and human
eading capaci y.
The complexi y is u he compounded by domain-speci ic e minology challenges. Resea ch in o biomedical lexical
pa e ns has documen ed ha biomedical li e a u e employs highly specialized ocabula y wi h signi ican ambigui y
ac oss sub-disciplines. Quan i a i e analysis o e minology dis ibu ion shows ha a subs an ial po ion o e ms in
biomedical co po a a e domain-speci ic and no ound in gene al language da ase s, c ea ing subs an ial ba ie s o
adi ional sea ch sys ems. Fu he mo e, e minological a ia ion ac oss sub-domains esul s in he same concep s
being desc ibed using di e en e minology in many cases, making c oss-domain in o ma ion disco e y pa icula ly
challenging [2].
Table 1 Challenges in Biomedical Li e a u e Sea ch [2]
Challenge
Desc ip ion
Impac
Volume O e load
30+ million ci a ions in PubMed
Impossible o comp ehensi ely e iew li e a u e
Te minology Ba ie s
Domain-speci ic ocabula y
Missed ele an pape s due o e minology
di e ences
C oss-S udy
Connec ions
Insigh s ac oss sub-disciplines
Impo an ela ionships emain undisco e ed
Time Cons ain s
G owing li e a u e s. limi ed ime
Incomple e li e a u e co e age
Seman ic Limi a ions
Lexical a he han concep ual
sea ch
Low p ecision o complex que ies
C oss-s udy connec ions ep esen ano he c i ical limi a ion o adi ional li e a u e e iew. Sys ema ic analysis o
b eak h ough disco e ies in biomedicine indica es ha key insigh s equen ly eme ge om connec ions ac oss pape s
om di e en sub-disciplines, publica ion ime pe iods, o me hodological app oaches. When in o ma ion is sca e ed
ac oss he li e a u e landscape, adi ional sea ch me hods s uggle o acili a e hese connec ions. In o ma ion e ie al
expe imen s demons a e ha many po en ially ele an connec ions emain undisco e ed when using con en ional
sea ch app oaches due o e minology di e ences, con ex a ia ions, and he inabili y o keywo d sys ems o ecognize
concep ual simila i y wi hou lexical o e lap [3].
Time cons ain s c ea e pe haps he mos p ac ical limi a ion o esea che s. The signi ican expansion in publica ion
olume has no been accompanied by co esponding inc eases in ime a ailable o li e a u e e iew. Su ey da a
indica es ha esea che s now spend many hou s weekly a emp ing o keep cu en wi h li e a u e, ep esen ing an
inc ease since he p e ious decade, ye s ill insu icien o comp ehensi e co e age. De ailed p oduc i i y analysis
sugges s ha comp ehensi e li e a u e e iew using adi ional me hods has become unc ionally impossible in mos
biomedical domains due o he olume- ime imbalance [1].
T adi ional keywo d-based sea ch ools like PubMed p o ide access o as eposi o ies bu lack he seman ic
unde s anding o answe nuanced ques ions o summa ize indings ac oss mul iple sou ces. Compa a i e s udies o
in o ma ion e ie al models used in sea ch engines ha e documen ed ha adi ional sea ch engines ely p ima ily on
lexical ma ching and s a is ical e m weigh ing, achie ing limi ed p ecision and ecall on complex biomedical que ies
whe e concep ual unde s anding is equi ed. These limi a ions become pa icula ly p oblema ic when esea che s
need o explo e ela ionships be ween concep s, unde s and eme ging ends, o syn hesize indings ac oss sub-
disciplines [3].
2. Solu ion A chi ec u e: LLM + RAG o Biomedical Li e a u e
Ou p oposed solu ion combines he easoning capabili ies o s a e-o - he-a LLMs wi h he ac ual g ounding o RAG
echniques, speci ically op imized o biomedical li e a u e. This a chi ec u e add esses he undamen al limi a ions o
adi ional sea ch by in oducing seman ic unde s anding, c oss-documen easoning, and na u al language in e ac ion
capabili ies.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
659
2.1. Documen P ocessing Pipeline
The documen p ocessing pipeline o ms he ounda ion o he sys em, es ablishing how biomedical li e a u e is
ans o med in o a machine-accessible o ma . Resea ch on LLMs o scien i ic li e a u e e iew has demons a ed ha
documen p ocessing quali y signi ican ly impac s downs eam pe o mance, wi h specialized scien i ic PDF pa sing
imp o ing in o ma ion ex ac ion compa ed o gene ic app oaches. The pipeline begins wi h PDF ex ac ion and
pa sing designed speci ically o he complex layou s common in scien i ic jou nals. Unlike gene al-pu pose PDF
ex ac ion, scien i ic pape -awa e pa se s can iden i y and co ec ly p ocess mul i-column layou s, embedded ables,
igu es wi h cap ions, and e e ence sec ions wi h high s uc u al p ese a ion, signi ican ly ou pe o ming gene ic
ex ac o s [4].
NLP p ep ocessing speci ically op imized o biomedical con en enables iden i ica ion o domain-speci ic en i ies and
ela ionships wi h signi ican ly highe accu acy han gene al NLP pipelines. Compa a i e e alua ions show ha
biomedical-speci ic named en i y ecogni ion achie es supe io F1 sco es o gene names, disease en i ies, and d ug
names, ep esen ing a subs an ial imp o emen o e gene al-pu pose NLP sys ems. This specialized p ocessing
ensu es ha scien i ic concep s a e co ec ly p ese ed h oughou he in o ma ion e ie al p ocess [4].
Chunking s a egies op imized o scien i ic pape s ep esen ano he c ucial inno a ion in he pipeline. Ra he han
a bi a y leng h-based spli s ha can sepa a e ela ed in o ma ion, scien i ic pape -awa e chunking espec s he
seman ic and s uc u al bounda ies o esea ch pape s. Expe imen al e alua ions demons a e ha sec ion-awa e
chunking ha p ese es he in eg i y o abs ac , me hods, esul s, and discussion segmen s imp o es downs eam
e ie al pe o mance by main aining he con ex ual cohe ence necessa y o accu a e in o ma ion ex ac ion.
Addi ionally, me ada a ex ac ion p ocesses cap u e essen ial bibliog aphic in o ma ion including au ho s, publica ion
da es, jou nal de ails, and ci a ion ne wo ks wi h high accu acy, enabling sophis ica ed il e ing and p io i iza ion la e
in he e ie al p ocess [4].
Table 2 Co e Componen s o LLM+RAG A chi ec u e [4]
Componen
Key Fea u es
P ima y Bene i s
Documen
P ocessing
Scien i ic PDF pa sing, Sec ion-awa e
chunking
P ese es pape s uc u e and seman ic
in eg i y
Vec o Da abase
Biomedical embeddings, E icien s o age
Enables seman ic sea ch a scale
LLM In eg a ion
Con ex managemen , Ci a ion acking
Cohe en syn hesis wi h e i iable sou ces
Da a Sou ces
PubMed API, NCBI da abases, Local co pus
Comp ehensi e co e age ac oss sou ces
Use In e ace
Na u al language que ies, In e ac i e
explo a ion
Reduced cogni i e load, disco e y-o ien ed
3. Vec o Da abase
The ec o da abase componen p o ides he seman ic sea ch capabili ies essen ial o concep -based a he han
keywo d-based e ie al. Resea ch on LLMs o scien i ic li e a u e e iew has es ablished ha he quali y o
embeddings signi ican ly impac s e ie al pe o mance, wi h domain-speci ic embeddings demons a ing subs an ial
ad an ages. Gene a ion o biomedical-speci ic embeddings in ol es aining o ine- uning embedding models on
massi e co po a o biomedical ex , esul ing in ec o ep esen a ions ha mo e accu a ely cap u e he seman ic
ela ionships be ween biomedical concep s. Compa a i e e alua ions documen ha specialized biomedical
embeddings imp o e e ie al pe o mance compa ed o gene al-pu pose embeddings when es ed on domain-speci ic
in o ma ion e ie al asks [4].
E icien ec o s o age and sea ch implemen a ion is essen ial o eal-wo ld usabili y. Benchma king s udies
compa ing ec o da abase echnologies including FAISS and Ch omaDB show ha op imized implemen a ions can
index millions o documen chunks while main aining quick que y esponse imes. Pe o mance e alua ions wi h
biomedical co po a con aining millions o documen chunks demons a ed a o able a e age que y imes using FAISS
wi h op imized indexing, scaling app oxima ely linea ly wi h co pus size. These esponse imes emain well wi hin
accep able limi s o in e ac i e use e en wi h e y la ge documen collec ions [4].
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
660
Me ada a il e ing capabili ies complemen seman ic sea ch by allowing e inemen based on publica ion de ails.
In eg a ion o me ada a il e ing wi h ec o simila i y sea ch enables complex que ies ha combine concep ual
simila i y wi h c i e ia such as publica ion da e anges, jou nal epu a ion me ics, o au ho a ilia ions. Expe imen al
e alua ion shows ha his hyb id app oach imp o es p ecision compa ed o ei he seman ic o me ada a il e ing alone,
pa icula ly o sea ches whe e bo h ele ance and au ho i y a e impo an conside a ions [4].
4. LLM In eg a ion
The LLM in eg a ion componen se es as he easoning engine o he sys em, ans o ming e ie ed in o ma ion in o
cohe en , con ex ualized esponses. Comp ehensi e e alua ion o RAG sys ems o medical ques ion answe ing has
iden i ied se e al c i ical ac o s o e ec i e biomedical LLM in eg a ion. Con ex window managemen o scien i ic
pape s p esen s unique challenges due o he leng h and complexi y o biomedical documen s. Expe imen s wi h
di e en con ex window s a egies ound ha hie a chical app oaches ha i s p ocess ull abs ac s ollowed by
a ge ed e ie al o me hodology and esul s sec ions achie ed op imal pe o mance, inc easing answe accu acy
compa ed o ixed-window app oaches [5].
P omp enginee ing o biomedical que ies ep esen s ano he c ucial a ea o op imiza ion. Expe imen al compa isons
o di e en p omp ing s a egies e ealed ha p omp s inco po a ing domain guidance, explici ly eques ing scien i ic
igo , and speci ying ci a ion equi emen s educed hallucina ion compa ed o gene al-pu pose p omp s. These
specialized p omp s include speci ic ins uc ions ega ding scien i ic e minology, cau ion ega ding specula i e
indings, and equi emen s o ci a ion o e idence [5].
Ou pu alida ion and ci a ion acking mechanisms ensu e esponse eliabili y. Implemen a ion o au oma ed ac -
checking p ocesses ha e i y claims agains e ie ed e idence imp o ed esponse accu acy in con olled e alua ions.
Addi ionally, comp ehensi e ci a ion acking ha links speci ic claims o sou ce documen s achie ed high aceabili y,
allowing use s o di ec ly e i y in o ma ion sou ces. These alida ion mechanisms a e pa icula ly c i ical in
biomedical con ex s whe e in o ma ion accu acy has po en ial heal h implica ions [5].
5. Da a Sou ces
The da a sou ces componen p o ides he aw ma e ial ha eeds he en i e sys em. PubMed API in eg a ion gi es he
sys em access o he mos comp ehensi e collec ion o biomedical esea ch abs ac s, wi h co e age exceeding ens o
millions o ci a ions ac oss housands o jou nals. Benchma k es ing o API pe o mance showed consis en e ie al
o eques ed eco ds wi h easonable esponse imes o s anda d que ies. Implemen a ion o op imized eques
managemen espec ing he NCBI's a e limi s ensu es eliable access wi hou se ice in e up ions [1].
NCBI da abase connec ions ex end he sys em's each beyond PubMed o encompass specialized da ase s including
genomic, p o ein, and chemical da abases. In eg a ion wi h nume ous p ima y NCBI da abases enables comp ehensi e
c oss-domain sea ches ha can connec li e a u e indings wi h molecula da a, signi ican ly expanding he sys em's
u ili y o ansla ional esea ch. Pe o mance e alua ion demons a ed success ul c oss-da abase que y in eg a ion in
he majo i y o es cases, wi h he emaining cases p ima ily in ol ing highly specialized da abases wi h unique access
pa e ns [1].
Local co pus managemen capabili ies allow o ganiza ions o inco po a e p op ie a y o specialized documen
collec ions alongside public da abases. Benchma king shows ha he sys em can p ocess and index many scien i ic PDFs
pe hou on s anda d ha dwa e, wi h linea scaling on mul i-co e sys ems. Inc emen al indexing enables con inuous
in eg a ion o new pape s wi hou sys em down ime, main aining a sho eshness delay be ween documen addi ion
and sea chabili y [4].
Real- ime sea ch capabili y combines hese di e se da a sou ces in o a uni ied sea ch expe ience. La ency
measu emen s unde ealis ic load condi ions demons a e ha mos que ies e u n ini ial esul s wi hin seconds, wi h
comple e esul s agg ega ion wi hin seconds e en o complex mul i-sou ce que ies. This pe o mance ensu es he
sys em emains esponsi e enough o in e ac i e use du ing esea ch wo k lows [5].
6. Use In e ace Design
The use in e ace se es as he c i ical poin o in e ac ion be ween esea che s and he sys em. Na u al language que y
inpu ep esen s a undamen al shi om adi ional s uc u ed sea ch syn ax. Comp ehensi e e alua ion o RAG
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
661
sys ems o medical ques ion answe ing has shown ha na u al language in e aces lowe he cogni i e bu den o
esea che s while imp o ing que y speci ici y. Usabili y s udies demons a e ha esea che s can exp ess mo e
complex in o ma ion needs using na u al language compa ed o keywo d-based in e aces, leading o mo e p ecise and
ele an esul s [5].
Ci a ion-backed esponses p o ide anspa ency and build us in he sys em. Implemen a ion o inline ci a ion linking
ha connec s speci ic claims di ec ly o sou ce publica ions achie ed highe use us a ings han sys ems wi hou
explici ci a ion. This ea u e enables esea che s o e icien ly e i y in o ma ion and explo e sou ce ma e ial in g ea e
dep h when needed [5].
In e ac i e explo a ion ea u es ans o m li e a u e sea ch om a one-sho que y o an i e a i e disco e y p ocess.
E alua ion o di e en in e ac ion pa e ns ound ha in e aces suppo ing que y e inemen , sou ce explo a ion, and
concep isualiza ion inc eased in o ma ion disco e y compa ed o s a ic sea ch in e aces. Speci ic implemen a ions
include sugges ed que y e inemen s, in e ac i e ci a ion ne wo ks, and concep ela ionship isualiza ions ha help
esea che s iden i y unexpec ed connec ions ac oss he li e a u e landscape [5].
7. Ad anced Fea u es and Op imiza ion
Domain-speci ic op imiza ions signi ican ly enhance sys em pe o mance o biomedical applica ions. The in eg a ion
o specialized biomedical en i y ecogni ion models ha ha e been speci ically ained on la ge co po a o biomedical
ex p o ides a subs an ial ad an age o domain-speci ic ques ions. Compa a i e e alua ion agains gene al en i y
ecogni ion shows ha biomedical-speci ic models achie e highe en i y ecogni ion F1 sco es compa ed o gene al
models when es ed on specialized e minology om genomics, pha macology, and clinical domains [4].
Rela ionship ex ac ion be ween biomedical en i ies enables he sys em o answe complex ques ions abou
in e ac ions and mechanisms. Implemen a ion o ela ion ex ac ion models ained on biomedical li e a u e achie es
good p ecision and ecall o iden i ying ela ionships like p o ein-p o ein in e ac ions, d ug- a ge ela ionships, and
gene-disease associa ions. These capabili ies allow he sys em o syn hesize in o ma ion ac oss mul iple pape s and
iden i y implici connec ions ha would be di icul o disco e h ough adi ional sea ch me hods [4].
Specialized embeddings ailo ed o biomedical con en u he enhance e ie al pe o mance. Benchma k compa ison
o biomedical embeddings ained on PubMed abs ac s agains gene al embeddings shows imp o emen in e ie al
p ecision o domain-speci ic que ies. This pe o mance ad an age s ems om he embeddings' abili y o cap u e sub le
seman ic ela ionships be ween biomedical concep s ha may no be appa en in gene al language models [4].
Que y decomposi ion echniques add ess he complexi y o biomedical esea ch ques ions. Analysis o esea che
ques ions shows ha ypical biomedical que ies implici ly con ain mul iple sub-ques ions equi ing di e en ypes o
in o ma ion. Implemen a ion o au oma ic que y decomposi ion ha b eaks complex ques ions in o simple
componen s imp o ed answe comple eness in con olled e alua ions by ensu ing ha each aspec o a mul i-pa
ques ion ecei es app op ia e a en ion [5].
Hyb id sea ch s a egies combining seman ic and keywo d app oaches p o ide mo e obus e ie al. Pe o mance
analysis shows ha hyb id app oaches achie e highe F1 sco es compa ed o seman ic-only and keywo d-only
app oaches when e alua ed on complex biomedical que ies. This app oach combines he s eng hs o bo h me hods,
using seman ic sea ch o unde s and concep s while le e aging keywo ds o p ecision on speci ic e ms [5].
Caching and inc emen al indexing op imiza ions ensu e sys em esponsi eness and eshness. Implemen a ion o mul i-
le el caching educes esponse ime o epea ed o simila que ies while inc emen al indexing ensu es ha new
li e a u e is a ailable o sea ch wi hin hou s a e publica ion. These op imiza ions balance pe o mance wi h up- o-
da e esul s, c i ical o esea che s wo king in apidly e ol ing ields [5].
8. E alua ion and Benchma king
Comp ehensi e e alua ion has been conduc ed o ensu e he sys em p o ides accu a e and eliable in o ma ion.
Benchma k es ing agains s anda dized da ase s including BioASQ, a communi y-o ganized challenge o biomedical
seman ic indexing and ques ion answe ing, demons a es he sys em's capabili ies in a con olled en i onmen .
Pe o mance analysis shows ha he in eg a ed LLM+RAG app oach achie es good accu acy on ac oid ques ions, lis

Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
662
ques ions, and yes/no ques ions om he BioASQ da ase , ep esen ing an imp o emen o e p e ious s a e-o - he-a
app oach [5].
Expe alida ion p o ides eal-wo ld con i ma ion o he sys em's u ili y. Blind es ing wi h domain expe s e alua ed
he quali y o sys em esponses agains he same ques ions answe ed h ough adi ional li e a u e e iew. Resul s
show ha he sys em p o ided answe s a ed as "equally o mo e comp ehensi e" han manual sea ch in mos cases
while equi ing subs an ially less esea che ime. Fu he mo e, he sys em iden i ied ele an pape s missed by
expe s in many es cases, highligh ing i s abili y o ind connec ions ac oss di e en e minology o domains [5].
Ci a ion accu acy e alua ion con i ms he sys em's eliabili y o academic use. De ailed analysis o sys em-gene a ed
ci a ions ound ha he as majo i y o p o ided ci a ions di ec ly suppo ed he associa ed claims, wi h a small
pe cen age p o iding pa ial suppo , and only a iny ac ion being i ele an o misleading. This le el o ci a ion
accu acy app oaches ha o human li e a u e e iews and p o ides he e i ica ion capabili y essen ial o scien i ic
wo k [5].
9. Ad anced Fea u es and Op imiza ions o Li e Sciences Li e a u e QA Sys ems
9.1. Domain-Speci ic Op imiza ions
9.1.1. Biomedical En i y Recogni ion
Biomedical en i y ecogni ion ep esen s a co ne s one capabili y o e ec i e li e a u e unde s anding. The in eg a ion
o specialized biomedical language models has ans o med he accu acy o en i y ex ac ion om scien i ic ex .
Resea ch on Bio BERT demons a es subs an ial pe o mance imp o emen s ac oss mul iple biomedical NER asks,
wi h he model achie ing signi ican ly highe F1 sco es on he NCBI disease co pus, ep esen ing a ma ked
imp o emen o e anilla BERT. When e alua ed on he BC4CHEMD da ase o chemical en i y ecogni ion, BioBERT
eached imp essi e F1 sco es, while demons a ing s ong pe o mance on he BC2GM co pus o gene men ions [6].
These specialized language models bene i om p e aining on massi e biomedical ex collec ions including billions o
wo ds om PubMed abs ac s and PMC ull- ex a icles, enabling hem o de elop nuanced ep esen a ions o domain-
speci ic e minology ha gene al language models ypically misin e p e [6]. The pe o mance imp o emen s a e
pa icula ly no able o a e en i ies and complex nomencla u e pa e ns common in genomics and molecula biology
li e a u e.
The mapping o ecognized en i ies o s anda dized on ologies u he enhances sys em capabili ies by no malizing
e minology ac oss publica ions. Biomedical li e a u e is cha ac e ized by high e minological a iabili y, wi h he same
concep s o en desc ibed using di e en e ms ac oss sub-disciplines and ime pe iods. In eg a ion wi h esou ces like
he Uni ied Medical Language Sys em (UMLS) enables he esolu ion o e ms o canonical concep s, wi h s udies
showing ha on ology mapping imp o es downs eam que y unde s anding by allowing sys ems o ecognize ha
di e en su ace o ms e e o he same unde lying concep . In o ma ion e ie al e alua ions demons a e ha
on ology-enhanced sys ems imp o e mean a e age p ecision o complex biomedical in o ma ion needs compa ed o
sys ems wi hou concep no maliza ion [7]. The in eg a ion o Gene On ology anno a ions is pa icula ly aluable o
genomics- ela ed que ies, as he hie a chical s uc u e allows sys ems o co ec ly handle bo h speci ic gene men ions
and b oade unc ional gene ca ego ies, add essing a signi ican challenge in biomedical in o ma ion e ie al [7].
Table 3 Domain-Speci ic Op imiza ions [7]
Op imiza ion
Implemen a ion
Pe o mance Impac
En i y Recogni ion
BioBERT o simila models
Imp o ed iden i ica ion o biomedical en i ies
On ology Mapping
UMLS, Gene On ology in eg a ion
Be e e minology no maliza ion
Rela ionship Ex ac ion
Biomedical ela ion models
Enhanced complex ques ion answe ing
Knowledge G aphs
S uc u ed ela ionship ep esen a ion
Suppo o mul i-hop easoning
Specialized Embeddings
Domain-speci ic ec o models
Be e seman ic ma ching
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
663
10. Rela ionship Ex ac ion
Iden i ying ela ionships be ween biomedical en i ies enables sophis ica ed ques ion answe ing beyond simple en i y
ecogni ion. Rela ionship ex ac ion models specialized o biomedical li e a u e can iden i y complex associa ions
be ween genes, diseases, d ugs, and biological p ocesses ha o m he ounda ion o mechanis ic unde s anding in
biomedical science. E alua ions on s anda d benchma ks such as he chemical-p o ein in e ac ion co pus om
BioC ea i e VI demons a e ha domain-specialized models achie e signi ican ly highe mac o F1 sco es compa ed o
gene al language models [6]. The pe o mance ad an age s ems om he models' abili y o ecognize domain-speci ic
in e ac ion pa e ns and con ex ual cues ha indica e ela ionships such as inhibi ion, ac i a ion, anspo , me abolism,
and binding in e ac ions ha a e cen al o biomedical esea ch ques ions.
Knowledge g aph cons uc ion based on ex ac ed ela ionships p o ides an addi ional laye o easoning capabili y. By
ans o ming ex ac ed ela ionships in o a s uc u ed g aph ep esen a ion, sys ems can pe o m mul i-hop easoning
ha iden i ies connec ions no explici ly s a ed in any single documen . Biomedical li e a u e is pa icula ly amenable
o knowledge g aph app oaches due o he highly s uc u ed na u e o biomedical ela ionships and he impo ance o
indi ec associa ions in unde s anding complex biological sys ems. E alua ions o g aph-based biomedical easoning
sys ems show signi ican imp o emen s in answe ing complex ques ions equi ing he in eg a ion o in o ma ion
ac oss mul iple publica ions [7]. The abili y o a e se ela ionship pa hs enables sys ems o disco e po en ial
connec ions ha would be di icul o iden i y h ough adi ional sea ch me hods, such as iden i ying po en ial o -
a ge e ec s o d ugs o unexpec ed in e ac ions be ween biological pa hways desc ibed in sepa a e li e a u e s eams.
These capabili ies a e pa icula ly aluable o disco e y-o ien ed esea ch ques ions whe e he goal is o gene a e
no el hypo heses a he han simply e ie e known in o ma ion [7].
10.1. Specialized Embeddings
Domain-speci ic embeddings p o ide a c i ical ounda ion o seman ic sea ch in specialized domains. The unique
linguis ic cha ac e is ics o biomedical li e a u e—including ex ensi e use o specialized e minology, abb e ia ions,
and complex naming con en ions—c ea e challenges o gene al-pu pose embedding models. Domain-speci ic
embedding models ained on la ge biomedical co po a demons a e subs an ial imp o emen s in cap u ing
meaning ul seman ic ela ionships be ween biomedical concep s. Compa a i e e alua ions show ha embeddings
ained speci ically on biomedical li e a u e ou pe o m gene al embeddings on medical wo d simila i y benchma ks
and on medical concep ela edness asks [8]. These pe o mance ad an ages s em om he specialized models' abili y
o accu a ely cap u e he domain-speci ic meanings o e ms ha o en ha e di e en conno a ions in gene al language
con ex s o highly speci ic echnical meanings wi hin biomedicine.
Specialized biomedical embedding models like BioWo dVec and BioSen Vec o e pa icula ad an ages o li e a u e
unde s anding asks. These models, ained on combina ions o PubMed abs ac s, ull- ex a icles, and clinical no es,
demons a e imp o ed pe o mance on asks equi ing nuanced unde s anding o biomedical e minology. E alua ion
on he UMNSRS medical concep simila i y and ela edness benchma ks shows ha specialized biomedical embeddings
achie e s ong co ela ion sco es wi h expe judgmen s, ep esen ing a signi ican imp o emen o e gene al language
embeddings [8]. The embeddings' abili y o cap u e sub le ela ionships be ween echnical e ms enables mo e e ec i e
e ie al o complex que ies, pa icula ly o concep s ha may be desc ibed using di e en e minology ac oss sub-
disciplines o esea ch communi ies. Fo eme ging esea ch a eas whe e e minology is s ill e ol ing, domain-speci ic
embeddings demons a e pa icula ad an ages in connec ing concep ually ela ed wo k despi e lexical a ia ion [8].
11. Pe o mance Enhancemen s
11.1. Que y Decomposi ion
Complex biomedical esea ch ques ions o en encompass mul iple sub-ques ions equi ing di e en ypes o
in o ma ion. Sys ema ic analysis o esea che in o ma ion needs shows ha biomedical que ies ypically con ain
mul iple implici aspec s, wi h esea ch ques ions equen ly combining inqui ies abou mechanisms, compa isons,
empo al de elopmen s, and con ex ual ac o s. Decomposi ion app oaches ha au oma ically iden i y hese
cons i uen componen s and gene a e a ge ed sub-que ies signi ican ly imp o e e ie al pe o mance. Expe imen s
wi h que y decomposi ion o sys ema ic li e a u e e iew au oma ion demons a ed ha b eaking complex esea ch
ques ions in o cons i uen pa s imp o ed ecall while main aining o imp o ing p ecision [8]. The app oach is
pa icula ly e ec i e o b oad esea ch ques ions spanning mul iple aspec s o a opic, such as que ies in es iga ing
bo h clinical and molecula aspec s o disease mechanisms.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
664
The e ec i eness o que y decomposi ion a ies by ques ion ype, wi h he la ges gains obse ed o complex ques ions
spanning mul iple biomedical subdomains. Fo ques ions equi ing in eg a ion o clinical and molecula in o ma ion,
decomposi ion in o domain-speci ic sub-que ies ollowed by in eg a ed ecomposi ion imp o ed F1 sco es compa ed
o di ec single-que y app oaches [8]. This imp o emen s ems om he abili y o op imize e ie al s a egies o
di e en ypes o in o ma ion needs—using di e en e ie al pa ame e s, documen sec ions, o sou ce da abases o
each sub-que y. The ecomposi ion p ocess ha in eg a es indings om indi idual sub-que ies equi es sophis ica ed
summa iza ion capabili ies o main ain cohe ence and esol e po en ial con adic ions be ween esul s om di e en
in o ma ion sou ces. E alua ions o ecomposi ion app oaches show ha neu al summa iza ion me hods achie ed
highe cohe ence a ings compa ed o simple agg ega ion o sub-que y esul s [8].
11.2. Hyb id Sea ch S a egies
Combining seman ic sea ch wi h adi ional lexical app oaches c ea es mo e obus e ie al sys ems. Biomedical
li e a u e e ie al p esen s pa icula challenges due o he complex e minology, inconsis en naming con en ions,
and concep ual complexi y o he domain. Seman ic sea ch excels a inding concep ually ela ed con en despi e
e minological di e ences bu may lack p ecision o speci ic echnical e ms. Keywo d-based app oaches p o ide high
p ecision o exac e minology bu miss concep ually ele an con en desc ibed using di e en e ms. Hyb id
a chi ec u es ha in eg a e bo h app oaches le e age hei complemen a y s eng hs. E alua ions on he TREC
P ecision Medicine ack demons a e ha hyb id sys ems inco po a ing bo h seman ic and keywo d componen s
achie ed highe mean a e age p ecision compa ed o seman ic-only and keywo d-only app oaches [7]. The
pe o mance ad an age o hyb id app oaches is pa icula ly p onounced o que ies equi ing bo h concep ual
unde s anding and e minology p ecision, such as sea ches o speci ic mechanisms o in e en ions ela ed o b oade
disease ca ego ies.
Re- anking s a egies u he enhance e ie al quali y by applying mul i- ac o assessmen o ini ial sea ch esul s.
A e e ie ing a candida e se o documen s, e- anking algo i hms can inco po a e ac o s beyond di ec que y
ele ance, including ci a ion impac , ecency, me hodology quali y, and sou ce c edibili y. E alua ions o biomedical
in o ma ion e ie al sys ems implemen ing e- anking s a egies show p ecision imp o emen s a op ank posi ions
compa ed o sys ems wi hou e- anking [7]. These imp o emen s a e pa icula ly aluable in biomedical con ex s
whe e he quali y and c edibili y o in o ma ion sou ces signi ican ly impac hei u ili y o esea ch pu poses. Fo
ime-sensi i e que ies in apidly e ol ing esea ch a eas, empo al e- anking s a egies ha balance ele ance wi h
ecency ensu e ha esul s e lec he cu en s a e o knowledge while main aining ele ance o he o iginal que y [7].
11.3. Caching and Indexing
E icien caching and indexing s a egies a e essen ial o sys em esponsi eness and scalabili y when wo king wi h
la ge biomedical li e a u e collec ions. Real-wo ld deploymen analysis shows ha biomedical ques ion answe ing
sys ems ypically expe ience clus e ed que y pa e ns, wi h esea che s o en explo ing ela ed ques ions wi hin
esea ch sessions. Implemen ing mul i-le el caching ha p ese es bo h exac and seman ically simila p e ious
que ies can subs an ially imp o e sys em pe o mance. Pe o mance benchma ks o caching implemen a ions o
biomedical li e a u e sys ems show signi ican esponse ime educ ions o que ies wi h seman ic simila i y o
p e iously p ocessed ques ions [8]. Seman ic caching s a egies ha ecognize when a new que y is concep ually
simila o a cached que y, e en i lexically di e en , ex end hese pe o mance bene i s beyond exac ma ches, p o iding
esponse ime imp o emen s e en o no el bu ela ed ques ions.
Inc emen al indexing ensu es sys em eshness wi hou comple e ep ocessing o he documen collec ion whene e
new publica ions become a ailable. In biomedical domains whe e esea ch p og esses apidly, index eshness is
pa icula ly impo an o p o iding cu en in o ma ion. E alua ions o inc emen al indexing app oaches o
biomedical li e a u e demons a e he abili y o p ocess and in eg a e many new pape s pe hou while main aining
index cohe ence [8]. These app oaches enable con inuous li e a u e moni o ing wi h new publica ions becoming
sea chable wi hin hou s o publica ion. P io i y-based indexing s a egies ha as - ack p ocessing o high-impac
jou nals o pape s in ac i e esea ch a eas u he imp o e e ec i e eshness o he mos ele an con en . Combined
wi h e icien documen p ocessing pipelines, hese app oaches enable sys ems o main ain comp ehensi e co e age o
he li e a u e while ensu ing ha esea che s ha e access o he la es indings ele an o hei ques ions [8].
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 657-668
665
12. E alua ion and Benchma king
12.1. Benchma k Da ase s
Rigo ous e alua ion agains s anda dized benchma ks p o ides quan i a i e pe o mance assessmen . The BioASQ
challenge, a communi y-o ganized compe i ion o biomedical seman ic indexing and ques ion answe ing, o e s a
comp ehensi e e alua ion amewo k wi h ques ions de eloped by biomedical expe s and mul iple e e ence answe s
o ai assessmen . Pe o mance analysis on he BioASQ da ase shows ha s a e-o - he-a biomedical ques ion
answe ing sys ems achie e good accu acy a es on ac oid ques ions, lis - ype ques ions, and yes/no ques ions [6].
These benchma ks enable consis en compa ison ac oss sys em e sions and agains al e na i e app oaches. The
s uc u ed na u e o he BioASQ asks, which sepa a e ques ions in o di e en ypes ( ac oid, lis , yes/no, and
summa y), allows o de ailed pe o mance analysis ac oss di e en ques ion ca ego ies and iden i ica ion o speci ic
s eng hs and weaknesses in sys em capabili ies.
Pe o mance on benchma k da ase s e eals impo an pa e ns in sys em capabili ies. Analysis by ques ion ype shows
ha cu en sys ems ypically pe o m be e on ac oid ques ions equi ing speci ic en i y iden i ica ion han on
ques ions equi ing complex easoning o syn hesis ac oss mul iple sou ces [6]. Simila ly, pe o mance a ies by
biomedical subdomain, wi h highe accu acy ypically obse ed o well-es ablished esea ch a eas wi h s anda dized
e minology compa ed o eme ging ields wi h e ol ing language pa e ns. These pa e ns help iden i y p io i y a eas
o u u e de elopmen and p o ide ealis ic expec a ions o sys em pe o mance in di e en scena ios. Longi udinal
e alua ion ac oss mul iple sys em gene a ions demons a es consis en imp o emen ends, wi h he in eg a ion o
mo e sophis ica ed language models and e ie al echniques yielding measu able pe o mance gains on s anda dized
benchma ks [6].
12.2. Expe Valida ion
While benchma k pe o mance p o ides s anda dized me ics, expe alida ion assesses eal-wo ld u ili y. Con olled
s udies in ol ing biomedical domain expe s p o ide c ucial insigh s in o p ac ical sys em e ec i eness ac oss di e en
esea ch scena ios. Expe e alua ions compa ing sys em-gene a ed answe s agains adi ional li e a u e e iew
esul s show ha e ie al-augmen ed gene a ion sys ems o biomedical li e a u e can achie e "sa is ac o y" o
"highly sa is ac o y" a ings om domain expe s in mos es cases while equi ing signi ican ly less esea che ime
[7]. These e alua ion p o ocols ypically in ol e blinded assessmen whe e expe s e alua e sys em esponses wi hou
knowing hei sou ce, ollowed by compa a i e analysis o ime e iciency and in o ma ion comple eness be ween
sys em-assis ed and adi ional app oaches.
Expe alida ion e eals pa icula s eng hs in ce ain use cases. Fo ques ions equi ing in eg a ion ac oss mul iple
subdisciplines, sys ems le e aging domain-speci ic op imiza ions demons a e pa icula ad an ages due o hei
abili y o b idge e minology di e ences and iden i y concep ual connec ions ac oss disciplina y bounda ies. Simila ly,
o eme ging esea ch a eas whe e e minology is s ill e ol ing, sys ems using adap i e language models and
specialized embeddings show signi ican ad an ages in e ie ing ele an in o ma ion despi e inconsis en
e minology [7]. These eal-wo ld pe o mance ad an ages ansla e o measu able ime sa ings, wi h expe
e alua ions indica ing ha li e a u e e iew asks ha ypically equi e many hou s o esea che ime can o en be
comple ed in signi ican ly less ime wi h sys em assis ance while main aining o imp o ing in o ma ion disco e y [7].
12.3. Ci a ion Accu acy
Ci a ion accu acy p o ides a c i ical measu e o sys em eliabili y o academic use. The abili y o g ound gene a ed
esponses in speci ic publica ions wi h app op ia e ci a ions is essen ial o scien i ic c edibili y and e i ica ion.
De ailed e alua ion o ci a ion p ac ices in biomedical ques ion answe ing sys ems shows ha ma u e implemen a ions
can achie e co ec sou ce a ibu ion o he as majo i y o ac ual claims, wi h highe accu acy o es ablished
knowledge and somewha lowe accu acy o eme ging o con es ed opics [7]. This le el o ci a ion accu acy
app oaches ha o human li e a u e e iews and p o ides he e i ica ion capabili y essen ial o scien i ic wo k.
Ci a ion accu acy e alua ion ypically in ol es acing sys em-p o ided ci a ions o sou ce documen s and e i ying
ha he ci ed sou ces ac ually con ain he in o ma ion a ibu ed o hem—a p ocess ha equi es domain expe ise
and ca e ul assessmen .
The dis ibu ion o ci a ion pa e ns ac oss di e en ypes o con en p o ides addi ional insigh s in o sys em
capabili ies. Analysis o ci a ion beha io shows ha well-designed sys ems exhibi app op ia e a ia ion in ci a ion
densi y based on claim ype, p o iding mul iple suppo ing ci a ions o b oad claims abou es ablished knowledge
while using mo e speci ic ci a ions o de ailed me hodological claims o ecen indings [7]. These pa e ns mi o