Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock
Resea ch Pape s
Mohd Uwaish
Uni e si y o Gö ingen
Abs ac —E icien e ie al o geochemical da a emains challenging due
o i s he e ogeneous na u e, encompassing elemen al composi ions, iso opic
analyses, mine alogical da ase s, and compu a ional models. These da ase s
equi e specialized con ex ual unde s anding, such as no maliza ion in a e
ea h elemen (REE) analysis o mass spec ome y p o ocols o s able iso-
ope in e p e a ion. T adi ional me hods s uggle wi h seman ic a iabili y,
uns uc u ed o ma s, and con ex -awa e ex ac ion, limi ing e icien knowl-
edge e ie al.
This s udy in oduces Hyb id Re ie al-Augmen ed Gene a ion (Hyb idRAG)
o enhance knowledge ex ac ion om Geo ock esea ch pape s. Hyb idRAG
in eg a es Vec o RAG, le e aging dense embeddings o seman ic e ie al,
and G aphRAG, u ilizing en i y- ela ionship modeling o s uc u ed knowledge
ex ac ion. A que y s a i ica ion mechanism classi ies que ies in o explici ac
e ie al, implici easoning, and a ionale-d i en explana ions, dynamically
applying e ie al s a egies o imp o e con ex ual ele ance.
Hyb idRAG is e alua ed agains a baseline RAG sys em ha employs a
ec o simila i y sea ch e ie al mechanism. Pe o mance assessmen is
conduc ed using Response Quali y Me ics, including Fac ual Co ec ness,
Seman ic Simila i y, BLEU, and ROUGE, which help de e mine he accu acy,
cohe ence, and ele ance o gene a ed esponses. These me ics iden i y how
well Hyb idRAG aligns wi h e e ence answe s, p ese es seman ic meaning,
and main ains ex ual simila i y. Re ie al-Based Me ics, such as Con ex
P ecision, Con ex Recall, En i y Recall, Noise Sensi i i y, and Fai h ulness,
measu e he sys em’s abili y o e ie e comple e and ele an knowledge
while minimizing e ie al e o s and ensu ing consis ency be ween e ie ed
con ex s and gene a ed ou pu s. These e alua ions p o ide insigh s in o
Hyb idRAG’s e ec i eness in e ie ing high-quali y in o ma ion and educing
misin o ma ion.
Resul s demons a e ha Hyb idRAG ou pe o ms baseline RAG in complex
que ies by le e aging s uc u ed knowledge e ie al. I e ie es mo e p ecise
and ele an con ex s while educing e ie al noise, pa icula ly excelling in
implici and a ionale-d i en que ies whe e s uc u ed knowledge enhances
esponse quali y. These indings highligh he ad an ages o hyb id e ie al
s a egies in geochemical da a ex ac ion. By combining ec o - and g aph-
based e ie al, Hyb idRAG o e s a mo e adap i e and accu a e app oach
o in o ma ion e ie al in scien i ic li e a u e. Fu u e wo k may e ine e ie al
selec ion, in eg a e geoscien i ic on ologies, and op imize e ie al usion
echniques o imp o ed pe o mance.
Keywo ds— Re ie al-Augmen ed Gene a ion (RAG), Hyb idRAG, Vec o -
RAG, G aphRAG, Knowledge Ex ac ion, In o ma ion Re ie al, La ge Lan-
guage Models (LLMs), Geo ock Resea ch Pape s, Scien i ic Tex P ocessing,
Seman ic Re ie al, Knowledge G aphs, Que y S a i ica ion, Con ex -Awa e
Re ie al
1. In oduc ion
Geochemis y esea ch in ol es analyzing he e ogeneous da ase s,
including elemen al composi ions, iso opic a ios, mine alogical p op-
e ies, and expe imen al measu emen s. Ex ac ing ele an insigh s
om hese di e se and uns uc u ed da a sou ces emains challeng-
ing due o seman ic a iabili y, domain-speci ic e minology, and
implici ela ionships be ween geochemical pa ame e s. T adi ional
keywo d-based e ie al me hods ail o e ec i ely cap u e hese com-
plexi ies, o en leading o inaccu a e o incomple e knowledge ex ac-
ion (Sa mah e al., 2024).
Recen ad ances in Re ie al-Augmen ed Gene a ion (RAG) ha e
in oduced hyb id app oaches ha combine ec o -based e ie al
(Vec o RAG) and g aph-based e ie al (G aphRAG) o mo e ac-
cu a e and con ex -awa e in o ma ion e ie al (Zhao e al., 2024).
Howe e , exis ing solu ions o en lack dynamic que y s a i ica ion,
which is c ucial o adap ing e ie al s a egies based on que y ype.
To add ess his, we p opose a Hyb idRAG sys em ha in eg a es
Vec o RAG o cap u ing seman ic simila i ies and G aphRAG o
s uc u ed knowledge ep esen a ion. A que y classi ica ion mecha-
nism ensu es ha e ie al me hods a e dynamically selec ed based
on que y in en , enhancing e iciency and con ex ual accu acy (Li e
al., 2025).
Ou Hyb idRAG amewo k consis s o i e key componen s: (1)
Que y Classi ica ion – ca ego izes que ies in o ac -based e ie al,
easoning-based e ie al, o explana ion-d i en e ie al, op imiz-
ing e ie al s a egy (Zhao e al., 2024); (2) Vec o -Based Re ie al
(Vec o RAG) – uses Ch omaDB o e ie e seman ically simila docu-
men s (Sa mah e al., 2024); (3) G aph-Based Re ie al (G aphRAG)
– ex ac s s uc u ed knowledge using Neo4j knowledge g aphs, im-
p o ing in e p e abili y (Khemakhem e al., 2024); (4) Hyb idRAG
Pipeline – me ges e ie ed con ex s and eeds hem in o GPT-4 LLM
o esponse gene a ion; and (5) Con ex -Augmen ed Response Gene -
a ion – ensu es ha inal esponses a e ai h ul o e ie ed knowledge
while main aining geochemical in e p e abili y (Ganesh e al., 2024).
This s udy aims o op imize e ie al e iciency by dynamically
selec ing Vec o RAG o G aphRAG based on que y classi ica ion;
imp o e con ex ual accu acy in knowledge ex ac ion by le e aging
seman ic and s uc u ed e ie al me hods; ensu e in e p e abili y by
s uc u ing e ie ed esponses wi h geochemis y-speci ic ela ion-
ships; and de elop a scalable e ie al amewo k applicable o o he
scien i ic disciplines equi ing knowledge ex ac ion.
2. Rela ed Wo k
The ield o Re ie al-Augmen ed Gene a ion (RAG) has e ol ed
o add ess limi a ions o adi ional LLMs ha ely exclusi ely on
pa ame ic knowledge. RAG sys ems enhance ac ual accu acy and
domain ele ance by e ie ing ex e nal in o ma ion du ing in e -
ence (Zhao e al., 2024). Ini ial RAG a chi ec u es p edominan ly
employed ec o -based e ie al u ilizing dense passage embeddings,
which imp o ed con ex ual alignmen bu exhibi ed limi a ions in
s uc u ed knowledge ep esen a ion and explici easoning capabili-
ies (Khemakhem e al., 2024).
Vec o -based e ie al (Vec o RAG) sys ems le e age embedding
models such as BERT and Sen ence T ans o me s o encode seman i-
cally meaning ul ep esen a ions in high-dimensional ec o spaces.
These ep esen a ions enable e icien simila i y sea ch h ough ap-
p oxima e nea es neighbo algo i hms implemen ed in pla o ms
like Ch omaDB and FAISS (Sa mah e al., 2024). Despi e ad an ages
in scalabili y and seman ic ele ance, Vec o RAG demons a es in-
he en cons ain s in handling s uc u ed easoning asks ha equi e
explici knowledge ep esen a ion.
G aph-based e ie al (G aphRAG) a chi ec u es u ilize Neo4j and
RDF-based knowledge g aphs o model en i y- ela ionship s uc u es
and enable mul i-hop easoning pa hways. These sys ems implemen
en i y-linking and ela ion ex ac ion models o imp o e knowledge
ep esen a ion, while on ology-based easoning enhances e ie al
p ecision h ough domain-speci ic ela ion s uc u ing (Ganesh e al.,
2024). G aphRAG demons a es pa icula e icacy in geochemis y
esea ch, whe e chemical p ope ies, iso opic composi ions, and min-
e alogical ela ionships adhe e o well-de ined s uc u al pa e ns.
Hyb id e ie al app oaches in eg a e ec o -based seman ic simi-
la i y wi h g aph-based s uc u al easoning, c ea ing lexible knowl-
edge ex ac ion pipelines. The Hyb idRAG me hodology enables
dynamic swi ching be ween e ie al mechanisms based on que y
classi ica ion, acili a ing seman ic con ex alignmen o uns uc-
u ed que ies and s uc u ed knowledge e ie al o en i y-cen ic
inqui ies (Li e al., 2025). Con empo a y e alua ion amewo ks
ha e expanded beyond adi ional p ecision me ics o inco po a e
Con as i e In-Con ex Lea ning (CICL), Con ex -Awa e Re ie al
E alua ion (CARE), and Fac uali y Sco ing (Fac Sco e) me hodolo-
gies.
Uni e si y o Gö ingen Ma ch 10, 2025 1–9
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
In domain-speci ic applica ions, pa icula ly geochemis y, RAG
sys ems mus p ocess complex in e dependencies be ween elemen-
al composi ions, iso opic ac iona ion pa e ns, and mine alogical
ans o ma ions. Con ex -Augmen ed Re ie al (CAR) echniques op-
imize e ie al wo k lows by implemen ing dynamic pa i ioning o
he in o ma ion space based on eal- ime que y classi ica ion, he eby
educing e ie al noise and enhancing compu a ional e iciency o
la ge scien i ic da ase s (Ganesh e al., 2024).
Despi e signi ican ad ancemen s, c i ical esea ch gaps pe sis :
p edominan implemen a ions employ s a ic e ie al s a egies
a he han dynamic selec ion me hods; cu en a chi ec u es lack e -
icien hyb id pipelines op imized o specialized scien i ic domains;
and domain-speci ic easoning alida ion me ics emain insu i-
cien ly explo ed (Li e al., 2025). Ou Hyb idRAG sys em add esses
hese limi a ions h ough a comp ehensi e que y classi ica ion-based
pipeline ha e ec i ely b idges ec o -based e ie al e iciency wi h
knowledge g aph in e p e abili y o geochemis y esea ch and
domain-speci ic knowledge ex ac ion.
3. Me hodology
The p oposed Hyb idRAG sys em is designed o enhance geochem-
ical knowledge ex ac ion by dynamically selec ing Vec o RAG o
G aphRAG based on que y classi ica ion. The me hodology ollows a
s uc u ed pipeline comp ising da a p ep ocessing, knowledge g aph
ex ac ion, model aining, ec o -based e ie al, g aph-based e-
ie al, and hyb id esponse gene a ion.
3.1. Da a P ep ocessing Pipeline
To acili a e e icien e ie al and esponse gene a ion, a s uc u ed
and cleaned da ase is essen ial. The da a p ep ocessing pipeline
consis s o he ollowing s ages:
3.1.1 Tex Ex ac ion
Scien i ic pape s om he Geo ock Da abase a e p ocessed using
G obid, which con e s PDFs in o a s uc u ed XML o ma . The ex-
ac ed XML con en is hen ans o med in o JSON o ma , including
me ada a, sec ions, and e e ences o o ganized s o age.
3.1.2 Tex Cleaning and No maliza ion
To enhance da a quali y, unwan ed cha ac e s, special symbols, and
edundan me ada a a e emo ed. Addi ionally, s opwo ds a e il-
e ed, and s emming and lemma iza ion echniques a e applied o
s anda dize he ex .
3.1.3 Chunking and Me ada a Assignmen
A ixed-window chunking s a egy (512 okens pe chunk wi h a 100-
oken o e lap) is implemen ed o di ide he ex in o e ie able uni s.
Each chunk is assigned ele an me ada a and s o ed in MongoDB
o s uc u ed indexing.
3.1.3 Embedding Gene a ion
To enable e icien ec o -based e ie al, ex chunks a e con e ed
in o high-dimensional embeddings using OpenAI’s ex -embedding-
ada-002 model. These embeddings a e s o ed in Ch omaDB, ensu ing
as and accu a e e ie al.
3.2. Knowledge G aph Ex ac ion
Since Vec o RAG lacks explici s uc u ed ela ionships, we ex ac
knowledge g aphs om p ocessed ex o cap u e geochemical depen-
dencies.
3.2.1 Chunk Re inemen
Ex ac ed chunks a e u he e ined o iden i y key scien i ic ela-
ionships. The sys em p ocesses ela ionships be ween chemical ele-
men s, iso opic composi ions, mine alogical da a, and expe imen al
measu emen s.
3.2.2 T iple Ex ac ion
Using Named En i y Recogni ion (NER) and ela ion ex ac ion mod-
els, s uc u ed iple s a e o med in he o ma :
1(Head En i y, Rela ionship, Tail En i y)
2(Elemen A, Has Iso ope, Iso opic Ra io)
3(Mine al X, Con ains, Elemen Y)
4
Code 1. T iple s Fo ma .
The iple s a e s o ed wi h me ada a o e ain con ex . Knowledge
G aph Cons uc ion: Ex ac ed iple s a e appended o a Neo4j-based
knowledge g aph. Me ada a, such as sou ce documen s, e e ences,
and expe imen de ails, is linked o nodes in he g aph. This s ep en-
su es ha G aphRAG e ie al can e ie e no jus seman ic ex pas-
sages bu also s uc u ed ela ionships o easoning-based que ies.
3.3. Que y Classi ica ion Model T aining
To op imize e ie al s a egies, use que ies a e ca ego ized in o ou
dis inc ypes: Explici Fac Re ie al, Implici Reasoning, Hidden
Ra ionale, and In e p e able Ra ionale. The sys em classi ies que ies
dynamically o de e mine he mos sui able e ie al me hod.
3.3.1 Da ase P epa a ion
A syn he ic da ase was gene a ed using Claude AI, ensu ing a di e se
and well-balanced se o que ies ac oss he ou ca ego ies. Example
classi ica ions include:
•
Explici Fac Re ie al
→
"Wha is he iso opic ac iona ion o
U anium?"
•
Implici Reasoning
→
"How does mine al composi ion a ec
elemen di usion?"
•
Hidden Ra ionale
→
"Why do ce ain iso opes exhibi ac iona-
ion unde p essu e?"
•
In e p e able Ra ionale
→
"Wha ac o s in luence iso opic ac-
iona ion ends?"
The da ase was s o ed in CSV o ma , wi h ields o que y ex and
ca ego y labels.
3.3.2 Tex Vec o iza ion
Que ies we e ec o ized using TF-IDF wi h he ollowing pa ame e s:
•max- ea u es = 7000 (selec ing he mos in o ma i e e ms)
•
ng am- ange = (1,3) (cap u ing single wo ds, big ams, and i-
g ams)
This ans o ma ion ensu es ha he classi ie cap u es meaning ul
pa e ns in he que y ex .
3.3.3 Model T aining and E alua ion
Mul iple classi ica ion models we e ained, and Logis ic Reg ession
wi h he ollowing hype pa ame e s ou pe o med o he models:
•C = 50 ( egula iza ion s eng h)
•max-i e = 3000 (ensu ing con e gence)
•class-weigh = balanced (handling class imbalance)
The ained model was sa ed as a pickle ile o eal- ime que y clas-
si ica ion.
3.3.4 Classi ica ion a Que y Time
When a use submi s a que y, i is classi ied in o one o he ou
p ede ined ca ego ies, and he co esponding e ie al s a egy is
applied:
•Explici Fac Re ie al –Simple Simila i y Sea ch
–
U ilizes ec o -based e ie al o e ch di ec ac -based
in o ma ion.
–Implemen ed ia Ch omaDB simila i y sea ch.
•Implici Reasoning –Mul i-Hop Re ie al
–
Uses Mul iQue yRe ie e o e ie e mul iple ele an
passages o in e en ial que ies.
–
Employs OpenAI’s LLM o gene a e mul i-hop easoning
esponses.
•In e p e able Ra ionale –Keywo d Expansion Re ie al
2
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
–
Expands he que y wi h LLM-based keywo d expansion
o imp o e sea ch accu acy.
–
Uses Con ex ualComp essionRe ie e o e ine e-
ie ed documen s.
•Hidden Ra ionale –Hyb id Re ie al
–
Combines spa se (keywo d-based) and dense ( ec o -
based) e ie al o complex easoning.
–
Uses an LLM-powe ed agen o in e ac wi h Ch o-
maDB and selec op imal in o ma ion sou ces.
This classi ica ion-d i en app oach ensu es ha e ie al is adap-
i e and con ex -awa e, op imizing in o ma ion access o geochem-
ical que ies.
3.4. Vec o -Based Re ie al (Vec o RAG)
The Vec o RAG app oach is designed o ac -based and seman ic
e ie al using ec o embeddings and FAISS simila i y sea ch.
3.4.1 Que y Embedding
The use que y is i s p ep ocessed o emo e unnecessa y cha ac-
e s and s opwo ds. I is hen con e ed in o a high-dimensional
ec o ep esen a ion using OpenAI’s ex -embedding-ada-002
model.
3.4.2 Simila i y Sea ch in Ch omaDB
The ec o ized que y is used o pe o m a FAISS simila i y sea ch
wi hin Ch omaDB. The sys em e ie es he op-k mos ele an
documen chunks based on ec o simila i y.
3.4.3 Con ex Re ie al and Agg ega ion
Re ie ed chunks a e anked based on simila i y sco es. Duplica e
in o ma ion is il e ed, and he mos ele an passages a e me ged
in o a inal e ie ed con ex .
This me hod ensu es as , e icien e ie al o seman ically el-
e an in o ma ion. Howe e , i does no cap u e s uc u ed ela-
ionships o easoning-based que ies.
3.5. G aph-Based Re ie al (G aphRAG)
The G aphRAG app oach enables s uc u ed e ie al o
easoning-based que ies by le e aging a Neo4j knowledge g aph
o mul i-hop easoning.
3.5.1 En i y Ex ac ion om Que y
The sys em ex ac s key geochemical en i ies om he que y, such
as iso opes, elemen s, and mine als, using GPT-4 o Named
En i y Recogni ion (NER). These ex ac ed en i ies se e as he
s a ing poin s o que ying he knowledge g aph.
3.5.2 Knowledge G aph Que y Execu ion
ACyphe que y is execu ed in Neo4j o e ie e ele an subg aphs
based on he ex ac ed en i ies. The sys em u ilizes APOC p oce-
du es o expand he subg aph and include ele an ela ionships
and connec ions wi hin he g aph. The que y dep h is dynamically
adjus ed depending on he complexi y o he easoning ask.
1MATCH (s a {name: $en i y})
2CALL apoc.pa h.subg aphAll(s a , {
3maxLe el: $dep h
4})
5YIELD nodes, ela ionships
6UNWIND nodes AS node
7UNWIND ela ionships AS el
8RETURN
9node.name AS en i y_name,
10 labels(node) AS en i y_ ypes,
11 ype( el) AS ela ionship_ ype,
12 s a Node( el).name AS s a _node,
13 endNode( el).name AS end_node
Code 2. CYPHER que y o ex ac he subg aph
3.5.3 Con ex Re ie al and G aph Expansion
The e ie ed subg aph p o ides s uc u ed ac s ha o m he basis
o he esponse. Ex ac ed ela ionships and en i ies a e o ma ed
in o s uc u ed con ex ual in o ma ion. Redundan in o ma ion is
il e ed, and i necessa y, he que y scope is expanded dynamically
o inco po a e addi ional ela ed knowledge om he g aph.
T iple s ex ac ed using G aphRAG a e combined wi h he e-
ie ed ex ual in o ma ion om Vec o RAG and passed o he LLM,
enabling he model o gene a e esponses ha in eg a e bo h s uc-
u ed knowledge and seman ic ex e ie al. This ensu es ac ual
consis ency, mul i-hop easoning, and en iched con ex gene -
a ion while le e aging he s eng hs o bo h e ie al me hods.
3.6. Hyb idRAG: In eg a ing Vec o and G aph Re ie al
The Hyb idRAG pipeline in eg a es Vec o RAG and G aphRAG o
enable bo h seman ic ex e ie al and s uc u ed knowledge
e ie al, ensu ing a comp ehensi e esponse gene a ion p ocess. By
combining hese e ie al s a egies, he sys em e ec i ely e ie es
ac ual knowledge while inco po a ing s uc u ed ela ionships es-
sen ial o easoning-based que ies.
3.6.1 Me ging Re ie ed Con ex s
Fo e e y que y, he sys em e ie es in o ma ion om bo h Vec-
o RAG and G aphRAG. The ex ac ed ex chunks om Vec o -
RAG p o ide di ec ac ual insigh s, while he knowledge g aph
esponses om G aphRAG ensu e s uc u ed easoning. These wo
con ex s a e me ged in o a uni ied ep esen a ion, ensu ing ha s uc-
u ed ac s complemen seman ic ex e ie al.
3.6.2 Final Con ex Fo ma ing
The combined esponse is o ma ed in o a s uc u ed con ex dic-
iona y, p ese ing bo h ex ual and g aph-based knowledge. The
esponse ollows he o ma :
1{
2"Vec o RAG_Con ex ": {
3"0": "U anium iso opic ac iona ion in sedimen a y
↪en i onmen s is p ima ily in luenced by edox
↪condi ions.
4Unde educing condi ions, u anium
↪p ecipi a es as U(IV) and accumula es in o ganic-
↪ ich sedimen s.",
5"1": "Ra e ea h elemen s (REEs) in hyd o he mal
↪ luids show sys ema ic a ia ions ha p o ide
↪insigh s
6in o luid sou ce, empe a u e, and
↪mine aliza ion p ocesses.",
7"2": "S on ium iso ope a ios in ca bona e deposi s
↪a e widely used as ace s o econs uc ing
↪paleoceanog aphic
8condi ions and luid- ock in e ac ions.",
9"3": "Magma ic di e en ia ion p ocesses a e o en
↪in e ed om he geochemical ends o majo and
↪ ace elemen s
10 in igneous ock sui es.",
11 "4": "The Mak an subduc ion zone shows signi ican
↪geochemical he e ogenei y, wi h olcanic a c
↪magma ism in luenced
12 by sedimen subduc ion and slab-de i ed luid
↪me asoma ism."
13 },
14 "G aphRAG_Con ex ": {
15 "0": "U anium Unde Reducing Condi ions
↪P ecipi a es as U(IV)",
16 "1": "REEs Show Sys ema ic Va ia ions Hyd o he mal
↪Fluids",
17 "2": "S on ium Iso opes Used o
↪Paleoceanog aphic Recons uc ion",
18 "3": "Igneous Rock Sui e Displays Magma ic
↪Di e en ia ion T ends",
19 "4": "Mak an Subduc ion Zone In luenced by Slab-
↪De i ed Fluid Me asoma ism"
20 }
21 }
Code 3. Con ex Passed o LLM
3.6.3 Response Gene a ion using GPT-4
The s uc u ed con ex om bo h Vec o RAG and G aphRAG is
3
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
passed o GPT-4, whe e i gene a es a con ex -g ounded esponse.
The sys em ensu es ha bo h ac ual and s uc u ed knowledge a e
inco po a ed in o he inal esponse, imp o ing in e p e abili y and
accu acy. By consis en ly including bo h e ie al app oaches, Hy-
b idRAG gua an ees a well- ounded and knowledge- ich esponse
gene a ion p ocess.
This in eg a ion o ec o -based seman ic e ie al and g aph-
based s uc u ed easoning ensu es ha Hyb idRAG p o ides
ac ually accu a e, con ex ually ich, and easoning-awa e e-
sponses. The sys em always u ilizes bo h e ie al mechanisms in
andem, implemen ing a comp ehensi e knowledge e ie al ame-
wo k ha enhances esponse quali y and easoning dep h.
4. E alua ion Me ics
To assess he pe o mance o he p oposed Hyb idRAG sys em, we
e alua e i based on a combina ion o e ie al-based and esponse
quali y me ics.
4.1. Con ex P ecision
Con ex P ecision measu es he p opo ion o ele an chunks e-
ie ed wi hin he e ie ed con ex s. I is compu ed as he mean o
P ecision@K o each chunk in he e ie ed con ex lis .
Con ex P ecision@K =
∑𝐾
𝑘=1(P ecision@k ×𝑣𝑘)
|𝑅𝐾|(1)
P ecision@k =TP@𝑘
TP@𝑘+FP@𝑘(2)
whe e
𝐾
ep esen s he o al numbe o chunks in he e ie ed
con ex s,
|𝑅𝐾|
is he numbe o ele an i ems in op K esul s, and
𝑣𝑘∈{0,1}
is a bina y indica o o ele ance a ank
𝑘
. TP and FP
ep esen ue posi i es and alse posi i es.
LLM-Based Con ex P ecision: Two a ia ions o his me ic a e
used:
•
Wi hou Re e ence: Compa es e ie ed con ex wi h he gen-
e a ed esponse.
•
Wi h Re e ence: Uses bo h e ie ed con ex and e e ence
answe .
Ahighe P ecision indica es highly ele an e ie ed documen s.
Alowe P ecision sugges s inclusion o i ele an con ex s.
4.2. Con ex Recall
Con ex Recall measu es he p opo ion o e e ence claims suppo ed
by he e ie ed con ex .
Con ex Recall =|𝐶𝑠𝑢𝑝|
|𝐶𝑟𝑒𝑓|(3)
whe e
|𝐶𝑠𝑢𝑝|
is he numbe o suppo ed claims and
|𝐶𝑟𝑒𝑓|
is he
o al claims in e e ence.
LLM-Based Con ex Recall: E alua es e e ence in o ma ion
cap u ed in he e ie ed con ex by checking claim suppo .
Highe Recall alues indica e comp ehensi e e ie al. Lowe
ecall sugges s missing impo an claims.
4.3. Con ex En i ies Recall
Con ex En i ies Recall e alua es p ese a ion o named en i ies om
he e e ence answe .
CER =|𝑅𝐶𝐸∩𝑅𝐸|
|𝑅𝐸|(4)
whe e
𝑅𝐸
ep esen s en i ies in e e ence answe , and
𝑅𝐶𝐸
is en i-
ies in e ie ed con ex .
In e p e a ion:
•Highe CER: Be e e en ion o key en i ies.
•Lowe CER: Missing c i ical named en i ies.
4.4. Noise Sensi i i y
Noise Sensi i i y quan i ies inco ec claims due o e ie al noise.
NS =|𝐶𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡|
|𝐶𝑡𝑜𝑡𝑎𝑙|(5)
Lowe NS indica es mo e accu a e claims. Highe NS sugges s
he model is p one o using misleading con ex s.
4.5. Fai h ulness
Fai h ulness measu es ac ual consis ency be ween gene a ed e-
sponse and e ie ed con ex .
FS =|𝐶𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑|
|𝐶𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒|(6)
In e p e a ion:
•Highe FS: Claims suppo ed by e ie ed con ex .
•Lowe FS: Con ains unsuppo ed/hallucina ed claims.
Fai h ulness is c ucial o ensu ing he eliabili y o he RAG sys-
em, especially in scien i ic applica ions whe e ac ual accu acy is
pa amoun .
4.6. Fac ual Co ec ness
Fac ual Co ec ness measu es how accu a ely he gene a ed esponse
aligns wi h he e e ence answe . I ensu es ha he sys em does
no in oduce misin o ma ion o hallucina ed claims. This me ic
is compu ed using P ecision, Recall, and F1 Sco e, which assess he
ac ual o e lap be ween he esponse and e e ence.
4.6.1. Compu ing T ue Posi i es, False Posi i es, and False Nega i es
𝑇𝑃=Numbe o claims in esponse ha a e p esen in e e ence
(7)
𝐹𝑃=Numbe o claims in esponse ha a e no p esen in e e ence
(8)
𝐹𝑁=Numbe o claims in e e ence ha a e no p esen in esponse
(9)
4.6.2. P ecision, Recall, and F1 Sco e
P ecision =𝑇𝑃
𝑇𝑃+𝐹𝑃 (10)
Recall =𝑇𝑃
𝑇𝑃+𝐹𝑁 (11)
F1 Sco e =2×P ecision ×Recall
P ecision +Recall (12)
In e p e a ion:
•
Ahighe F1 Sco e (close o 1) means he esponse accu a ely
e lec s he e e ence.
•
Alowe F1 Sco e (close o 0) indica es ac ual inconsis ency
o hallucina ions in he esponse.
4
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
4.7. Seman ic Simila i y
Seman ic Simila i y e alua es he meaning-based closeness be ween
he gene a ed esponse and he e e ence. I is compu ed using a
bi-encode model ha con e s bo h ex s in o ec o embeddings
and calcula es hei cosine simila i y.
Seman ic Simila i y =cos(𝜃)= 𝐴⋅𝐺
‖𝐴‖‖𝐺‖(13)
whe e:
•𝐴is he embedding o he gene a ed esponse.
•𝐺is he embedding o he e e ence answe .
•𝐴⋅𝐺is he do p oduc o he wo ec o s.
In e p e a ion:
•
Highe Seman ic Simila i y (close o 1)
→
The esponse con eys
he same meaning as he e e ence.
•
Lowe Seman ic Simila i y (close o 0)
→
The esponse de ia es
signi ican ly om he e e ence in meaning.
4.8. Non-LLM S ing Simila i y
Non-LLM S ing Simila i y assesses he ex ual esemblance be ween
he gene a ed esponse and he e e ence using adi ional s ing dis-
ance measu es, wi hou elying on language models. I is compu ed
using echniques like Le ensh ein Dis ance, Hamming Dis ance, and
Ja o Simila i y.
4.8.1. Le ensh ein Simila i y
1−Le ensh ein Dis ance(Response,Re e ence)
max(len(Response),len(Re e ence)) (14)
ℏ2
2𝑚∇2Ψ+𝑉(𝐫)Ψ=−𝑖ℏ𝜕Ψ
𝜕𝑡 (15)
4.8.2. Hamming Simila i y
Hamming Simila i y =1−Hamming Dis ance(Response,Re e ence)
len(Re e ence)(16)
4.8.3. Ja o Simila i y
Ja o Simila i y =1
3(𝑚
|𝑠1|+𝑚
|𝑠2|+𝑚−𝑡
𝑚)(17)
whe e:
•𝑚is he numbe o ma ching cha ac e s.
•𝑡is he numbe o ansposi ions.
•𝑠1,𝑠2a e he wo compa ed s ings.
4.9. BLEU Sco e
The BLEU (Bilingual E alua ion Unde s udy) sco e measu es he
simila i y be ween he esponse and he e e ence based on n-g am
p ecision and a b e i y penal y o p e en o e ly sho esponses.
BLEU =𝐵𝑃×exp(𝑁
∑
𝑛=1𝑤𝑛log𝑝𝑛)(18)
whe e:
•𝑝𝑛is he n-g am p ecision.
•𝑤𝑛is he weigh assigned o each n-g am.
•𝐵𝑃(b e i y penal y) is calcula ed as:
𝐵𝑃={1,i 𝑐>𝑟
exp(1−𝑟∕𝑐),i 𝑐≤𝑟(19)
whe e 𝑐is he esponse leng h and 𝑟is he e e ence leng h.
4.10. ROUGE Sco e
The ROUGE (Recall-O ien ed Unde s udy o Gis ing E alua ion)
Sco e is used o e alua e he simila i y be ween he gene a ed e-
sponse and he e e ence based on n-g am o e lap. The de aul
ROUGE sco e p o ided by RAGAS was used in his s udy, wi hou
modi ica ions o he ROUGE ype.
The ROUGE sco e is calcula ed as ollows:
Numbe o o e lapping wo ds be ween esponse and e e ence
To al wo ds in e e ence (20)
In e p e a ion:
•
Highe ROUGE Sco e (close o 1)
→
Indica es ha he esponse
con ains a high p opo ion o wo ds om he e e ence, sugges -
ing be e alignmen .
•
Lowe ROUGE Sco e (close o 0)
→
Sugges s ha he esponse
di e s signi ican ly om he e e ence in e ms o wo d o e lap.
5. Compa ison o Hyb idRAG s. BaselineRAG
The pe o mance o ou Hyb idRAG sys em was compa ed wi h he
adi ional RAG sys em, which ollows a Vec o Simila i y-Based
Re ie al app oach. This adi ional RAG sys em ac s as a baseline
o ou sys em.
5.1. BaselineRAG: T adi ional Vec o Simila i y-Based Re ie al
Sys em
The BaselineRAG sys em ollows a adi ional e ie al-
augmen ed gene a ion (RAG) app oach, whe e documen
e ie al is pe o med using ec o simila i y sea ch. Gi en a use
que y, i e ie es ele an ex chunks om a ec o da abase using
cosine simila i y o o he dis ance me ics. The e ie ed con ex s
a e hen passed o a language model o gene a e he esponse.
5.2. E alua ion Da a P epa a ion
To ensu e a sys ema ic compa ison be ween Hyb idRAG and Base-
lineRAG, e alua ion da a was ca e ully p epa ed be o e compu ing
he me ics. Two que ies we e selec ed om each ca ego y, co e ing
Explici Fac s, Implici Fac s, Hidden Ra ionale, and In e p e able Ra-
ionale que ies. Fo each que y, a g ound u h e e ence answe was
gene a ed using Cha GPT, which se ed as he baseline o ac ual
co ec ness and simila i y-based me ics. Along wi h he e e ence
answe s, he e ie ed con ex s and sys em-gene a ed esponses om
bo h Hyb idRAG and BaselineRAG we e collec ed. These elemen s,
including e ie ed con ex s, gene a ed esponses, and e e ence an-
swe s, we e used o compu e a ious e alua ion me ics.
5.3. E alua ion Plo s
All he me ics we e plo ed o e alua e he pe o mance o hy-
b idRAG compa ed o baselineRAG, using bo h Re ie al Me ics
and Response Me ics as ou e alua ion c i e ia.
5.3.1. Response Quali y Me ics
These me ics e alua e he co ec ness and quali y o he gene a ed
esponses compa ed o e e ence answe s. We plo ed he ollowing
me ics:
•Fac ual Co ec ness
•Seman ic Simila i y
•Non-LLM S ing Simila i y
•BLEU Sco e
•ROUGE Sco e
5.3.2. Re ie al-Based Me ics
Re ie al me ics analyze he e ec i eness o e ie ing ele an con-
ex s, ensu ing ha he sys em p o ides use ul in o ma ion o e-
sponse gene a ion. We plo ed he ollowing e ie al-based me ics:
5
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
(a) ig: Re ie al-based Me ics
Figu e 1. Compa ison o Hyb idRAG and BaselineRAG in Re ie al-Based Me ics: Con ex P ecision, Con ex P ecision (Wi hou Re e ence), Con ex Recall,
Con ex En i ies Recall, Noise Sensi i i y, and Fai h ulness.
•Con ex P ecision
•Con ex P ecision (Wi hou Re e ence)
•Con ex Recall
•Con ex En i ies Recall
•Noise Sensi i i y
•Fai h ulness
The ollowing igu es illus a e he compa a i e pe o mance o
Hyb idRAG and BaselineRAG ac oss di e en que y ca ego ies.
6
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
(a) ig: Response quali y Me ics
Figu e 2. Compa ison o Hyb idRAG and BaselineRAG in Response Quali y Me ics: Fac ual Co ec ness, Seman ic Simila i y, Non-LLM S ing Simila i y,
BLEU Sco e, and ROUGE Sco e. PFGPlo s.
7
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
6. Analysis o Resul s
The compa a i e e alua ion be ween Hyb idRAG and Baselin-
eRAG p o ides aluable insigh s in o he e ec i eness o bo h e-
ie al and esponse gene a ion mechanisms. The analysis o bo h
e ie al-based me ics and esponse-based me ics helps de e -
mine he s eng hs and limi a ions o each app oach.
6.1. Response Quali y Analysis
The esponse quali y me ics assess he accu acy and luency o gene -
a ed esponses in compa ison o e e ence answe s. The obse a ions
om he e alua ion plo s a e as ollows:
•
Fac ual Co ec ness: Hyb idRAG achie es highe ac ual co -
ec ness sco es Hidden Ra ionale and In e p e able que y ca -
ego ies, indica ing ha i p oduces mo e ac ually accu a e e-
sponses han BaselineRAG.
•
Seman ic Simila i y: Hyb idRAG ou pe o ms BaselineRAG
in main aining seman ic consis ency wi h e e ence answe s,
especially o mo e complex que ies.
•
Non-LLM S ing Simila i y: The Hyb idRAG sys em exhibi s
be e ex -le el simila i y wi h e e ence answe s, demons a -
ing i s abili y o gene a e esponses ha closely ma ch expec ed
ou pu s.
•
BLEU Sco e: Pe o mance luc ua es be ween bo h sys ems,
bu Hyb idRAG shows a no iceable imp o emen in some que y
ca ego ies, indica ing be e s uc u ed esponses.
•
ROUGE Sco e: Hyb idRAG achie es highe ROUGE sco es,
e lec ing s onge wo d o e lap be ween gene a ed esponses
and e e ence ex s.
O e all, Hyb idRAG ou pe o ms BaselineRAG in esponse
quali y by p oducing mo e ac ually co ec and seman ically aligned
answe s.
6.2. Re ie al E ec i eness Analysis
Re ie al-based me ics e alua e how well he sys em e ie es ele-
an con ex s o aid esponse gene a ion. The key indings a e:
•
Con ex P ecision: BaselineRAG pe o ms be e in some cases,
especially o explici ac s, bu Hyb idRAG main ains s able
p ecision ac oss que y ca ego ies.
•
Con ex P ecision (Wi hou Re e ence): Hyb idRAG
achie es signi ican ly highe sco es, highligh ing i s abili y o
e ie e mo e ele an con ex s.
•
Con ex Recall: Bo h sys ems pe o m simila ly in e ie ing
ele an con ex s, hough Hyb idRAG main ains a sligh ad an-
age in some que y ypes.
•
Con ex En i ies Recall: Hyb idRAG consis en ly ecalls mo e
en i ies om he e e ence, making i mo e eliable o en i y-
ocused que ies.
•
Noise Sensi i i y: Hyb idRAG has lowe noise sensi i i y,
meaning i in oduces ewe inco ec claims in esponses.
•
Fai h ulness: Hyb idRAG achie es be e ai h ulness sco es,
ensu ing ha gene a ed esponses align wi h e ie ed con ex s
mo e e ec i ely.
These esul s indica e ha Hyb idRAG excels in e ie ing p e-
cise and ele an con ex s while minimizing noise, making i mo e
eliable han BaselineRAG.
6.3. O e all Pe o mance Compa ison
•
Hyb idRAG ou pe o ms BaselineRAG in ac ual co ec -
ness, seman ic simila i y, and ai h ulness, p o ing i s supe io -
i y in esponse quali y.
•
Hyb idRAG demons a es s onge e ie al e ec i eness,
pa icula ly in p ecision, ecall, and en i y ecall.
•BaselineRAG pe o ms compe i i ely in ce ain e ie al
scena ios, pa icula ly in cases in ol ing explici ac que ies.
•
Hyb idRAG minimizes noise sensi i i y, ensu ing ewe in-
co ec claims, making i a mo e eliable choice o knowledge-
in ensi e applica ions.
Based on hese indings, Hyb idRAG ou pe o ms he baselin-
eRAG sys em, p o iding mo e accu a e, ele an , and con ex ually
ai h ul esponses while main aining s ong e ie al e ec i eness.
7. Conclusion
This s udy ocused on de eloping a Re ie al-Augmen ed Gene a ion
(RAG) sys em ailo ed o he Geo ock da abase o enhance knowl-
edge ex ac ion and esponse gene a ion o geochemis y- ela ed
que ies. The Hyb idRAG sys em in eg a es a que y classi ica ion
mechanism ha dynamically selec s be ween ec o -based and g aph-
based e ie al, ensu ing mo e p ecise and con ex ually ele an e-
ie al, pa icula ly o complex que ies equi ing s uc u ed knowl-
edge ep esen a ion.
To assess i s e ec i eness, Hyb idRAG was compa ed agains Base-
lineRAG, a adi ional RAG sys em using ec o simila i y-based
e ie al. The esul s show ha Hyb idRAG achie es highe ac-
ual co ec ness, seman ic simila i y, and ai h ulness, demons a ing
i s abili y o gene a e mo e accu a e and con ex ually g ounded e-
sponses. In e ie al-based me ics, Hyb idRAG consis en ly e ie es
mo e ele an con ex s wi h lowe noise sensi i i y, as e lec ed in
imp o ed Con ex P ecision and Con ex En i ies Recall.
Fo explici ac -based que ies, bo h Hyb idRAG and BaselineRAG
exhibi simila pe o mance since Hyb idRAG also employs ec o
simila i y e ie al o such que ies. Howe e , Hyb idRAG ou pe -
o ms BaselineRAG in implici and a ionale-d i en que ies, whe e
i s hyb id e ie al app oach inco po a ing g aph-based knowledge
enhances esponse quali y.
The e alua ion me hodology, in ol ing a s anda dized da ase wi h
Cha GPT-gene a ed e e ence answe s, ensu ed eliable me ic com-
pu a ions. The indings highligh ha in eg a ing s uc u ed knowl-
edge g aphs wi h ec o e ie al signi ican ly imp o es e ie al and
esponse accu acy.
In conclusion, Hyb idRAG o e s signi ican imp o emen s o
complex que ies, making i a aluable ool o scien i ic knowledge
ex ac ion. Fu u e wo k may explo e expanding he sys em wi h
domain-speci ic on ologies, e ining que y classi ica ion models, and
op imizing e ie al s a egies o u he enhance pe o mance.
Re e ences
[1]
B. Sa mah, B. Hall, R. Rao, S. Pa el, S. Pasquali, and D. Meh a,
“Hyb idRAG: In eg a ing Knowledge G aphs and Vec o Re-
ie al Augmen ed Gene a ion o E icien In o ma ion Ex ac-
ion,” a Xi , ol. 2408.04948, 2024.
[2]
S. Zhao, Y. Yang, Z. Wang, Z. He, L. Qiu, and L. Qiu, “Re ie al
Augmen ed Gene a ion (RAG) and Beyond: A Comp ehensi e
Su ey on How o Make You LLMs Use Ex e nal Da a Mo e
Wisely,” a Xi , ol. 2409.14924, 2024.
[3]
S. Li, L. S enzel, C. Eickho , and S. A. Bah ainian, “Enhancing
Re ie al-Augmen ed Gene a ion: A S udy o Bes P ac ices,”
a Xi , ol. 2501.07391, 2025.
[4]
M. Khemakhem, H. E. Rekik, and O. Bouaziz, “Enhancing Tech-
nical Knowledge Acquisi ion wi h RAG Sys ems: The TEI Use
Case,” HAL Science, 2024.
[5]
S. Ganesh, G. Balak ishnan, and A. Pu wa , “Con ex -
augmen ed Re ie al: A No el F amewo k o Fas In o ma ion
Re ie al Based Response Gene a ion Using La ge Language
Models,” Resea chGa e, 2024.
8
Building a Sys em o Tex In o ma ion Ex ac ion om Geo ock Resea ch Pape s
[6]
RAGAS Team, “RAGAS: A amewo k o e alua ing
e ie al-augmen ed gene a ion,” 2023. [Online]. A ail-
able:
h ps://gi hub.com/explodingg adien s/ agas
.
[Accessed: 10-Ma -2025].
9