LEGAL ASPECTS OF AI TRAINING AND
RETRIEVAL AUGMENTED GENERATION*
L. Bee †, Open Sea ch Founda ion e.V., Munich, Ge many
P. C.Johannes§, H.Koulani¶, ITeG, Uni e si y o Kassel, Ge many
Abs ac
This pape explo es legal aspec s o he de elopmen
o he Open Web Index (OWI), a publicly unded
Eu opean ini ia i e designed as an al e na i e o
p op ie a y web indexes. I examines how he OWI
suppo s he aining o La ge Language Models
(LLMs) and enhances Re ie al Augmen ed Gene a ion
(RAG) sys ems. The discussion co e s he OWI’s
a chi ec u e, including i s dis ibu ed c awling and
indexing me hods, which allow o he collec ion o
as amoun s o web da a. By making high-quali y,
accessible da a a ailable, his open in as uc u e could
bene i smalle companies and esea ch ins i u ions ha
migh o he wise s uggle o compe e wi h la ge
playe s. The pape del es in o he egula o y landscape
wi hin he Eu opean Union, pa icula ly in ela ion o
he AI-Ac and copy igh law. I conside s he legal
challenges su ounding he OWI’s use in LLM aining
and RAG, emphasizing he impo ance o da a quali y,
legal compliance, and public us . The conclusion
highligh s key a eas o u u e esea ch, including he
need o cla i y amewo ks o igh s o use, consen
and p ocessing au ho isa ion o da a o add esses legal
unce ain ies.
INTRODUCTION
A i icial in elligence (AI) has become widely
adop ed in a a ie y o a eas a ligh ning speed and has
become an in eg al elemen o science, business and
socie y. RAG, i.e. he combina ion o sea ch and a
gene a i e componen , is no longe a e m ha only
expe s unde s and. In ac , he e a e nume ous RAG
sys ems on he ma ke ha a e used by millions o
people in he EU and elsewhe e e e y day[1].
A he same ime, he Eu opean Union (EU) has
adop ed a la ge numbe o egula ions and di ec i es in
he a ea o digi al go e nance ha impose nume ous
obliga ions on he de elope s and use s o such
applica ions. Fu he legisla ion is also planned o he
u u e o ensu e ai ness in he digi al space and
gua an ee he compe i i eness o Eu opean companies.
In his epo o he EU Commission on he u u e o
Eu opean compe i i eness, Ma io D aghi b ings up he
p oblem o ex ensi e egula ion: “inno a i e
companies ha wan o scale up in Eu ope a e hinde ed
a e e y s age by inconsis en and es ic i e
egula ions”[2]. De egula ion is o en a pa o he
demands by economic in e es g oups as he many
legal equi emen s a e di icul o keep ack o ,
especially o small and medium-sized companies, and
he e o e migh hinde inno a ion.
The P iDI (P i acy enhancing digi al in as uc u es)
esea ch p ojec , has se i sel he goal o making he
complex digi al legisla ion a na ional and EU le el
easy o unde s and o de elope s o digi al
applica ions[3]. To his end, he conso ium o he
Uni e si y o Kassel and he Open Sea ch Founda ion
is analysing use - ela ed and legal equi emen s o he
aining o LLM’s and RAG sys ems i hese
applica ions a e based on da a om he so-called OWI.
Such an index is cu en ly being de eloped by he EU-
unded esea ch p ojec OpenWebSea ch.EU[4].
This pape b ie ly desc ibes he c ea ion and cu a ion
o he OWI. I hen p o ides key in o ma ion on he
p ocesses o LLM aining and he ope a ion o RAG
sys ems. The au ho s hen shed ligh on he legal
amewo k wi hin he EU in which he de elope s o
such applica ions ope a e. A pa icula ocus is pu on
he obliga ions a ising om he egula ion o AI and
he in ellec ual p ope y igh s o he owne s o websi e
con en . In addi ion, he pape analyses equi emen s
ha a ise om he use 's pe spec i e wi h ega d o
us and accep ance o AI sys ems ha a e based on
he OWI. I concludes wi h an ou look on u he
esea ch wo k.
THE OPEN WEB INDEX
Simila o he Google o Bing indexes, he Open
Web Index is being c ea ed by sys ema ically c awling
he web, analysing he c awled con en and s o ing i
wi h me ada a in a da abase[5]. The OWI is in ended
o s eng hen he EU's digi al so e eign y by educing
dependence on he sea ch engine monopolis s h ough
a sus ainable, eely accessible web index.
The esea che s ha e se up a dis ibu ed c awling,
indexing and hos ing a chi ec u e o he OWI. This
consis s o he combina ion o a on ie c awle , ha
basically cha s he web along embedded links and
collec s URLs, and dis ibu ed wo ke c awle s, ha
la e on e ch he websi es and s o e he con en in so
called web a chi e (WARC) iles. La e on, he “ aw”
web da a is u he p ocessed, cleaned, il e ed,
en iched wi h me ada a, classi ied acco ding o
______________________________________________
* Based on esea ch o p ojec P i acy-enhancing digi al in as uc u es
(P iDI), unded by he Ge man Fede al Minis y o Educa ion and
Resea ch (BMBF). The esponsibili y o he con en o his publica ion
lies wi h he au ho s. Pa s o he indings p esen ed in his pape ha e
al eady been submi ed o publica ion o CPDP.AI 2025.
† [email p o ec ed]g
§[email p o ec ed]
¶ [email p o ec ed]
h ps://doi.o g/10.5281/zenodo.17229684
language and web gen e and s o ed as web index cha s
ollowing he common index ile o ma (CIFF) and
addi ional me ada a se s.
The sys em is designed in a way ha i ede a es
s o age and compu ing capaci ies ac oss se e al high
pe o mance compu ing cen es ac oss Eu ope and can
be dynamically ex ended wi h addi ional compu ing
cen es being added o he ede a ion. To access he
index, p o ide s o LLMs and RAG sys ems o o he
scien i ic use s o he OWI can au hen ica e hemsel es
ia a public sys em and can access and e ie e pa s o
he index ia a command line ool. Cu en ly he web
da a is made a ailable unde a esea ch licence, bu he
esea ch eam is also wo king o g an access o he
sys em o comme cial pu poses.
The public accessibili y o he index is in ended o
s eng hen eedom in in e ne sea ches and o o m a
basis o inno a ions in science and economy. The
esea che s ha e now c awled a ound 2.23 billion
URLs in 185 di e en languages. The Open Web Index
cu en ly has a olume o a ound 14 TB and is al eady
a ailable o in e es ed de elope s o ini ial es s.
Howe e , Google’s index wi h a olume o a ound
100.000 TB is much la ge as i includes also
humbnails and o he da a, whe eas he OWI cu en ly
includes ex da a only[6].
ACTORS
When assessing he Open Web Index and i s use
cases o LLM aining and RAG om a legal and use
accep ance pe spec i e, a dis inc ion mus be made
be ween di e en s akeholde s. Fi s ly, he e a e he
da a subjec s. This ole desc ibes pe sons o companies
whose pe sonal da a and in ellec ual p ope y a e s o ed
in he OWI and a e used o may be accessed by ools
based on he OWI. The index i sel is de eloped and
main ained by he OWI de elope . The OWI
de elope s ha e joined oge he o o m an independen
legal en i y, he ope a o conso ium. The da a
e ie e s o da a consume s o he in o ma ion
con ained in he OWI a e e e ed o as applica ion
de elope s. They a e pe sons o sys ems ha eques
he e ie al o web da a om he index in o de o
c ea e and de elop a ious ools and models based on
he e ie ed da a. They can be indi iduals,
o ganisa ions, companies, public ins i u ions o s a -
ups ha use OWI's open da a o de elop hei own
applica ions and se ices. Finally, end use s also come
in o con ac wi h he index. They a e he na u al o
legal pe sons who use he ools and sys ems de eloped
by he applica ion de elope s.
USE CASES OF THE OWI
The Open Web Index can be used in a ious ways,
e.g. as a basis o sea ch engines (see [7] and [8] o
mo e de ails). This pape ocuses on he use o he
index’s da a o ain AI-Sys ems, i.e. he aining o
LLM and he de elopmen o RAG sys ems.
T aining o LLMs
Fo he aining o LLMs, comp ehensi e and high-
quali y p e- il e ed web da a a e essen ial. A common
example o such an LLM is Mis al La ge 2 by he
F ench company Mis al AI, on which he company’s
cha bo is based. These models a e ained wi h la ge
amoun s o ex da a, which mainly come om online
sou ces. Wi h he help o machine lea ning using
neu al ne wo ks and deep lea ning me hods, LLMs
lea n o ecognise s a is ical ela ionships be ween
wo ds and sen ences in o de o unde s and and
gene a e ex s.
The OWI con ibu es o he de elopmen o new
LLMs by p o iding smalle companies wi h a
su icien amoun o aining da a a low cos . By
o e ing an open and anspa en al e na i e o
p op ie a y da ase s, he OWI enables s a -ups and
esea ch ins i u ions o ain hei own models wi hou
elying on a ew dominan playe s in he ield. This no
only os e s inno a ion and di e si y in AI
de elopmen bu also p omo es ai compe i ion. As a
esul , smalle companies can de elop high-quali y
p oduc s ha can compe e wi h he leading language
models cu en ly a ailable on he ma ke .
Re ie al Augmen ed Gene a ion
RAG is an ad anced app oach in AI ha enhances
ex gene a ion models by in eg a ing an in o ma ion
e ie al componen . This me hod combines he
gene a i e capabili ies o language models wi h he
p ecision o e ie ing ele an da a om ex e nal
knowledge sou ces such as da abases, documen s, o
he web.
The i s key componen o RAG is he language
model, which is ained on as amoun s o ex da a o
unde s and language and gene a e cohe en esponses.
This model se es as he ounda ion o answe ing
que ies and can be ained wi h OWI da a. The second
componen , e ie al, dynamically accesses a
knowledge base o e ch ele an in o ma ion in eal
ime. The OWI can se e as such knowledge base. By
combining hese wo elemen s, RAG enables AI
sys ems o p oduce esponses ha a e no only
con ex ually accu a e bu also based on he mos
cu en a ailable da a.
This app oach is pa icula ly aluable in a eas whe e
p ecise and up- o-da e in o ma ion is c ucial, such as
cus ome suppo , medical consul a ion, legal esea ch,
and knowledge managemen . By o e coming he
limi a ions o s a ic aining da a, RAG ensu es ha AI-
d i en solu ions emain ele an and eliable, e en in
apidly e ol ing ields.
LEGAL ASPECTS
Rele an legisla ion
The complexi y o he Open Web Index aises many
ques ions as o how he index and i s applica ions all
h ps://doi.o g/10.5281/zenodo.17229684
unde Union and membe s a e law. Eu opean law on
da a and online se ices has unde gone majo changes
in ecen yea s. Whe e ini ially mainly he Gene al
Da a P o ec ion Regula ion (GDPR, Regula ion (EU)
2016/679) laid down de ailed ules on he handling o
pe sonal da a, now a ne wo k o mo e o less
specialized, di ec ly applicable legal ac s has eme ged.
The Digi al Se ices Ac (DSA, Regula ion (EU)
2022/2065), he Da a Go e nance Ac (DGA,
Regula ion (EU) 2022/868), he Da a Ac (DA,
Regula ion (EU) 2022/868), he Digi al Ma ke s Ac
(DMA, Regula ion (EU) 2022/1925) and he AI Ac
(AIA, Regula ion (EU) 2024/1689) o m a legal
amewo k o digi al se ices and business models[9].
These egula ions a e di ec ly and uni o mly applicable
in all membe s a es.
A he same ime, he e is a ha monized copy igh
law amewo k in he Eu opean Union. I is p ima ily
go e ned by a combina ion o EU Di ec i es,
in e na ional ea ies, and na ional laws o membe
s a es. While he EU aims o ha monize copy igh laws
ac oss i s membe s a es, a ia ions s ill exis a he
na ional le el.
This a icle ocuses on legal challenges aced by he
OWI de elope and he LLM and RAG applica ion
de elope s ac oss he egula o y domains AI egula ion
and copy igh . The ini ial assessmen o he use case
highligh s he complexi ies and he need o ongoing
s udies.
Regula ion o AI-Sys ems
The AIA aims o egula e sys ems and p ac ices in
he ield o a i icial in elligence. I was adop ed in
o de o c ea e a obus and lexible legal amewo k
ha makes he use o AI and au oma ed decision-
making sys ems us wo hy and secu e. The AIA
in oduces a uni o m amewo k o AI sys ems based
on a isk-based app oach, see Reci al26 AIA. AI
sys ems a e in A icle3 No.1 de ined as amachine-
based sys em ha is designed o ope a e wi h a ying
le els o au onomy and ha may exhibi adap i eness
a e deploymen , and ha , o explici o implici
objec i es, in e s, om he inpu i ecei es, how o
gene a e ou pu s such as p edic ions, con en ,
ecommenda ions, o decisions ha can in luence
physical o i ual en i onmen s. The highe he isk,
he mo e subs an ial a e he obliga ions pu on
ope a o s(see A icle3 No.8 AIA o de ini ion) o AI
sys ems. AI sys ems wi h unaccep able isks, e.g.
sys ems ha allow “social sco ing” by go e nmen s o
companies, a e conside ed a clea h ea o people's
undamen al igh s and a e he e o e banned pu suan
o A icle5 AIA. To add ess hei speci ic anspa ency
isk, AI sys ems like cha bo s mus clea ly in o m use s
ha hey a e in e ac ing wi h a machine, while ce ain
AI-gene a ed con en mus be labelled as such, see
A icle50 AIA. Only a ew AI sys ems wi h limi ed
isk ace no obliga ion unde he AIA. High- isk AI
sys emsacco ding o A icle6 AIA and Annex I and II
AIA on he o he hand, such as AI-based medical
so wa e o AI sys ems used o ec ui men , mus
comply wi h s ic equi emen s, including isk-
mi iga ion sys ems, high-quali y o da a se s, clea use
in o ma ion, human o e sigh .
Di ec applicabili y o AIA The de elope o he
OWI would ha e o examine o wha ex en he
p o isions o he AIA di ec ly apply o he
echnologies used o acili a e he OWI. I ’s plausible
ha he algo i hms u ilized by he OWI o assis and
coo dina e web c awling will be ca ego ized as
minimal isk, as hey p ima ily ocus on in e nal
ope a ions. Howe e , o conclusi ely es ablish his, a
ho ough isk assessmen is equi ed. Following he
isk-based app oach, AI sys ems “ ha may ha e a
signi ican ad e se impac on he heal h, sa e y and
undamen al igh s o pe sons” (Reci al46 AIA) a e
classi ied as high- isk AI sys ems in A icle6 AIA,
whe eby a dis inc ion is made be ween high- isk AI
sys ems in connec ion wi h p oduc egula ion (pa a.1)
and s and-alone high- isk AI sys ems (pa a.2).
AI sys ems ha a e sa e y componen s o p oduc s
(A icle3 No.14 AIA) o a e hemsel es p oduc s
co e ed by he ha moniza ion legisla ion in AnnexII
(e.g. machine y, oys, ele a o s, adio equipmen ,
cableways, medical de ices, mo o ehicles and
ai c a ) a e deemed high- isk sys ems. The OWI
would p obably no be classi ied as a high isk sys ems
pu suan o A icle6 pa a.1 AIA, since i is no used as
a sa e y componen co e ed by Annex I o he AIA.
A icle6 o he AIA also designa es high- isk AI
sys ems as hose enume a ed in Annex III. This
includes AI sys ems in biome ics, c i ical
in as uc u es, educa ion, employmen , basic se ices,
law en o cemen , mig a ion, asylum and bo de
con ol, as well as he adminis a ion o jus ice and
democ a ic p ocesses. AI used in indexing, as well as
de e mining he exclusion o inclusion o ce ain
websi e con en , gene ally ha e angible ex e nal
impac s on he index usage by hi d pa ies. Howe e ,
i emains unlikely ha hese ope a ions, o he en i e y
o he index, migh be classi ied unde one o he
sec o s speci ied in AnnexIII, wi h he excep ion o he
"c i ical in as uc u es" lis ed as No.2.
Indi ec applicabili y Fu he mo e, i mus be
asked i he AIA con ains egula ions ha in luence he
use o OWI da a o speci ic AI applica ions. Fo
example, ce ain AI sys ems a e banned unde A icle 5
AIA. The OWI de elope migh he e o e seek o
p ohibi he use o i s index o he aining o such AI
sys ems wi h unaccep able isks. This could be
achie ed by means o he index licence o e ms o
condi ions o using he OWI as a se ice. The OWI
ope a o would ha e ce ain leeway, since he
p ohibi ion clause does no p e en scien i ic esea ch
in o he use o AI sys em me ely capable o p ohibi ed
p ac ices. Fu he mo e, acco ding o A icle2 pa a.6
AIA, he egula ion does no apply o AI sys ems o AI
models, including hei ou pu s, which a e de eloped
h ps://doi.o g/10.5281/zenodo.17229684
and pu in o ope a ion o he sole pu pose o scien i ic
esea ch and de elopmen , see A icle3 No.11 AIA.
This is in ended o p omo e inno a ion and p o ec
scien i ic eedom, see Reci al25 AIA. In acco dance
wi h A icle13 Cha e o Fundamen al Righ s o he
Eu opean Union, scien i ic esea ch includes ac i i ies
wi h he aim o “gaining new knowledge in a
me hodical, sys ema ic and e i iable manne ”. This
includes basic esea ch and applied esea ch in he
public (e.g. uni e si ies) and p i a e (e.g. indus ial
esea ch) sec o s. De elopmen includes he
applica ion and implemen a ion o he knowledge
gained h ough esea ch. S ill, he excep ion is o be
in e p e ed na owly in e ms o wo ding and well as
meaning and pu pose.
Ano he example would be, ha pu suan o
A icle10 pa a.2-5 AIA in conjunc ion wi h A icle10
pa a.1 AIA high- isk AI sys ems mus be de eloped
wi h aining, alida ion and es da a se s ha mee he
ce ain quali y c i e ia. A icle10 pa a.3 o he AIA
s ipula es ha aining, alida ion and es da a se s
mus be ele an , su icien ly ep esen a i e and, as a
as possible, e o - ee and comple e wi h ega d o he
in ended pu pose. Among o he hings, i is
ques ionable whe he legally e oneous da a (e.g. da a
ob ained in iola ion o da a p o ec ion o copy igh
law) o da a anonymized o pseudonymized o da a
p o ec ion easons (e.g. due o added noise) can s ill be
conside ed e o - ee and comple e[10]. The da a
eco ds mus also ha e he app op ia e s a is ical
cha ac e is ics, i necessa y also wi h ega d o he
pe sons o g oups o pe sons o whom he high- isk AI
sys em is o be used as in ended. OWI da a p o ides a
massi e and di e se sou ce o in o ma ion, including
ex and links. This di e si y is c ucial o aining AI
models. In o de o be usable unde he quali y c i e ia
o high- isk AI sys ems, he OWI de elope should
use speci ic echniques o ensu e da a quali y, e.g. da a
cleaning, augmen a ion, balancing o anno a ion. A he
same ime i could y o c ea e i s da ase s in a way,
subsequen applica ion de elope s could use o build
on o ensu e da a quali y o hei speci ic use case.
Copy igh
Wi hin he Eu opean Union, in ellec ual p ope y
igh s a e mainly de e mined by Eu opean law, bu a e
implemen ed mos ly a na ional le el. Fo copy igh
law, which is pa icula ly ele an in he con ex o
LLM aining and RAG, he EU legisla o has adop ed
p o isions in he Copy igh Di ec i e (2001/29/EC)
and he Di ec i e on Copy igh in he Digi al Single
Ma ke (2019/790), which ha e been implemen ed in o
na ional law by he membe s a es. In he ollowing, he
legisla ion in Ge many (mainly he Ge man Ac on
Copy igh and Rela ed Righ s – U hG) is aken as an
example.
The U hG de ines he ex en o which copy igh -
p o ec ed con en may be indexed in he OWI and used
by applica ions based on he index. The OWI con ains
a la ge amoun o da a, mos o which a e p o ec ed
wo ks unde Sec ion2 U hG. These wo ks a e
ep oduced egula ly as pa o hei inclusion in he
index. Howe e , he igh o ep oduc ion de ined in
Sec ion16 U hG is, in p inciple, g an ed o he au ho
o he wo k and no o he OWI o applica ion
de elope s in acco dance wi h Sec ion15 pa a.1 No.1
U hG. Copy igh -inducing, a leas empo a y,
ep oduc ions canno be a oided when c ea ing he
index and aining LLMs wi h he index da a.
Howe e , he Copy igh Ac con ains a ious
excep ions ha can jus i y ac s o ep oduc ion. In
2021, he Ge man legisla o c ea ed he excep ion ule
o Sec ion44b U hG o gene al ex and da a mining
in implemen ing A icle4 o he Di ec i e
No.2019/790. This is designed o make i possible o
analyse la ge amoun s o digi al in o ma ion[11].
Acco ding o he legal academia[12-20], he aining
o AI models can usually be jus i ied by Sec ion44b
U hG and case law also shows a sligh endency in his
di ec ion[21]. Howe e , any ese a ions o he c ea o
pu suan o Sec ion44b pa a.3 U hG mus be aken
in o accoun . I he igh s holde opposes o he use o
hei websi e con en o LLM aining o RAG
de elopmen , he applica ion and OWI de elope s mus
adhe e o he con en owne ’s ese a ions.
In he case o ac s o ep oduc ion c ea ed by
c awle s du ing he indexing o web con en ,
Sec ion44a U hG also comes in o ques ion. In his, he
legisla o p o ides o an excep ion o ac s o
ep oduc ion ha a e only o a empo a y na u e and
pa o a echnical p ocess, ha e no independen
economic signi icance and se e a pu pose o
Sec ion44a U hG.
USER ACCEPTANCE AND TRUST
While legal compliance, da a p o ec ion, and p i acy
a e highly alued in Eu ope, use beha iou sugges s
ha hese ac o s o en ake a backsea when choosing
digi al se ices. In p ac ice, o he aspec s end o ha e a
s onge in luence on use s o adap new echnologies.
Many esea ch s udies ocused on in es iga ing
concep s and aspec s o use pe cep ion and hei
in en ion o use and us echnologies[22-26].
The e o e, inco po a ing key elemen s o use
accep ance in o he de elopmen o OWI and ela ed
ools is essen ial. To apply use accep ance and us
p inciples e ec i ely o OWI, ou s udy combines
insigh s om T us -TAM[26] and UTAUT2[24]. The
la e iden i ies se en key ac o s—such as
pe o mance and e o expec ancy, social in luence,
and acili a ing condi ions—all o which we aim o
adap o he OWI amewo k. In e ms o es ablishing
us wi h de elope s as ac o s in hese use cases,
knowledge-based us could be concei ed in oducing
amilia i y as an an eceden o his us . Suppo ing he
de elope s’ amilia i y wi h he s uc u e and
unc ionali ies o OWI p omo e hei con idence and
h ps://doi.o g/10.5281/zenodo.17229684
us in he in e ac ion wi h OWI. Pa icula ly, his
amilia i y could be implemen ed in e ms o p o iding
web da a o LLM aining in a o ma ha is s anda d
in hese con ex s and hus educing cogni i e load
equi ed o acqui e web da a om OWI o he
discussed ma e s. This adap a ion is s ill in i s ea ly
s ages and ep esen s a wo k in p og ess, wi h he
men ioned ideas se ing as an ini ial ounda ion o
u he de elopmen .
FUTURE WORK
Rega dless o he p e iously conduc ed analysis o
he legal amewo k, some issues in he con ex o he
use o he OWI o LLM aining and he de elopmen
o RAG a e s ill open. In u u e wo ks, he in e ac ion,
including con ac ual ela ionships, be ween he
de elope o he OWI and he de elope o LLMs and
RAG would need o be cla i ied. Any legal loopholes
o unin ended consequences de lege la a should be
add essed by u he de eloping he law, ei he on he
Union le el, o whe e possible, on he na ional le el. In
his con ex , he ocus should be on c ea ing simple and
concise p o isions ha a e easy o de elope s o
implemen . The P iDI p ojec will also ocus on his in
i s u u e wo k and speci y he legal equi emen s
ou lined abo e in he o m o equi emen and design
pa e ns.
The OWI can also be used o ain o he web da a-
based AI applica ions, such as knowledge
ep esen a ion and easoning (KRR) sys ems. In
con as o LLMs, which use s a is ics o p oduce ex s,
KRR sys ems ep esen in o ma ion in a way ha a
compu e can unde s and i and sol e complex
p oblems like a human. The aim is o c ea e in elligen
machines ha lea n om human knowledge and ac in
he same way. KRR sys ems a e used, o example, in
quali y managemen o moni o p oduc quali y o o
p e en aud in he insu ance indus y. Fu u e
publica ions will need o conside whe he he abo e
legal equi emen s also apply o KRR sys ems.
CONCLUSION
The OWI da a can be used o he aining o LLMs
and RAG de elopmen , since i s da a would be
comp ehensi e and high-quali y, p e- il e ed web da a.
While he OWI ope a o concei ably would no all
unde he AIA, he LLM and RAG de elope mos
likely would. Depending on i s speci ic use case, he
LLM o RAG sys em could e en be classi ied as a
high- isk AI sys em. Ei he way, he quali y o he
p o ided da a as well as he legali y o i s con en and
i s p o ision a e pa amoun . E en i he AIA would no
be applicable o he speci ic use case (e.g. because o
A icle 2 pa a. 6 o 12 AIA), da a p o ec ion law as
well as copy igh law mos ce ainly s ill would.
In ega ds o da a p o ec ion law, he OWI de elope
would ha e o make su e ha i is allowed o sha e o
publish he pe sonal da a i has collec ed. The LLM
and RAG de elope would ha e o make su e, ha i is
allowed o p ocess he pe sonal da a on he basis o one
o he au ho isa ions in A icle6 GDPR. Fo example:
The AI o RAG de elope mos likely would be
allowed o p ocess publicly a ailable pe sonal da a o
pe sons linked o a business unde A icle6 pa a.1
subpa a.1 li . GDPR.
Likewise, in ega ds o copy igh , he LLM and RAG
de elope would s ill ha e o make su e, i could use
he p o ided da a. Ideally i could ely on a legal
limi a ion o he copy igh , like Sec ion44b U hG.
REFERENCES
[1] Acco ding o DemandSage,
h ps://www.demandsage.com/pe plexi y-ai-
s a is ics/ he RAG-sys ems Pe plexi y AI has abou
2 million daily ac i e use s.
[2] Eu opean Commission,
h ps://commission.eu opa.eu/ opics/eu-
compe i i eness/d aghi- epo _en, p.6.
[3] Fo o e in o ma ion on he esea ch p ojec see i s
websi e h ps://p idi-p ojek .de/home-en/.
[4] Fo mo e in o ma ion on he OWS.EU p ojec see i s
websi e h ps://openwebsea ch.eu/.
[5] Mo e de ails on he c ea ion and ope a ion o he OWI
can be ound in G.Hend iksen e al., “The Open Web
Index. C awling and Indexing he Web o Public Use”,
in Ad ances in In o ma ion Re ie al: 46 h Eu opean
Con . on In o ma ion Re ie al. ECIR 2024, Glasgow.
UK. Ma ch 2024. pp.130-143.
[6] Google,
h ps://www.google.com/in l/en_us/sea ch/ho
wsea chwo ks/how-sea ch-wo ks/o ganizing-
in o ma ion/.
[7] Se e al o he applica ions a e lis ed in D.Nowakowski,
N.Zimme mann and L.Ke ne , “Ma ke po en ial
assessmen o OpenWebSea ch.eu - Explo ing he
economic and socie al impac o an Open Web Index”,
Mücke/Ro h, Ge many, Rep., 2024,
h ps://openwebsea ch.eu/wp-con en /uploads
/2024/09/Ma ke Assessmen O OWI-Repo -
V1.pd
[8] P.C.Johannes, L.Bee and H.Koulani, “Legal
challenges o using he OWI o sea ch engines”,
p esen ed a OSSYM 2025 - 7 h In . Open Sea ch
Symposium, Helsinki, Finland, Oc . 2025, pape #####,
his con e ence, submi ed.
[9] Also see C.Geminn and P.C.Johannes, Handbuch
eu opäisches Da en ech . Baden-Baden, Ge many:
Nomos, in p epa a ion.
[10] I.Vogel e al, “Na u al Language P ocessing (NLP) und
de Da enschu z - Chancen und Risiken ü den Schu z
de P i a hei ”, in In o ma ik 2022. Lec u e No es in
In o ma ics (LNI), p./659.
doi: 10.18420/in 2022_24
[11] Ge man Bundes ag
h ps://dse e .bundes ag.de/b d/19/274/192
7426.pd
[12] E.g. D.Bomha d and J.Siglmülle , “AI Ac - das
h ps://doi.o g/10.5281/zenodo.17229684
T iloge gebnis”, Rech Digi al, ol. 5, no. 2, p.50, 2024.
[13] M.D egelies, “KI-T aining un e dem AI Ac ”,
Gewe bliche Rech sschu z und U hebe ech , ol. 126,
no. 20, p.1484, 2024.
[14] R.Heine, “Gene a i e KI: Nu zungs ech e und
Nu zungs o behal ”, Gewe bliche Rech ssschu z und
U hebe ech in de P axis, ol. 16, no. 4, p.88, 2024.
[15] F.Ho mann, “Re en Sch anken Geschä smodelle
gene a i e KI-Sys eme?”, Zei sch i ü U hebe - und
Medien ech , ol. 68, no. 3, p.166, 2024.
[16] L.Kaede, “T aining gene a i e KI-Modelle is (auch)
Tex - und Da a-Mining. Anwendba kei de TDM-
Sch anke des §44b U hG”, Küns liche In elligenz und
Rech , ol. 1, no. 5, p.162, 2024.
[17] N.Maama , “U hebe ech liche F agen beim Einsa z
on gene a i en KI-Sys emen”, Zei sch i ü U hebe -
und Medien ech , ol. 67, no. 7, p.481, 2023.
[18] K.Wagne , “Gene a i e KI: Eine “Blackbox”
u hebe ech liche Ha ungs isiken?. Balanceak
zwischen Inno a ions ö de ung und e ek i em
Rech sschu z ü We ke D i e ”, Zei sch i ü IT-Rech
und Rech de Digi alisie ung, ol. 27, no. 4, p.298,
2024.
[19] Con a y opinion: T.W.Do nis and S.S obe ,
U hebe ech und T aining gene a i e KI-Modelle:
Technologische und ju is ische G undlagen. Baden-
Baden, Ge many: Nomos 2024.
doi: 10.5771/9783748949558
[20] Con a y opinion: T.W.Do nis, ”Gene a i es KI-
T aining und Tex - und Da a-Mining. Eine unk ionale
Un e scheidung”, Küns liche In elligenz und Rech , ol.
1, no. 5, p.156, 2024.
[21] LG Hambu g, Judgemen o 27 Sep embe 2024,
Re e ence 310 O 227/23, h ps://www.i m.n w/wp-
con en /uploads/2024/09/2495651-en.pd
[22] F.D.Da is, “Use accep ance o in o ma ion
echnology: sys em cha ac e is ics, use pe cep ions and
beha io al impac s” In e na ional Jou nal o Man-
Machine S udies, ol. 38, no. 3, pp.475–487, Ma . 1993.
doi: 10.1006/imms.1993.1022
[23] V. Venka esh, M. Mo is, G. Da is, and F. D. Da is,
“Use accep ance o in o ma ion Technology: owa d a
uni ied iew,” MIS Qua e ly, ol. 27, no. 3, p. 425, Jan.
2003. doi: 10.2307/30036540
[24] V.Venka esh, J.Thong, and X.Xu, “Consume
Accep ance and use o In o ma ion echnology:
Ex ending he uni ied heo y o accep ance and use o
echnology” MIS Qua e ly, ol. 36, no. 1, p.157, Jan.
2012. doi: 10.2307/41410412
[25] M.Söllne , A.Ho mann, and J.M.Leimeis e , “Why
di e en us ela ionships ma e o in o ma ion
sys ems use s” Eu opean Jou nal o In o ma ion
Sys ems, ol. 25, no. 3, pp.274–287, Dec. 2015.
doi: 10.1057/ejis.2015.17
[26] D.Ge en, E.Ka ahanna, and D.S aub, “T us and TAM
in online shopping: an in eg a ed model” MIS Qua e ly,
ol. 27, no. 1, p.51, Jan. 2003.
doi: 10.2307/30036519
h ps://doi.o g/10.5281/zenodo.17229684