scieee Science in your language
[en] (orig)

Small-scale Domain-specific Web Crawling for Complementing Established LLM Data Sources

Author: Helfer, Felix; Eckart, Thomas; Körner, Erik; Schröder, Christopher; Binder, Frank
Publisher: Zenodo
DOI: 10.5281/zenodo.17278835
Source: https://zenodo.org/records/17278835/files/poster_ws_osc25.pdf
OSCAR
22.01 + 23.01
Ge man
LCC C awl
2022
Ge man
LCC C awl
2023
Ge man
Sou ce URL O e laps
4.04%
4.57%
20.33%
572,167,630
643,637,710
172,075,879
To al Numbe o Sou ce URLs
LCC DE 2022
LCC DE 2023
OSCAR DE 22.01+23.01
2 - h ps://osca -p ojec .o g
We compa e he LCC's annual news and web c awls o he .de op-le el domain
wi h he Ge man subse o he OSCAR² co pus, a popula esou ce o machine
lea ning and a ificial in elligence applica ions, o he same c awling pe iod.
Using iden ical con en ex ac ion and sen ence-le el deduplica ion p oce-
du es on bo h esou ces and conside ing hei le els o con en densi y,
we es ablish solid es ima es o hei deg ee o complemen a i y.
The Ge man LCC c awl o 2023 (396 billion okens in o al) yields a leas
35.9 billion okens o cleaned ex da a deduplica ed a documen -le el, wi h
a ound 5% o e lap wi h he Ge man OSCAR subse o 2022 and 2023 and
a ound 20% o e lap wi h he LCC da a o he p e ious yea ’s c awling ound.
O e laps we e calcula ed by compa ing he URL sou ce lis s o he h ee da ase s
wi h each o he , as a heu is ic o gene al con en o e lap.
This demons a es ha :
–Small-scale c awling in as uc u es ha ensu e comple e con ol o e sou ce selec ion and
hema ic ocus can s ill p o ide a significan gain in LLM aining ma e ial compa ed o es ab-
lished esou ces.
–Es ablished aining da ase s based on c awling a e s ill a om achie ing a 100% co e age
o he Web.
–Regula e-c awls (e.g., on an annual basis) con inue o esul in significan g ow h in new
da a and po en ial aining ma e ial.
Da ase Compa isons
P ocessing Impac s on Token Coun s
(in Millions)
Raw A e cleaning A e deduplica ion
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
LCC DE 2022
LCC DE 2023
OSCAR-DE
–T aining o a Ge man ounda ion model a he Cen e o Scalable Da a Analy ics and
A ificialIn elligence (ScaDS.AI).
–Cons ained Re ie al-Augmen ed LLMs (CORAL): Resea ch p ojec examining LLMs
unde legal, echnical, esou ce-based and da a-based cons ain s.
–Mul ilingual, day- o-day da a o moni o Eu opean news and o digi al p ess e iews.
Da a Usage o La ge Language Models
Since June 2021, we annually collec o e 130 billion okens o cleaned documen s (o
~27-35 billion okens a e sen ence-wise deduplica ion), which can be eely used as
cu en , di e se, high-quali y and sou ceable aining da a o LLMs and o he o ms o ex
and da a mining.
Use-cases include:
C awling
Cleaning
En ichmen
DATA SOURCES
h ps://wo scha z-leipzig.de
Fo mo e han 30 yea s, he Leipzig Co po a
Collec ion (LCC), o Wo scha z Leipzig, p o ides
digi al co po a and co pus-based dic iona ies.
Cu en ly, da a o mo e han 250 languages
can be accessed and downloaded. Fo many o
hose languages, he p ojec p o ides he
la ges eely a ailable ex esou ces on he
web.
This is possible due o he LCC's own c awling
in as uc u e and p ocessing pipeline, bo h o
which we e de eloped and g ew wi h he
p ojec . These p ocesses we e buil o enable
c awling wi h limi ed ha dwa e and he c ea ion
o smalle - and la ge-scale da ase s, depending
on a ailabili y o sou ces and esou ces.
This in as uc u e allows o he a ge ed c e-
a ion o domain-specific co po a, many o hem
he only ones o hei kind.
The LCC da a p ocessing pipeline can be
di ided in o h ee majo s eps:
1) C awling and sc aping o web page ex ,
using he In e ne A chi e's open-sou ce
c awle He i ix¹.
2) Cleaning o he aw HTML da a, i.e. ex
ex ac ion, language sepa a ion, emo al o
non- ex and o he a e ac s, sen ence and
oken spli ing, e c.
3) Va ious da a en ichmen s a e added in
subsequen pos -p ocessing s eps: di e en
anno a ions, like pa -o -speech ags, and
language s a is ics such as wo d equencies
and co-occu ences da a.
WORTSCHATZ

LEIPZIG

1 - h ps://gi hub.com/in e ne a chi e/he i ix3
LCC Web C awling
Small-scale Domain-specific Web C awling
o Complemen ing Es ablished LLM Da a Sou ces
The NFDI conso ium Tex + is unded by he Ge man Resea ch
Founda ion (DFG) – P ojec numbe 460033370
Uni e si ä Leipzig / In AI /
ScaDS.AI
Ch is ophe Sch öde
Ins i u ü Angewand e
In o ma ik (In AI)
F ank Binde
Sächsische Akademie de
Wissenscha en zu Leipzig
Thomas Ecka , Felix Hel e ,
E ik Kö ne
Pa icipan s