P ese ing Schola ly Communica ions
on he Web wi h Open Me ada a
Ma in Czygan1, Na haniel Smi h1
1 In e ne A chi e
In oduc ion
Schola ly communica ion documen s on he web, pa icula ly in he long ail, a e exposed o
a leas wo kinds o decay: hey anish en i ely (Laakso, Ma hias, & Jahn, 2021) o hey a e
a ec ed by ci a ion o e e ence o (Klein e al., 2014). Since 2017, he In e ne A chi e has
inc eased i s ocus on p ese ing schola ly ma e ial on he web o ul ill i s mission, and also
o add ess da a decay issues. The access si e a schola .a chi e.o g
(h ps://schola .a chi e.o g) was launched in Ma ch 2021; he i s i e a ion o a ci a ion g aph,
called e ca was eleased in Oc obe 2021 (Czygan, Holzmann, & Newbold, 2021)
1
. In his
pape , we highligh a ew aspec s o he p ese a ion p ocess and ecen de elopmen s,
speci ically bulk bibliog aphic da ase s and how po en ially ele an links a e disco e ed, as
well as p o ide a b ie upda e on he ci a ion g aph da ase ( 3).
Open Bibliog aphic Me ada a
Fi s , p ese a ion o schola ly communica ions on he web elies on openly a ailable
me ada a. In he pas , we ha e included me ada a ga he ed om bibliog aphic agg ega o s
like PubMed (Canese & Weis, 2013), DBLP (Ley, 2002), and DOAJ (Mo ison, 2017) and om
DOI egis a s like C ossRe (Hend icks, Tkaczyk, Lin, & Feeney, 2020) and Da aCi e (B ase,
2009) in o ou con inuous ca aloging pipeline. In addi ion, we used da a om he (now
discon inued) Mic oso Academic (Sinha e al., 2015), which has been ca ied o wa d in
OpenAlex (P iem, Piwowa , & O , 2022), o a ge ed web a chi ing. We also expanded ou
me ada a acquisi ion om he web by accessing mo e han 150,000 endpoin s ac oss o e
40,000 domains implemen ing OAI-PMH (Lagoze, Van de Sompel, Nelson, & Wa ne , 2002),
an applica ion-le el p o ocol used by popula web publishing ools like Open Jou nal Sys ems
(Willinsky, 2005) and ins i u ional eposi o y so wa e, such as DSpace, among o he s. A
ecen e sion o his da ase , which we call oaisc ape
2
, included abou 200M
3
me ada a
eco ds in he dublin co e o ma (Weibel, Kunze, Lagoze, & Wol , 1998) ep esen ing a ious
ypes o eco ds (publica ions, heses, da ase s, media and o he s). A e a cleanup p ocess,
we disco e ed o e 300M unique URL candida es
4
in his da ase alone (ac oss all me ada a
1
Mos bibliog aphic da a a i ac s a e a ailable unde he Bulk Bibliog aphic Me ada a collec ion
(h ps://a chi e.o g/de ails/ia_biblio_me ada a).
2
We make his da ase a ailable in egula in e als as pa o a me ada a collec ion a IA biblio
me ada a (h ps://a chi e.o g/de ails/ia_biblio_me ada a)
3
We obse e 196204847 unique eco ds by iden i ie . While mos eco ds ha e dis inc iden i ie s,
he e exis eco ds in he da ase (and ups eam da a) sha ing an iden i ie , while e e ing o di e en
eco ds, which is pe mi ed by he s anda d (Lagoze e al., 2002).
4
Fo an exac coun addi ional ounds o da a e iews a e equi ed.
ields
5
); hese 300M+ URL a e dis ibu ed ac oss abou 280K di e en domains
6
. Table 1
shows he equency o he op-10 domains e e enced in he oaisc ape eco ds. Among hem,
we also ind o e 600K links o gi hos ing si es (mainly gi hub.com), sugges ing men ions o
so wa e p ojec s, p ojec websi es, o da ase s (Escamilla e al., 2023).
Table 1. A lis o he op-10 domains o links ound in he oaisc ape da ase (2025)
Coun
Domain
30535604
hdl.handle.ne
15740750
doi.o g
14469950
p .cision.com
8604306
gallica.bn .
7127858
www.kb.dk
6791854
igsha e.com
5599077
p ensahis o ica.mcu.e
s
5224331
kb-images.kb.dk
4757477
dx.doi.o g
3915844
hal.science
F om hese URL candida es, we gene a e seed lis s o a ge ed c awls, which we conduc
wi h He i ix (Moh , S ack, Rni o ic, A e y, & Kimp on, 2004). C awle s use cus om
con igu a ions o he ask, add essing di e en web c awling si ua ions such as pe sis en
iden i ie edi ec s (up o a maximum), ollowing Google Schola me a- ags
7
, and o he s.
Es ima ion o Me ada a O e lap in La ge Bibliog aphic Da ase s
In e ne A chi e Schola uses da a om di e en sou ces o guide a chi ing e o s, and a
egula ques ion is: o wha ex en does he me ada a o hese di e en sou ces o e lap? We
an an analysis ac oss se en bibliog aphic da a sou ces o unde s and he amoun o
duplica ion in he me ada a
8
- we looked a o e laps based on DOI in C ossRe , Da aCi e,
DOAJ, DBLP, oaisc ape, OpenAlex and PubMed. When we conca ena e all DOIs p esen in
hese se en da ase s, we ind 451,395,081 pe sis en iden i ie s. When keeping only unique
5
While he OAI DC XML schema has mo e p e e ed ields o a u l, e.g. dc:iden i ie , no all me ada a
p o ide s ollow he guidelines s ic ly.
6
A e e i ica ion agains he cu en Lis o Top-Le el Domains om ICANN:
h ps://www.icann.o g/ esou ces/pages/ lds-2012-02-25-en
7
As desc ibed on: Google Schola : Inclusion Guidelines o Webmas e s
8
Analysis conduc ed in 01/2025
iden i ie s, we a i e a 235,399,926 DOIs
9
. We load he DOI lis s in o a DuckDB da abase
(Raas eld & Mu ghleisen, 2019) and calcula e he ca dinali y o hei pai wise in e sec ions and
di e ences, o which he esul s can be ound in Table 2.
Table 2. Pai wise compa ison o DOI in bulk ups eam da ase s (01/2025). We ex ac and
no malize he DOI ound in he a ious da ase s, hen calcula e hei in e sec ion and
di e ences. All da ase s pai s ha e DOI in common.
A
B
ca d(A)
ca d(B)
A ∩ B
A B
B A
c oss e
da aci e
165,644,551
62,966,529
177,878
165,466,673
59,394,867
c oss e
dblp
165,644,551
6,461,206
6,028,963
159,615,588
432,243
c oss e
doaj
165,644,551
9,004,954
7,411,626
158,232,925
1,593,328
c oss e
oaisc ape
165,644,551
6,794,290
4,926,119
160,718,432
1,868,171
c oss e
openalex
165,644,551
169,057,060
157,182,783
8,461,768
11,874,277
c oss e
pubmed
165,644,551
31,466,491
26,771,400
138,873,151
4,695,084
da aci e
dblp
62,966,529
6,461,206
303,452
59,269,293
6,157,754
da aci e
doaj
62,966,529
9,004,954
107,795
59,464,950
8,897,159
da aci e
oaisc ape
62,966,529
6,794,290
905,195
58,667,550
5,889,095
da aci e
openalex
62,966,529
169,057,060
8,060,828
51,511,917
160,996,232
da aci e
pubmed
62,966,529
31,466,491
10,566
59,562,179
31,455,918
dblp
doaj
6,461,206
9,004,954
284,159
6,177,047
8,720,795
dblp
oaisc ape
6,461,206
6,794,290
258,262
6,202,944
6,536,028
dblp
openalex
6,461,206
169,057,060
6,344,982
116,224
162,712,078
dblp
pubmed
6,461,206
31,466,491
355,755
6,105,451
31,110,729
doaj
oaisc ape
9,004,954
6,794,290
509,456
8,495,498
6,284,834
doaj
openalex
9,004,954
169,057,060
7,737,253
1,267,701
161,319,807
doaj
pubmed
9,004,954
31,466,491
3,830,635
5,174,319
27,635,849
oaisc ape
openalex
6,794,290
169,057,060
4,556,644
2,237,646
164,500,416
oaisc ape
pubmed
6,794,290
31,466,491
662,073
6,132,217
30,804,411
openalex
pubmed
169,057,060
31,466,491
26,968,174
142,088,886
4,498,310
9
Acco ding o DOI.o g app oxima ely 300M DOI ha e been assigned o da e
We ind ha each sou ce has unique DOI con ibu ions (see Table 3) alida ing ou app oach
o using mul iple ups eam sou ces o build a comp ehensi e me ada a ca alog.
Table 3. Unique DOI con ibu ions om he analyzed da ase s (01/2025); numbe o DOI we
ind in a da ase , bu in none o he o he s.
Da ase
Unique con ibu ions (DOI)
da aci e
50,689,363
c oss e
7,764,417
pubmed
4,263,905
openalex
3,362,058
doaj
1,006,264
oaisc ape
820,600
dblp
63,537
Addi ional Link Disco e y wi h Si emaps
A scale, da a acquisi ion wi h he OAI-PMH mechanism can exhibi some challenges, as
endpoin s use a wide a ie y o implemen a ions on he se e side, om well es ed, widely
used open sou ce p ojec s o ad-hoc implemen a ions – some imes no e en emi ing well-
o med XML. In gene al, we a e in e es ed in he bibliog aphic da a, and ha es ing OAI-PMH
me ada a o e s a compu a ionally ligh weigh way o acqui e his da a, as opposed o mo e
elabo a e me hods, such as analysis o PDF da a wi h GROBID (Roma y & Lopez, 2015)
10
o
simila ools.
Addi ionally, me ada a quali y a ies conside ably ac oss endpoin s, which usually equi es
addi ional code o compensa e
11
. When ocusing on p ese a ion, we likely ca e abou
publica ions i s , and me ada a second.
Websi es may choose o implemen si emaps (Si emaps XML o ma , 2005). Si emaps ollow
a s anda d schema and can use plain ex o XML o lis a numbe o links on a gi en domain,
in addi ion o me ada a such as da e o las upda e. While he si emap o ma is s anda dized,
hei URLs is no , al hough common loca ions exis . We i e a i ely disco e ed he si emap
10
Fo GROBID p ocessing, we ypically use mul ico e machines; as GROBID can u ilize deep
lea ning models o a icle segmen a ion, i bene i s om he p esence o a GPU. Meanwhile,
me ada a ha es ing can be done on low-end sys ems, me ely ha ing enough s o age space
a ailable.
11
Typical issues include incomple e da a, mul iple da a i ems in a single ield, duplica ion, inconsis en
o ma ing, among many o he issues.
loca ions speci ically o domains unning Open Jou nal Sys ems (OJS)
12
and expanded hem,
i necessa y
13
.
We limi ed he si emap explo a ion o OJS, because OJS con ains ypically only ew pages
no di ec ly ela ed o ac ual publica ions (and we a e in e es ed in a ge ed c awls). We
gene a ed a lis o candida e si emap loca ions and hen used a high-pe o mance link check
ool
14
o con i m exis ence. A mino i y o he assumed si emap loca ions we e alid; howe e
we we e able o gene a e a lis o abou 40M aw URLs and hen im his lis down o 19M
likely HTML o PDF links, due o he egula URL s uc u e o si es implemen ing OJS. A e
deduplica ing he 19M URL agains al eady p ese ed holdings a he In e ne A chi e we
we e able o gene a e a c awl seedlis o 10,081,479 unseen URL om his app oach
15
.
In e ne A chi e Schola Ci a ion G aph Upda e
In Feb ua y 2024, we eleased an upda ed e sion ( 3) o ou In e ne A chi e Schola ci a ion
g aph, called e ca
16
, o sho . A e s a ing wi h o e 3.5B aw e e ence en i ies collec ed
om me ada a and om unning GROBID (Roma y & Lopez, 2015) o e a chi ed PDF
documen s, we employed a se ies o ma ching echniques o de ec ela ions o publica ions
in ou ca alog. The mos ecen e sion con ains 2.173B edges, an inc ease o abou 64%
compa ed o he ini ial da ase om 2021. While mos edges a e again de e mined by exac
ma ches (i.e. by iden i ie ), abou 7% o he edges a e ound h ough a ious uzzy ma ching
algo i hms de eloped a he In e ne A chi e
17
. In he spi i o he Open Ci a ions p ojec
(Pe oni & Sho on, 2020) we aim o inc ease he amoun o publicly a ailable e e ence da a
o os e open science and euse o his kind o da a o a a ie y o applica ions. As ou nex
i e a ion’s ca alog will likely see an inc ease in me ada a co e age, and we expec he nex
i e a ion o he ci a ion g aph o be mo e comp ehensi e as well, as e e ences a e ma ched
agains his ca alog.
Summa y and Ou look
We aim o con inuously build and ex end ou me ada a ca alog, while p o iding access h ough
schola .a chi e.o g and making a ious da a a i ac s and open sou ce so wa e ools
a ailable in he p ocess. Wi h he ex ension o ou ca alog, we expec he de i ed ci a ion
g aph o become mo e comp ehensi e o e ime as well. We also aim o u he ex end he
disco e y o po en ial schola ly ma e ial by applying me ada a and link-ga he ing echniques
– simila o he desc ibed app oach using si emaps – o o he so wa e ools egula ly ound
in use o ins i u ional eposi o ies and open access jou nals. The signi ican inc ease in usage
o la ge language models (Minaee e al., 2024) in ecen yea s p omp ed a numbe o p ojec s
aiming o ex ac s uc u e om bina y documen s o ma s, among hem docling (Team, 2024)
12
Via: h ps://openjou nal heme.com/wha -is- he-ojs-omp-si emap-loca ion/
13
We used a helpe CLI ool called si emapped. A si emap can con ain di ec links o links o o he
si emaps. Fo example, he single si emap o he CORE p ojec ound unde co e.ac.uk/si emap.xml
would expand o o e 200M links.
14
Many ools in his space exis , we used a link checke called clinke , which can un se e al hund ed
checks in pa allel.
15
The co esponding c awl was conduc ed be ween 07/2024 and 01/2025 yielding 4.1TB o c awl
da a.
16
A ailable unde h ps://a chi e.o g/de ails/ e ca 2024-02-15
17
Code a ailable a h ps://gi lab.com/in e ne a chi e/ e ca
and ma ki down (ma ki down, 2024), and we a e e alua ing hese ools o ou indexing
pipeline, which eeds documen s o he sea ch engine unde lying ou access po al.
Re e ences
B ase, J. (2009). Da aci e - a global egis a ion agency o esea ch da a. In 2009 ou h
in e na ional con e ence on coope a ion and p omo ion o in o ma ion esou ces in
science and echnology (pp. 257–261).
Canese, K., & Weis, S. (2013). Pubmed: he bibliog aphic da abase. The NCBI handbook,
2(1).
Czygan, M., Holzmann, H., & Newbold, B. (2021). Re ca : The in e ne a chi e schola ci a ion
g aph. a Xi p ep in a Xi :2110.06595. Re ie ed om
h ps://a xi .o g/pd /2110.06595
Escamilla, E., Salsabil, L., Klein, M., Wu, J., Weigle, M. C., & Nelson, M. L. (2023). I ’s no jus
gi hub: iden i ying da a and so wa e sou ces included in publica ions. In In e na ional
con e ence on heo y and p ac ice o digi al lib a ies (pp. 195–206).
Hend icks, G., Tkaczyk, D., Lin, J., & Feeney, P. (2020). C oss e : The sus ainable sou ce o
communi y-owned schola ly me ada a. Quan i a i e Science S udies, 1(1), 414–427.
Re ie ed om h ps://di ec .mi .edu/qss/a icle-pd /1/1/414/1760913/qss a 00022.pd
Klein, M., Van de Sompel, H., Sande son, R., Shanka , H., Balaki e a, L., Zhou, K., & Tobin,
R. (2014). Schola ly con ex no ound: one in i e a icles su e s om e e ence o .
PloS one, 9 (12), e115253. Re ie ed om
h ps://jou nals.plos.o g/plosone/a icle/ ile?id=10.1371/jou nal.pone.0115253& ype=
p in able
Laakso, M., Ma hias, L., & Jahn, N. (2021). Open is no o e e : A s udy o anished open
access jou nals. Jou nal o he Associa ion o In o ma ion Science and Technology,
72(9), 1099–1112. Re ie ed om
h ps:// e ubium. ube lin.de/bi s eam/handle/ ub188/30265/asi.24460.pd ?sequence=
2&isAllowed=y
Lagoze, C., Van de Sompel, H., Nelson, M., & Wa ne , S. (2002). Open a chi es ini ia i e-
p o ocol o me ada a ha es ing- . 2.0.
Ley, M. (2002). The dblp compu e science bibliog aphy: E olu ion, esea ch issues,
pe spec i es. In In e na ional symposium on s ing p ocessing and in o ma ion e ie al
(pp. 1–10).
ma ki down. (2024). h ps://gi hub.com/mic oso /ma ki down
Minaee, S., Mikolo , T., Nikzad, N., Chenaghlu, M., Soche , R., Ama iain, X., & Gao, J.
(2024). La ge language models: A su ey. a Xi p ep in a Xi :2402.06196.
Moh , G., S ack, M., Rni o ic, I., A e y, D., & Kimp on, M. (2004). In oduc ion o he i ix. In
4 h in e na ional web a chi ing wo kshop (Vol. 15, pp. 109–115).
Mo ison, H. (2017). Di ec o y o open access jou nals (doaj). The Cha les on Ad iso , 18(3),
25–28.
Pe oni, S., & Sho on, D. (2020). Openci a ions, an in as uc u e o ganiza ion o open
schola ship. Quan i a i e Science S udies, 1(1), 428–444. Re ie ed om
h ps://a xi .o g/pd /1906.11964
P iem, J., Piwowa , H., & O , R. (2022). Openalex: A ully-open index o schola ly wo ks,
au ho s, enues, ins i u ions, and concep s. a Xi p ep in a Xi :2205.01833 .
Re ie ed om h ps://a xi .o g/pd /2205.01833
Raas eld , M., & Mu ghleisen, H. (2019). Duckdb: an embeddable analy ical da abase. In
P oceedings o he 2019 in e na ional con e ence on managemen o da a (pp. 1981–
1984).
Roma y, L., & Lopez, P. (2015). G obid-in o ma ion ex ac ion om scien i ic publica ions.
ERCIM News, 100.
Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-J., & Wang, K. (2015). An o e iew o
mic oso academic se ice (mas) and applica ions. In P oceedings o he 24 h
in e na ional con e ence on wo ld wide web (pp. 243–246). Re ie ed om
h ps://dl.acm.o g/doi/pd /10.1145/2740908.2742839
Si emaps xml o ma . (2005). h ps://www.si emaps.o g/p o ocol.h ml
Team, D. S. (2024, 8). Docling echnical epo (Tech. Rep.). Re ie ed om
h ps://a xi .o g/abs/2408.09869 doi: 10.48550/a Xi .2408.09869
Weibel, S., Kunze, J., Lagoze, C., & Wol , M. (1998). Dublin co e me ada a o esou ce
disco e y (Tech. Rep.).
Willinsky, J. (2005). Open jou nal sys ems: An example o open sou ce so wa e o jou nal
managemen and publishing. Lib a y hi ech , 23 (4), 504–519. Re ie ed om
h ps://pkp.s u.ca/ iles/Lib a y Hi Tech DRAFT.pd