F ow aw a ilia ions o
pe sis en iden i ie s
My o Kallipoli i1, Se a eim Cha zopoulos2, Mi iam Baglioni3,
Eleni Adamidi2, Pa is Kolo eas2,4, Thanasis Ve goulis2
1 OpenAIRE AMKE, A hens, G eece
2ATHENA RC, A hens, G eece
3CNR-ISTI, Pisa, I aly
4Uni e si y o he Peloponnese, T ipolis, G eece
TPDL 2025, Tampe e, Finland | 23-26 Sep embe 2025
In oduc ion
•The g ow h a e o esea ch p oduc s is cons an ly inc easing
•Why?
•The numbe o esea che s inc eases wo ldwide
•P essu e o publish mo e – “publish o pe ish”
Open Science Ini ia i es
•Open Science Ini ia i es (e.g., EOSC, I4OC)
•La ge amoun s o schola ly da a a e openly a ailable
•Schola ly da a o en ep esen ed as Schola ly Knowledge G aphs
•E.g, OpenAIRE G aph, OpenAlex
•Rich and ela i ely clean sou ces o in o ma ion abou
esea ch p oduc s
Schola ly Knowledge G aphs (SKGs)
•SKGs a e inhe en ly he e ogeneous g aphs
•Resea ch p oduc s and ela ed en i ies
a e ep esen ed as nodes (wi h me ada a)
Schola ly Knowledge G aphs (SKGs)
•SKGs a e inhe en ly he e ogeneous g aphs
•Resea ch p oduc s and ela ed en i ies
a e ep esen ed as nodes (wi h me ada a)
•Connec ions be ween hem a e
ep esen ed as edges (wi h seman ics)
CITES
Schola ly Knowledge G aphs (SKGs)
•SKGs a e inhe en ly he e ogeneous g aphs
•Resea ch p oduc s and ela ed en i ies
a e ep esen ed as nodes (wi h me ada a)
•Connec ions be ween hem a e
ep esen ed as edges (wi h seman ics)
•AFFILIATED_WITH ela ions a e incomple e
•Use a ailable me ada a o ex ac and
en ich hese ela ions
CITES
Mo i a ion
•Why a ilia ion ela ions ma e ?
•T ack ins i u ional and na ional con ibu ions o global
challenges and inno a ion
•Da a in e ope abili y ac oss esea ch da abases
•La ge-scale da a in eg a ion and bibliome ic analyses
Mo i a ion
•Why a ilia ion ela ions ma e ?
•T ack ins i u ional and na ional con ibu ions o global
challenges and inno a ion
•Da a in e ope abili y ac oss esea ch da abases
•La ge-scale da a in eg a ion and bibliome ic analyses
•Goal: In e a ilia ion ela ions om he me ada a
•Map a ilia ion s ings o o ganiza ion pe sis en iden i ie s (like
ROR ids)
The Challenge
•Raw a ilia ion s ings a e o en uns uc u ed
•They can be inconsis en in o ma ing
•F equen ly e e ence mul iple o ganiza ions in one s ing
Ou app oach
Ou app oach
P ep ocessing phase
Ma ching phaseDisambigua ion phase
P ep ocessing Phase
•Cleaning and s emming
•Lowe case s ings, emo e s opwo ds, mul i-digi numbe s e c
•S emming: educe wo ds o hei oo o base o m o enable
easie compa ison and ma ching
P ep ocessing Phase
•Cleaning and s emming
•Lowe case s ings, emo e s opwo ds, mul i-digi numbe s e c
•S emming: educe wo ds o hei oo o base o m o enable
easie compa ison and ma ching
•Keywo d labeling and pa i ioning
•Spli s a ilia ion s ings in o pa i ions (o segmen s)
•Iden i ies common keywo ds (such as ‘hospi al’, ‘uni e si y’
e c) and coun y names inside pa i ions
P ep ocessing Phase
•Pa i ion p uning
•Pa i ions ha do no con ain any keywo ds a e no
conside ed o he upcoming simila i y compu a ions
•Coun y names a e se aside and conside ed la e in he
‘Candida e iden i ica ion’ s ep
P ep ocessing Phase
•Pa i ion p uning
•Pa i ions ha do no con ain any keywo ds a e no
conside ed o he upcoming simila i y compu a ions
•Coun y names a e se aside and conside ed la e in he
‘Candida e iden i ica ion’ s ep
•Sho ening o he emaining pa i ions
•P ese es e ms nea he keywo ds (using a window
pa ame e )
Ma ching & Disambigua ion Phases
•Candida e iden i ica ion
•Pe o ms s ing simila i y among emaining pa i ions and he names in
he o ganiza ion da abase
•Res ic ma ching o o ganiza ions loca ed in he coun ies
men ioned in he a ilia ion s ing
•Apply di e en simila i y h esholds: o uni e si ies (sim_u) and
o he o ganiza ions (sim_o)
Ma ching & Disambigua ion Phases
•Candida e iden i ica ion
•Pe o ms s ing simila i y among emaining pa i ions and he names in
he o ganiza ion da abase
•Res ic ma ching o o ganiza ions loca ed in he coun ies
men ioned in he a ilia ion s ing
•Apply di e en simila i y h esholds: o uni e si ies (sim_u) and
o he o ganiza ions (sim_o)
•Resul s e inemen and disambigua ion
•I mul iple candida e o ganiza ion names a e iden i ied o a single
pa i ion, hey a e compa ed o he o iginal (clean and s emmed)
a ilia ion s ing o de e mine he bes ma ch
•I mul iple o ganiza ions wi h he same name exis , ci y in o ma ion is
conside ed o de e mine he bes ma ch
Ou app oach
A RoDB: Expe -Cu a ed Da ase
•A ilia ion s ings o ROR ids mappings
•Each en y independen ly anno a ed by a leas wo expe s wi h one o he
h ee ca ego ies: exac ma ch, ances o ma ch, ague
•S a s: 1,500 eco ds om C oss e / 1,374 unique a ilia ion s ings / 1,475
a ilia ion ela ions (25% exac / 75% ances o )
•All expe anno a ions a e openly a ailable on Zenodo*
Regula upda es: adding ma ches o new a ilia ion s ings and e ising exis ing
en ies
h elease will include a cu a ed ba ch (~70% new s ings, ~30% p e iously
p ocessed) o alida e iden i ie s and ca ch missing ma ches
*h ps://zenodo.o g/ eco ds/15322098
A Ro pa ame e analysis
•3 pa ame e s: window, sim_u, sim_o
•F1 sco e imp o es when inc easing window –peaks a window = 3
•P ecision inc eases wi h highe alues o simila i y h esholds
•Highe h esholds (sim_o > 0.7, sim_u > 0.5) lead o lowe ecall
Conclusions
•In oduced A Ro algo i hm o ackle he p oblem o a ilia ion ma ching
•Released a ully expe -cu a ed da ase o acili a e ele an s udies
Conclusions
•In oduced A Ro algo i hm o ackle he p oblem o a ilia ion ma ching
•Released a ully expe -cu a ed da ase o acili a e ele an s udies
Fu u e wo k:
•Use he s uc u e o he SKGs o in e a ilia ion ela ions
•Compa e agains ecen s udies: A ilGood*
* N Du an-Sil a, P Accuos o, P P zybyła, H Saggion: A ilGood: Building eliable ins i u ion name
disambigua ion ools o imp o e scien i ic li e a u e analysis
Thank you!
scha z@a hena c.g
Code: h ps://gi hub.com/mkallipo/a ilia ion-ma ching
API: h ps://a o-api.imsi.a hena c.g /docs
Da ase : h ps://zenodo.o g/ eco ds/15322098