He e ogeneous da ase s
A ale o in eg a ion and explo a ion
Nelly Ba e
Pos doc o al esea che
Da a Science g oup
Dipa imen o di Ele onica, In o mazione e Bioingegne ia
Poli ecnico di Milano
Janua y 24, 2025
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 1 / 88
Sho bio
My CS backg ound:
Bachelo @ Uni . Lyon
Mas e , AI ack @ Uni . Lyon
PhD @ In ia Saclay and Ecole Poly echnique
Pos -doc @ Poli ecnico di Milano (I alie)
My hesis was abou use -o ien ed explo a ion o semi-s uc u ed da a.
My pos -doc is abou enabling ede a ing analyses o heal h da a.
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 2 / 88
Ou line
1Mo i a ion: da a in eg a ion and explo a ion p oblems
2PhD: explo ing unknown semi-s uc u ed da ase s
3Pos -doc: heal hca e analy ics ac oss hospi als
4Sys ems de eloped
5Conclusion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 3 / 88
Mo i a ion: da a in eg a ion and explo a ion p oblems
Ou line
1Mo i a ion: da a in eg a ion and explo a ion p oblems
2PhD: explo ing unknown semi-s uc u ed da ase s
3Pos -doc: heal hca e analy ics ac oss hospi als
4Sys ems de eloped
5Conclusion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 4 / 88
Mo i a ion: da a in eg a ion and explo a ion p oblems Da a in eg a ion and explo a ion
Di e en se ings, di e en needs
S uc u ed da a models:
Tables
Rela ional da abases
Semi-s uc u ed da a models:
XML documen s
JSON documen s
RDF g aphs
P ope y g aphs
Uns uc u ed da a models:
Tex
Images
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 5 / 88
Mo i a ion: da a in eg a ion and explo a ion p oblems Da a in eg a ion and explo a ion
Di e en se ings, di e en needs
Va ious domains:
Heal h
Jou nalism
T anspo s, ...
Sensi i i y le els:
En o ce p i acy ules
EU GDPR ules
Se e al ac o s/use s:
Di e en skills
Time/money cons ain s
Da ase in eg a ion and explo a ion is ha d: la ge, complex, i egula
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 6 / 88
Mo i a ion: da a in eg a ion and explo a ion p oblems Da a in eg a ion and explo a ion
Di e en se ings, di e en needs
Va ious domains:
Heal h
Jou nalism
T anspo s, ...
Sensi i i y le els:
En o ce p i acy ules
EU GDPR ules
Se e al ac o s/use s:
Di e en skills
Time/money cons ain s
Da ase in eg a ion and explo a ion is ha d: la ge, complex, i egula
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 6 / 88
PhD: explo ing unknown semi-s uc u ed da ase s
Ou line
1Mo i a ion: da a in eg a ion and explo a ion p oblems
2PhD: explo ing unknown semi-s uc u ed da ase s
3Pos -doc: heal hca e analy ics ac oss hospi als
4Sys ems de eloped
5Conclusion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 7 / 88
PhD: explo ing unknown semi-s uc u ed da ase s
Wha does he da ase desc ibe?
Real-wo ld objec s and ela ionships be ween hem
T adi ional se ing: En i y-Rela ionship models [RG03]
Need o compu e hem om he da ase !
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 8 / 88
PhD: explo ing unknown semi-s uc u ed da ase s
Thesis p oblem and esea ch con ibu ions
Thesis p oblem s a emen
How o acili a e use explo a ion o unknown he e ogeneous
semi-s uc u ed da ase s?
Abs a: semi-s uc u ed da a o e iews [BMU22,BMU24]
Au oma ically compu e ligh weigh En i y-Rela ionship diag ams
Ideal o i s -sigh da ase disco e y
Pa hWays: in e es ing Named En i y connec ions
[BGLM23b,BGLM23a,BGLM25]
Compu e and ank en i y pa hs in and ac oss da ase s
Ideal o explo ing connec ions wi hin and ac oss da ase s
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 10 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Rela ed wo k
Rela ed wo k
Da a summa iza ion
S uc u al
Quo ien [GGM20,KC10,MS99]
( he one we adop o build G)
Non-quo ien [GW97]
Pa e n mining [ZLVK16]
S a is ical [HS12]
Hyb id [RGSB17]
Schema in e ence
XML [CGS11]
JSON [BCGS19]
RDF [GLSW22]
PG [LBH21]
Da a summa iza ion and schema in e ence a e ied o one da a model
Schemas a e o en no sui ed o NTUs
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 11 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Rela ed wo k
A JSON schema om social ne wo k da a using [BCGS19]
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 12 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Rela ed wo k
Wha does he da ase desc ibe?
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 13 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Rela ed wo k
The Abs a app oach
1In eg a e all da a sou ces in a g aph (Connec ionLens) [ABC+22]
2Summa ize he g aph
3Among summa y nodes, iden i y en i ies and hei a ibu es
4In he summa y, iden i y ela ionships be ween he en i ies
5P opose a simple ca ego y o each en i y (bes -e o )
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 14 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Backg ound
Backg ound: om he e ogeneous da a o da a g aphs
Connec ionLens [ABC+22]:
1Inges s any da ase in o a di ec ed g aph
Gene ic, lexible, ine g anula i y
2Ex ac s Named En i ies (NEs) om all ex nodes
da e , email add ess , People , Place , O ganiza ion , ...
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 15 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Backg ound
Backg ound: om he e ogeneous da a o da a g aphs
Connec ionLens [ABC+22]:
1Inges s any da ase in o a di ec ed g aph
Gene ic, lexible, ine g anula i y
2Ex ac s Named En i ies (NEs) om all ex nodes
da e , email add ess , People , Place , O ganiza ion , ...
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 15 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Da a g aph summa iza ion
Da a g aph summa iza ion
We need a compac ep esen a ion o la ge da a g aphs
Challenges:
He e ogeneous g aphs o igina e om di e en da a models
Node and/o edge labels may be emp y
We aim o a quo ien g aph summa y:
Based on equi alence be ween nodes o he o iginal g aph
We p e e small summa ies (numbe o nodes)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 16 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Da a g aph summa iza ion
Da a g aph summa iza ion
We need a compac ep esen a ion o la ge da a g aphs
Challenges:
He e ogeneous g aphs o igina e om di e en da a models
Node and/o edge labels may be emp y
We aim o a quo ien g aph summa y:
Based on equi alence be ween nodes o he o iginal g aph
We p e e small summa ies (numbe o nodes)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 16 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Da a g aph summa iza ion
Da a g aph summa iza ion
We need a compac ep esen a ion o la ge da a g aphs
Challenges:
He e ogeneous g aphs o igina e om di e en da a models
Node and/o edge labels may be emp y
We aim o a quo ien g aph summa y:
Based on equi alence be ween nodes o he o iginal g aph
We p e e small summa ies (numbe o nodes)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 16 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Iden i ying en i ies in he collec ion g aph G
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Which collec ions ep esen en i ies in he E-R diag am?
Which collec ions ep esen en i y a ibu es?
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 22 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Iden i ying en i ies in he collec ion g aph G
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Which collec ions ep esen en i ies in he E-R diag am?
Which collec ions ep esen en i y a ibu es?
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 22 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Requi emen s and algo i hm
We need an algo i hm o iden i y en i y oo s and a ibu es o he E-R
diag am
Fo complex, po en ially cyclic, collec ion g aphs
G eedy selec ion o ew en i ies in G
1Assign a sco e o each collec ion node
2While less han Emax en i y oo s, o da a co e age <co min
1Elec he nex highes -sco ed eligible collec ion node as an en i y oo
2Compu e i s bounda y (se o a ibu es)
3Upda e he collec ion g aph o e lec he selec ion o an en i y
4Recompu e he sco es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 23 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Requi emen s and algo i hm
We need an algo i hm o iden i y en i y oo s and a ibu es o he E-R
diag am
Fo complex, po en ially cyclic, collec ion g aphs
G eedy selec ion o ew en i ies in G
1Assign a sco e o each collec ion node
2While less han Emax en i y oo s, o da a co e age <co min
1Elec he nex highes -sco ed eligible collec ion node as an en i y oo
2Compu e i s bounda y (se o a ibu es)
3Upda e he collec ion g aph o e lec he selec ion o an en i y
4Recompu e he sco es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 23 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
How o sco e a collec ion node?
Re lec he weigh o his node and i s s uc u e in he da ase
1wdesck,wlea k: # descendan s, lea descendan s, a dep h k
×No clea how o pick k
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 24 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
How o sco e a collec ion node?
Re lec he weigh o his node and i s s uc u e in he da ase
1wdesck,wlea k: # descendan s, lea descendan s, a dep h k
×No clea how o pick k
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 24 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
How o sco e a collec ion node?
Re lec he weigh o his node and i s s uc u e in he da ase
1wdesck,wlea k: # descendan s, lea descendan s, a dep h k
2Di ec ed Acyclic G aph (DAG) oo ed in each node: wDAG
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 25 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Da a weigh
Own weigh ow o a lea node: i s in-deg ee
Da a weigh dw o a lea collec ion node: he sum o i s nodes’ ow
pape
au ho
“da abase”
ow = 2
“L´ea”
ow = 1
“Da a lake”
ow = 1
con
name
hasW i en
w i enBy
keywo d
keywo d
opic
au ho hasW i en pape
w i enBy keywo d
name
# al
dw = 1
# al
dw = 3 opic con
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 26 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Da a weigh DAG p opaga ion
Lea collec ion dw is p opaga ed back o all ances o s which a e no in a
cycle
Edge ans e ac o : |nodes in C ha ing a pa en in Cs|
|C |
au ho hasW i en pape
w i enBy keywo d
name
# al
dw = 1
# al
dw = 3 opic con
1 1
1 1
1
1
1
1
0.5 1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 27 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Da a weigh DAG p opaga ion
Lea collec ion dw is p opaga ed back o all ances o s which a e no in a
cycle
Edge ans e ac o : |nodes in C ha ing a pa en in Cs|
|C |
au ho
dw = 1
hasW i en
dw = 0
pape
dw = 3
w i enBy
dw = 0
keywo d
dw = 3
name
dw = 1
# al
dw = 1
# al
dw = 3
opic
dw = 1.5
con
dw = 1.5
1 1
1 1
1
1
1
1
0.5 1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 28 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
The e e se collec ion g aph GRwi h dw- uned PR edge weigh s
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 34 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
.178
abs ac
.011
# al
.006
yea
.011
# al
.006
i le
.011
# al
.006
wB
.107
hW
.158
pIn
.063
in
.056
au ho
.179
con
.067
name
.011
# al
.006
da e
.011
# al
.006
email
.011
# al
.006
a ilia ion
.027
uni e si y
.024
ci y
.011
# al
.006
campus
.011
# al
.006
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 35 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
.178
abs ac
.011
# al
.006
yea
.011
# al
.006
i le
.011
# al
.006
wB
.107
hW
.158
pIn
.063
in
.056
au ho
.179
con
.067
name
.011
# al
.006
da e
.011
# al
.006
email
.011
# al
.006
a ilia ion
.027
uni e si y
.024
ci y
.011
# al
.006
campus
.011
# al
.006
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 36 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
.178
abs ac
.011
# al
.006
yea
.011
# al
.006
i le
.011
# al
.006
wB
.107
hW
.158
pIn
.063
in
.056
au ho
.179
con
.067
name
.011
# al
.006
da e
.011
# al
.006
email
.011
# al
.006
a ilia ion
.027
uni e si y
.024
ci y
.011
# al
.006
campus
.011
# al
.006
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
P opaga es sco es ac oss he collec ion g aph
Wo ks on cyclic collec ion g aphs
The sco e e lec s he opology and whe e he da a is
A collec ion node dis ibu es i s weigh
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 37 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
How o compu e an en i y bounda y?
Collec ions in G ep esen ing a ibu es o his en i y
“Those ha con ibu e o he en i y’s weigh ”
The bounda y may go a ( o deep-s uc u e en i ies)
Easy o de ine o wdesck,wlea k,wDAG . Example o wdesc2
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Does no apply o PageRank-based sco es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 38 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
How o compu e an en i y bounda y?
Collec ions in G ep esen ing a ibu es o his en i y
“Those ha con ibu e o he en i y’s weigh ”
The bounda y may go a ( o deep-s uc u e en i ies)
Easy o de ine o wdesck,wlea k,wDAG . Example o wdesc2
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Does no apply o PageRank-based sco es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 38 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
How o compu e an en i y bounda y?
Collec ions in G ep esen ing a ibu es o his en i y
“Those ha con ibu e o he en i y’s weigh ”
The bounda y may go a ( o deep-s uc u e en i ies)
Easy o de ine o wdesck,wlea k,wDAG . Example o wdesc2
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Does no apply o PageRank-based sco es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 38 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Da a-acyclic looding bounda y boundd l−ac
Idea: he collec ion nodes
Reachable om he en i y oo
Mainly pa o his en i y
Edge ans e ac o ≥ min
A -mos -one: each Csnode has a mos one child in C
The pa h be ween he en i y oo and his collec ion node is no da a
cyclic
I he pa h in Ghas no in-cycle edges
O , he Gpa h has in-cycle edges, bu hey a e no in he da a
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 39 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Da a-acyclic looding bounda y boundd l−ac
Idea: he collec ion nodes
Reachable om he en i y oo
Mainly pa o his en i y
Edge ans e ac o ≥ min
A -mos -one: each Csnode has a mos one child in C
The pa h be ween he en i y oo and his collec ion node is no da a
cyclic
I he pa h in Ghas no in-cycle edges
O , he Gpa h has in-cycle edges, bu hey a e no in he da a
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 39 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Da a-acyclic looding bounda y boundd l−ac
Idea: he collec ion nodes
Reachable om he en i y oo
Mainly pa o his en i y
Edge ans e ac o ≥ min
A -mos -one: each Csnode has a mos one child in C
The pa h be ween he en i y oo and his collec ion node is no da a
cyclic
I he pa h in Ghas no in-cycle edges
O , he Gpa h has in-cycle edges, bu hey a e no in he da a
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 39 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Selec ed en i ies and hei bounda ies
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 44 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Finding ela ionships be ween en i ies
Rela ionship: a pa h om an en i y o ano he
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
pape →wB →au ho
pape →pIn →con
au ho →hW →pape
con →in →au ho
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 45 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
En i y classi ica ion
Assign a seman ic ca ego y o each en i y
Inpu : an en i y E, ca ego ies K, seman ic p ope ies P
K: Pe son, Scien i icPape , E en , Websi e, Moun ain, ...
P:{label:"add ess", domain:[Pe s., O g.], ange:[Place]}, ...
Ou pu : a ca ego y o E
Algo i hm:
Compa e:
The common name o all nodes in he en i y oo (i i exis s) wi h
k∈ K (con , pape , au ho )
I s a ibu e names wi h p∈ P (a ilia ion, email, ...)
I s en i y p o iles wi h p. ange ∈ P (,,, ...)
Each good ma ch o es o one o ew ca ego ies
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 46 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
En i y classi ica ion
Assign a seman ic ca ego y o each en i y
Inpu : an en i y E, ca ego ies K, seman ic p ope ies P
K: Pe son, Scien i icPape , E en , Websi e, Moun ain, ...
P:{label:"add ess", domain:[Pe s., O g.], ange:[Place]}, ...
Ou pu : a ca ego y o E
Algo i hm:
Compa e:
The common name o all nodes in he en i y oo (i i exis s) wi h
k∈ K (con , pape , au ho )
I s a ibu e names wi h p∈ P (a ilia ion, email, ...)
I s en i y p o iles wi h p. ange ∈ P (,,, ...)
Each good ma ch o es o one o ew ca ego ies
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 46 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
En i y classi ica ion
Name Simila o Vo es o
pape Resea chPublica ion (0.85) Resea chPublica ion
News (0.63) News
pape
abs ac
# al
yea
# al
i le
# al
1
1
1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 47 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
En i y classi ica ion
A ibu e Simila o Vo es o
abs ac abs ac (1.0) Resea chPublica ion
summa y (0.92) Book
p e ace (0.47)
i le i le (1.0) Resea chPublica ion
hono i ic i le (0.87) Mo ie
Pe son
yea yea publica ion (0.85 + )E en
Book
Resea chPublica ion, ...
pape
abs ac
# al
yea
# al
i le
# al
1
1
1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 48 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
En i y classi ica ion
A ibu e Simila o Vo es o
abs ac abs ac (1.0) Resea chPublica ion
summa y (0.92) Book
p e ace (0.47)
i le i le (1.0) Resea chPublica ion
hono i ic i le (0.87) Mo ie
Pe son
yea yea publica ion (0.85 + )E en
Book
Resea chPublica ion, ...
pape
abs ac
# al
yea
# al
i le
# al
1
1
1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 49 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
En i y classi ica ion
pape nodes classi ied as Resea chPublica ion
au ho nodes classi ied as Resea che
con e ence nodes classi ied as E en
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 50 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Abs a ou pu : a ligh weigh En i y-Rela ionship diag am
318
pe son
(Pe son)
•
pe son@id
(100
%)
•
phone
(49
%)
•
c edi ca d
(49
%)
•
homepage
(47
%)
•
add ess
(46
%)
•
p o ince
(52
%)
•
zipcode
(100
%)
•
coun y
(100
%)
•
ci y
(100
%)
•
s ee
(100
%)
•
emailadd ess
(100
%)
•
name
(100
%)
150
open_auc ion
(P oduc )
•
p i acy
(56
%)
•
in e al
(100
%)
•
end
(100
%)
•
s a
(100
%)
•
ype
(100
%)
•
cu en
(100
%)
•
ese e
(51
%)
•
ini ial
(100
%)
•
open_auc ion@id
(100
%)
•
quan i y
(100
%)
wa ches.wa ch@open_auc ion
12
ca ego y
(Thing)
•
ca ego y@id
(100
%)
•
desc ip ion
(100
%)
•
ex
(73
%)
•
pa lis
(27
%)
•
lis i em
(291
%)
•
ex
(87
%)
•
name
(100
%)
p o ile.in e es @ca ego y
selle @pe son
anno a ion.au ho @pe son
bidde .pe son e @pe son
270
i em
(schema:how_ o_i em)
•
mailbox
(64
%)
•
mail
(101
%)
•
da e
(100
%)
•
o
(100
%)
•
om
(100
%)
•
ex
(100
%)
•
i em@ ea u ed
(9
%)
•
i em@id
(100
%)
•
shipping
(94
%)
•
desc ip ion
(100
%)
•
ex
(73
%)
•
pa lis
(27
%)
•
lis i em
(291
%)
•
ex
(87
%)
•
paymen
(94
%)
•
name
(100
%)
•
quan i y
(100
%)
•
loca ion
(100
%)
i em e @i em
inca ego y@ca ego y
120
closed_auc ion
(P oduc )
•
p ice
(100
%)
•
ype
(100
%)
•
da e
(100
%)
•
quan i y
(100
%)
selle @pe son
buye @pe son
anno a ion.au ho @pe son
i em e @i em
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 51 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Iden i ying en i ies and ela ionships
Abs a ou pu : a ligh weigh En i y-Rela ionship diag am
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 52 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 57 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Abs a selec s equen , cohe en and seman ically cen al en i ies
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 58 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Expe imen al e alua ion
Expe imen al e alua ion: scalabili y
Ou abs ac ion me hod scales up linea ly in he da a size
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 59 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Ex ending abs ac ions o P ope y G aphs
Ex ending abs ac ions o P ope y G aphs
P ope y G aphs (PGs) a e g aphs whose nodes and edges may ca y
named a ibu es
Model unde s anda diza ion [ABD+23,ABD+21]
Nume ous indus ial PG da abases (Neo4J, O acle)
Widely used ( he O sho e leaks da abase)
Fo in e ope abili y, we de i e a PG schema om any
(semi)s uc u ed da ase ollowing PG-Schema [ABD+23]
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 60 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Ex ending abs ac ions o P ope y G aphs
De i ing a PG schema om an abs ac ion
We need o accommoda e o nes ed a ibu es:
FLAT: w ap he nes ed a ibu e in a JSON objec
CUT: un old each nes ed a ibu e in a PG node
1Fo each Abs a en i y E:
1C ea e a PG node ype o E
2Fo each a ibu e a:
1I ais no nes ed: add a o he PG node
2I ais nes ed and nes ing is FLAT: w ap a
3I ais nes ed and nes ing is CUT: un old a
2Fo each Abs a ela ionship R:
1C ea e a PG edge ype wi h co esponding PG nodes
3I all Gnodes and edges a e in E-R: PG g aph ype is STRICT, else
LOOSE
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 61 / 88
PhD: explo ing unknown semi-s uc u ed da ase s Ex ending abs ac ions o P ope y G aphs
Ex ending abs ac ions o P ope y G aphs
CREATE GRAPH TYPE myG aphType STRICT {
(pape Type: Pape {
i le s ing,
OPTIONAL yea in ege ,
OPTIONAL abs ac s ing, ...
})
(au ho Type: Au ho {name s ing, email s ing, ...}),
(con Type: Con e ence {name s ing, yea in ege , ...}),
(:au ho Type)-[edgeAu ho Pape : HasW i en]->(:pape Type),
(:pape Type)-[edgePape Au ho : W i enBy]->(:au ho Type),
(:pape Type)-[edgePape Con : PublishedIn]->(:con Type),
}
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 62 / 88
Pos -doc: heal hca e analy ics ac oss hospi als
Ou line
1Mo i a ion: da a in eg a ion and explo a ion p oblems
2PhD: explo ing unknown semi-s uc u ed da ase s
3Pos -doc: heal hca e analy ics ac oss hospi als
4Sys ems de eloped
5Conclusion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 63 / 88
Pos -doc: heal hca e analy ics ac oss hospi als
Wha does mul i-sou ce heal hca e da a has o e eal?
Ve y low coope a ion/no maliza ion be ween medical cen e s
Few pa ien da a o a e diseases
T adi ional se ing: wa ehouses [DM88]
Need o p o ide decen alized and ede a ed analyses!
Le e age expe s’ knowledge + make i as au oma ic as possible
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 64 / 88
Pos -doc: heal hca e analy ics ac oss hospi als
Wha does mul i-sou ce heal hca e da a has o e eal?
Ve y low coope a ion/no maliza ion be ween medical cen e s
Few pa ien da a o a e diseases
T adi ional se ing: wa ehouses [DM88]
Need o p o ide decen alized and ede a ed analyses!
Le e age expe s’ knowledge + make i as au oma ic as possible
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 64 / 88
Pos -doc: heal hca e analy ics ac oss hospi als
Wha does mul i-sou ce heal hca e da a has o e eal?
Ve y low coope a ion/no maliza ion be ween medical cen e s
Few pa ien da a o a e diseases
T adi ional se ing: wa ehouses [DM88]
Need o p o ide decen alized and ede a ed analyses!
Le e age expe s’ knowledge + make i as au oma ic as possible
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 64 / 88
Pos -doc: heal hca e analy ics ac oss hospi als Rela ed wo k
Rela ed wo k
Da a pla e o ms:
EHDEN [PdGdK+23] o abula da a
Also: OHDSI [HDS+15], UMG-MeDIC [PSS+23], e c
Concep ual models:
OMOP [SRR+10] o obse a ional da a, also FHIR [ hi]
ETL pipelines:
D-ETL [OKK+17], also EHDEN’s ETL, OHDSI’s ETL
Da a pla o ms a e o en ied o a single da a ype ( ables, e c.)
Concep ual models o en design one kind o da a (obse a ional, e c.)
ETLs p o ide limi ed in e ope abili y and equi e ime om expe s
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 66 / 88
Pos -doc: heal hca e analy ics ac oss hospi als The I-ETL app oach
The I-ETL app oach
1Analyze da ase s and ex ac hei me ada a
2C ea e an in e ope able da abase in each medical cen e
3Assess in e ope abili y along he pipeline
4Allow ede a ed analyses o da a ac oss cen e s
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 67 / 88
Pos -doc: heal hca e analy ics ac oss hospi als In e ope abili y
In e ope abili y as in FAIR p inciples
FAIR p inciples a e guidelines o good da a managemen [WDA+16]:
FFindable: sea ch o (indexed) esou ces based on iden i ie s
AAccessible: access da a wi h s anda d p o ocols, e en a e da a dies
IIn e ope able: in eg a e and e e o da ase s ollowing FAIR p inciples
RReusable: euse da ase s in o he se ings using p o enance, e c.
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 68 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 1: me ada a c ea ion
F om da ase s analysis o me ada a
Me ada a: each da ase can be desc ibed by a se o ea u es
Name, de ini ion, ype, uni , alues, ...
Speci ied by medical expe s
Tabula da a o clinical measu emen s and pheno ypic in o ma ion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 69 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 1: me ada a c ea ion
F om da ase s analysis o me ada a
Me ada a: each da ase can be desc ibed by a se o ea u es
Name, de ini ion, ype, uni , alues, ...
Speci ied by medical expe s
The me ada a ob ained om he abula da a
Wha i “e hnici y” is e e ed o as “ ace” in ano he da ase ?
Wha i da ase s e e o “Homme”/“Femme” s. “Male”/“Female”?
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 70 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 1: me ada a c ea ion
F om da ase s analysis o me ada a
Me ada a: each da ase can be desc ibed by a se o ea u es
Name, de ini ion, ype, uni , alues, ...
Speci ied by medical expe s
The me ada a ob ained om he abula da a
Wha i “e hnici y” is e e ed o as “ ace” in ano he da ase ?
Wha i da ase s e e o “Homme”/“Femme” s. “Male”/“Female”?
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 70 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 1: me ada a c ea ion
The me ada a model
We aim o a concep ual model o exp essi e and in e ope able
me ada a
Name: he ea u e name
Vocabula y: a ocabula y name
Code: he code o he e m in he selec ed ocabula y
Kind: pheno ypic, clinical, genomic, ...
Da aType:s ing,in ege ,nume ic,boolean,ca ego y, ...
Uni : o in e p e alues when he da a ype is nume ic;
Ca ego ies: lis o disc e e alues o ca ego ical ea u es
Visibili y:public,anonymized,p i a e
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 71 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 2: mappings o ocabula ies
Associa e me ada a o ocabula ies
Vocabula ies a e dic iona ies o concep s/ alues uniquely iden i ied
SNOMED CT [SPSW01], LOINC [HRM+98], OMIM [HSA+05], ...
We associa e each ea u e and ca ego ical alue o an exis ing ocabula y
code →mo e in e ope abili y
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 72 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 3: gene ic da a model
The heal hca e da a model
We need a gene al, ex ensible heal hca e da a model
Challenges:
Use-cases b ing e y di e en kinds o da a
Expe s’ me ada a needs o be ep esen ed
We aim o a concep ual model:
Based on he no ions o ea u es and eco ds
Will be popula ed au oma ically by an ETL
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 73 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 3: gene ic da a model
Gene al, ex ensible heal hca e concep ual da a model
How o au oma ically popula e his da a model wi h hospi als da a?
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 74 / 88
Pos -doc: heal hca e analy ics ac oss hospi als S ep 5: in e ope abili y assessmen
Ancho ou me ics in FAIR p inciples
1(Me a)da a use a [...] b oadly applicable language o knowledge
ep esen a ion
Ou da a model can be implemen ed wi hin any ype o da abase
Me ada a can be easily speci ied using a abula ile
2(Me a)da a use FAIR- i s ocabula ies
Associa e me ada a a iables and ca ego ies o ocabula y esou ces
Use o widely used ocabula ies in heal hca e domain
3(Me a)da a include quali ied e e ences o o he da a and me ada a
To db ins ances: e e ences o a pa ien , a hospi al, and a Fea u e
To he da a: om which da ase he alue comes
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 80 / 88
Pos -doc: heal hca e analy ics ac oss hospi als I-ETL a wo k
I-ETL a wo k in he Be e p ojec
7 clinical cen e s ac oss Eu ope
I-ETL is unde deploymen a each cen e →7 in e ope able da abases
Wo king on:
Designing a ca alogue o:
Lis a ailable da ase s and hei associa ed me ada a
Explo e da ase s and hei agg ega ed da a
Wi h isualiza ions and que ies
Designing a decen alized ede a ed lea ning pla e o m
To un ede a ed AI algo i hms
Secu ed because no da a lea es cen e s, only agg ega es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 81 / 88
Pos -doc: heal hca e analy ics ac oss hospi als I-ETL a wo k
I-ETL a wo k in he Be e p ojec
7 clinical cen e s ac oss Eu ope
I-ETL is unde deploymen a each cen e →7 in e ope able da abases
Wo king on:
Designing a ca alogue o:
Lis a ailable da ase s and hei associa ed me ada a
Explo e da ase s and hei agg ega ed da a
Wi h isualiza ions and que ies
Designing a decen alized ede a ed lea ning pla e o m
To un ede a ed AI algo i hms
Secu ed because no da a lea es cen e s, only agg ega es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 81 / 88
Pos -doc: heal hca e analy ics ac oss hospi als Be e pla e o m
Be e pla e o m: decen alized ede a ed lea ning
Based on he Pe sonal Heal h T ain: s a ions (cen e s), ains (que ies),
cen al s a ion ( esul s agg ega ion) →no da a lea es cen e s = p i acy
p ese a ion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 82 / 88
Sys ems de eloped
Ou line
1Mo i a ion: da a in eg a ion and explo a ion p oblems
2PhD: explo ing unknown semi-s uc u ed da ase s
3Pos -doc: heal hca e analy ics ac oss hospi als
4Sys ems de eloped
5Conclusion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 83 / 88
Sys ems de eloped
Sys ems de eloped (1/2)
Abs a o da a abs ac ion:
65 Ja a co e classes, 10K LOC
Published in EDBT 2024 [BMU24]
Demons a ed a CIKM and BDA 2022 [BMU22]
Pa hWays o NE- o-NE pa hs:
18 Ja a co e classes, 4K LOC
Published in ADBIS 2023 [BGLM23a], In o. Sys [BGLM25]
Demons a ed a ESWC and BDA 2023 [BGLM23b]
Connec ionS udio o NTU da a explo a ion:
Web in e ace by CEDAR enginee s
Published in CoopIS 2023 [BEG+23]
Demons a ed a BDA and SEAGRAPH 2024 [BEMM24,BBE+24]
Also o jou nalis s a Da aJou nos (40) and CFI (60)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 84 / 88
Sys ems de eloped
Sys ems de eloped (1/2)
Abs a o da a abs ac ion:
65 Ja a co e classes, 10K LOC
Published in EDBT 2024 [BMU24]
Demons a ed a CIKM and BDA 2022 [BMU22]
Pa hWays o NE- o-NE pa hs:
18 Ja a co e classes, 4K LOC
Published in ADBIS 2023 [BGLM23a], In o. Sys [BGLM25]
Demons a ed a ESWC and BDA 2023 [BGLM23b]
Connec ionS udio o NTU da a explo a ion:
Web in e ace by CEDAR enginee s
Published in CoopIS 2023 [BEG+23]
Demons a ed a BDA and SEAGRAPH 2024 [BEMM24,BBE+24]
Also o jou nalis s a Da aJou nos (40) and CFI (60)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 84 / 88
Sys ems de eloped
Sys ems de eloped (2/2)
I-ETL o in e ope able heal hca e da abases:
31 Py hon co e classes, 8K LOC ( es ic ed access)
Re iewed a BMC Med. In o. & Decision Making [BBBP25]
Unde deploymen in he 7 medical cen e s o he p ojec
Da a ca alogue and decen alized pla o m:
Unde de elopmen by an IT company
Wi h collabo a ion o Be e echnical pa ne s
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 85 / 88
Conclusion
Ou line
1Mo i a ion: da a in eg a ion and explo a ion p oblems
2PhD: explo ing unknown semi-s uc u ed da ase s
3Pos -doc: heal hca e analy ics ac oss hospi als
4Sys ems de eloped
5Conclusion
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 86 / 88
Conclusion
Takeaways and nex s eps (1/2)
In my PhD, we in oduced:
1A uni ied iew o e he e ogeneous semi-s uc u ed da a models
2Abs a: a da ase abs ac ion sys em o semi-s uc u ed da a
3Pa hWays: an en i y- ocused explo a ion sys em
4Connec ionS udio: a comp ehensi e da a lake explo a ion ool
Nex s eps:
Mig a e da a g aphs in o PG g aphs eusing [BEMM24]
En ich ex ac ed NEs wi h RDF knowledge bases
P opose an end- o-end da a p ocessing/explo a ion pipeline
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 87 / 88
Re e ences IV
Benoˆı G oz, Au ´elien Lemay, Slawek S awo ko, and Pio Wieczo ek.
In e ence o shape g aphs o g aph da abases.
In ICDT, olume 220, 2022.
Roy Goldman and Jenni e Widom.
Da aGuides: enabling que y o mula ion and op imiza ion in semis uc u ed da abases.
In VLDB, 1997.
Geo ge H ipcsak, Jon D Duke, Nigam H Shah, Ch is ian G Reich, Voj ech Huse , Ma ijn J Schuemie, Ma c A Sucha d,
Rae Woong Pa k, Ian Chi Kei Wong, Pe e R Rijnbeek, e al.
Obse a ional heal h da a sciences and in o ma ics (OHDSI): oppo uni ies o obse a ional esea che s.
In MEDINFO 2015: eHeal h-enabled Heal h, pages 574–578. IOS P ess, 2015.
S anley M Hu , Robe o A Rocha, Clemen J McDonald, Geo ges JE De Moo , Tom Fie s, W Dean Bidgood J ,
A den W Fo ey, William G F ancis, Wayne R T acy, Dennis Lea elle, e al.
De elopmen o he logical obse a ion iden i ie names and codes (LOINC) ocabula y.
Jou nal o he Ame ican Medical In o ma ics Associa ion, 5(3):276–292, 1998.
Ka ja Hose and Ral Schenkel.
Towa ds bene i -based RDF sou ce selec ion o SPARQL que ies.
In P oceedings o he 4 h In e na ional Wo kshop on Seman ic Web In o ma ion Managemen , pages 1–8, 2012.
Ada Hamosh, Alan F Sco , Joanna S Ambe ge , Ca ol A Bocchini, and Vic o A McKusick.
Online mendelian inhe i ance in man (omim), a knowledgebase o human genes and gene ic diso de s.
Nucleic acids esea ch, 33(suppl 1):D514–D517, 2005.
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 4 / 27
Re e ences V
Shahan Kha chadou ian and Ma iano P Consens.
ExpLOD: summa y-based explo a ion o in e linking and RDF usage in he Linked Open Da a Cloud.
In Ex ended seman ic web con e ence, pages 272–287. Sp inge , 2010.
Hanˆa Lba h, Angela Boni a i, and Russ Ha me .
Schema in e ence o p ope y g aphs.
In EDBT, 2021.
To a Milo and Dan Suciu.
Index s uc u es o pa h exp essions.
In In e na ional Con e ence on Da abase Theo y, pages 277–295. Sp inge , 1999.
Toan C Ong, Michael G Kahn, Be hany M Kwan, T aci Yamashi a, Elias B and , Pa ick Hosokawa, Ch is Uh ich, and
Lisa M Schilling.
Dynamic-ETL: a hyb id app oach o heal h da a ex ac ion, ans o ma ion and loading.
BMC medical in o ma ics and decision making, 17:1–12, 2017.
Law ence Page, Se gey B in, Rajee Mo wani, and Te y Winog ad.
The PageRank ci a ion anking: B inging o de o he web.
Technical epo , S an o d In oLab, 1999.
Daniel Pu mann, Rowdy de G oo , Nicole e de Keize , Ronald Co ne , e al.
Assessing he FAIRness o da abases on he EHDEN po al: A case s udy on wo Du ch ICU da abases.
In e na ional Jou nal o Medical In o ma ics, 176:105104, 2023.
Felipe Pezoa, Juan L Reu e , Fe nando Sua ez, Ma ´ın Uga e, and Domagoj V goˇc.
Founda ions o JSON schema.
In P oceedings o he 25 h in e na ional con e ence on Wo ld Wide Web, pages 263–273, 2016.
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 5 / 27
Re e ences VI
Ma cel Pa ciak, Ma kus Suh , Ch is ian Schmid , Ca oline B¨onisch, Benjamin L¨ohnha d , Do o hea Kesz y¨us, and Tibo
Kesz y¨us.
Fai ness h ough au oma ion: de elopmen o an au oma ed medical da a in eg a ion in as uc u e o ai heal h da a
in a maximum ca e uni e si y hospi al.
BMC Medical In o ma ics and Decision Making, 23(1):94, 2023.
Raghu Ramakh ishnan and Johannes Geh ke.
Da abase Managemen Sys ems (3 d edi ion).
McG aw-Hill, 2003.
Ma eo Rionda o, Da id Ga c´ıa-So iano, and F ancesco Bonchi.
G aph summa iza ion wi h quali y gua an ees.
Da a mining and knowledge disco e y, 31:314–349, 2017.
Michael Q S ea ns, Colin P ice, Ken A Spackman, and Amy Y Wang.
SNOMED clinical e ms: o e iew o he de elopmen p ocess and p ojec s a us.
In P oceedings o he AMIA Symposium, page 662. Ame ican Medical In o ma ics Associa ion, 2001.
Paul E S ang, Pa ick B Ryan, Judi h A Racoosin, J Ma c O e hage, Ab aham G Ha zema, Ch is ian Reich, Emily
Welebob, Thomas Sca necchia, and Jane Woodcock.
Ad ancing he science o ac i e su eillance: a ionale and design o he obse a ional medical ou comes pa ne ship.
Annals o in e nal medicine, 153(9):600–606, 2010.
Resou ce Desc ip ion F amewo k (RDF).
h ps://www.w3.o g/RDF/.
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 6 / 27
Re e ences VII
The XML da a model.
h ps://www.w3.o g/XML/Da amodel.h ml.
W3C XML Documen Type Speci ica ion.
h ps://www.w3.o g/TR/REC-xml/#d -doc ype, 2008.
W3C XML Schema De ini ion Language (XSD).
h ps://www.w3.o g/TR/xmlschema11-1/, 2012.
Ma k D Wilkinson, Michel Dumon ie , IJsb and Jan Aalbe sbe g, Gab ielle Apple on, Myles Ax on, A ie Baak, Niklas
Blombe g, Jan-Willem Boi en, Luiz Bonino da Sil a San os, Philip E Bou ne, e al.
The FAIR guiding p inciples o scien i ic da a managemen and s ewa dship.
Scien i ic da a, 3(1):1–9, 2016.
Mussab Zneika, Claudio Lucchese, Dan Vodisla , and Dimi is Ko zinos.
Summa izing linked da a RDF g aphs using app oxima e g aph pa e n mining.
In 19 h In e na ional Con e ence on Ex ending Da abase Technology, 2016.
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 7 / 27
The ela ional da a model
Acco ding o [RG03]:
A ela ional schema is a se o ela ions
Each ela ion has a name and se o named a ibu es wi h hei
domain
Ap ima y key is a subse o a ibu es o uniquely iden i y a uple
A o eign key is a e e ence o a p ima y key
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 8 / 27
The XML da a model
Acco ding o he W3C [W3Cb], a ee o :
A (single) documen node
Elemen nodes wi h non-labels, possibly wi h named a ibu es
Tex nodes, ca ying alues, a e child en o elemen nodes
Possibili y do de ine a DTD [W3C08] o an XSD [W3C12]
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 9 / 27
The JSON da a model
Acco ding o [PRS+16], a ee whe e a node can be:
Amap (label, one o mo e a key- alue elemen s)
An a ay (label, ze o o mo e child nodes)
A alue (a s ing)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 10 / 27
The RDF da a model
Acco ding o he W3C [W3Ca], an RDG g aph con ains iples <s,p,o>
whe e:
sand pa e esou ce iden i ie s (URIs)
ocan be a esou ce iden i ie o a li e al (s ing)
Also: blank nodes o anonymous esou ces (in e nal ID)
Add seman ic in o ma ion wi h on ologies (incl. RDFS, OWL)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 11 / 27
The p ope y g aph da a model
Anode is a s uc u ed eco d wi h:
0..n labels ( ypes)
0..n p ope ies (key- alues)
Reco ds wi h he same ype se may ha e di e en p ope ies
A ela ionship is a di ec ed labeled edge; possibly ha e a ibu es
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 12 / 27
Da a summa iza ion echniques
Build s uc u ed and concise summa ies ou o da ase s; many app oaches
o semi-s uc u ed
S uc u al app oaches: g oups o equi alen nodes; di e en no ions
o node simila i y
Quo ien summa ies: g oups based on an equi alence ela ion
Non-quo ien summa ies: o he means (da aguides, e c)
Pa e n mining app oaches: disco e y o pa e ns
S a is ical app oaches: coun s o e da a (classes, p ope ies, alue
ypes, e c)
Hyb id app oaches: combine abo e me hods
Schema in e ence echniques: build a schema s. . he da a con o ms o i
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 13 / 27
A comp ehensi e da a explo a ion ool o NTUs
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 20 / 27
A comp ehensi e da a explo a ion ool o NTUs
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 21 / 27
The EHDEN pla o m [PdGdK+23,BVD+21]
“Eu opean Heal h Da a and E idence Ne wo k”
Conso ium o 15 pa ne s ac oss 10 coun ies
Mapped hei da a o he OMOP da a model
Semi-au oma ic mapping
The sys em p oposes mappings
Expe s ha e o selec /co ec hem
P oduced 98 da abases
Asses quali y h ough a DQ (Da a Quali y Dashboa d)
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 22 / 27
The OHDSI pla o m [HDS+15]
“Obse a ional Heal h Da a Sciences and In o ma ics” (said Odyssey, child
om OMOP)
In e na ional collabo a ion o open-sou ce da a analy ics on heal hca e
ne wo ks
Build ools o da a explo a ion and e idence gene a ion
Achilles: in e ac i e epo s and s a is ics
He mes: ocabula y b owsing and ela ed sea ches
He acles: build coho s o assess clinical ea u es on poopula ions
Home : isk iden i ica ion by explo ing many clinical dimensions
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 23 / 27
UMG-MeDIC [PSS+23]
Medical da a in eg a ion cen e ; elies on Medical In o ma ics Ini ia i e
(MI-I) unds and HiGHmed conso ium
C ea e a echnical and legal amewo k o c oss-si e seconda y use o
ou ine heal hca e da a
Aim o high compliance wi h FAIR P inciples bu da a in eg a ion
wo k lows a e complex and ine icien when done manually
Ope a es on a con inuous low o da a (6= indi idual da ase s)
Pe iodic in eg a ion o new da a
A cen al ela ional da abase wi h anonymized da a
Combine indi idual p e-p ocessing asks in o wo k lows
Requi e ha each ask is documen ed wi h “me a-da a”
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 24 / 27
The OMOP da a model [SRR+10]
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 25 / 27
D-ETL [OKK+17]
“Dynamic-ETL”: semi-au oma ic ETL o map sou ce and a ge da a
models
C ea ion o an ETL speci ica ion documen ( ocabula ies, da a
schema, de ini ions, con en ions)
Da a ex ac ion om ini ial sou ces and alida ion
D-ETL ules w i ing (T1./ T2on T1.a=T2.b)
Con e sion o ules o SQL s a emen s
Tes ing ules on da a; i e a e i no sa is ying
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 26 / 27
C ea ing new codes wi h pos -coo dina ion
Some heal hca e concep s do no ha e a speci ic code
SNOMED-CT in oduces pos -coo dina ion as a composi ional g amma
A pos -coo dina ed code = a sequence o exis ing codes wi h ope a o s
Nelly Ba e (DEIB@PoliMi) Da a in eg a ion and explo a ion Janua y 24, 2025 27 / 27