scieee Science in your language
[en] (orig)

User-oriented exploration of semi-structured datasets

Author: Barret, Nelly
Publisher: Zenodo
DOI: 10.5281/zenodo.17679803
Source: https://zenodo.org/records/17679803/files/phd-defense-slides.pdf
Use -o ien ed explo a ion o semi-s uc u ed da ase s
Nelly Ba e
In ia Saclay and Ins i u Poly echnique de Pa is
Supe ised by Ioana Manolescu and Ka en Bas ien
Ma ch 15, 2024
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 1 / 100
Ou line
1Mo i a ion: explo ing semi-s uc u ed da a
2O e iew o ou app oach
3Abs a: i s -sigh o e iew o a da ase
4Pa hways: e icien ly inding in e es ing pa hs
5Sys ems de eloped
6Conclusion
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 2 / 100
Mo i a ion: explo ing semi-s uc u ed da a
Ou line
1Mo i a ion: explo ing semi-s uc u ed da a
2O e iew o ou app oach
3Abs a: i s -sigh o e iew o a da ase
4Pa hways: e icien ly inding in e es ing pa hs
5Sys ems de eloped
6Conclusion
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 3 / 100
Mo i a ion: explo ing semi-s uc u ed da a
Da a explo a ion by non- echnical use s (NTUs)
Con lic s o In e es
in he biomedical domain
[ABB+21] w/ S. Ho el
Is his da ase use ul o he in es iga ions?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 4 / 100
Mo i a ion: explo ing semi-s uc u ed da a
Da a explo a ion by non- echnical use s (NTUs)
Con lic s o In e es
in he biomedical domain
[ABB+21] w/ S. Ho el
Is his da ase use ul o he in es iga ions?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 4 / 100

Mo i a ion: explo ing semi-s uc u ed da a
Da a explo a ion by non- echnical use s (NTUs)
Con lic s o In e es
in he biomedical domain
[ABB+21] w/ S. Ho el
Is his da ase use ul o he in es iga ions?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 4 / 100
Mo i a ion: explo ing semi-s uc u ed da a
Da a explo a ion by non- echnical use s (NTUs)
Con lic s o In e es
in he biomedical domain
[ABB+21] w/ S. Ho el
How a e au ho s connec ed o biomedical companies?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 5 / 100
Mo i a ion: explo ing semi-s uc u ed da a
Semi-s uc u ed da a explo a ion
Se e al semi-s uc u ed da a models:
XML documen s
JSON documen s
RDF g aphs
P ope y g aphs
Semi-s uc u ed da ase explo a ion is ha d: complex, i egula s uc u e
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 6 / 100
Mo i a ion: explo ing semi-s uc u ed da a
Semi-s uc u ed da a explo a ion
Se e al semi-s uc u ed da a models:
XML documen s
JSON documen s
RDF g aphs
P ope y g aphs
Semi-s uc u ed da ase explo a ion is ha d: complex, i egula s uc u e
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 6 / 100
O e iew o ou app oach
Resea ch con ibu ions
Abs a: da a o e iews [BMU22,BMU24]
Ligh weigh En i y-Rela ionship diag ams
Compac ye meaning ul da a o e iews
Ideal o i s -sigh da ase disco e y
Pa hWays: in e es ing Named En i y connec ions
[BGLM23b,BGLM23a,BGLM24]
In e es ing en i y pa hs in and ac oss da ase s
Comple e se o NE- o-NE in e es ing connec ions
Ideal o explo ing connec ions wi hin and ac oss da ase s
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 9 / 100

Abs a: i s -sigh o e iew o a da ase
Ou line
1Mo i a ion: explo ing semi-s uc u ed da a
2O e iew o ou app oach
3Abs a: i s -sigh o e iew o a da ase
4Pa hways: e icien ly inding in e es ing pa hs
5Sys ems de eloped
6Conclusion
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 10 / 100
Abs a: i s -sigh o e iew o a da ase
Wha does he da ase desc ibe?
Real-wo ld objec s and ela ionships be ween hem
En i y-Rela ionship models [RG03]
Need o compu e hem om he da ase !
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 11 / 100
Abs a: i s -sigh o e iew o a da ase
Wha does he da ase desc ibe?
Real-wo ld objec s and ela ionships be ween hem
En i y-Rela ionship models [RG03]
Need o compu e hem om he da ase !
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 11 / 100
Abs a: i s -sigh o e iew o a da ase
Wha does he da ase desc ibe?
Real-wo ld objec s and ela ionships be ween hem
En i y-Rela ionship models [RG03]
Need o compu e hem om he da ase !
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 11 / 100
Abs a: i s -sigh o e iew o a da ase
Wha does he da ase desc ibe?
Real-wo ld objec s and ela ionships be ween hem
En i y-Rela ionship models [RG03]
Need o compu e hem om he da ase !
Wha abou semi-s uc u ed da a models (nes ing)?
Keep i simple and o con ollable size
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 12 / 100

Abs a: i s -sigh o e iew o a da ase
Wha does he da ase desc ibe?
Real-wo ld objec s and ela ionships be ween hem
En i y-Rela ionship models [RG03]
Need o compu e hem om he da ase !
Wha abou semi-s uc u ed da a models (nes ing)?
Keep i simple and o con ollable size
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 12 / 100
Abs a: i s -sigh o e iew o a da ase
Wha does he da ase desc ibe?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 13 / 100
Abs a: i s -sigh o e iew o a da ase
The Abs a app oach
1In eg a e all da a sou ces in a g aph (Connec ionLens) [ABC+22]
2Summa ize he g aph
3Among summa y nodes, iden i y en i ies and hei a ibu es
4In he summa y, iden i y ela ionships be ween he en i ies
5P opose a simple ca ego y o each en i y (bes -e o )
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 14 / 100
Abs a: i s -sigh o e iew o a da ase Backg ound
Backg ound: om he e ogeneous da a o da a g aphs
Connec ionLens [ABC+22]:
1Inges s any da ase in o a di ec ed g aph
Gene ic, lexible, ine g anula i y
2Ex ac s Named En i ies (NEs) om all ex nodes
da e , email add ess , People , Place , O ganiza ion , ...
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 15 / 100
Abs a: i s -sigh o e iew o a da ase Da a g aph summa iza ion
The summa y (collec ion g aph) G
Collec ion node o each equi alence class
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 19 / 100

Abs a: i s -sigh o e iew o a da ase Da a g aph summa iza ion
The summa y (collec ion g aph) G
Collec ion node o each equi alence class
Collec ion edge Cs→C i a da a edge exis s
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 20 / 100
Abs a: i s -sigh o e iew o a da ase Da a g aph summa iza ion
The summa y (collec ion g aph) G
Collec ion node o each equi alence class
Collec ion edge Cs→C i a da a edge exis s
En i y p o ile o each lea collec ion node: e lec s NEs in he lea es
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 21 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Iden i ying en i ies in he collec ion g aph G
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Which collec ions ep esen en i ies in he E-R diag am?
Which collec ions ep esen en i y a ibu es?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 22 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Iden i ying en i ies in he collec ion g aph G
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Which collec ions ep esen en i ies in he E-R diag am?
Which collec ions ep esen en i y a ibu es?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 22 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Iden i ying en i ies in he collec ion g aph G
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Which collec ions ep esen en i ies in he E-R diag am?
Which collec ions ep esen en i y a ibu es?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 22 / 100

Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Requi emen s and algo i hm
We need an algo i hm o iden i y en i y oo s and a ibu es o he
E-R diag am
Fo complex, po en ially cyclic, collec ion g aphs
G eedy selec ion o ew en i ies in G
1Assign a sco e o each collec ion node
2While less han Emax en i y oo s, o da a co e age <co min
1Elec he nex highes -sco ed eligible collec ion node as an en i y oo
2Compu e i s bounda y , i.e., a ibu e se
3Upda e he collec ion g aph o e lec he selec ion o an en i y
4Recompu e he sco es
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 23 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Requi emen s and algo i hm
We need an algo i hm o iden i y en i y oo s and a ibu es o he
E-R diag am
Fo complex, po en ially cyclic, collec ion g aphs
G eedy selec ion o ew en i ies in G
1Assign a sco e o each collec ion node
2While less han Emax en i y oo s, o da a co e age <co min
1Elec he nex highes -sco ed eligible collec ion node as an en i y oo
2Compu e i s bounda y , i.e., a ibu e se
3Upda e he collec ion g aph o e lec he selec ion o an en i y
4Recompu e he sco es
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 23 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
How o sco e a collec ion node?
Re lec he weigh o his node and i s s uc u e in he da ase
1wdesck,wlea k: # descendan s, lea descendan s, a dep h k
×No clea how o pick k
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 24 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
How o sco e a collec ion node?
Re lec he weigh o his node and i s s uc u e in he da ase
1wdesck,wlea k: # descendan s, lea descendan s, a dep h k
×No clea how o pick k
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 24 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
PageRank sco e o a collec ion g aph node
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
The e e se collec ion g aph GR
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 31 / 100

Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
PageRank sco e o a collec ion g aph node
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
11
1
1
1
0.5
1
1
1
1
1
0.5
1
1
0.5
0.5
1
1 1
1
1 1 1
1
1
The e e se collec ion g aph GRwi h PR edge weigh s
Collec ions dis ibu e hei sco e based solely on hei connec i i y
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 32 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
PageRank sco e o a collec ion g aph node
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
11
1
1
1
0.5
1
1
1
1
1
0.5
1
1
0.5
0.5
1
1 1
1
1 1 1
1
1
The e e se collec ion g aph GRwi h PR edge weigh s
Collec ions dis ibu e hei sco e based solely on hei connec i i y
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 32 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
How o sco e a collec ion node?
1wdesck,wlea k: # descendan s, lea descendan s, a dep h k
2wDAG :dw bo om-up p opaga ion on G(ou side cycles)
3wPageRank : PageRank algo i hm on G
4wdwPageRank : PageRank algo i hm on Gwi h dw- uned PR edge
weigh s
XRe lec s bo h he opology and whe e ac ual da a is
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 33 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
The e e se collec ion g aph GRwi h dw- uned PR edge weigh s
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 34 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
.178
abs ac
.011
# al
.006
yea
.011
# al
.006
i le
.011
# al
.006
wB
.107
hW
.158
pIn
.063
in
.056
au ho
.179
con
.067
name
.011
# al
.006
da e
.011
# al
.006
email
.011
# al
.006
a ilia ion
.027
uni e si y
.024
ci y
.011
# al
.006
campus
.011
# al
.006
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 35 / 100

Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
.178
abs ac
.011
# al
.006
yea
.011
# al
.006
i le
.011
# al
.006
wB
.107
hW
.158
pIn
.063
in
.056
au ho
.179
con
.067
name
.011
# al
.006
da e
.011
# al
.006
email
.011
# al
.006
a ilia ion
.027
uni e si y
.024
ci y
.011
# al
.006
campus
.011
# al
.006
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 36 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
The da a-weigh ed PageRank sco e
pape
.178
abs ac
.011
# al
.006
yea
.011
# al
.006
i le
.011
# al
.006
wB
.107
hW
.158
pIn
.063
in
.056
au ho
.179
con
.067
name
.011
# al
.006
da e
.011
# al
.006
email
.011
# al
.006
a ilia ion
.027
uni e si y
.024
ci y
.011
# al
.006
campus
.011
# al
.006
1
1
11
1
1
1
0.66
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
P opaga es sco es ac oss he collec ion g aph
Wo ks on cyclic collec ion g aphs
The sco e e lec s he opology and whe e he da a is
A collec ion node dis ibu es i s weigh
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 37 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
How o compu e an en i y bounda y?
Collec ions in G ep esen ing a ibu es o his en i y
“Those ha con ibu e o he en i y’s weigh ”
The bounda y may go a ( o deep-s uc u e en i ies)
Easy o de ine o wdesck,wlea k,wDAG . Example o wdesc2
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Does no apply o PageRank-based sco es
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 38 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
How o compu e an en i y bounda y?
Collec ions in G ep esen ing a ibu es o his en i y
“Those ha con ibu e o he en i y’s weigh ”
The bounda y may go a ( o deep-s uc u e en i ies)
Easy o de ine o wdesck,wlea k,wDAG . Example o wdesc2
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Does no apply o PageRank-based sco es
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 38 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
How o upda e he collec ion g aph a e selec ing an
en i y?
Re lec he alloca ion o da a nodes and edges o one en i y
1upda eboolean
Collec ion nodes and edges in he bounda y o he en i y
Ve y e icien
Su icien o wdesck,wlea k,wDAG
2upda eexac
G aph nodes and edges
Much mo e cos ly
Requi ed o wPageRank ,wdwPageRank
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 42 / 100

Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Exac g aph upda e
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 43 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Exac g aph upda e
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 44 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Selec ed en i ies and hei bounda ies
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 45 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
Finding ela ionships be ween en i ies
Rela ionship: a pa h om an en i y o ano he
pape
abs ac
# al
yea
# al
i le
# al wB
hW
pIn
in
au ho
con
name # al
da e # al
email # al
a ilia ion uni e si y ci y # al
campus
# al
1
1
1 1
1
1
1
1
1
1
1
1
1
0.33
1
1
0.4
0.6
1
1 1
1
1 1 1
1
1
pape →wB →au ho
pape →pIn →con
au ho →hW →pape
con →in →au ho
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 46 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
En i y classi ica ion
Assign a seman ic ca ego y o each en i y
Inpu : an en i y E, ca ego ies K, seman ic p ope ies P
K: Pe son, Scien i icPape , E en , Websi e, Moun ain, ...
P:{label:"add ess", domain:[Pe s., O g.], ange:[Place]}, ...
Ou pu : a ca ego y o E
Algo i hm:
Compa e:
The common name o all nodes in he en i y oo (i i exis s) wi h
k∈ K (con , pape , au ho )
I s a ibu e names wi h p∈ P (a ilia ion, email, ...)
I s en i y p o iles wi h p. ange ∈ P (,,, ...)
Each good ma ch o es o one o ew ca ego ies
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 47 / 100

Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
En i y classi ica ion
Assign a seman ic ca ego y o each en i y
Inpu : an en i y E, ca ego ies K, seman ic p ope ies P
K: Pe son, Scien i icPape , E en , Websi e, Moun ain, ...
P:{label:"add ess", domain:[Pe s., O g.], ange:[Place]}, ...
Ou pu : a ca ego y o E
Algo i hm:
Compa e:
The common name o all nodes in he en i y oo (i i exis s) wi h
k∈ K (con , pape , au ho )
I s a ibu e names wi h p∈ P (a ilia ion, email, ...)
I s en i y p o iles wi h p. ange ∈ P (,,, ...)
Each good ma ch o es o one o ew ca ego ies
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 47 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
En i y classi ica ion
Name Simila o Vo es o
pape Resea chPublica ion (0.85) Resea chPublica ion
News (0.63) News
pape
abs ac
# al
yea
# al
i le
# al
1
1
1 1
1
1
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 48 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
En i y classi ica ion
A ibu e Simila o Vo es o
abs ac abs ac (1.0) Resea chPublica ion
summa y (0.92) Book
p e ace (0.47)
i le i le (1.0) Resea chPublica ion
hono i ic i le (0.87) Mo ie
Pe son
yea yea publica ion (0.85 + )E en
Book
Resea chPublica ion, ...
pape
abs ac
# al
yea
# al
i le
# al
1
1
1 1
1
1
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 49 / 100
Abs a: i s -sigh o e iew o a da ase Iden i ying en i ies and ela ionships
En i y classi ica ion
A ibu e Simila o Vo es o
abs ac abs ac (1.0) Resea chPublica ion
summa y (0.92) Book
p e ace (0.47)
i le i le (1.0) Resea chPublica ion
hono i ic i le (0.87) Mo ie
Pe son
yea yea publica ion (0.85 + )E en
Book
Resea chPublica ion, ...
pape
abs ac
# al
yea
# al
i le
# al
1
1
1 1
1
1
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 50 / 100
Abs a: i s -sigh o e iew o a da ase Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 55 / 100

Abs a: i s -sigh o e iew o a da ase Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 56 / 100
Abs a: i s -sigh o e iew o a da ase Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 57 / 100
Abs a: i s -sigh o e iew o a da ase Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 58 / 100
Abs a: i s -sigh o e iew o a da ase Expe imen al e alua ion
En i y selec ion quali y wi h (wdwPageRank ,bound l−ac )
Da ase name |C| |ME| |MR| co ME dmax |MEi|
Mondial 168 5 8 0.85
Ci y
P o ince
Coun y
O ganiza ion
Ri e
3
3
4
4
4
3,152
1,455
231
168
135
PubMed 26 1 0 1.0 PubMedA icle 5 957
XMa k1 136 5 10 0.91
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
25,500
21,750
12,000
9,750
1,000
XMa k4 136 5 10 0.90
Pe son
I em
Open Auc ion
Closed Auc ion
Ca ego y
4
7
8
8
2
102,000
87,000
48,000
39,000
4,000
Wikimedia 59 2 0 1.0 Page
Namespace
4
3
54,750
32
Abs a selec s equen , cohe en and seman ically cen al en i ies
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 59 / 100
Abs a: i s -sigh o e iew o a da ase Expe imen al e alua ion
Expe imen al e alua ion: scalabili y
Ou abs ac ion me hod scales up linea ly in he da a size
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 60 / 100

Abs a: i s -sigh o e iew o a da ase Rela ed wo k
Rela ed wo k
Da a summa iza ion
S uc u al
Quo ien [GGM20,KC10,MS99]
( he one we adop o build G)
Non-quo ien [GW97]
Pa e n mining [ZLVK16]
S a is ical [HS12]
Hyb id [RGSB17]
Schema in e ence
XML [CGS11]
JSON [BCGS19]
RDF [GLSW22]
PG [LBH21]
Da a summa iza ion and schema in e ence a e ied o one da a model
Schemas a e o en no sui ed o NTUs
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 61 / 100
Abs a: i s -sigh o e iew o a da ase Rela ed wo k
A JSON schema om social ne wo k da a using [BCGS19]
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 62 / 100
Pa hways: e icien ly inding in e es ing pa hs
Ou line
1Mo i a ion: explo ing semi-s uc u ed da a
2O e iew o ou app oach
3Abs a: i s -sigh o e iew o a da ase
4Pa hways: e icien ly inding in e es ing pa hs
5Sys ems de eloped
6Conclusion
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 63 / 100
Pa hways: e icien ly inding in e es ing pa hs
Da a is o en used o ind connec ions
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 64 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
Some pa hs connec ing Pe son NEs () o O ganiza ion NEs ()
←# al ←Name ←Au ho →A ilia ion →# al →
←# al ←Name ←Au ho ←Au ho s ←A icle →Jou nal →# al →
←# al ←COI ←A icle →Jou nal →# al →←# al →
Which pa hs a e mos in e es ing and dese e o be e alua ed?
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 66 / 100

Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
Some pa hs a e un eliable: we ace en i y ex ac ion e o s
E.g., “John Hopkins
| {z }
pe son
Uni e si y Hospi al”
False posi i es, o w ong en i y ype a ibu ion, e.g., “THC
| {z }
o g.
”
Some pa hs a e s uc u ally weak: we ace in o ma ion dilu ion
E.g., a pape has 50 au ho s
Pa h in e es ingness : based on edge eliabili y and edge o ce
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 67 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
Some pa hs a e un eliable: we ace en i y ex ac ion e o s
E.g., “John Hopkins
| {z }
pe son
Uni e si y Hospi al”
False posi i es, o w ong en i y ype a ibu ion, e.g., “THC
| {z }
o g.
”
Some pa hs a e s uc u ally weak: we ace in o ma ion dilu ion
E.g., a pape has 50 au ho s
Pa h in e es ingness : based on edge eliabili y and edge o ce
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 67 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
Some pa hs a e un eliable: we ace en i y ex ac ion e o s
E.g., “John Hopkins
| {z }
pe son
Uni e si y Hospi al”
False posi i es, o w ong en i y ype a ibu ion, e.g., “THC
| {z }
o g.
”
Some pa hs a e s uc u ally weak: we ace in o ma ion dilu ion
E.g., a pape has 50 au ho s
Pa h in e es ingness : based on edge eliabili y and edge o ce
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 67 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
1Reliabili y (Ci99K )o an ex ac ion collec ion edge
The a io o NEs ha ing he ype , and ex ac ed om Ci
Pa h eliabili y: minimum ex ac ion edge eliabili y
2Fo ce (Ci→Cj)o a s uc u al collec ion edge
The in e se o he maximal sou ce node ou -deg ee among da a edges
ep esen ed by Ci→Cj
Pa h o ce: p oduc o edge o ces
3Rank pa hs on hei eliabili y, hen hei o ce
4Take a op-ko hose ha ing ≥θ
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 68 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
1Reliabili y (Ci99K )o an ex ac ion collec ion edge
The a io o NEs ha ing he ype , and ex ac ed om Ci
Pa h eliabili y: minimum ex ac ion edge eliabili y
2Fo ce (Ci→Cj)o a s uc u al collec ion edge
The in e se o he maximal sou ce node ou -deg ee among da a edges
ep esen ed by Ci→Cj
Pa h o ce: p oduc o edge o ces
3Rank pa hs on hei eliabili y, hen hei o ce
4Take a op-ko hose ha ing ≥θ
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 68 / 100

Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
1Reliabili y (Ci99K )o an ex ac ion collec ion edge
The a io o NEs ha ing he ype , and ex ac ed om Ci
Pa h eliabili y: minimum ex ac ion edge eliabili y
2Fo ce (Ci→Cj)o a s uc u al collec ion edge
The in e se o he maximal sou ce node ou -deg ee among da a edges
ep esen ed by Ci→Cj
Pa h o ce: p oduc o edge o ces
3Rank pa hs on hei eliabili y, hen hei o ce
4Take a op-ko hose ha ing ≥θ
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 68 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
1Reliabili y (Ci99K )o an ex ac ion collec ion edge
The a io o NEs ha ing he ype , and ex ac ed om Ci
Pa h eliabili y: minimum ex ac ion edge eliabili y
2Fo ce (Ci→Cj)o a s uc u al collec ion edge
The in e se o he maximal sou ce node ou -deg ee among da a edges
ep esen ed by Ci→Cj
Pa h o ce: p oduc o edge o ces
3Rank pa hs on hei eliabili y, hen hei o ce
4Take a op-ko hose ha ing ≥θ
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 68 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Wha makes a NE- o-NE pa h in e es ing?
Some pa hs connec ing Pe son NEs () o O ganiza ion NEs ()
1.0
←−− # al 1.0
←−− Name 1.0
←−− Au ho 1.0
−−→ A ilia ion 1.0
−−→ # al 0.91
−−→ 
Reliable; s ong
1.0
←−− # al 1.0
←−− Name 1.0
←−− Au ho 0.02
←−− Au ho s 1.0
←−− A icle 1.0
−−→ Jou nal 1.0
−−→ # al
0.41
−−→ 
Reliable; weak
0.09
←−− # al 1.0
←−− COI 1.0
←−− A icle 1.0
−−→ Jou nal 1.0
−−→ # al 0.05
−−→ 0.09
←−− # al 0.04
−−→ 
No eliable; s ong
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 69 / 100
Pa hways: e icien ly inding in e es ing pa hs NE- o-NE pa h enume a ion
Pa hWays ou pu : da a pa hs as ables
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 70 / 100
Pa hways: e icien ly inding in e es ing pa hs Rela ed wo k
Rela ed wo k
S uc u ed que ying
SQL, SPARQL, GQL
[DFG+22]
Assis ed s uc . que ying
In e ac i e que ies [DAB16]
Guided que y w i ing
[ERAAL18,KKBS10]
NL2SQL [KSHL20]
Keywo d-based sea ch
Unidi ec ional
[ABC+02,LOF+08]
Bi-di ec ional [ABC+22]
Pa h sea ch in s uc . que ies
SPARQL ex ensions:
[ASMH18,AMSH18,
AMM23]
Fo PGs: [DFG+22]
Pa hways use s need no knowledge o he g aph s uc u e o alues
Less in imida ing o NTUs
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 76 / 100

Sys ems de eloped
Ou line
1Mo i a ion: explo ing semi-s uc u ed da a
2O e iew o ou app oach
3Abs a: i s -sigh o e iew o a da ase
4Pa hways: e icien ly inding in e es ing pa hs
5Sys ems de eloped
6Conclusion
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 77 / 100
Sys ems de eloped
Sys ems de eloped
Abs a o da a abs ac ion:
h ps:// eam.in ia. /ceda /p ojec s/abs a/
65 Ja a co e classes and 10K LOC
Demons a ed a CIKM 2022 [BMU22] (also BDA 2022)
Pa hWays o NE- o-NE pa hs:
h ps:// eam.in ia. /ceda /p ojec s/pa hways/
18 Ja a co e classes and 4K LOC
Demons a ed a ESWC 2023 [BGLM23b] (also BDA 2023)
Connec ionS udio o NTU da a explo a ion:
h ps://connec ions udio.in ia. /
4K Ja a LOC and 21K Ja aSc ip LOC (w/ T. Galizzi, S. Ebel,
M. Mohan y)
Demons a ed a CoopIS 2023 [BEG+23] (also BDA 2023)
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 78 / 100
Sys ems de eloped
Sys ems de eloped
Abs a o da a abs ac ion:
h ps:// eam.in ia. /ceda /p ojec s/abs a/
65 Ja a co e classes and 10K LOC
Demons a ed a CIKM 2022 [BMU22] (also BDA 2022)
Pa hWays o NE- o-NE pa hs:
h ps:// eam.in ia. /ceda /p ojec s/pa hways/
18 Ja a co e classes and 4K LOC
Demons a ed a ESWC 2023 [BGLM23b] (also BDA 2023)
Connec ionS udio o NTU da a explo a ion:
h ps://connec ions udio.in ia. /
4K Ja a LOC and 21K Ja aSc ip LOC (w/ T. Galizzi, S. Ebel,
M. Mohan y)
Demons a ed a CoopIS 2023 [BEG+23] (also BDA 2023)
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 78 / 100
Sys ems de eloped
Connec ionS udio so wa e pile
All deployed using Ma en, hund eds o uni es s, e c.
Help om T. Galizzi, M. Mohan y
Se e al ounds o e-enginee ing (ML model memo y consump ion, e c.)
Connec ionS udio
Pa hways
Abs a
Connec ionLens, incl. [AMM23]RDFQuo ien (14K LOC)
On oSQL (85K LOC)
Jena
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 79 / 100
Sys ems de eloped
A comp ehensi e da a explo a ion ool o NTUs
Connec ionS udio: a da a lake o inges ing, explo ing and que ying
he e ogeneous da a
1Da a abs ac ions as E-R diag ams (Abs a)
2NE- o-NE pa hs as ables (Pa hWays)
3“Gen le in oduc ion” o he da a lake (w/ jou nalis inpu )
Demons a ed o jou nalis s a Da aJou nos (40) and CFI (60)
Connec ionS udio in e es ing o a i s look a he da a.
S ill ma u ing...
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 80 / 100

Sys ems de eloped
A comp ehensi e da a explo a ion ool o NTUs
Connec ionS udio: a da a lake o inges ing, explo ing and que ying
he e ogeneous da a
1Da a abs ac ions as E-R diag ams (Abs a)
2NE- o-NE pa hs as ables (Pa hWays)
3“Gen le in oduc ion” o he da a lake (w/ jou nalis inpu )
Demons a ed o jou nalis s a Da aJou nos (40) and CFI (60)
Connec ionS udio in e es ing o a i s look a he da a.
S ill ma u ing...
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 80 / 100
Sys ems de eloped
A comp ehensi e da a explo a ion ool o NTUs
Connec ionS udio: a da a lake o inges ing, explo ing and que ying
he e ogeneous da a
1Da a abs ac ions as E-R diag ams (Abs a)
2NE- o-NE pa hs as ables (Pa hWays)
3“Gen le in oduc ion” o he da a lake (w/ jou nalis inpu )
Demons a ed o jou nalis s a Da aJou nos (40) and CFI (60)
Connec ionS udio in e es ing o a i s look a he da a.
S ill ma u ing...
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 80 / 100
Conclusion
Ou line
1Mo i a ion: explo ing semi-s uc u ed da a
2O e iew o ou app oach
3Abs a: i s -sigh o e iew o a da ase
4Pa hways: e icien ly inding in e es ing pa hs
5Sys ems de eloped
6Conclusion
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 81 / 100
Conclusion
Takeaways and nex s eps
We in oduced:
1A uni ied iew o e he e ogeneous semi-s uc u ed da a models
2Abs a: a da ase abs ac ion sys em o semi-s uc u ed da a
3Pa hWays: an en i y- ocused explo a ion sys em
4Connec ionS udio: a comp ehensi e da a lake explo a ion ool
Nex s eps:
Gene a e PG schemas om abs ac ions [BEMM24]
Mig a e da a g aphs in o PG g aphs
En ich ex ac ed NEs wi h RDF knowledge bases
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 82 / 100
Conclusion
Re e ences III
Nelly Ba e , An oine Gauquie , Jia Jean Law, and Ioana Manolescu.
PATHWAYS: en i y- ocused explo a ion o he e ogeneous da a g aphs (demons a ion).
In ESWC, 2023.
Nelly Ba e , An oine Gauquie , Jia Jean Law, and Ioana Manolescu.
Explo ing he e ogeneous da a g aphs h ough hei en i y pa hs.
In . Sys ems SUBM, 2024.
Nelly Ba e , Ioana Manolescu, and P ajna Upadhyay.
ABSTRA: owa d gene ic abs ac ions o da a o any model (demons a ion).
In CIKM, 2022.
Nelly Ba e , Ioana Manolescu, and P ajna Upadhyay.
Compu ing gene ic abs ac ions om applica ion da ase s.
In EDBT, 2024.
Da io Colazzo, Gio gio Ghelli, and Ca lo Sa iani.
Schemas o sa e and e icien XML p ocessing.
In ICDE. IEEE Compu e Socie y, 2011.
Gonzalo Diaz, Ma celo A enas, and Michael Benedik .
SPARQLByE: que ying d da a by example.
P oceedings o he VLDB Endowmen , 9(13):1533–1536, 2016.
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 88 / 100

Conclusion
Re e ences IV
Alin Deu sch, Nadime F ancis, Alas ai G een, Kei h Ha e, Bei Li, Leonid Libkin, Tobias Lindaake , Vic o Ma saul , Wim
Ma ens, Jan Michels, Filip Mu lak, S e an Plan ikow, Pe a Selme , Oska an Res , Hannes Voig , Domagoj V goc,
Mingxi Wu, and F ed Zemke.
G aph pa e n ma ching in GQL and SQL/PGQ.
In SIGMOD ’22: In e na ional Con e ence on Managemen o Da a, Philadelphia, PA, USA, June 12 - 17, 2022, pages
2246–2258, 2022.
Ahmed El-Roby, Khaled Amma , Ash a Aboulnaga, and Jimmy Lin.
Sapphi e: que ying d da a made simple.
a Xi p ep in a Xi :1805.11728, 2018.
F an¸cois Goasdou´e, Pawel Guzewicz, and Ioana Manolescu.
RDF g aph summa iza ion o i s -sigh s uc u e disco e y.
The VLDB Jou nal, 29(5), Ap il 2020.
Benoˆı G oz, Au ´elien Lemay, Slawek S awo ko, and Pio Wieczo ek.
In e ence o shape g aphs o g aph da abases.
In ICDT, olume 220, 2022.
Roy Goldman and Jenni e Widom.
Da aGuides: enabling que y o mula ion and op imiza ion in semis uc u ed da abases.
In VLDB, 1997.
Ka ja Hose and Ral Schenkel.
Towa ds bene i -based RDF sou ce selec ion o SPARQL que ies.
In P oceedings o he 4 h In e na ional Wo kshop on Seman ic Web In o ma ion Managemen , pages 1–8, 2012.
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 89 / 100
Conclusion
Re e ences V
Shahan Kha chadou ian and Ma iano P Consens.
ExpLOD: summa y-based explo a ion o in e linking and RDF usage in he Linked Open Da a Cloud.
In Ex ended seman ic web con e ence, pages 272–287. Sp inge , 2010.
Nodi a Khoussaino a, YongChul Kwon, Magdalena Balazinska, and Dan Suciu.
SnipSugges : con ex -awa e au ocomple ion o SQL.
P oceedings o he VLDB Endowmen , 4(1):22–33, 2010.
Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hong ae Lee.
Na u al language o SQL: Whe e a e we oday?
P oceedings o he VLDB Endowmen , 13(10):1737–1750, 2020.
Hanˆa Lba h, Angela Boni a i, and Russ Ha me .
Schema in e ence o p ope y g aphs.
In EDBT, 2021.
Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu Zhou.
EASE: an e ec i e 3-in-1 keywo d sea ch me hod o uns uc u ed, semi-s uc u ed and s uc u ed da a.
In P oceedings o he 2008 ACM SIGMOD in e na ional con e ence on Managemen o da a, pages 903–914, 2008.
To a Milo and Dan Suciu.
Index s uc u es o pa h exp essions.
In In e na ional Con e ence on Da abase Theo y, pages 277–295. Sp inge , 1999.
Raghu Ramakh ishnan and Johannes Geh ke.
Da abase Managemen Sys ems (3 d edi ion).
McG aw-Hill, 2003.
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 90 / 100
Conclusion
Re e ences VI
Ma eo Rionda o, Da id Ga c´ıa-So iano, and F ancesco Bonchi.
G aph summa iza ion wi h quali y gua an ees.
Da a mining and knowledge disco e y, 31:314–349, 2017.
Mussab Zneika, Claudio Lucchese, Dan Vodisla , and Dimi is Ko zinos.
Summa izing linked da a RDF g aphs using app oxima e g aph pa e n mining.
In 19 h In e na ional Con e ence on Ex ending Da abase Technology, 2016.
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 91 / 100
Conclusion
Da a-acyclic looding bounda y
mailbox email da e # al
con en lis i em ex # al
The bounda y is unca ed due o cyclic collec ion edges
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 92 / 100
Conclusion
En i y classi ica ion ime
The classi ica ion ime is composed o :
Loading he Wo d2Vec seman ic model
Cons an , 4-8 seconds
Compa ing en i y a ibu es wi h seman ic p ope ies
Va ies wi h he numbe o en i ies and hei numbe o a ibu es
May a y in a gene a ed da ase o di e en sizes (di e en en i y
oo s)
Compu ing en i y p o iles
Linea in he inpu size
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 93 / 100

Conclusion
RDF quo ien g aph summa iza ion [GGM20]
Sou ce clique: se o ou going p ope ies co-occu ing oge he on a
leas one node
Ta ge clique: se o incoming p ope ies co-occu ing oge he on a
leas one node
P ope ies “a”, “b”, “d” a e in he
same sou ce clique
P ope ies “a” and “e” a e in he
same a ge clique
(c) Pawel Guzewic
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 94 / 100
Conclusion
S ong summa y [GGM20]
S ong S summa y:
Two nodes a e S equi alen i hey ha e bo h he same sou ce and
a ge cliques
Sou ce and a ge cliques o each
node S ong summa y
(c) Pawel Guzewic
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 95 / 100
Conclusion
Typed-s ong summa y [GGM20]
Typed-s ong TS summa y:
Two yped nodes a e TS equi alen i hey ha e he same ype se
Two un yped nodes a e TS equi alen i hey ha e bo h he same
sou ce and a ge cliques
Sou ce and a ge cliques o each
node + an RDF ype Typed-s ong summa y
(c) Pawel Guzewic
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 96 / 100
Conclusion
Disag eemen be ween Flai and Cha GPT
False Flai posi i es:
Flai iden i ies “A . Pe e Hen y Rol s
| {z }
pe son
36570-900 Vicosa”
Flai mislead by capi aliza ion:
Flai iden i ies “Claudin-7b
| {z }
pe son
” (bu no Cha GPT)
Di e en oken alloca ion:
“Uni e si y o Alabama
| {z }
o g.
”, “Bi mingham
| {z }
loc.
”
“Uni e si y o Alabama, Bi mingham
| {z }
loc.
”
Missed non-English spelling/names:
Cha GPT inds “An onio Gonz´alez
| {z }
pe son
”
Cha GPT inds “Yoshida, Sakyo-ku, Kyo o 606-8501, Japan
| {z }
loc.
”
Nelly Ba e (In ia) Semi-s uc u ed Da a Explo a ion Ma ch 15, 2024 97 / 100