1
Mul imodal da a wi hou
bo de s: in eg a ion and
explo a ion o he escue
Nelly Ba e
Assis an P o esso a INSA Lyon and LIRIS
Oc obe 27 h, 2025
2/56
PhD @ In ia Saclay and Ecole Poly echnique
• Use -o ien ed explo a ion o semi-s uc u ed da a
Pos -doc @ Poli ecnico di Milano (I aly)
• FL analyses o e mul imodal heal hca e da a
Assis an P o esso @ INSA Lyon and LIRIS
• Da a managemen and Fede a ed Lea ning
SHORT BIO
INTRODUCTION
2021-2024
2024-2025
Sep . 2025
3/56
Mo i a ion
4/56
MULTIMODAL DATA IS EVERYWHERE
Va ied sou ces, ac o s, and s anda ds → mul imodali y is e e ywhe e!
INTRODUCTION
5/56
MULTIMODAL DATA IS EVERYWHERE
Va ied sou ces, ac o s, and s anda ds → mul imodali y is e e ywhe e!
INTRODUCTION
How o help domain expe s make
sense o a ious mul imodal da a?
Mul imodal da a usage is ha d: di e en
o ma s, schemas, g anula i ies, …
6/56
AGENDA
1. Mo i a ion: mul imodal da a is e e ywhe e
2. PhD: explo ing unknown semi-s uc u ed da ase s
3. Pos -doc: heal hca e analy ics ac oss hospi als
4. Conclusions
INTRODUCTION
7/56
PhD: explo ing unknown
semi-s uc u ed da ase s
8/56
WHAT DOES THE DATASET DESCRIBE?
•Real-wo ld objec s and ela ionships be ween hem
•T adi ional se ing: En i y-Rela ionship models [RG03]
•Need o compu e hem om he da ase !
•Wha abou semi-s uc u ed da a models (nes ing)?
•Keep i simple and o con ollable size
PhD: EXPLORING SEMI-STRUCTURED DATASETS
9/56
PHD PROBLEM STATEMENT AND CONTRIBUTIONS
How o acili a e use explo a ion o unknown
he e ogeneous semi-s uc u ed da ase s?
1. Abs a: semi-s uc u ed da a o e iews [BMU22, BMU24]
•Au oma ically compu e ligh weigh En i y-Rela ionship diag ams
•Ideal o i s -sigh da ase disco e y
2. Pa hWays: in e es ing Named En i y connec ions [BGLM23b, BGLM23a,
BGLM24]
•Compu e and ank en i y pa hs in and ac oss da ase s
•Ideal o explo ing connec ions wi hin and ac oss sou ces
PhD: EXPLORING SEMI-STRUCTURED DATASETS
16/56
QUOTIENT SUMMARIZATION ACROSS DATA MODELS
Each da a model has i s own syn ax:
PhD: EXPLORING SEMI-STRUCTURED DATASETS
XML
RDF
JSON
PG
17/56
SUMMARIZATION BASED ON SAME-KIND NODES
We iden i y node kinds in each model ollowing bes p ac ices o da a design:
•XML:elemen s wi h he same label (o ype)
• JSON:nodes on he same pa h om he oo
•RDF [GGM20]: depending on node ype(s) o , i absen , incoming and
ou going p ope ies
•PG:adap a ion o he abo e [GGM20]
We ob ain apa i ion o e he g aph: ase o equi alence classes
PhD: EXPLORING SEMI-STRUCTURED DATASETS
18/56
THE SUMMARY (COLLECTION GRAPH)
𝒢
•Collec ion node: one o each equi alence class
•Collec ion edge: Cs→C i ada a edge exis s
•En i y p o ile o each lea collec ion node: e lec s NEs in he lea es
PhD: EXPLORING SEMI-STRUCTURED DATASETS
19/56
IDENTIFYING ENTITIES IN THE COLLECTION GRAPH
𝒢
PhD: EXPLORING SEMI-STRUCTURED DATASETS
Which collec ions ep esen en i ies in he E-R diag am?
Which collec ions ep esen en i y a ibu es?
20/56
REQUIREMENTS AND ALGORITHM
We need an algo i hm o iden i y en i y oo s and a ibu es o he E-R diag am
§Fo complex, po en ially cyclic, collec ion g aphs
G eedy selec ion o ew en i ies in
𝒢
:
1. Assign asco e o each collec ion node
2. While less han 𝐸!"# en i y oo s o da a co e age <𝑐𝑜𝑣!$%:
a. Elec he nex highes -sco ed eligible collec ion node as an en i y oo
b. Compu e i s bounda y (se o a ibu es)
c. Upda e he collec ion g aph o e lec he selec ion o an en i y
d. Recompu e he sco es
PhD: EXPLORING SEMI-STRUCTURED DATASETS
21/56
HOW TO SCORE A COLLECTION NODE?
Objec i e: e lec he weigh o his node and i s s uc u e in he da ase
•w&'()!and 𝑤*'"+!: #descendan s, #lea descendan s, a dep h k
No clea how o pick k
•𝑊
,-.: Di ec ed Acyclic G aph (DAG) oo ed in each node
Does no wo k on cyclic g aphs
•𝑊
/0:PageRank algo i hm on 𝒢
The sco e is solely based on he posi ion o he node in he g aph
•𝑊
&12/0:PageRank algo i hm on 𝒢wi h dw- uned PR edge weigh s
Re lec s bo h he opology and whe e ac ual da a is
PhD: EXPLORING SEMI-STRUCTURED DATASETS
22/56
THE DATA-WEIGHTED PAGERANK SCORE
PhD: EXPLORING SEMI-STRUCTURED DATASETS
The o iginal collec ion g aph 𝒢
23/56
THE DATA-WEIGHTED PAGERANK SCORE
PhD: EXPLORING SEMI-STRUCTURED DATASETS
The e e se collec ion g aph 𝒢0
24/56
THE DATA-WEIGHTED PAGERANK SCORE
PhD: EXPLORING SEMI-STRUCTURED DATASETS
The e e se collec ion g aph 𝒢0 wi h dw- uned PR edge weigh s
25/56
DATA-WEIGHT PAGERANK SCORES IN
𝒢
PhD: EXPLORING SEMI-STRUCTURED DATASETS
𝑊
&12/0 sco e compu a ion on 𝒢0 wi h dw- uned PR edge weigh s
32/56
FINDING RELATIONSHIPS BETWEEN ENTITIES
PhD: EXPLORING SEMI-STRUCTURED DATASETS
Pa hs be ween en i y oo s:
•pape → wB → au ho
•pape → pIn → con
•au ho → hW → pape
•con → in → au ho
Remaining ask: classi y each en i y in o a seman ic ca ego y
33/56
ABSTRA OUTPUT: A LIGHTWEIGHT E-R DIAGRAM
318 pe son (Pe son)
•pe son@id (100 %)
•phone (49 %)
•c edi ca d (49 %)
•homepage (47 %)
•add ess (46 %)
•p o ince (52 %)
•zipcode (100 %)
•coun y (100 %)
•ci y (100 %)
•s ee (100 %)
•emailadd ess (100 %)
•name (100 %)
150 open_auc ion (P oduc )
•p i acy (56 %)
•in e al (100 %)
•end (100 %)
•s a (100 %)
• ype (100 %)
•cu en (100 %)
• ese e (51 %)
•ini ial (100 %)
•open_auc ion@id (100 %)
•quan i y (100 %)
wa ches.wa ch@open_auc ion
12 ca ego y (Thing)
•ca ego y@id (100 %)
•desc ip ion (100 %)
• ex (73 %)
•pa lis (27 %)
•lis i em (291 %)
• ex (87 %)
•name (100 %)
p o ile.in e es @ca ego y
selle @pe son
anno a ion.au ho @pe son
bidde .pe son e @pe son
270 i em (schema:how_ o_i em)
•mailbox (64 %)
•mail (101 %)
•da e (100 %)
• o (100 %)
• om (100 %)
• ex (100 %)
•i em@ ea u ed (9 %)
•i em@id (100 %)
•shipping (94 %)
•desc ip ion (100 %)
• ex (73 %)
•pa lis (27 %)
•lis i em (291 %)
• ex (87 %)
•paymen (94 %)
•name (100 %)
•quan i y (100 %)
•loca ion (100 %)
i em e @i em
inca ego y@ca ego y
120 closed_auc ion (P oduc )
•p ice (100 %)
• ype (100 %)
•da e (100 %)
•quan i y (100 %)
selle @pe son
buye @pe son
anno a ion.au ho @pe son
i em e @i em
PhD: EXPLORING SEMI-STRUCTURED DATASETS
34/56
ABSTRA OUTPUT: A LIGHTWEIGHT E-R DIAGRAM
PhD: EXPLORING SEMI-STRUCTURED DATASETS
35/56
QUICK OVERVIEW OF THE EXPERIMENTAL EVALUATION
Main semi-s uc u ed da a models: 8 JSON, 7 RDF, 5 XML, 3 PG
10 syn he ic, 13 eal-wo ld / 5M o 14M nodes
Collec ion g aphs: 26 o 4.8K collec ions, 14/23 ha e cycles
PhD: EXPLORING SEMI-STRUCTURED DATASETS
Ou abs ac ion me hod scales up
linea ly in he da a size
Abs a selec s equen , cohe en and seman ically
cen al en i ies
36/56
Pos -doc: heal hca e
analy ics ac oss hospi als
37/56
WHAT DOES HEALTHCARE DATA HAVE TO REVEAL?
•Few coope a ion/no maliza ion be ween cen e s
•Ve y ew pa ien da a o a e diseases
•T adi ional app oach: consolida ed wa ehouse [KR13]
•Need o p o ide decen alized and ede a ed solu ions
•Need o le e age expe knowledge + au oma iza ion
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
38/56
POST-DOC PROBLEM STATEMENT AND CONTRIBUTIONS
How o enable ede a ed analyses o heal hca e
da a ac oss ins i u ions and na ional bo de s?
1. I-ETL: an ETL o build in e ope able heal hca e da abases [BBBP25]
§P o ides a gene al and ex ensible me ada a and da a model
§Assesses in e ope abili y along he ETL pipeline
§P oduces an in e ope able wa ehouse a each medical cen e
2. A gene al ca alogue o explo ing heal hca e silos [BBCP+25]
§Facili a es easy explo a ion and ede a ed lea ning algo i hms design
§Au oma ically p o iles unde lying silos
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
39/56
EXISTING WORKS
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
Da a Pla o ms
EDHEN [
PdGdK+23]
OHDSI [
HDS+15]
UMG
-MeDIC [PSS+23]
…
• Lack o gene ali y → IT and expe s’ knowledge and e o
• Use-case ailo ed app oaches → om sc a ch o e e y p ojec
• Exis ing models ha dly i on exis ing da a
Concep ual models
OMOP [
SRR+10]
FHIR
…
Da a Ca alogues
EDHEN: p e
-
de ined
s a is ics
OHDSI: medicine
-
speci ic in e ac ions
…
40/56
THE BETTER APPROACH
1. Analyze da ase s and ex ac hei me ada a
2. C ea e an in e ope able da abase in each medical cen e
3. Assess in e ope abili y along he pipeline
4. Allow ede a ed analyses o da a ac oss cen e s h ough a ca alogue
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
41/56
INTEROPERABILITY AS PART OF FAIR PRINCIPLES
FAIR p inciples a e guidelines o good da a managemen [WDA+16]:
•Findable: sea ch o (indexed) esou ces based on iden i ie s
•Accessible: access da a wi h s anda d p o ocols, e en a e da a dies
•In e ope able: in eg a e and e e o da ase s ollowing FAIR p inciples
•Reusable: euse da ase s in o he se ings using p o enance, e c.
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
48/56
MODELING AND PROFILING HEALTHCARE SILOS
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
Silo p o iling
[BBP+25]
49/56
OUR CATALOGUE DATA MODEL
P o iled en i ies:
•Da ase : a ile
•Fea u e: any a iable
cap u ed by a da ase
O he en i ies:
•S a ion (hospi al)
•Resou ce (hospi al se up)
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
50/56
MODELING AND PROFILING HEALTHCARE SILOS
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
Silo p o iling
[BBP+25]
51/56
PROFILING SILOS FOR INDIVIDUAL CATALOGUES
11/09/2025 Heal hca e silos unlocked - Nelly Ba e e al.
1. Fo each hospi al:
§Ge s a ion in o ma ion
§Lis i s compu a ional esou ces
2. Fo each da ase , compu e i s Da ase P o ile wi h:
§S a is ics
§In o ma ion om expe s
3. Fo each ea u e:
§Compu e i s Fea u eP o ile: me ada a, s a is ics, agg ega es
§Ex ac i s Fea u eDomain om me ada a
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
52/56
MODELING AND PROFILING HEALTHCARE SILOS
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
Silo p o iling
[BBP+25]
53/56
BUILDING THE GLOBAL CATALOGUE
Combine indi idual ca alogues o each silo:
Union o all he indi idual ca alogues (one pe hospi al)
§Easy hanks o ou gene al ca alogue model
§Allows o implemen isualisa ions o all cen e s a a ime
When hospi als upda e hei da a:
1. Recompu e he unde lying ca alogue
2. Replace he old wi h he new indi idual ca alogue
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
54/56
I-ETL AND THE CATALOGUE AT WORK
3 use-cases on gene ic a e diseases:
1. Unde s and paedia ic in elligence disabili y
2. Co ela e eye dys ophies and gene mu a ions
3. Unde s and au is ic pa ien s' sel ha m
3 da abases buil wi h syn he ic da a wi h I-ETL
The ca alogue in e ace is up and helps expe s o designing FL algo i hms
T ains a e cu en ly de eloped o UC1
POST-DOC: HEALTHCARE ANALYTICS ACROSS HOSPITALS
55/56
PHD AND POST-DOC LESSONS LEARNT
Exis ing wo ks: lack o gene ali y, app oach pe o ma o use-case
Abs ac ing cu en app oaches/models is p omising:
ØReuse, use case-agnos ic pipelines, a ious se ings
Balance gene aliza ion s. domain knowledge
Ha d o collec and ansla e expe s' knowledge / wishes
CONCLUSION
56/56
IF YOU ARE INTERESTED
Abs a: eam.in ia. /ceda /p ojec s/abs a
Pa hWays: eam.in ia. /ceda /p ojec s/pa hways
→ Connec ionS udio: connec ions udio.in ia.
Be e :h ps://www.be e -heal h-p ojec .eu/
CONCLUSION
Sui ez les ac uali és de l’école
www.insa-lyon.
Rejoignez-nous su les éseaux sociaux