A ep oducible and scalable pipeline o
p ocessing adminis a i e heal h claims da a
Depa men o Bios a is ics, Ha a d T.H. Chan School o Public Heal h, Bos on, MA
Depa men o Da a Science, Dana-Fa be Cance Ins i u e, Bos on, MA
US-RSE’ 2025 Resea ch So wa e Enginee ing Con e ence
Mahima Kau , Sh eya Nallu i, James C. Ki ch, Tinashe M. Tape a, Jona han Gilmou ,
Michelle Audi ac, Danielle B aun
2
Mo i a ion
B ie ly: 70+ p ojec s a ge ing clima e adap a ion solu ions, 200+
publica ions, 8,000+ da ase downloads, 30,000+ so wa e downloads
20+ P incipal in es iga o s
20+ Pos -Pos doc o al ellows
20+ PhD & Mas e ’s s uden s
10+ Resea ch scien is s & collabo a o s
5+ Visi ing s uden s
5+ S a membe s
Team o e iew
•Main ain and in eg a e la ge clima e and heal h da a
in en o ies.
• Build end- o-end, open-sou ce pipelines o scalable,
ep oducible analy ics.
• Deli e con aine ized wo k lows o CMS and o he la ge
da ase s.
• Sha e FAIR, analysis- eady da ase s ia Ha a d Da a e se and
open code on Gi Hub.
•De elop ools o machine lea ning and causal in e ence
esea ch.
• D i e impac ul, policy- ele an science in clima e and heal h.
•Access: pu chased annually and deli e ed in la ge
ba ches
•Scale: co e s >60M bene icia ies in o al
•B ead h: con ains en ollmen , demog aphics,
u iliza ion, p esc ip ions and admissions da a.
• Resea ch impac :
• Used ex ensi ely in epidemiology, heal h
economics, and ou comes esea ch.
• Suppo s policy e alua ion (paymen e o ms,
co e age expansions).
h ps://da a.cms.go /in og aphic/medica e-bene icia ies-a -a-glance
3
Backg ound
Example da ase : CMS Medica e da a
Inconsis en da a
quali y ac oss
ba ches
P i acy and
con iden iali y
limi a ions
Schema d i &
s uc u al
inconsis encies
Lack o
in e ope abili y wi h
FAIR s anda ds
Scale &
pe o mance
bo lenecks
Delayed access
o analysis-
eady da a
4
Challenges o
he esea ch
communi y
using CMS da a
Challenges
5
Solu ion
Secu i y es ic ions p e en sha ing o p e-
p ocessed da ase s → hinde s euse and
collabo a ion
Gap:
Ou solu ion:
Online Analy ical P ocessing (OLAP) se e -less heal h da a pipeline;
a modula , sel -se ice and con aine ized da a pipeline → p oducing
analysis- eady da ase s wi hou edundan enginee ing
5
Solu ion
Secu i y es ic ions p e en sha ing o p e-
p ocessed da ase s → hinde s euse and
collabo a ion
Gap:
Ou solu ion:
Online Analy ical P ocessing (OLAP) se e -less heal h da a pipeline;
a modula , sel -se ice and con aine ized da a pipeline → p oducing
analysis- eady da ase s wi hou edundan enginee ing
6
Gene alizable da a pipeline
7
Modula da a pipeline
• Independen , loosely-coupled s ages; easie
debugging & as e i e a ion
• Pa allelizable modules; mo e e icien , cos -
e ec i e execu ion
• Buil -in alida ion & p o enance; aceable,
eliable ou pu s
• Da a p oduc s a e eusable & sel -con ained
om he s a
• Consume s can in e ac a di e en le els o
g anula i y
Building blocks
8
Pa que wa ehouse + DuckDB engine
Why Pa que ?
• Columna o ma wi h high comp ession
• Embedded schema
• Scales o e y la ge his o ical da ase s
• Suppo ed ac oss Py hon, R, Spa k, SQL engines
Why DuckDB?
• Pa allel execu ion, blazing speed
• Columna + ec o ized que y execu ion educes CPU cycles
• Op imized o OLAP que ies (agg ega ions, joins, scans)
• No ex e nal se e se up needed
• Handles la ge - han-memo y da ase s di ec ly om Pa que
• Runs anywhe e (HPC, cloud, local machines)
h ps://blog.da aenginee hings.o g/i-spen -8-hou s-lea ning-pa que -he es-wha -i-disco e ed-97add13 b28
14
Open science & FAIR da a in as uc u e
Execu ion & o ches a ion
Da a p ocessing & s o age
Rep oducibili y & po abili y
Open-Sou ce Tech S ack
15
Key akeaways
Se e less, con aine ized pipeline designed o s anda dize and deli e AI/
ML- eady da ase s a unp eceden ed speed.
•Accele a es esea ch ac oss key domains, including en i onmen al
heal h, hospi aliza ions, causal in e ence, and spa ial epidemiology.
•S eamlines da a wo k lows o la ge-scale epidemiology eams,
enabling seamless collabo a ion ac oss esea ch labs.
•Modula a chi ec u e elimina es duplica ion, educes compu a ional
was e, and ensu es op imal esou ce u iliza ion.
•Fully aligned wi h FAIR p inciples and Open Science bes p ac ices,
suppo ing anspa ency, ep oducibili y, and da a eusabili y.
AI/ML-Ready Da a Pipeline o Scalable
Scien i ic Resea ch
16
Thanks o lis ening.
This wo k used he compu ing esou ces o Ha a d Uni e si y Resea ch Compu ing and Da a Se ices (HURC)
and Ha a d FAS Resea ch Compu ing (FASRC).
Funding Sou ces:
•Ai Pollu ion, Hea , Cold, and Heal h: Dispa i ies in he Ru al Sou h R01MD016054 supplemen R01MD016054
S2
•Na ional Coho S udies o Alzheime 's Disease, Rela ed Demen ias and Ai Pollu ion R01AG066793,
supplemen #1 R01AG066793 S1, supplemen #2 R01AG066793 S2
•Cha ac e izing he link be ween mul iple en i onmen al exposu es and Pa kinson's disease exace ba ion
R01ES034373
•The con luence o ex eme hea /cold on he heal h and longe i y o an Aging Popula ion wi h Alzheime ’s (and
ela ed Demen ia) R01AG074372
Na ional S udies on Ai Pollu ion and Heal h (NSAPH): h ps://hsph.ha a d.edu/ esea ch/ai -pollu ion-heal h/
NSAPH Ha a d Da a e se: h ps://da a e se.ha a d.edu/da a e se/nsaph
NSAPH Gi Hub: h ps://gi hub.com/NSAPH-Da a-P ocessing