1
U z-Uwe Haus, Head o HPE EMEA Resea ch Lab
2025 -09 -02, REX -IO Wo kshop @ CLUSTER25, Edinbu gh
Da a cen ic wo k lows:
The Maes o Middlewa e in he
Des ina ion Ea h Twin Engine
01
Maes o backg ound and undamen als
02
Des ina ion Ea h Clima e Twins
03
Eme gency Checkpoin ing
04
Maes o 0.5 ea u e upda e
05
Faul ole ance design (WIP)
06
Raw pe o mance
Agenda
2
Maes o
h ps://gi lab.com/maes o -da a/maes o -co e
3
2020 mo i a ing use case: Ope a ional Wea he P edic ion Wo k low
Da a acquisi ion
Ini ial
cons uc ion o
he a mosphe e
High - esolu ion
o ecas
Ensemble
o ecas
P oduc s
gene a ion (57
millions/day)
Fields
DB
~150 TB/day
R/W A chi e
Cu en da a
mo emen
Today’s bo leneck
•Da a mo emen be ween o ecas s ages and p oduc
gene a ion
•A chi ing ia I/O agg ega o nodes in o PFS
•Each p oduc gene a ion job is eading om PFS
Vision
•Speed up da a -mo emen o he
Pgen s ep ( he 42 TiB)
•Exploi mul iple s o age
echnologies
•Mo e lexible dependencies
Time c i ical:
1h
4
Applica ion coupling
5Con iden ial | Au ho ized
``````
``
Maes o
Maes o -enabled
T adi ional
C en al da a eposi o y (PFS, Da abase) o
igh ly in eg a ed coupling amewo k (MPMD
o spli Comm)
Da a objec ‘Ma ke place’, pee - o-pee da a
ans e , c oss -p ocess
6
O e iew/A chi ec u e: Maes o in a nu shell
APPLICATION 1 APPLICATION 3
APPLICATION 2
0010110100110
Co e Da a Objec API: decla e, o e /wi hd aw o equi e/demand, dispose
CDO
o e equi e+demand gi e
equi e+demand
Maes o Da a Managemen
CDO POOL Scope
Objec
Maes o Da a
T ans o ma ion
Uni ied Memo y-s o age API: mamba lib a y
Maes o Sys em Model
Sys Cos
Mem Mem Mem Mem Mem
CDO
Resou ces
DRAM HBM NVRAM SSD PFS
CDO
CDO CDO CDO CDO
CDO
GPU FPGA
CDO
CDO (Co e Da a Objec )
I is a he hea o Maes o’s design and is used o
encapsula e da a and me ada a. Suppo s
dependencies.
OFFER+WITHDRAW
Applica ions OFFER CDOs o he managemen pool.
Maes o manages he da a, un il WITHDRAW occu s.
REQUIRE+DEMAND
When an applica ion REQUIREs a CDO, Maes o makes
da a a ailable. A DEMAND is hands o e esou ces
con aining he da a and elinquishes all con ol i .
SCOPE OBJECT
Cap u es in o ma ion abou scope, size, access
ela ions and schedules o he da a o enable e icien
mo emen and/o ans o ma ion
MAESTRO SYSTEM MODEL
Compu es he cos o mo ing, ans o ming o copying
da a a CDO
SYS
In e ace o e e y memo y le el, enabling co e
unc ionali y o ha memo y ia mamba lib a y.
Scope
Objec
Sys
cdo = ms o_cdo_decla e(“name”)
—same name = same objec
ms o_cdo_a ibu e_add(cdo,key, al)
—Impo an : size, layou , (dis ibu ion), da a e e ence
—Op ional: use -de ined a ibu es
ms o_cdo_o e (cdo)
—A his poin all o he wo k low pa icipan s can access cdo
ms o_cdo_wi hd aw(cdo)
—may block, async a ian a ailable
ms o_cdo_dispose(cdo)
P oduce side cdo = ms o_cdo_decla e(“name”)
—same name = same objec
ms o_cdo_a ibu e_add(cdo,key, al)
—Impo an : size, layou , (dis ibu ion), da a e e ence
—Op ional: use -de ined a ibu es
ms o_cdo_ equi e(cdo)
—A his poin e e ence o a sui able sou ce o CDO
will be es ablished
ms o_cdo_demand(cdo)
—may block, async a ian a ailable
ms o_cdo_dispose(cdo)
Consume side
7
Low in usi eness
•Ba ch up CDOs in a CDO G oup ( o ba ched OFFERs)
Al e na i es
•Subsc ibe o pool e en s (like o e , equi e,
wi hd aw) and ac on hem
•C ea e CDO G oup based on a ibu es
( hink: SQL SELECT) and i e a e on hem
Al e na i es
Da a - and Memo y -awa e
wo k lows wi h maes o
Applica ions coupling bypassing
ilesys em in e media y.
Pool e en s allow he implemen a ion o
use ul wo k low componen s .
No p og amming pa adigm o memo y
managemen laye en o ced, bu u ilizes
and can ake ad an age o Mamba
memo y managemen lib a y
(h ps://gi lab.com/ce l/mamba )
8
9
T anspo : Pee - o-Pee
The Pool Manage is jus he messenge
Clima e DT
16
Wo k low Da a No i ica ion In eg a ion (WIP)
h ps:// doi.o g /10.1016/j.jeme s.2025.100015
Eme gency Checkpoin ing
Applica ion
Wo k in p og ess
18
•KAUST ACC p ojec
•Quickly mo e da a om applica ion o
sa e y when ecei ing a SIGTERM
•A oid losing p og ess/impo an da a
on abno mal e mina ion o an
applica ion
•Da a needs o be p e - egis e ed o
backup
•Le e age Maes o dis ibu ed CDOs o
backup/ es o e dis ibu ed da a (no
se ializa ion)
•Maes o lib a ian componen a chi es
da a o a s o age and s age hem on
eques .
Eme gency Checkpoin ing Applica ions
19
Maes o Pool Manage
W appe API
M × N
dis ibu ion
Regis e _backup()
Res o e()
Maes o
lib a ian
Wo ke
job
e en s/commands
SIGTERM
A. Esposi o, C. Haine and A. Mohammed, "Eme gency Backup o Scien i ic Applica ions," 2022 IEEE/ACM Thi d In e na ional Symposium
on Checkpoin ing o Supe compu ing ( Supe Check ), Dallas, TX, USA, 2022, pp. 1 -8, doi : 10.1109/Supe Check56652.2022.00008.
Local memo y S o age
Benchma ks
20
0
5
10
15
20
25
0 0.05 0.1 0.15 0.2
Bandwid h (GB/s)
To al da a size (GB)
MEB Model OSU
Slingsho 10 Slingsho 11
Model (bandwid h)= 𝑠𝑏
𝜆𝑏+𝑠
s = da a size
b = 12.5 GB/s (nominal bandwid h)
𝜆 = 45 𝑢𝑠 (𝑙𝑎𝑡𝑒𝑛𝑐𝑦)
s = da a size
b = 25 GB/s (nominal bandwid h)
𝜆 = 7 𝑢𝑠 (𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑖𝑛𝑐𝑢𝑟𝑟𝑒𝑑 𝑏𝑦 7 𝑚𝑠𝑔𝑠)
Fea u e upda es
21
Majo changes:
✓Re amp maes o co e h eading model
✓Suppo maes o co e h ead pinning o
ope a ions, anspo , and ab ic h eads
✓Memo y and bug ixes
✓Read he docs documen a ion
Maes o 0.4 (Sep 2023)
Majo changes:
✓Py hon in e ace o maes o -co e
✓Re i e Cen OS in CI and use ocky ins ead
✓OFI h eads build hei own p i a e endpoin s, i.e.
less locking be ween h eads
✓OFI and ope a ion h eads a e NUMA -awa e
✓Upda e o mamba 0.2.1
Maes o 0.5 -27-gcbb6000e (Augus 2025)
S a us o e iew
22
720 commi s, + 41,810 /-7,389 , 14 new es s and examples since 0.3 (End o Ho izon Eu ope p ojec )
New ea u es:
✓A isualiza ion ool o maes o logs
✓Suppo o anspo o la ge CDOs wi h
agmen a ion
✓CXI/Slingsho 11 suppo
✓Inline anspo o small sized CDOs
✓
New ea u es:
✓New lib a ian componen as an example
✓Suppo o OFI mul i - ec
✓OpenFAM anspo me hod
✓GPU memo y suppo
New Fea u es
As o 0.5 -27
23
•SWIG In e ace (maes o -py.i )
•Type mappings: Con e be ween Py hon and C da a ypes
•Excep ion handling: T ans o m C s a us codes in o Py hon
excep ions
•Memo y managemen : Handle alloca ion/dealloca ion ac oss
language bounda ies
•Objec w apping: C ea e Py hon objec s o opaque C handles
•Py hon Module ( maes o_co e )
•Di ec access o all public Maes o C API unc ions
•Py honic e o handling h ough excep ions
•Au oma ic memo y managemen
•Type -sa e a ibu e handling
Py hon in e ace
24
impo maes o_co e as M
impo numpy as np
impo _mamba as mamba
de da a_p oduce (pm_in o):
M.ms o_ini ("numpy_wo k low", "p oduce ", 0)
M.ms o_pm_a ach(pm_in o)
# C ea e a 2D numpy a ay
da a_a ay = np.a ay([[0,1,2,3], [4,5,6,7],
[8,9,10,11], [12,13,14,15]],
d ype='double')
# W ap numpy a ay in a Mamba a ay
mamba_a ay = mamba.new_a ay(da a_a ay)
mamba_a ay.desc ibe()
# C ea e CDO and a ach he Mamba a ay
cdo = M.ms o_cdo_decla e("scien i ic_da a", None)
M.ms o_cdo_a ibu e_se (cdo,
M.MSTRO_ATTR_CORE_CDO_MAMBA_ARRAY,
mamba_a ay)
# Make da a a ailable
M.ms o_cdo_seal(cdo)
M.ms o_cdo_o e (cdo)
p in ("P oduce : O e ed CDO wi h numpy a ay da a")
M.ms o_cdo_wi hd aw(cdo)
M.ms o_cdo_dispose(cdo)
M.ms o_ inalize()
•Th ead Teams :
•Gene ic in as uc u e o managing g oups o wo ke h eads
•Each h ead has i s own FIFO queue o wo k i ems
•OFI Th ead Team : Specialized h ead eam o OpenFab ics In e ace ope a ions
•Pool Ope a ions Th ead Team: Specialized h ead eam o maes o pool ope a ions
•Pool Manage (PM) : Se e -side componen managing esou ce pools
•Pool Clien (PC) : Clien -side componen o pool ope a ions
•NUMA Awa eness : Round - obin wo k dis ibu ion wi h NUMA -awa e scheduling
Mul i h eading model
25
OFI h ead
PM Pool
OP h ead
T anspo
h ead
Pool
Manage
OFI h ead
PC Pool
OP h ead
App
h ead
Clien
Syn he ic Pe o mance benchma ks
32
Handling CDOs (decla e/o e )
Single p ocessing h ead
Two p ocessing h eads
A: Numbe o a ibu es
S: Size o a ibu es in by es
#nodes: Clien s alking o he PM
Almos 120k CDO -ops/s wi h 2 h eads o a single
Maes o pool manage .
Syn he ic Pe o mance benchma ks
33
Agg ega ed bandwid h:
300 GB/s a 110k msg/s
p ocessed by a single
h ead o he pool
manage .
Wi e speed up o 8
consume nodes.
Model bandwid h
based on #msg/s on he
PM:
(3 msg/CDO)×CDO_size
T anspo Bandwid h
© 2025 Hewle Packa d En e p ise De elopmen LP
[email p o ec ed]
Thank You
h ps://gi lab.com/maes o -da a/maes o -co e
backup
Mul i h eading model
36
OP queue
p ocessing
CQ
p ocessing
Disco e and
build EPs
OP queue
i_ ead
i_send
i_m _ eg
…
Mem pool
msg en elope
msg con ex
send bu e s
ec bu e s
OFI h ead
ID/locali y/handle
OP make
•Specialized h ead eams o OpenFab ics
In e ace ope a ions
•Each h ead manages i s own se o OFI
endpoin s
•Ope a ion queues o ne wo k ope a ions
(send, ecei e, RDMA)
•Memo y pools o e icien bu e
managemen
OFI Th eads
Push Pool
OPs
Mul i h eading model
37
•Handle se e /clien -side pool ope a ions
•Ope a ion Queues : Pe - h ead FIFO queues
o wo k i ems
•Pool Ope a ion Engine :
•S a e machine o mul i -s ep ope a ions
•S eps could block/ esume, wai o s a e
change/e en , o OFI op comple ion
•May skip s eps when needed
•E en Domain : Asynch onous comple ion
handling
Pool OP Th eads
Pool OP engine
OP queue
join
lea e
decla e
…
Handle decla e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
OP h ead
ID/locali y/handle
e ch
p ocess s ep
(OFI OP?)
nex s ep?
would block?
comple e?
Handle lea e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
Handle join s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
Push OFI
OPs
Mul i h eading model
38
OP queue
p ocessing
CQ
p ocessing
Disco e and
build EPs
OP queue
i_ ead
i_send
i_m _ eg
…
Mem pool
msg en elope
msg con ex
send bu e s
ec bu e s
OFI h ead
ID/locali y/handle
OP make
no i y_e en
handle_acks
handle_join
send_welcome
…
Handle join s ack
Pool OP engine
OP queue
join
lea e
decla e
…
Handle decla e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
OP h ead
ID/locali y/handle
e ch
p ocess s ep
(OFI OP?)
nex s ep?
would block?
comple e?
Handle lea e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
Handle join s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
Push o
same
locali y
Push o same locali y
Pool OP engine
OP queue
join
lea e
decla e
…
Handle decla e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
OP h ead
ID/locali y/handle
e ch
p ocess s ep
(OFI OP?)
nex s ep?
would block?
comple e?
Pool OP engine
OP queue
join
lea e
decla e
…
Handle decla e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
OP h ead
ID/locali y/handle
e ch
p ocess s ep
(OFI OP?)
nex s ep?
would block?
comple e?
OFI h ead eam Pool OP h ead eam
Handle lea e s ack
no i y_e en
handle_acks
handle_join
send_welcome
…
Handle join s ack
no i y_e en
handle_acks
handle_join
send_welcome
…