LLNL-PRES-2010660
This wo k was pe o med unde he auspices o he U.S. Depa men o Ene gy by Law ence Li e mo e
Na ional Labo a o y unde con ac DE-AC52-07NA27344. Law ence Li e mo e Na ional Secu i y, LLC
Op imizing I/O o an Exascale Implici Kine ic
Plasma Simula ion using he Rabbi S o age Sys em
Ian Lumsden, Ha iha an De a ajan, Izze Yildi im, S e ano Ma kidis, Andong Hu, I y Peng, Luca Penna i, Dewi Yokelson,
S ephanie B ink, Olga Pea ce, Tom Scogland, B onis R. de Supinski, Gian Luca Delzanno, An hony Kougkas, Xian-He Sun, Michela Tau e
2
LLNL-PRES-2010660
Exascale Compu ing Powe s B eak h ough Scien i ic Disco e y
Running on El Capi an,
iPIC3D can achie e ully
kine ic, h ee-dimensional
simula ion o small- o-
medium plane a y
magne osphe es (e.g.,
Me cu y) wi h ealis ic
pa ame e s, which was
p e iously compu a ionally
p ohibi i e
Check ou ou pape on
a Xi o mo e in o ma ion:
3
LLNL-PRES-2010660
The Exascale Bo leneck: Fas Compu ing, Slow Da a
Exascale sys ems speed up
compu a ion so much ha
I/O becomes an inc easingly
la ge bo leneck (e.g., 33%
o un ime o iPIC3D wi h
32 nodes)
4
LLNL-PRES-2010660
Closing he I/O Gap: Rack-Le el I/O Accele a ion wi h Rabbi S o age
The Rabbi s o age sys em is a ack-le el, so wa e-de ined I/O accele a o enginee ed o
b idge he gap be ween compu e and s o age pe o mance on exascale sys ems
SSD SSD SSD SSD
GPFS Lus e
SSD SSD SSD SSD
Compu e Nodes
Rabbi Nodes
Global S o age
Sie a El Capi an/Tuolumne
5
LLNL-PRES-2010660
Closing he I/O Gap: Bu s Bu e -S yle S o age wi h Rabbi XFS
Rabbi XFS p o ides highly scalable node-local s o age, simila o bu s bu e s on sys ems
like Sie a and F on ie
Compu e Nodes
Rabbi Nodes
Global S o age Lus e
SSD SSD SSD SSD
Rabbi XFS
6
LLNL-PRES-2010660
Closing he I/O Gap: Fas Sha ed S o age wi h Rabbi Lus e
Rabbi Lus e p o ides as , easy- o-use sha ed s o age suppo ing bo h collec i e and non-
collec i e I/O ac oss p ocesses
Compu e Nodes
Rabbi Nodes
Global S o age Lus e
SSD SSD SSD SSD
Rabbi XFS
Lus e
SSD SSD SSD SSD
Rabbi Lus e
Lus e
7
LLNL-PRES-2010660
Closing he I/O Gap: Ha dwa e Locali y o S o age
Rabbi s accele a e I/O by mo ing bo h local and sha ed s o age close o compu e nodes
SSD SSD
PCIe
Swi ch
PCIe
Swi ch
AMD EPYC
CPU
NICs
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
PCIe PCIe
Compu e Node
AMD
MI300A
APU
PCIe
PCIe
NICs
Compu e Node
PCIe
PCIe
AMD
MI300A
APU
NICs
Ha dwa e Con igu a ion
Rabbi Node
8
LLNL-PRES-2010660
Closing he I/O Gap: Dynamic So wa e P o isioning o S o age
Rabbi s accele a e I/O by p o iding dynamic s o age p o isioning h ough a cloud-like
so wa e s ack
PCIe
Swi ch
PCIe
Swi ch
AMD EPYC
CPU
NICs
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
PCIe PCIe
Compu e Node
AMD
MI300A
APU
PCIe
PCIe
NICs
Compu e Node
PCIe
PCIe
AMD
MI300A
APU
NICs
Ha dwa e Con igu a ion
Rabbi Node
NVMe De ices
Rabbi Node
So wa e S ack
Swi ch ec and NVMe D i e s
XFS Lus e GFS2
Ke nel
TOSS
Kube ne es (Kuble + CRIO)
Con ig Managemen NVMe Con ol
Con aine s
Calico O e lay Ne wo k
Lus e CSI
Co e Kube ne es Se ices
Use Con aine s
O ches a ion Se ices Da a Mo emen Se ices
9
LLNL-PRES-2010660
Op imizing applica ions o Rabbi is challenging due o di e se I/O beha io s
Ou Solu ion:
We map iPIC3D’s I/O phases o Rabbi con igu a ions o phase-speci ic op imiza ion:
▪Analyze iPIC3D’s I/O o iden i y dominan phases and quan i y pe o mance
▪Iden i y op imal s o age backends o iPIC3D’s I/O phases by con e ing he
cha ac e iza ion o each phase in o p oxy con igu a ions o IOR
▪Recon igu e iPIC3D using he op imal s o age backends o op imize iPIC3D’s I/O
Ou Con ibu ions
16
LLNL-PRES-2010660
Unde s anding he I/O Phases o iPIC3D
Res a Da a Field Da a Momen Da a
Numbe o Files
128 1 1
P ocesses pe File
1 128 128
To al I/O pe File (MB)
115074 128 64
T ans e Size (MB)
498.2 ± 0.2 1.0 ± 0.5 0.5 ± 0.0
Pe cen I/O Time
93.564% 0.007% 0.006%
I/O Bandwid h (MB/s)
2029.4 27.4 26.4
We use Calipe and Pe e o o examine he
so wa e laye s o iPIC3D and iden i y 3 phases o
I/O w i es
We use DFT ace and DFAnalyze o cha ac e ize
each phase
The e a e 3 I/O phases in iPIC3D:
•Res a Da a Phase: la ge, sequen ial ADIOS2
w i es wi h 1 ile pe p ocess
•Field Da a Phase: small, collec i e MPI-IO
w i es o 1 sha ed ile
•Momen Da a Phase: small, collec i e MPI-IO
w i es wi h iny ans e sizes o 1 sha ed ile
All uns o iPIC3D we e pe o med on
LLNL’s Tuolumne supe compu e
17
LLNL-PRES-2010660
Unde s anding he I/O Phases o iPIC3D
Res a Da a Field Da a Momen Da a
Numbe o Files
128 1 1
P ocesses pe File
1 128 128
To al I/O pe File (MB)
115074 128 64
T ans e Size (MB)
498.2 ± 0.2 1.0 ± 0.5 0.5 ± 0.0
Pe cen I/O Time
93.564% 0.007% 0.006%
I/O Bandwid h (MB/s)
2029.4 27.4 26.4
We use Calipe and Pe e o o examine he
so wa e laye s o iPIC3D and iden i y 3 phases o
I/O w i es
We use DFT ace and DFAnalyze o cha ac e ize
each phase
The e a e 3 I/O phases in iPIC3D:
•Res a Da a Phase: la ge, sequen ial ADIOS2
w i es wi h 1 ile pe p ocess
•Field Da a Phase: small, collec i e MPI-IO
w i es o 1 sha ed ile
•Momen Da a Phase: small, collec i e MPI-IO
w i es wi h iny ans e sizes o 1 sha ed ile
All uns o iPIC3D we e pe o med on
LLNL’s Tuolumne supe compu e
18
LLNL-PRES-2010660
Unde s anding he I/O Phases o iPIC3D
Res a Da a Field Da a Momen Da a
Numbe o Files
128 1 1
P ocesses pe File
1 128 128
To al I/O pe File (MB)
115074 128 64
T ans e Size (MB)
498.2 ± 0.2 1.0 ± 0.5 0.5 ± 0.0
Pe cen I/O Time
93.564% 0.007% 0.006%
I/O Bandwid h (MB/s)
2029.4 27.4 26.4
We use Calipe and Pe e o o examine he
so wa e laye s o iPIC3D and iden i y 3 phases o
I/O w i es
We use DFT ace and DFAnalyze o cha ac e ize
each phase
The e a e 3 I/O phases in iPIC3D:
•Res a Da a Phase: la ge, sequen ial ADIOS2
w i es wi h 1 ile pe p ocess
•Field Da a Phase: small, collec i e MPI-IO
w i es o 1 sha ed ile
•Momen Da a Phase: small, collec i e MPI-IO
w i es wi h iny ans e sizes o 1 sha ed ile
All uns o iPIC3D we e pe o med on
LLNL’s Tuolumne supe compu e
19
LLNL-PRES-2010660
Op imizing applica ions o Rabbi is challenging due o di e se I/O beha io s
Ou Solu ion:
We map iPIC3D’s I/O phases o Rabbi con igu a ions o phase-speci ic op imiza ion:
▪Analyze iPIC3D’s I/O o iden i y dominan phases and quan i y pe o mance
▪Iden i y op imal s o age backends o iPIC3D’s I/O phases by con e ing he
cha ac e iza ion o each phase in o p oxy con igu a ions o IOR
▪Recon igu e iPIC3D using he op imal s o age backends o op imize iPIC3D’s I/O
Ou Con ibu ions
20
LLNL-PRES-2010660
Con e ing Pe -Phase I/O Cha ac e iza ion in o P oxies
Res a Da a Field Da a Momen Da a
Numbe o Files
128 1 1
P ocesses pe File
1 128 128
To al I/O pe File (MB)
115074 128 64
T ans e Size (MB)
498.2 ± 0.2 1.0 ± 0.5 0.5 ± 0.0
Pe cen I/O Time
93.564% 0.007% 0.006%
I/O Bandwid h (MB/s)
2029.4 27.4 26.4
io –b=16384m – =512m
–i 10 -F –w -m
io –b=16m – =1m –i10
–a MPIIO –c –w -m
io –b=4m – =512k –i10
–a MPIIO –c –w -m
IOR Pa ame e s o Res a Da a
Block Size (MiB) 16384
T ans e Size (MiB) 512
Numbe o I e a ions 10
File-pe -P ocess? Yes
Collec i e I/O? No
API POSIX
IOR Pa ame e s o Field Da a
Block Size (MiB) 16
T ans e Size (MiB) 1
Numbe o I e a ions 10
File-pe -P ocess? No
Collec i e I/O? Yes
API MPI-IO
IOR Pa ame e s o Momen
Block Size (MiB) 4
T ans e Size (MiB) 0.5
Numbe o I e a ions 10
File-pe -P ocess? No
Collec i e I/O? Yes
API MPI-IO
We s udy each o iPIC3D’s I/O phases
independen ly by con e ing each phase’s
cha ac e iza ion in o a p oxy con igu a ion o
he IOR benchma k
21
LLNL-PRES-2010660
Iden i ying Op imal S o age Backends o I/O Phases using IOR
Res a Da a
Based on ou IOR esul s, we obse e:
•Rabbi XFS is bes sui ed o he ile-pe -p ocess
es a da a.
•Rabbi Lus e deli e s op imal pe o mance o
small MPI-IO collec i e w i es used in ield da a.
•Rabbi Lus e pe o ms bes o MPI-IO collec i e
w i es wi h iny ans e sizes used in momen da a.
All uns o iPIC3D we e pe o med on
LLNL’s Tuolumne supe compu e
22
LLNL-PRES-2010660
Iden i ying Op imal S o age Backends o I/O Phases using IOR
Res a Da a Field Da a
Based on ou IOR esul s, we obse e:
•Rabbi XFS is bes sui ed o he ile-pe -p ocess
es a da a.
•Rabbi Lus e deli e s op imal pe o mance o
small MPI-IO collec i e w i es used in ield da a.
•Rabbi Lus e pe o ms bes o MPI-IO collec i e
w i es wi h iny ans e sizes used in momen da a.
All uns o iPIC3D we e pe o med on
LLNL’s Tuolumne supe compu e
23
LLNL-PRES-2010660
Iden i ying Op imal S o age Backends o I/O Phases using IOR
Res a Da a
Based on ou IOR esul s, we obse e:
•Rabbi XFS is bes sui ed o he ile-pe -p ocess
es a da a.
•Rabbi Lus e deli e s op imal pe o mance o
small MPI-IO collec i e w i es used in ield da a.
•Rabbi Lus e pe o ms bes o MPI-IO collec i e
w i es wi h iny ans e sizes used in momen da a.
All uns o iPIC3D we e pe o med on
LLNL’s Tuolumne supe compu e
Field Da a
Momen Da a
24
LLNL-PRES-2010660
Op imizing applica ions o Rabbi is challenging due o di e se I/O beha io s
Ou Solu ion:
We map iPIC3D’s I/O phases o Rabbi con igu a ions o phase-speci ic op imiza ion:
▪Analyze iPIC3D’s I/O o iden i y dominan phases and quan i y pe o mance
▪Iden i y op imal s o age backends o iPIC3D’s I/O phases by con e ing he
cha ac e iza ion o each phase in o p oxy con igu a ions o IOR
▪Recon igu e iPIC3D using he op imal s o age backends o op imize iPIC3D’s I/O
Ou Con ibu ions
25
LLNL-PRES-2010660
Recon igu ing iPIC3D Using Op imal S o age Backends om IOR
Res a Da a Field Da a Momen Da a
IOR
Res a Da a Field Da a Momen Da a
Rabbi XFS Rabbi Lus e Rabbi Lus e
iPIC3D
We econ igu e iPIC3D o use he op imal s o age backends iden i ied wi h IOR