scieee Science in your language
[en] (orig)

Comprehensive Performance Analysis of Portals4 Communication Primitives on BXI Hardware

Author: Bartelheimer, Niklas; Neuwirth, Sarah M.
Publisher: Zenodo
DOI: 10.5281/zenodo.17290213
Source: https://zenodo.org/records/17290213/files/2025-MASCOTS-Bartelheimer-final.pdf
Comp ehensi e Pe o mance Analysis o Po als4
Communica ion P imi i es on BXI Ha dwa e
Niklas J. Ba elheime
Johannes Gu enbe g Uni e si y Mainz, Ge many
[email p o ec ed]
Sa ah M. Neuwi h
Johannes Gu enbe g Uni e si y Mainz, Ge many
[email p o ec ed]
Abs ac —This pape p esen s a comp ehensi e pe o mance
analysis o he BullSequana eXascale In e connec (BXI) using
he Po als4 p og amming model. The main con ibu ions
include: (1) he design and implemen a ion o P lBench,
a Po als4 mic obenchma k sui e ha e alua es low-le el
ea u es such as bandwid h, la ency, cache e ec s, and igge ed
ope a ions; (2) a de ailed compa ison o Po als4-compa ible MPI
implemen a ions (OpenMPI, Pa aS a ionMPI) and ou cus om
Po als4 de ice o he PGAS lib a y GPI-2, co e ing poin - o-
poin , one-sided, and collec i e ope a ions; and (3) applica ion-
le el analysis using he Himeno and SSCA1 benchma ks o assess
he impac on di e en communica ion pa e ns. These esul s
p o ide aluable in o ma ion o BXI’s capabili ies and limi a ions
o eal-wo ld HPC wo kloads and communica ion models.
Index Te ms—MPI, GASPI, PGAS, BXI, Po als4, Pe o -
mance S udy, Benchma king, Pe o mance Analysis
I. INTRODUCTION
High-pe o mance compu ing (HPC) sys ems ely on
e icien in e connec s o enable as , scalable communica ion
ac oss housands o nodes. As wo kloads become inc easingly
complex, om scien i ic simula ions o AI-d i en applica ions,
ne wo k pe o mance becomes a c i ical bo leneck in
achie ing exascale compu ing capabili ies. In e connec
echnologies mus p o ide high bandwid h, low la ency, and
e icien synch oniza ion while minimizing CPU o e head o
op imize compu a ing esou ces.
One eme ging solu ion is he BullSequana eXascale In e -
connec (BXI) [1], designed o mee he equi emen s o ecen
Eu opean supe compu ing ini ia i es such as he Eu opean
Pilo o Exascale (EUPEX) [2]. BXI u ilizes Po als4 [3],
an e en -d i en low-le el communica ion API ha suppo s
bo h he Pa i ioned Global Add ess Space (PGAS) model [4]
and he Message Passing In e ace (MPI). By o loading
communica ion p ocessing o he ha dwa e, Po als4 aims o
imp o e scalabili y and e iciency o a ious HPC wo kloads.
Despi e i s p omising design, he eal-wo ld pe o mance
cha ac e is ics o BXI and i s compa ibili y wi h Po als4-
based lib a ies ha e no ye been su icien ly explo ed. In o de
o ully le e age BXI in u u e exascale sys ems, a deepe
pe o mance analysis is equi ed o e alua e he s eng hs and
limi a ions o BXI in di e en p og amming models.
T adi ional high-pe o mance in e connec s, including In-
iniBand, C ay’s Slingsho and Fuji su’s To u in e connec ,
each ha e dis inc ad an ages, bu also come wi h ade-o s.
BXI, on he o he hand, uses a connec ionless, ha dwa e-
accele a ed model using Po als4, ha ies o s ike a balance
be ween e iciency and lexibili y. Howe e , exis ing s udies
lack comp ehensi e e alua ions compa ing Po als4’s pe o -
mance on BXI wi h o he in e connec solu ions. In addi ion,
li le esea ch has explo ed how well Po als4-based MPI
implemen a ions (e.g., OpenMPI, Pa aS a ionMPI) and PGAS
un imes (e.g., GPI-2) pe o m on BXI ha dwa e.
To add ess hese gaps, his pape p esen s an in-dep h
pe o mance s udy o Po als4 on BXI ha dwa e wi h h ee
main con ibu ions: (1) he de elopmen o P lBench, a
dedica ed Po als4 mic obenchma k sui e ha enables de ailed
analysis o key ea u es and di e en con igu a ions; (2) a
compa ison o Po als4-compa ible MPI implemen a ions
(OpenMPI and Pa aS a ionMPI) and ou cus om Po als4
backend o he GPI-2 PGAS un ime, ocusing on he
e iciency o poin - o-poin , one-sided, and collec i e
communica ion; and (3) applica ion-le el e alua ions using
he Himeno and SSCA1 benchma ks, which e lec eal-
wo ld communica ion pa e ns, highligh ing he p ac ical
implica ions o Po als4’s design on BXI o HPC use cases.
II. BACKGROUND AND RELATED WORK
This sec ion p o ides he necessa y backg ound on Po als4,
BXI, and he PGAS p og amming model. We also e iew
ela ed benchma k sui es and p io pe o mance s udies.
A. Po als4 Ne wo k API
Po als [3] is a low-le el ne wo k API designed o e i-
cien and scalable p og amming. The la es e sion, Po als4,
suppo s bo h PGAS and MPI models. To imp o e scalabili y,
Po als4 uses a eliable, connec ionless a chi ec u e ha a oids
he complexi y o connec ion-o ien ed ne wo ks like In ini-
Band (IB) [5] and simpli ies connec ion se up and shu down.
Po als4 p o ides a comp ehensi e se o communica ion p im-
i i es, including one-sided pu /ge ope a ions and ma ching
seman ics o e icien wo-sided communica ion. Lib a ies
such as Sandia-OpenSHMEM [6] and OpenMPI [7] le e age
Po als, and a e e ence implemen a ion [8] suppo s bo h IB
ne wo ks and sha ed memo y sys ems.
B. BullSequana eXascale In e connec V2 (BXI)
Designed as a nex -gene a ion HPC in e connec , BXI in e-
g a es ha dwa e o loading, adap i e ou ing [9] and high- adix
swi ching o suppo la ge-scale pa allel wo kloads. Unlike
adi ional in e connec s, BXI is designed on he basis o
Po als4 and enables low-la ency, high-bandwid h communi-
ca ion wi hou he need o explici connec ion managemen .
I consis s o wo p ima y ha dwa e componen s: a ne wo k
in e ace ca d (NIC) op imized o PGAS and MPI ope a ions
and a 48-po high- adix swi ch p o iding up o 100 Gbps pe
po and a o al bidi ec ional bandwid h o 9600 Gbps. This
lexibili y allows BXI o scale up o 64,000 nodes, ensu ing
high e iciency and Quali y o Se ice (QoS) ia Reliabili y,
A ailabili y, and Se iceabili y (RAS) ea u es [1].
C. PGAS Model and GASPI
The Pa i ioned Global Add ess Space (PGAS) model [4]
o e s an al e na i e o MPI by p o iding a sha ed-memo y
abs ac ion o dis ibu ed nodes while allowing ine-g ained
con ol o e memo y locali y. This model simpli ies da a
access pa e ns and educes synch oniza ion cos s, making i
a ac i e o la ge-scale applica ions. PGAS is implemen ed
in a ious o ms, including ex ensions such as Uni ied Pa allel
C (UPC), Co-A ay Fo an (CAF), and lib a ies such as
OpenSHMEM [10], al hough he e is no s ic de ini ion. The
Global Add ess Space P og amming In e ace (GASPI) [11],
[12] is a PGAS-based communica ion API ha uses asyn-
ch onous, one-sided communica ion wi h explici emo e eads
and w i es. Unlike MPI, GASPI a oids bulk-synch onous ope -
a ions, allowing o a mo e e icien o e lapping o communi-
ca ion and compu a ion. I s only implemen a ion, GPI-2 [13],
suppo s no i ied communica ion and dynamic memo y alloca-
ion, making i an ideal candida e o Po als4-based sys ems.
D. Rela ed Wo k
Two a eas o ela ed wo k a e pa icula ly ele an o his
pape : (1) benchma ks and (2) pe o mance s udies. Bench-
ma ks a e gene ally ca ego ized in o mic obenchma ks and
applica ion benchma ks [14]. The e a e many mic obenchma k
sui es o pa allel p og amming models. Fo example, he OSU
Mic o-Benchma ks (OMB) [15] sui e is widely used o e al-
ua e MPI pe o mance and also includes PGAS mic obench-
ma ks o OpenSHMEM, UPC, and UPC++. O he no able
PGAS mic obenchma k sui es include he UPC Ope a ions
Mic obenchma king Sui e (UOMS) [16] and he GASPI Bench-
ma k Sui e [17]. In addi ion, he OpenSHMEM Benchma k
Sui e (OBS) [10] o e s mic obenchma ks, applica ion ke nels
and applica ions speci ically o OpenSHMEM.
Ne wo k in e connec pe o mance s udies a e essen ial o
unde s anding da a ansmission e iciency, which di ec ly
impac s compu ing sys em scalabili y and pe o mance. Fo
example, Kalia e al. [18] examine how NIC a chi ec u e
a ec s RDMA-based sys em pe o mance. De Sensi e al. [19]
conduc an expe imen al analysis o he Slingsho in e connec
o guide esea che s and sys em adminis a o s. In addi ion, Li
e al. [20] p esen an o e iew o mode n in e connec s in da a
cen e s and HPC clus e s, along wi h ep esen a i e bench-
ma ks. These s udies o m he basis o ou esea ch. Howe e ,
Po als4-based in e connec s, pa icula ly BXI, emain unde -
Ini ia o Ta ge
NI
NI
LE LE LE
ME MEME
EQ CT
CT
EQ
CT CT
Non-Ma ching NI
Ma ching NI
MD
CT
EQ
1
2
3
P lPu
Da a
Acknowledgemen
(op ional)
Ne wo k
Fig. 1. Illus a ion o a P lPu ope a ion, including he da a s uc u es needed
o ma ching and non-ma ching communica ion.
explo ed, wi h limi ed s udies assessing hei impac on MPI
and PGAS-based wo kloads. This pape aims o ill his gap.
III. PTLBENCH: PORTALS4 MICROBENCHMARK SUITE
A de ailed e alua ion o BXI pe o mance equi es low-le el
access o Po als4 communica ion p imi i es. While he BXI
so wa e s ack includes P lpe , a Se e -Clien benchma k
o e ing basic me ics, i is no sui able o comp ehensi e
pe o mance s udies. The Sandia Po als4 e e ence imple-
men a ion [8] p o ides h ee mic obenchma ks: NETPipe,
Message Ra e, and Round-T ip Time (RTT), which espec i ely
measu e poin - o-poin h oughpu , simula e eal-wo ld
message a ic (inspi ed by he Sandia MPI Mic obenchma k
Sui e [21]), and implemen a ping-pong la ency es . These
ools use he P ocess Managemen In e ace (PMI) o p ocess
o ches a ion and suppo job schedule s like Slu m. Howe e ,
hese benchma ks o e limi ed Po als4 API co e age and
do no collec pe -i e a ion da a needed o de ailed s a is ical
analysis. To o e come hese limi a ions, we de elop a new
mic obenchma k sui e ailo ed speci ically o Po als4. This
sec ion begins wi h an analysis o key Po als4 componen s
ia he execu ion pa h o a P lPu ope a ion. We hen desc ibe
he a chi ec u e and unc ionali y o ou benchma k sui e.
A. Exempla y B eakdown o a Po als4 Da a T ans e
To illus a e da a mo emen in Po als4, we use he P lPu
ope a ion as an example. Figu e 1schema ically illus a es
he p ocess. I begins wi h he Ini ia o issuing a P lPu
eques o he Ta ge . On he Ta ge side, he eques is
p ocessed by ei he he Ma ching Ne wo k In e ace (MNI) o
Non-Ma ching Ne wo k In e ace (NMNI). MNIs use Ma ch
Bi s (MB) in he message o loca e he app op ia e memo y
add ess, while NMNIs igno e MBs en i ely. Each Logical
Ne wo k In e ace (LNI), ep esen ing pe -p ocess access o
he ha dwa e, abs ac s a Physical Ne wo k In e ace (PNI),
wi h up o ou LNIs pe PNI.
On he Ta ge side, memo y access occu s h ough a Po al,
iden i ied by an index in he Po al Table (PT). Each po al
main ains a linked lis o Ma ching En ies (MEs) o Lis
En ies (LEs), which de ine accessible memo y egions. These
egions may be egis e ed, speci ying exac bu e de ails,
o un egis e ed, co e ing he en i e i ual add ess space by
se ing he s a ield o NULL and leng h o PTL_SIZE_MAX.
BXI suppo s up o 248 −1add esses o un egis e ed memo y.
Ma ch bi s di ec MNIs o he co ec ME ( ed diamonds
in Fig. 1), while NMIs de aul o he i s LE. Pe sis en LEs
a e used in simple cases; in dynamic scena ios, he i s LE
is unlinked a e use o suppo sequen ial access. Each po al
main ains bo h a P io i y Lis (shown) and an O e low Lis
(no shown). To moni o ac i i y, PT en ies can be associa ed
wi h an E en Queue (EQ) o Full E en s (FEs), o wi h E en
Coun e s (CTs) o mo e g anula acking.
On he Ini ia o side, a Memo y Desc ip o (MD) de ines
he sou ce bu e o pu ope a ions o he des ina ion o ge
ope a ions, and can e e ence ei he egis e ed o un egis e ed
memo y. O se s speci ied in Po als4 calls de ine he p ecise
egion in ol ed in communica ion. Once he P lPu comple es,
he Ta ge may op ionally e u n a PTL_EVENT_ACK, which
is eco ded by he Ini ia o ia EQ o CT [22].
B. Design and Implemen a ion o P lBench
Unlike adi ional low-le el ne wo k benchma ks ha ely
on a Se e -Clien communica ion model, ou Po als4 mi-
c obenchma k sui e, P lBench, uses MPI o o ches a ion.
MPI was selec ed o e PMI o PMIx due o i s po abili y, ease
o use, and ich ea u e se . I s compa ibili y wi h Po als4-
enabled NICs enables seamless communica ion be ween
benchma k p ocesses, while collec i e ope a ions like ba ie s
p o ide e icien synch oniza ion. P lBench includes i e dis-
inc mic obenchma ks, each designed o e alua e a speci ic
aspec o he Po als4 API. The ollowing sec ions desc ibe
hei implemen a ion, key ea u es, and in ended use cases.
1) p l bench: The p l bench benchma k measu es band-
wid h and la ency o P lPu and P lGe ope a ions,
modeled a e OMB’s one-sided communica ion benchma ks.
I suppo s bo h non-ma ching o ma ching NIs, and allows
e alua ion o e en handling ia ei he FEs o CTs. To assess
cache e ec s, a cus om cache sa u a ion unc ion is used o
simula e cold-cache condi ions by popula ing memo y wi h
andom alues. This enables con olled compa isons be ween
cold and ho cache scena ios. On he a ge side, communi-
ca ion uses ei he a pe sis en LE o ME, whe e “pe sis en ”
deno es euse o he same lis en y h oughou he benchma k.
2) p l me none pe sis en : The p l me none pe sis en
benchma k measu es he la ency and bandwid h o Po als4’s
ma ching ne wo k in e ace, ocusing on communica ion pa -
e ns ep esen a i e o MPI’s send- ecei e seman ics. In con-
as o p l_bench, which elies on pe sis en lis en ies
ega dless o ma ching beha io , his benchma k emula es
mo e dynamic, pe -message ma ching scena ios.
I uses wo po al indices: one o he command channel
and ano he o da a ans e . On he a ge side, mul iple lis
en ies a e c ea ed, each associa ed wi h a dis inc memo y
bu e as speci ied by he window_size pa ame e . The
ini ia o begins ansmission a e ecei ing a no i ica ion ia
he command channel. Messages a e ma ched o lis en ies
using con igu ed ma ch bi s, and upon success ul ansmis-
sion, indica ed by ull e en s such as PTL_EVENT_PUT
o PTL_EVENT_GET, he co esponding lis en ies a e e-
mo ed. This p ocess epea s o each i e a ion.
3) p l memo y bench: This benchma k e alua es he BXI
NIC’s i ual- o-physical add ess ansla ion pe o mance by
measu ing page aul la ency in bo h local and emo e access
scena ios. I suppo s bo h one-sided and ping-pong commu-
nica ion pa e ns, using P lPu and P lGe ope a ions.
Memo y is alloca ed using mmap o c ea e page-sized i ual
segmen s, elying on de e ed physical alloca ion. This enables
con igu a ion o whe he page aul s a e handled by he NIC o
he hos . Placemen ollows a i s - ouch policy, whe e ini ial
access igge s physical alloca ion. Pages a e classi ied as Ho
(p e-alloca ed) o Cold (unalloca ed). Messages a e cons ain
o 4 kB o less, wi h ans e s a ge ing andom o se s wi hin
each page. The ping-pong pa e n mi o s he one-sided se up,
enabling compa a i e analysis o ansla ion o e head.
4) p l ping pong: This benchma k measu es Round-T ip
Time (RTT) using a ping-pong communica ion scheme. I
highligh s Po als4’s T igge ed Ope a ions, in which s anda d
calls like P lPu and P lGe a e de e ed un il a CT
eaches a de ined h eshold, allowing he NIC o execu e
he ope a ion au onomously, bypassing CPU in ol emen . The
benchma k compa es RTT be ween s anda d P lPu and i s
igge ed a ian P lT igge edPu , and also quan i ies he
addi ional se up la ency in oduced by he igge mechanism.
5) p l ge ni p ops: As a ha dwa e implemen a ion o
Po als4, BXI imposes inhe en limi a ions on a ailable
esou ces. Po als4 allows cus omiza ion o hese limi s by
p o iding a poin e o a pl _ni_limi s_ s uc u e
du ing NI c ea ion. This s uc u e can be ini ialized wi h
INT_MAX,LONG_MAX, o ze o be o e p oceeding wi h NI
ini ializa ion. The ac ual esou ce limi s imposed by he
Po als4 implemen a ion a e hen e ie ed om a sepa a e
s uc u e a e ini ializa ion.
IV. COMPREHENSIVE PERFORMANCE EVALUATION
To e alua e he eal-wo ld iabili y o Po als4 on BXI, we
conduc ed a comp ehensi e, mul i-phase pe o mance s udy.
This sec ion p esen s ou me hodology and esul s, span-
ning low-le el mic obenchma king wi h P lBench, compa -
a i e analysis o MPI, speci ically OpenMPI (OMPI) and
Pa aS a ionMPI (PSMPI), bo h unning o e Po als4, and he
PGAS un ime GPI-2, as well as applica ion-le el e alua ions
using wo ep esen a i e HPC wo kloads.
A. Tes Sys em Se up
The e alua ion was pe o med on he DEEP Sys em [23], an
MSA [24] p o o ype ea u ing a ange o compu ing modules.
Ou esea ch ocused on he BXI Module (BM), which com-
p ises ou nodes, each equipped wi h an In el Xeon Gold 5122
CPU, 48 GB o RAM, and a 100 Gbi BXI 2 in e connec . A
he ime o his w i ing, his was he only publicly a ailable es
sys em equipped wi h BXI. The nodes ope a e on Rocky Linux
8.10 G een Obsidian. Communica ion o e he BXI in e ace
u ilized he p e-ins alled Po als4 e sion 2.1.9 by E iden/Bull.
1 16 1024 65536 4194304
0
4000
8000
12000
0
4000
8000
12000
Message Size [B]
Bandwid h [MB/s]
P lGe P lPu Limi : 100Gbs Es ima ed
(a) Bandwid h esul s o P lPu and P lGe .
1.0
1.1
1.2
1.3
1.4
1 16 1024 65536 4194304
Message Size in [B]
Ho Cache Speedup
P lGe P lPu
(b) Ho Cache Speedup usage o P lPu and P lGe .
P lGe
P lPu
1 16 1024 65536 4194304
0
3000
6000
9000
0
3000
6000
9000
Message Size [B]
Bandwid h [MB/s]
ME−none−pe sis en ME−pe sis en
(c) Bandwid h o pe sis en & non-pe sis en MEs.
2.4
2.5
2.6
P lGe P lPu
La ency [us]
E en Coun e Full E en
(d) La ency esul s o P lPu & P lGe o di e en
e en acking ypes (message size: 1024 by es).
0
1
2
3
64 1024
Message Size [B]
Round T ip Time (RTT) [us]
P lPu P lTigge edPu + Se up Time P lT igge edPu
(e) Round-T ip Time o P lPu and P lT igge edPu
(message sizes: 64 by es and 1024 by es).
0
30
60
90
cold−cold cold−ho ho −cold ho −ho
Page S a e [local − emo e]
La ency [us]
P lGe P lPu
( ) La ency esul s o P lPu and P lGe a ge ing
Ho and Cold pages.
Fig. 2. O e iew o pe o mance esul s measu ed wi h P lBench.
B. Benchma k Me hodology
Ou pe o mance s udy ollows a hie a chical, mul i-s age
me hodology o sys ema ically e alua e he pe o mance o he
BXI unde he Po als4 communica ion model. The me hod-
ology is s uc u ed in h ee phases.
1) Po als4 Mic obenchma king on BXI wi h P lBench:
The i s phase ocuses on low-le el communica ion
benchma king using ou P lBench sui e1(see Sec ion III).
Expe imen s sys ema ically a ied message sizes (1 by e o
4 MB), cache s a es (ho s. cold), communica ion ypes
(P lPu s. P lT igge edPu ), and memo y alloca ion
(p e-alloca ed s. on-demand) o isola e key pe o mance
cha ac e is ics o Po als4 on BXI. Each benchma k was
execu ed 100 imes ( e e ed o as expe imen s), wi h each
expe imen consis ing o 10 wa m-up and 1000 imed
i e a ions. The pe -expe imen median was calcula ed om
i e a ion esul s, and he o e all median was de i ed om he
100 expe imen medians. Fo benchma ks in ol ing mul iple
message sizes, his p ocedu e was applied independen ly
o each size. Bandwid h measu emen s we e pe o med by
sending 64 messages pe i e a ion wi hin a message window.
All benchma ks alloca ed a single memo y bu e pe message
size, aligned o page bounda ies using posix_memalign,
o mmap in he case o p l_memo y_bench. As he es
sys em ea u ed a single CPU wi h a single NUMA domain,
no p ocess binding was applied.
2) Compa a i e E alua ion o MPI and PGAS Communica-
ion Models: The second phase e alua es he communica ion
e iciency o Po als4-based MPI implemen a ions (OpenMPI
5.0.4, Pa aS a ionMPI 5.9.2) and he GPI-2 PGAS un ime, us-
ing ou cus om GASPI Benchma k Sui e (GBS) [17] inspi ed
by he OSU Mic o-Benchma ks (OMB) and ine- uned wi h
1h ps://gi hub.com/Neuwi hLab/P lBench
he insigh s om Sec ions IV-D and IV-C. The analysis co -
e s poin - o-poin ( wo-sided and one-sided), one-sided RMA
synch oniza ion, and All educe collec i e ope a ion, wi h pe -
o mance measu ed ac oss message sizes om 1 by e o 4 MB
and mul iple p ocess coun s o cap u e scalabili y ends. All
lib a ies a e compiled wi h GCC 12.3.0, and MPI a ian s a e
buil using hei espec i e compile w appe s. GPI-2, which
lacks na i e Po als4 suppo , is ex ended wi h a cus om Po -
als4 backend1de eloped o his s udy [25]. Each benchma k
scena io consis s o 100 expe imen s wi h a leas 1,000 inne
i e a ions pe un, enabling s a is ically obus compa isons.
Median esul s and boxplo s a e used o p esen indings.
3) Applica ion-Le el Pe o mance E alua ion: Phase h ee
connec s low-le el benchma k esul s wi h eal-wo ld HPC
wo kloads by e alua ing Po als4’s impac on BXI h ough
wo ep esen a i e applica ions: he Himeno and SSCA1 bench-
ma ks. The Himeno benchma k, based on he Jacobi me hod
o sol ing Poisson equa ions, ea u es a highly pa alleliz-
able s encil compu a ion wi h a high communica ion- o-
compu a ion a io, ideal o assessing s uc u ed poin - o-poin
messaging e iciency. In con as , SSCA1 [10] implemen s he
Smi h-Wa e man sequence alignmen algo i hm wi h dynamic
p og amming and gap sco ing, s essing ine-g ained memo y
access and equen small-message communica ion, issuing
mul iple Pu s and Ge s pe i e a ion. Bo h benchma ks a e
execu ed using OpenMPI, Pa aS a ionMPI, and ou cus om
Po als4-enabled GPI-2 backend, unde wo- and ou -node
con igu a ions. Me ics such as execu ion ime, communica ion
o e head, and synch oniza ion e iciency a e collec ed unde
a ying p ocess decomposi ions.
C. Po als4 Mic obenchma king wi h P lBench
We e alua ed Po als4’s pe o mance on BXI using a se ies
o expe imen s. Bandwid h measu emen s o P lPu and
TABLE I
LIMITS OF THE BXIV2HARDWARE RESOURCES.
Pa ame e Value Pa ame e Value Pa ame e Value
max en ies 4080 max unexpec ed heade s 16319 max mds 1024
max eqs 960 max c s 1024 max p index 255
max io ecs 0 max lis size 16582 max igge ed ops 7167
max msg size 67108864 max a omic size 1024 max e ch a omic size 64
max waw o de ed size 2048 max wa o de ed size 64 max ola ile size 64
P lGe wi h e en coun e s a e shown in Figu e 2(a). While
BXI 2 ad e ises a heo e ical maximum bandwid h o 12.5
GB/s (solid black line), ou esul s aligned mo e closely wi h
he 11 GB/s sa u a ion poin p edic ed by De adji e al. [1]
(dashed line). Figu e 2(b) illus a es la ency imp o emen s
unde ho cache condi ions ela i e o cold cache scena ios. A
speed-up o 1.0 indica es no imp o emen . Fo small messages
(<512 by es), P lPu la ency imp o es by o e 30%, wi h
messages <64 by es showing gains exceeding 20%. The dip a
64 by es co esponds o he NIC’s ansi ion om P og ammed
I/O (PIO) o Di ec Memo y Access (DMA). Cache e ec s
signi ican ly impac pe o mance: cold cache condi ions
in oduce highe la ency due o equen cache misses in he
Po als4 lib a y, especially o da a s uc u es like memo y
desc ip o s. These e ec s e lec eal-wo ld scena ios, whe e
such s uc u es a e o en e ic ed du ing compu e-in ensi e
phases. In con as , ho cache condi ions educe la ency,
excep o la ge messages ha be e ole a e cache delays.
Using he p l_me_none_pe sis en benchma k, we
analyzed he impac o ag-ma ching seman ics on poin - o-
poin communica ion. Figu e 2(c) compa es bandwid h o
pe sis en e sus non-pe sis en ma ching lis en ies. The
pe sis en con igu a ion (g een cu e) app oaches sa u a ion
quickly o bo h P lGe and P lPu , while he non-
pe sis en se up (o ange cu e) incu s addi ional o e head due
o a endez ous-like p o ocol, in which he a ge p e-pos s
64 en ies ( he window size) be o e no i ying he ini ia o ,
esul ing in a no iceably la e slope.
Po als4 suppo s wo mechanisms o acking ne wo k
e en s: Full E en s (FEs) and E en Coun e s (CTs).
Figu e 2(d) shows la ency compa isons o P lGe and
P lPu . Boxplo s e eal a igh -skewed dis ibu ion, wi h
mos la encies clus e ed nea he lowe qua ile. Delays, likely
due o conges ion, e ansmissions, o NIC queue sa u a ion,
a ec only a small subse o messages. The median la ency di -
e ence be ween CTs and FEs is oughly 0.05 µs o bo h ope -
a ions. FEe exhibi mo e ou lie s (black do s) due o in e up -
d i en handling wi h P lEQWai /P lEQPoll, whe eas
CTs ely on busy-wai ing wi h P lCTWai /P lCTPoll,
which may accoun o educed a iance.
The nex expe imen e alua es he achie able Round-T ip
Time (RTT) o ping-pong communica ion using P lPu
and P lT igge edPu . Unlike P lPu , which execu es
immedia ely, P lT igge edPu in oduces a se up
phase o a m he igge a a speci ied CT alue, adding
addi ional la ency. Figu e 2(e) shows RTT esul s o 64-by e
(o ange) and 1024-by e ( u quoise) messages, compa ing
P lT igge edPu wi h and wi hou se up o e head. Fo
small messages, P lT igge edPu achie es a lowe RTT
han P lPu , hough his ad an age diminishes sligh ly
when se up la ency is included. Fo la ge messages, he se up
o e head becomes negligible, and P lT igge edPu
consis en ly ou pe o ms P lPu .
We also analyzed he impac o page aul s managed by he
BXI NIC. Figu e 2( ) depic s he la ency o local and emo e
page aul s in P lPu and P lGe , dis inguishing ho
(p e-alloca ed) and cold (unalloca ed) pages. Fo P lPu ,
emo e page s a e has li le impac unless he local page is
ho . Cold local pages inc ease la ency due o NIC- igge ed
aul s, indica ing ha pe o mance is p ima ily go e ned by
Po als4’s memo y desc ip o s a e. In con as , P lGe is
mo e sensi i e o bo h local and emo e page s a es. In he
cold-cold case, i s la ency nea ly doubles ha o P lPu ,
e lec ing he need o esol e bo h emo e po al and local
memo y desc ip o in o ma ion.
D. BXI Limi a ions and P ope ies
The ollowing ou lines BXI’s limi a ions and ea u es as
obse ed du ing ou e alua ion. As a ha dwa e implemen a ion
o Po als4, BXI imposes inhe en esou ce cons ain s. Po -
als4 pe mi s que ying and cus omizing hese limi s h ough
he pl _ni_limi s_ s uc u e du ing NI c ea ion. We
de eloped a es p og am ha ini ializes his s uc u e wi h
INT_MAX,LONG_MAX, o ze o, hen e ie es he ac ual
alues a e ini ializa ion. The esul s a e shown in Table I.
Key limi s include allocable esou ces such as EQs
and CTs. Some pa ame e s indica e ex ended unc ion-
ali y, such as max_io ec, which de ines suppo o
IOVECs (simila o Sca e -Ga he Lis s). Al hough BXI
se s his alue o ze o, implying no suppo , code inspec-
ion e ealed a pa ial, non-Po als4-complian implemen-
a ion. The max_msg_size limi s messages o 64 MiB,
ela i ely small compa ed o In iniBand’s 2 GiB. Po als4
T igge ed Ope a ions (TOs), which enable de e ed com-
munica ion based on coun e h esholds, a e pa ially sup-
po ed on BXI, allowing up o 7,167 ou s anding ope a ions
(max_ igge ed_ops). The max_ ola ile_size pa-
ame e speci ies he maximum size o pu o a omic ope a-
ions using he PTL_MD_VOLATILE lag, enabling PIO o
educed la ency in small ans e s, as discussed in Sec ion IV.
These and o he capabili ies a e g ouped in o h ee ea u e
ca ego ies de ined by Po als4 in he NI limi s s uc u e.
PTL_TARGET_BIND_INACCESSIBLE indica es ha no
all memo y desc ibed by LEs, MEs, o MDs needs o be allo-

0
3000
6000
9000
1 16 1024 65536 4194304
Message Size [B]
Bandwid h [MB/s]
OMPI PSMPI
(a) MPI poin - o-poin bandwid h esul s.
0
3000
6000
9000
1 16 1024 65536 4194304
Message Size [B]
Bandwid h [MB/s]
GPI−2 ead
GPI−2 w i e
OMPI ge
OMPI pu
PSMPI ge
PSMPI pu
(b) Pu / Ge bandwid h esul s.
3
10
30
100
300
1 16 1024 65536 4194304
Message Size [B]
La ency [us]
GPI−2 ead
GPI−2 w i e
OMPI ge
OMPI pu
PSMPI ge
PSMPI pu
(c) Pu / Ge la ency esul s.
10
20
30
1 2 4 8 16 32 64 128
Numbe o Reduc ion Elemen s
La ency [us]
GPI−2−p#2
GPI−2−p#4
OMPI−p#2
OMPI−p#4
PSMPI−p#2
PSMPI−p#4
(d) All educe la ency esul s using one p ocess pe
node using wo and ou nodes.
P ocess Coun : 2
Node Coun : 2
6000
6250
6500
6750
7000
MFLOPS
P ocess Coun : 4
Node Coun : 4
13000
13200
13400
13600
GPI−2 OMPI PSMPI
MFLOPS
(e) Pe o mance in MFLOPS o he Himeno bench-
ma k using wo di e en domain decomposi ions.
P ocess Coun : 2
Node Coun : 2
60
80
100
120
Runime [s]
P ocess Coun : 4
Node Coun : 4
50
75
100
125
GPI−2 OMPI PSMPI
Run ime [s]
( ) Run ime esul s o SSCA1 o di e en 2 and 4
p ocesses using one p ocess pe node.
Fig. 3. OpenMPI, Pa aS a ionMPI and GPI-2 pe o mance esul s.
ca ed o accessible by he applica ion. BXI suppo s his ia a
ha dwa e-based i ual- o-physical add ess ansla ion cache.
PTL_TOTAL_DATA_ORDERING de ines he o de ing
beha io o messages and en o ces s ic o de ing o sho
messages, as de ined by max_waw_o de ed_size and
max_wa _o de ed_size. BXI implemen s his beha io
in e nally, dis ega ding use -speci ied limi s.
PTL_COHERENT_ATOMICS, which gua an ees cohe ence
be ween Po als4 and p ocesso a omic ope a ions, is no
suppo ed by he cu en BXI ha dwa e.
Ou sou ce code analysis e ealed ha unc ions like
P lBundleS a and P lBundleEnd a e implemen ed
as no-ops. These unc ions, designed o op imize memo y syn-
ch oniza ion, pa icula ly o high- a e, s ided emo e memo y
access, a e c i ical o main aining pe o mance in demanding
communica ion pa e ns [22].
E. MPI Poin - o-Poin Communica ion
Poin - o-poin communica ion is undamen al o many
HPC applica ions, making i s pe o mance a key ocus in ou
e alua ion. We measu ed he bandwid h o bo h OMPI and
PSMPI, wi h esul s shown in he op hal o Figu e 3(a). The
bandwid h cu es highligh he p o ocol swi ch be ween he
Eage and Rendez ous p o ocols unde de aul se ings. Fo
OMPI, his swi ch occu s a message sizes abo e 16,384 by es,
while PSMPI ansi ions a 32,768 by es. The Eage p o ocol
sends messages immedia ely, e en i he co esponding
ecei e has no ye been pos ed. This app oach a o s small
messages, whe e he ecei e is likely o be eady and can
p ocess da a p omp ly. In con as , he Rendez ous p o ocol,
designed o la ge messages, uses a handshake mechanism:
he sende i s no i ies he ecei e , which hen alloca es a
bu e and signals eadiness be o e da a ans e begins. This
added coo dina ion in oduces la ency, leading o no iceable
pe o mance d ops o bo h OMPI and PSMPI a he ansi ion
poin . I should be no ed ha PSMPI is op imized o MSA-
awa e communica ion. As such, i s pe o mance e lec s
ade-o s aligned wi h he es ic ions and equi emen s o
he Modula Supe compu ing A chi ec u e (MSA) [24].
F. GPI-2 and MPI One-Sided Communica ion
Since GASPI, and by ex ension GPI-2, ollows he
PGAS communica ion model, we e alua e i s pe o mance
using ou Po als4 de ice and compa e i o he one-sided
communica ion ou ines p o ided by OMPI and PSMPI.
While he PGAS model is designed o a global add ess
space, MPI-based one-sided communica ion elies on Access
Epochs and equi es explici synch oniza ion. Fo his
e alua ion, we use he Passi e Ta ge Communica ion model
wi h he MPI_Win_Fence access pa e n. This me hod
places a ence a he beginning o he communica ion phase
o ma k he s a o an access epoch and ano he a he end o
ensu e ha all memo y ope a ions a e comple ed bo h locally
and emo ely be o e con inuing execu ion.
Figu e 3(b) illus a es he bandwid h esul s o Pu and Ge
ope a ions using one-sided communica ion. As expec ed, GPI-
2 consis en ly ou pe o ms bo h MPI implemen a ions, bene i -
ing om i s elaxed memo y model and e icien asynch onous
seman ics. OMPI pe o ms compe i i ely, app oaching GPI-2’s
bandwid h in many cases, while PSMPI alls sho , ailing o
achie e compa able h oughpu . No ably, PSMPI’s bandwid h
p og ession o bo h Pu and Ge ope a ions esembles ha
o poin - o-poin communica ion, including a isible p o ocol
swi ch. This beha io sugges s ha PSMPI in e nally emula es
one-sided ope a ions using poin - o-poin mechanisms, which
may con ibu e o i s educed pe o mance in his con ex .
Flush
Flush_local
1 16 1024 65536 4194304 1 16 1024 65536 4194304
1
10
100
Message Size [B]
La ency [us]
OMPI ge OMPI pu PSMPI ge PSMPI pu
Fig. 4. MPI one-sided passi e a ge communica ion la ency using lush
(le ) and lush_local ( igh ).
Figu e 3(c) shows he co esponding la ency esul s. The
esul s ollow a simila end as he bandwid h da a, wi h
OMPI and GPI-2 achie ing he lowes la encies. Fo small
message sizes, GPI-2 ou pe o ms PSMPI, by a ac o o
app oxima ely 3.5× o Ge ope a ions and 2× o Pu
ope a ions. OMPI’s la ency p o ile e lec s he absence o
PTL_MD_VOLATILE usage, which would o he wise educe
la ency o small messages. In GPI-2, he la ency da a indica es
a swi ch om PIO o DMA ans e s o Pu ope a ions a
message sizes abo e 256 by es. Fo Ge ope a ions, his
swi ch happens ea lie , a sizes exceeding 64 by es. These
h esholds align wi h he max_ ola ile_size NI limi ,
sugges ing ha PIO is used only below his limi .
We also obse ed inconsis encies in how MPI implemen-
a ions in e p e he speci ica ion ega ding lush beha io
on MPI memo y windows. Figu e 4shows he la ency o
MPI-based Pu and Ge ope a ions using MPI Win lush
(le ) and MPI Win lush local ( igh ) o synch oniza-
ion. Acco ding o he MPI speci ica ion, MPI Win lush
gua an ees ha all RMA ope a ions a ge ing a spe-
ci ic ank’s window a e comple ed bo h locally and e-
mo ely, while MPI_Win_ lush_local only ensu es lo-
cal comple ion. The la ency esul s con i m his dis inc ion.
OMPI and PSMPI pe o m simila ly o MPI_Win_ lush,
consis en wi h he ea lie la ency ends. Howe e , o
MPI_Win_ lush_local, PSMPI exhibi s a subs an ial im-
p o emen , whe eas OMPI’s pe o mance emains unchanged.
Sou ce code analysis e ealed ha OMPI’s Po als4 module
en o ces emo e comple ion e en o local lushes, e ec i ely
ea ing bo h lush modes he same. To e i y whe he his
beha io is speci ic o PSMPI, we epea ed he expe imen s
using anilla MPICH 4.1.0 o e In iniBand. The esul s
ma ched hose o PSMPI, con i ming ha his beha io is
inhe i ed om he unde lying MPICH implemen a ion.
G. Collec i e Ope a ions
Collec i e ope a ions acili a e communica ion and
synch oniza ion among g oups o p ocesses in pa allel
p og ams. The GASPI speci ica ion de ines wo collec i e
ope a ions: Ba ie and All educe. A Ba ie ensu es ha all
pa icipa ing p ocesses each a synch oniza ion poin be o e
con inuing, while All educe pe o ms a global educ ion
(e.g., compu ing he maximum alue ac oss p ocesses) and
dis ibu es he esul o all pa icipan s.
Figu e 3(d) shows he la ency esul s o he All educe
ope a ion wi h educ ion elemen s anging om 1 o 128. The
limi o 128 educ ion elemen s was chosen on he p emise
ha GPI-2 in i s de aul con igu a ion only suppo s 255
educ ion elemen s. 128 is he la ges powe o wo ha is
con ained in his in e all. The benchma k was conduc ed
using one p ocess pe node on bo h wo-node and ou -node
con igu a ions. PSMPI achie es he bes pe o mance o
educ ion elemen s up o wo, a e which la ency inc eases bu
emains ela i ely s able up o 128 elemen s, consis en ac oss
bo h con igu a ions. OMPI demons a es he bes o e all
pe o mance, main aining nea -cons an la ency ac oss all
es ed coun s. I s Po als4 module implemen s he All educe
algo i hm using igge ed ope a ions, a ea u e whose
ad an ages we e highligh ed in Sec ion IV-C. GPI-2, excep
o a single educ ion elemen , shows he highes la ency,
wi h a linea inc ease as he numbe o educ ion elemen s
g ows. This is because he All educe ope a ion is no na i ely
implemen ed wi hin he Po als4 de ice abs ac ion bu is buil
in he uppe laye o he GPI-2 so wa e s ack. This highligh s
po en ial oppo uni ies o op imiza ion in he selec ion and
implemen a ion o he All educe algo i hm wi hin GPI-2.
H. Applica ion-le el Pe o mance Resul s
1) Himeno Benchma k: Fo ou e alua ion, we used he
MPI-based s a ic memo y alloca ion implemen a ion, he only
publicly a ailable C e sion. We modi ied i o suppo dy-
namic memo y alloca ion, added OpenMP p agmas o mul i-
h eading, and in eg a ed GPI-2 communica ion p imi i es.
The MPI e sion u ilizes ec o da a ypes, bu OMPI and
PSMPI could no e ec i ely le e age hese o e BXI due
o he Po als4 IOVEC implemen a ion’s non-con o mance
wi h he speci ica ion. Consequen ly, we eplaced ec o ypes
wi h a sequence o MPI_Isend and MPI_I ec calls.
Addi ionally, he o iginal benchma k is designed o un o
a ixed du a ion (e.g., one minu e); we modi ied i o execu e
a ixed numbe o i e a ions o ep oducibili y.
Figu e 3(e) shows pe o mance in Mega Flops pe Second
(MFLOPS) o a 3D domain o size 256x256x512. The
uppe plo epo s esul s o a decomposi ion along he
X-dimension, wi h wo p ocesses (one pe node). The GPI-2
implemen a ion achie es he highes pe o mance, peaking a
a ound 9,950 MFLOPS, ollowed closely by OMPI, which
is app oxima ely 100 MFLOPS lowe based on he median.
PSMPI’s pe o mance is sligh ly below ha o OMPI. The
lowe plo shows esul s o a domain u he pa i ioned along
he Y-dimension, e ec i ely doubling he numbe o p ocesses
and nodes. The anking o he communica ion lib a ies
emains unchanged, wi h pe o mance nea ly doubling as
expec ed. This is a ibu ed o GPI-2’s e icien no i ied
one-sided communica ion. The pe o mance gap be ween
OMPI and PSMPI is discussed u he in Sec ion IV-E.
2) Scalable Syn he ic Compac Applica ions 1 (SSCA1):
We ex ended he SSCA1 codebase, which uses C mac os o
abs ac pu and ge seman ics ac oss communica ion lib a ies,
o suppo GPI-2. Unlike MPI’s one-sided ou ines, which
allow sou ce da a o eside in s ack-alloca ed memo y, GPI-2
equi es da a o be placed in a designa ed memo y segmen .
To comply wi h his cons ain , ou GPI-2 implemen a ion
uses a dedica ed message bu e , equi ing an explici copy
o each da a i em be o e ansmission.
Figu e 3( ) shows he un ime esul s o he SSCA1
benchma k execu ed wi h wo and ou p ocesses on wo
and ou nodes, espec i ely. The esul s indica e ha
OMPI consis en ly achie es he lowes un imes, while bo h
GPI-2 and PSMPI exceed 120 seconds in he ou -node
con igu a ion. These esul s ea i m he pe o mance gap
be ween OMPI and PSMPI discussed in Sec ion IV-F.
GPI-2’s compa a i ely weake pe o mance is likely due o
i s message bu e ing s a egy, which in oduces addi ional
o e head. Gi en SSCA1’s eliance on equen small pu and
ge ope a ions, he cos o copying da a in o message bu e s
signi ican ly impac s o e all un ime.
V. CONCLUSION AND OUTLOOK
This wo k in oduced a Po als4 mic obenchma k sui e o
analyze BXI pe o mance. We compa ed he pe o mance o
Po als4-compa ible MPI implemen a ions wi h ou cus om
Po als4-based backend o he GPI-2 communica ion lib a y.
Fo poin - o-poin communica ion, OMPI and PSMPI pe -
o med well, wi h OMPI showing a sligh edge. In one-sided
communica ion, GPI-2 ou pe o med bo h MPI implemen a-
ions, hough OMPI came close, while PSMPI’s eliance on
poin - o-poin communica ion in oduced signi ican o e head.
Fo collec i e ope a ions, GPI-2 showed s ong Ba ie pe o -
mance bu weake All educe esul s. PSMPI excelled in All e-
duce o small educ ion elemen coun s, su passing OMPI.
O e all, Po als4 o e BXI demons a es p omising pe o -
mance, wi h simple implemen a ion compa ed o In iniBand.
PSMPI, as he only MSA-awa e MPI implemen a ion, o e s
addi ional ea u es bu could bene i om imp o ed OSC
design o be e esou ce u iliza ion on homogeneous sys ems.
Fu u e wo k will include ex ending he mic obenchma k
sui e o co e all Po als4 unc ionali ies. Fo GPI-2, we
plan o explo e accele a o - o-accele a o communica ion o e
Po als4 and de elop mo e e icien All educe algo i hms.
ACKNOWLEDGMENT
This esea ch is suppo ed by EUPEX, which has ecei ed
unding om he Eu opean High-Pe o mance Compu ing
Join Unde aking (JU) unde GA No 101033975. The JU
ecei es suppo om he Eu opean Union’s Ho izon 2020
esea ch and inno a ion p og am, F ance, Ge many, I aly,
G eece, Uni ed Kingdom, Czech Republic, and C oa ia. The
au ho s would also like o hank G ´
egoi e Pichon om E iden
o his suppo and aluable commen s.
REFERENCES
[1] S. De adji, T. Pal e -Sollie , J.-P. Panzie a, A. Poudes, and F. W.
A os, “The bxi in e connec a chi ec u e,” in 2015 IEEE 23 d Annual
Symposium on High-Pe o mance In e connec s, pp. 18–25, IEEE, 2015.
[2] Eu opean Pilo o Exascale. h ps://eupex.eu/abou - he-p ojec /. ac-
cessed 08-Augus -2024.
[3] Sandia Na ional Labo a o ies, “Po als 4.0.” h ps://www.sandia.go /
po als/po als-4-0/. Online; accessed: 2024-07-19.
[4] M. De Wael, S. Ma , B. De F aine, T. Van Cu sem, and W. De Meu e ,
“Pa i ioned Global Add ess Space Languages,” ACM Compu . Su .,
ol. 47, May 2015.
[5] In iniBand T ade Associa ion, “In iniBand.” h ps://www.in iniband a.
o g. Online; accessed: 2024-08-02.
[6] B. W. Ba e , S. Smi h, J. Dinan, K. Seage , and R. E. G an , “Sandia
OpenSHMEM.” h ps://www.os i.go /se le s/pu l/1312730, 2016.
[7] E. Gab iel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Donga a, J. M.
Squy es, V. Sahay, P. Kambadu , B. Ba e , A. Lumsdaine, R. H.
Cas ain, D. J. Daniel, R. L. G aham, and T. S. Woodall, “Open MPI:
Goals, concep , and design o a nex gene a ion MPI implemen a ion,”
in P oceedings, 11 h Eu opean PVM/MPI Use s’ G oup Mee ing, 2004.
[8] Sandia Na ional Labo a o ies, “Po als4 Re e ence Implemen a ion.”
h ps://gi hub.com/sandialabs/po als4. Gi Hub eposi o y.
[9] P. Vign´
e as and J.-N. Quin in, “The bxi ou ing a chi ec u e o exas-
cale supe compu e ,” The Jou nal o Supe compu ing, ol. 72, no. 12,
pp. 4418–4437, 2016.
[10] T. Naugh on, F. Ade hold , M. Bake , S. Pophale, M. G. Venka a, and
N. Imam, “Oak Ridge OpenSHMEM Benchma k Sui e,” in Wo kshop on
OpenSHMEM and Rela ed Technologies, pp. 202–216, Sp inge , 2018.
[11] D. G ¨
unewald and C. Simmendinge , “The GASPI API speci ica ion and
i s implemen a ion GPI 2.0,” in 7 h In e na ional Con e ence on PGAS
P og amming Models, ol. 243, p. 52, 2013.
[12] C. Simmendinge , M. Rahn, and D. G uenewald, “The GASPI API: A
Failu e Tole an PGAS API o Asynch onous Da a low on He e oge-
neous A chi ec u es,” in Sus ained Simula ion Pe o mance 2014 (M. M.
Resch, W. Bez, E. Foch , H. Kobayashi, and N. Pa el, eds.), (Cham),
pp. 17–32, Sp inge In e na ional Publishing, 2015.
[13] F aunho e ITWM, “GPI-2: Global Add ess P og amming In e ace 2.”
h ps://gi hub.com/cc-hpc-i wm/GPI-2. accessed 24-Augus -2022.
[14] S. Neuwi h and A. K. Paul, “Pa allel I/O E alua ion Techniques and
Eme ging HPC Wo kloads: A Pe spec i e,” in 2021 IEEE In e na ional
Con e ence on Clus e Compu ing (CLUSTER), pp. 671–679, 2021.
[15] D. K. Panda, “OSU Mic o-Benchma ks (OMB).” h ps://m apich.cse.
ohio-s a e.edu/benchma ks/. Online; accessed 04-Augus -2024.
[16] D. A. Mall´
on, Design o Scalable PGAS Collec i es o NUMA and
Manyco e Sys ems. PhD hesis, Uni e si y o A Co u˜
na, Spain, 2014.
[17] N. Ba elheime and S. Neuwi h, “Towa d Rep oducible Benchma king
o PGAS and MPI Communica ion Schemes,” in IEEE 29 h In e na-
ional Con e ence on Pa allel and Dis ibu ed Sys ems (ICPADS), 2023.
[18] A. Kalia, M. Kaminsky, and D. G. Ande sen, “Design guidelines o
high pe o mance dma sys ems,” in 2016 USENIX Annual Technical
Con e ence (USENIX ATC 16), pp. 437–450, 2016.
[19] D. De Sensi, S. Di Gi olamo, K. H. McMahon, D. Rowe h, and
T. Hoe le , “An in-dep h analysis o he slingsho in e connec ,” in SC20:
In e na ional Con e ence o High Pe o mance Compu ing, Ne wo king,
S o age and Analysis, pp. 1–14, IEEE, 2020.
[20] Y. Li, H. Qi, G. Lu, F. Jin, Y. Guo, and X. Lu, “Unde s anding
ho in e connec s wi h an ex ensi e benchma k su ey,” BenchCouncil
T ansac ions on Benchma ks, S anda ds and E alua ions, ol. 2, no. 3,
p. 100074, 2022.
[21] Sandia Na ional Labo a o ies, “Sandia MPI Mic obenchma k Sui e.”
h ps://gi hub.com/sandialabs/SMB. Gi Hub eposi o y.
[22] B. W. Ba e , R. B igh well, R. E. G an , W. Schonbein, S. Hemme ,
K. Ped e i, K. Unde wood, R. Riesen, M. Ba be, L. H. S. Filho,
A. Ra cho , and A. B. Maccabe, “The Po als 4.3 Ne wo k P og amming
In e ace,” Tech. Rep. SAND2022-8810, SNL, 2022.
[23] Fo schungszen um Juelich, “DEEP Tes clus e – Sys em O e iew.”
h ps://deep ac.zam.k a-juelich.de:8443/ ac/wiki/Public/Use Guide/
Sys em o e iew. accessed 10-18-2023.
[24] S. Neuwi h, “Modula Supe compu ing and i s Role in Eu ope’s Exas-
cale Compu ing S a egy,” PoS, ol. LATTICE2022, p. 245, 2023.
[25] N. Ba elheime and S. Neuwi h, “Le e aging Po als4 Mic obench-
ma ks o Enhance GASPI Pe o mance on BXI Ne wo ks,” in 2024 IEEE
In e na ional Con e ence on Clus e Compu ing Wo kshops (CLUSTER
Wo kshops), pp. 176–177, 2024.