scieee Science in your language
[en] (orig)

Using objects for data storage in high-performance I/O

Author: Jackson, Adrian
Publisher: Zenodo
DOI: 10.5281/zenodo.17654701
Source: https://zenodo.org/records/17654701/files/REXIO2025-ExpertTalk2-Jackson.pdf
USING OBJECTS FOR STORAGE IN
HIGH-PERFORMANCE I/O
Ad ian Jackson
([email p o ec ed])
Andy Tu ne
EPCC
*Nicolau Manubens
ECMWF
S o age
•Lo s o ways o s o e da a on s o age de ices
•Filesys ems ha e wo componen s:
•Da a s o age
•Indexing
•Da a s o ed in blocks
•Chunks o da a physically s o ed on ha dwa e somewhe e
•Indexing is used o associa e names wi h blocks
•File names a e he index
•Files may consis o many blocks
•Va iable sized na u e o iles makes his a ha d p oblem o sol e
Filesys ems
… …
inodes blocks o sec o s
•Me ada a equi ed o ma ch da a o s o age
•Files and di ec o ies
•Expandable iles equi es managemen o blocks o s o age
•Can lead o was ed s o age
•Can lead o esou ce limi s
•Can equi e me ada a ope a ions ha educe “pe o mance”
•Files w ap da a in ways ha may be no ela ed o how da a is c ea ed o accessed
•Locking/sha ing can be ha d o achie e e icien ly
•Can equi e ke nel space ope a ions
•Pe o mance is ine o slow s o age
•Requi es bulk ope a ions o high pe o mance
Challenges o ilesy ems
•Gene al s speci ic ha dwa e cons ain s
•GPU is a specialised ha dwa e e sion o a CPU – p o ides highe pe o mance o
educed seman ic lexibili y in p og ams
•Newe ha dwa e p o ides mo e lexibili y
•Some hing like di ec add essabili y wi h educed ha dwa e o e heads
•Access pa e ns o c ea ing da a may be di e en o consuming da a
•Many p ocesses wo king on he same “da a s uc u es”
•Me ada a and sea ching becoming inc easingly impo an o scien i ic da ase s
•Mo e emphasis on ead o en da ase s
•Op imising e e y hing in a single sys em se up is ha d
•Con en ion on sha ed sys ems can be a big issue
Challenges o I/O o applica ions

•BSP (bulk synch onous pa allel) I/O
•Requi ed o high pe o mance on s o age
•No necessa ily wha you would “na u ally” do o an applica ion
•Wo k lows inc easingly impo an
Ideal I/O
Ac ual I/O
ARCHER2 Lus e IO pa e n
analysis using Da shan
Ac ual I/O
ARCHER2 Lus e IO pa e n
analysis using Da shan
•Ano he way o spli up he s o age ha dwa e ha is a ailable…
•…and manage he da a being s o ed
•Da abases is one possible op ion
•Requi es a di e en model o sha ed access sys ems han is adi ionally used o
da abases
•O en no well designed o slice accessing o sha ed access o same da ase
•ACID-like p ope ies can be as es ic ing and POSIX app oaches
•Objec s o age an al e na i e app oach
•Decons uc ed da abase
Objec s o age as an al e na i e
Lus e s DAOS

Ceph s DAOS s Lus e
S o age compa ison – 16 s o age nodes (o 16 + 1) GCP
Ceph s Lus e s DAOS: Small objec s
•Tes ing unde “p oduc ion” condi ions is impo an
•Redundancy impo an o p oduc ion ope a ions
•E en i backup no common
•Po en ial o mo e away om edundancy i si es/g oups/use s wan
•Bene i s o con igu able edundancy
•Two ypes o edundancy
•E asu e coding
•Replica ion
DAOS unde ope a ional condi ions
No hing s E asu e Coding (2+1) - IOR
W i e
Read
No hing EC 2+1

No hing s Redundancy ( ac o 2) - IOR
W i e
Read
No hing 2 x Replica ion
E asu e s Redundancy
W i e
Read
2 x Replica ion
EC 2+1
2+1 E asu e coding
Access in e aces
POSIX I/O /
“Files”
FUSE & In e cep ion
S3
Radosg
w
Block / NVMe-
oF
SPDK DAOS bde
Py ho
n
pydaos
Hadoop
Connec o
MPI-IO
DAOS
ROMIO
HDF5
DAOS
VOL
SEG
Y
FD
B
ROO
T
DA
Q
libd s (Pa allel Filesys em)
libdaos (key- alue-a ay in e ace)
AI/Analy ics/Scien i ic Wo k low
GPGPU CPU
Compu e Ins ances
1. Use space DFS lib a y wi h API like POSIX
○Requi e applica ion changes
○Low la ency & high concu ency
○No caching
2. DFUSE daemon o suppo POSIX API
○No applica ion changes
○VFS moun poin & high la ency
○Caching by Linux ke nel
3. DFUSE + In e cep ion lib a y
○No applica ion changes
○2 la o s using LD_PRELOAD
○libioil
■( ) ead/w i e in e cep ion
■Me ada a ia d use
○libpil4d s
■Da a & me ada a in e cep ion
■Aim a deli e ing same pe o mance as #1
w/o any applica ion change
■Mmap & bina y execu ion ia use
DFS - DAOS Filesys em (libd s)
DAOS Lib a y (libdaos)
In e cep ion Lib a y
libpil4d s libioil
Applica ion/F amewo k
d use
Single p ocess add ess space
Ke nel bypass
DAOS S o age Engine
RPC RDMA
Sys em calls
Linux
Ke nel
Da a & me ada a Da a
1
3b 3a
3
2
1
3a
3b
2
•Fo a speci ic benchma k un con igu ed wi h con en ion ac oss p ocesses
on indexing Key-Values:
•20 GiB/s w i e
•13 GiB/s ead
•Tweaking he benchma k con igu a ion o ha e all p ocesses ope a e on a
sepa a e Key-Values:
•35 GiB/s w i e
•68 GiB/s ead
•This may no be i ial o possible o all applica ions, bu i design can
suppo i hen his imp o es pe o mance
App oach/ ecommenda ions: Key-Value con en ion

•A oid communica ions on/wi h he se e whe e possible
•Cache objec s locally in DRAM i possible
•Use daos_a ay_open_wi h_a o a oid daos_a ay_c ea e calls
•Only suppo ed o DAOS_OT_ARRAY_BYTE, no o DAOS_OT_ARRAY
•Wa ning: he cell size and chunk size a ibu es need o be p o ided consis en ly on any
u u e daos_a ay_open_wi h_a o a oid da a co up ion
•daos_a ay_ge _size calls can be expensi e
•Can s o e a ay size in ou indexing Key-Values
•Can manually calcula e
•Also possible o in e he size by eading wi h o e alloca ion:
•use DAOS_OT_ARRAY_BYTE, o e -alloca e he ead bu e , and ead wi hou que ying he
size. The ac ual ead size (sho _ ead) will be e u ned
•daos_con _alloc_oids is expensi e, call i jus once pe w i e p ocess
•Requi ed o gene a e objec ideas o use in calls bu can gene a e many a one
App oach/ ecommenda ions
•C ea ing se e al con aine s (s a ing a ~300) in a DAOS pool
educes pe o mance
•Opening he same con aine om all p ocesses is expensi e
• his happens e en i only a ew con aine s exis in he DAOS pool
•e.g. ou o 20 seconds aken by a p ocess o w i e 2000 ields, 1.5
seconds we e spen jus o open one con aine
•we obse ed his s a ing a ~200 pa allel p ocesses
•Sha ing handles using MPI is he way o ix his
•Opening mo e han one con aine pe p ocess is e y expensi e
•e.g. ou o 30 seconds aken by a p ocess o ead 2000 ields, 6 seconds
we e spen jus o open wo con aine s
App oach/ ecommenda ion
•daos_key_ alue_lis is expensi e
•daos_a ay_open_wi h_a s, daos_k _open and
daos_a ay_gene a e_oid a e e y cheap (no RPC)
•No mal daos_a ay_open is expensi e
•daos_con _alloc_oids is expensi e
•daos_k _pu and _ge a e gene ally cheap
•Value size impac s his
•daos_obj_close, daos_con _close and daos_pool_disconnec a e
cheap
•Se e con igu a ion o use a ailable ne wo ks/socke s/e c… impo an o pe o mance
•Jus like any s o age sys em o applica ion
App oach/ ecommenda ions
Ceph
Designing o objec s o es

ype, public, bind(c) :: daos_a ay_s bu _
in ege (kind=daos_size_ ) :: s _size
in ege (kind=daos_epoch_ ) :: s _max_epoch
end ype daos_a ay_s bu _
in e ace
in ege (kind=c_in ) unc ion daos_a ay_c ea e(coh, oid, h,
cell_size, chunk_size, oh, e ) bind(c,name="daos_a ay_c ea e")
impo :: c_in
impo :: daos_handle_
impo :: daos_obj_id_
impo :: daos_size_
impo :: daos_e en _
ype(daos_handle_ ), alue, in en (in) :: coh
ype(daos_obj_id_ ), alue, in en (in) :: oid
ype(daos_handle_ ), alue, in en (in) :: h
in ege (kind=daos_size_ ), alue, in en (in) ::cell_size
in ege (kind=daos_size_ ), alue, in en (in) ::chunk_size
ype(daos_handle_ ), in en (inou ) :: oh
ype(daos_e en _ ), in en (inou ) :: e
end unc ion daos_a ay_c ea e
Nascen Fo an In e ace
•DAOS has an MPI-I/O implemen a ion
•ROMIO in mpich 3.4.2
•C ea es DAOS objec s om MPI-I/O unc ion calls…
•…bu does no a ge A ay objec s
•Wo king on c ea ing a ay objec s om MPI-I/O speci ica ions
•Au oma ically mapping objec sizes and ex en s, I/O sizes and ex en s, om MPI
da a ype and MPI-I/O pa e n o applica ions
•Ve y much a WiP a he momen
•Simpli ies po ing “ adi ional” pa allel I/O o DAOS mo e e icien ly
•Allows s a ing o weak he se up bu s ill s aying wi h MPI-I/O
MPI I/O DAOS au oma ed objec c ea ion
•Objec s o age can p o ide high pe o mance
•DAOS: 90+ GB/s pe se e is possible
•Ha dwa e and con igu a ion dependen , jus like all I/O
•Buil in eplica ion and edundancy unde you /use con ol
•Di e en in e aces a ailable
•Filesys em o ze o cos po ing
•Simple ile like access o sligh ly imp o ed pe o mance a li le e o
•P og amming APIs o ull unc ionali y
•Objec s o e in e ace enables changing I/O g anula i y/pa e ns o bigge bene i s
•High pe o mance when mo ing o mo e “ ealis ic” deploymen app oaches
•NVMe, eplica ions/e asu e coding
•P o ided in e aces a e gene ally pe o man
•Bu po ing o he libdaos laye has bene i s/po en ial, wi h an associa ed de elopmen
cos
Summa y