Rein o ced Fai ness-Awa e Mul i-Agen
Sel -O ganiza ion o 6G Radio Access Ne wo k
O ches a ion
Elham Hashemi Nezhad, An onio Di Maio, To s en B aun
Uni e si y o Be n, Be n, Swi ze land
{elham.hasheminezhad, an onio.dimaio, o s en.b aun}@unibe.ch
Abs ac —The o ches a o s’ deploymen p oblem p esen s
nume ous challenges in 6G Ne wo k Radio Access Ne wo ks due
o hei la ge-scale, dynamic condi ions, and a iable use de-
mands. Mos wo ks p opose single- o hie a chical-o ches a o
solu ions, which o e poo esiliency, high signaling o e head,
and slow adap a ion o a iable ne wo k dynamics. To ackle
hese challenges, in his wo k, we p opose an online, da a-
d i en, ully decen alized, Mul i-Agen Rein o cemen Lea n-
ing (MARL)-based, sel -o ganiza ion o ches a o deploymen
sys em o 6G ne wo ks , which join ly op imizes he adeo
be ween use h oughpu and ai ness, based on ime- a ying
sys em condi ions. In he p oposed app oach, a lexibly a iable
numbe o decen alized, coope a i e, pee sel -o ganiza ion
agen s au onomously adap hei associa ed o ches a o ’s de-
ploymen loca ion and ac i i y o op imize ne wo k ope a ion,
wi hou equi ing cen alized coo dina ion. Simula ions show
imp o emen s o up o 77% in use h oughpu and o e 200%
in ai ness compa ed o Hie a chical and Single O ches a o
baselines in a b oad ange o ealis ic scena ios.
Index Te ms—6G Ne wo ks, Sel -O ganiza ion (SO), Decen-
alized O ches a ion, Mul i-Agen Rein o cemen Lea ning
(MARL), Open-Radio Access Ne wo k (O-RAN)
I. INTRODUCTION
The Open-Radio Access Ne wo k (O-RAN) a chi ec u e en-
ables he decoupling o physical and con ol laye s, imp o ing
he scalabili y and adap abili y o he ne wo k o di e en
use cases, such as 5G and beyond. The O-RAN Alliance
has in oduced RAN In elligen Con olle (RIC) as a key
a chi ec u al componen ha p o ides a cen alized ne wo k
abs ac ion, enabling ope a o s o implemen cus omized con-
ol unc ions in Radio Access Ne wo k (RAN). The RIC
exis s in wo o ms: he Non-Real-Time (Non-RT) RIC, which
in eg a es wi h he ne wo k o ches a o and ope a es on a ime
scale longe han 1 s, and he Nea -Real-Time (Nea -RT) RIC,
which manages con ol loops wi h RAN nodes on a ime scale
be ween 10 ms and 1 s [1]–[5]. Va ious s udies [6]–[8], ha e
explo ed he Con olle Placemen P oblem (CPP) in di e en
a chi ec u es, including So wa e-De ined Ne wo k (SDN) and
O-RAN. Howe e , he p oblem o o ches a o o ganiza ion o
manage con olle s emains NP-ha d and has ecei ed limi ed
a en ion, pa icula ly wi hin he con ex o O-RAN. Some
s udies [9]–[11] p esen an addi ional managemen laye o
o ches a o deploymen , which b ings a single poin o ailu e
and o e head communica ion o he cen alized managemen .
In con as , we p opose a decen alized Sel -O ganiza ion (SO)
app oach, whe e o ches a o s au onomously manage hei
placemen , elimina ing dependence on cen alized con ol and
enhancing adap abili y h ough localized, eal- ime decision-
making. Ou design aligns wi h he 6G Sel -O ganizing Ne -
wo ks (SONs) ision, which seeks o o e come scalabili y,
s abili y, secu i y, and cos challenges ia au onomous ne wo k
es uc u ing. A SON is a ne wo k ha au onomously con ig-
u es, op imizes, and heals i sel using au oma ed mechanisms
o imp o e pe o mance, educe manual in e en ion, and
adap o changing condi ions. [12].
Building on ou p e ious wo k, we p oposed a decen al-
ized o ches a ion model u ilizing Mul i-agen Rein o cemen
Lea ning (MARL) sys ems o ackle CPP. Ou wo k high-
ligh ed he signi icance o decen alized o ches a ion (i.e., o -
ches a o domains) o deploying dis ibu ed Nea -Real-Time
RICs (i.e., con olle domains) wi hin he O-RAN a chi ec u e,
especially when compa ed o cen alized RAN In elligen
con olle o ches a ion unde simila ne wo k condi ions o
maximize use Packe Deli e y Ra io (PDR).
We adop he 6G ne wo k a chi ec u e (Figu e 1) which is
composed o ex eme edge (i.e., a Use Equipmen s (UEs)),
RAN, edge, co e, and cen al cloud, and is buil on h ee
key componen s: Managemen and O ches a ion F amewo k
(MOF), Cloud Con inuum F amewo k (CCF), and A i icial
In elligence and Machine Lea ning F amewo k (AIMLF). A
single Mas e Se ice O ches a o (MSO) manages mul iple
Dis ibu ed Se ice O ches a o s (DSOs), which con ol Ne -
wo k Se ices (NSs) consis ing o Vi ual Ne wo k Func ions
(VNFs), including con olle so wa e modules. In alignmen
wi h he O-RAN a chi ec u e, we in e p e VNFs in he edge
domain as RICs ha manage and op imize RAN ope a ions,
and a e deployed by DSOs.
In his pape , we enhance ou p e ious 6G ne wo k a chi ec-
u e by eplacing he hie a chical o ches a ion model wi h a
decen alized sel -o ganiza ion (SO) app oach. Using a MARL
sys em, o ches a o s au onomously manage hei placemen ,
imp o ing sys em esilience while maximizing use h oughpu
and enhancing ai ness. This goal is achie ed h ough adap i e
wo kload dis ibu ion wi hin he co e ne wo k, wi hou elying
on a cen alized managemen laye . This pape add esses wo
esea ch ques ions:
1) How can we op imize he p oblem o o ches a o s’ de-
ploymen o educe cen alized o e head and maximize use
h oughpu e en wi h highly dense UEs popula ion?
2) How can 6G ne wo k bandwid h be o ches a ed o ensu e
ai alloca ion among use s unde dynamic ne wo k condi-
ions?
To add ess hese ques ions, we p opose DEcen alized Re-
in o ced RAN In elligen Con olle o ches a ion (DERRIC)
Sel -O ganiza ion o ches a ion h ough MARL in O-RAN
(DERRIC-SO) o op imize he p oblem o o ches a o place-
men . This pape p o ides he ollowing con ibu ions.
1) By enabling o ches a o s o au onomously manage hei
placemen and ac ions using MARL, he ne wo k can adap
in eal ime o dynamic condi ions, educe dependence on
cen alized con ol, and enhance scalabili y, esilience, and
esponsi eness o o ches a ion.
2) Decen alized sel -o ganiza ion allows o ches a o s o bal-
ance con olle loads and use assignmen s mo e e ec i ely,
educing pe o mance dispa i ies and enhancing ai ness in
h oughpu dis ibu ion, pa icula ly unde a ying ne wo k
loads and use densi ies.
II. RELATED WORK
We p esen a comp ehensi e analysis o ela ed wo ks,
emphasizing how hey add ess he o ches a o s’ deploymen
p oblem.
A. Decen alized O ches a ion
Se e al s udies [13]–[15] ha e explo ed cen alized o -
ches a ion app oaches o enhance co e ne wo k pe o mance
and coo dina e managemen ac oss O-RAN and SDN a chi-
ec u es. Cen alized o ches a ion in oduces ulne abili y o
a single poin o ailu e, c ea es scalabili y bo lenecks as
ne wo k demands g ow, and inc eases con ol-plane la ency
due o cen alized decision-making. In con as , decen alized
o ches a ion enhances ne wo k esilience and scalabili y by
dis ibu ing con ol unc ions, he eby mi iga ing conges ion
and accommoda ing mo e connec ed use s e icien ly.
B. O ches a ion O ganiza ion
A numbe o wo ks [16]–[18] employ a hie a chical o -
ches a ion model, whe e a op-le el o ches a o coo dina es
lowe -le el o ches a o s o add ess he adap abili y o ne -
wo k managemen and o ches a ion. While such ex e nal
managemen sys ems can enhance secu i y and us , hey also
in oduce coo dina ion o e head, inc ease communica ion la-
ency, and c ea e uppe -laye dependency, esul ing in a single
poin o ailu e. These limi a ions ul ima ely educe sys em
esilience, hinde scalabili y, and slow eal- ime adap abili y
in dynamic ne wo k en i onmen s.
C. Sel -O ganiza ion
Mou a e al. [19] and Lyu e al. [20] p opose sel -
op imiza ion s a egies o o ches a o placemen based on
wo kload and mul i- imescale uning. Howe e , hei me hods
ely on ixed pa ame e s and lack adap abili y in highly
dynamic ne wo ks. In con as , DERRIC-SO empowe s o -
ches a o s o au onomously manage hei li ecycle h ough
Fig. 1. P oposed 6G Ne wo k A chi ec u e
TABLE I
RELATED WORKS
Wo ks O ches a ion
A chi ec u e Me hod Cen alized
O e head
Adap i e
In elligence
Fai ness
Index
[13]–
[15] Cen alized Da a-d i en ✓ ✓ ×
[9] Decen alized Blockchain ✓× ×
[16],
[17] Decen alized Hie a chical ✓ ✓ ×
[19],
[20] Decen alized Sel -
op imiza ion × × ×
P oposed Decen alized Sel -
o ganiza ion ×✓ ✓
duplica ion, eloca ion, and e mina ion using Rein o cemen
Lea ning (RL), enabling con inuous lea ning and eal- ime
decision-making. Duplica ing a single o ches a o , a he han
deploying h ee o mo e, allows he agen o ac mo e e -
icien ly and each con e gence as e due o he smalle
ac ion space. By in elligen ly eloca ing o ches a o s close o
a eas o demand, ou me hod imp o es adap abili y, balances
con olle wo kloads, and enhances ai ness in use h oughpu .
Table I ca ego izes he ele an wo ks acco ding o he
p oposed me hod, a chi ec u e, con olle , and o ches a ion
pa adigms.
III. SYSTEM MODEL AND PROBLEM FORMULATION
A. Sys em Model
We assume ha he sys em ope a es wi hin a RAN deploy-
men , adhe ing o he O-RAN speci ica ions. The ne wo k
opology is modeled as an undi ec ed g aph G= (V, E),
whe e V={ 1, . . . , |V|} ep esen s a se o in as uc u e
de ices con aining physical communica ion and p ocessing
de ices such as ixed Base S a ions (BSs) (e.g., gNodeBs),
Mul i-access Edge Compu ing (MEC) se e s, and la ge cloud
da a cen e s. Mo eo e , we conside a se o edges E=
{e1...,e|E|} ep esen ing a se o physical links be ween wo
nodes in V, including wi eless and wi ed links. We assume
ime is disc e ized in o ime s eps ∈ T , whe e T ⊂ Ndeno es
he se o all ime s eps du ing which he sys em ope a es. Each
link in he link se Eis cha ac e ized by link la ency Lij ( ),
which is de e mined by he physical leng h o he link, he
de ice ansmission a e and he conges ion le els expe ienced
du ing packe ansmission a ime . Fo any pai o nodes
i, j ∈V, we de ine ij ⊆Eas he se o edges o ming
he mul i-hop pa h be ween nodes iand j. The end- o-end
la ency L( ij)( ) = P(k,l)∈ ij Lkl( )is de ined as he sum o
link la encies along he pa h ij a ime s ep ∈ T .
In his a chi ec u e, o ches a o s and con olle s a e s a e ul,
i ualized, and mig a able so wa e modules ha can be
deployed on a se o 6G ne wo k de ices (i.e., cloud se e s
and gNodeBs). Bo h con olle and o ches a o agen s can
ope a e on he same ne wo k opology g aph G. The sys em
model has a se O( )o o ches a o s a ime s ep ha
pa i ion he ne wo k in o a Vo onoi-like [21] se o con iguous
o ches a o domains ha a e mig a ed on a M⊆Vnumbe o
o ches a o hos s, deno ed as {m1, . . . , mM}, wi h he highes
a ailable in as uc u e esou ces o o ches a o ope a ions.
These specialized ha dwa e de ices a e capable o unning
o ches a ion so wa e modules e icien ly. Each o ches a o
o∈ O( )has a logically ime- a ying deploymen loca ion
po( )∈Mon he ne wo k opology a ime s ep ha
op imizes he ime- a ying loca ion and numbe h ough a
sel -o ganiza ion p ocess. Based on he cu en deploymen
loca ion o he o ches a o , an o ches a o domain Dois
c ea ed by clus e ing he RAN nodes in Vwi h he lowes
la ency o he o ches a o node. Wi hin i s domain Do⊆V,
each o ches a o o∈ O( )is also esponsible o deploying a
ime- a ying se o con olle s Co( ) o egula e use me ics
such as ansmission powe Pu( ) o a se Uc( )o UEs which
a e connec ed o BSs a ime in he con olle domain Dc. We
assume he UEs can mo e in he scena io and connec o he
closes BS. In ou model, he e a e a Knumbe o BS, each
wi h ini e capaci y o connec UEs o he ne wo k, whe e use s
a e assumed o be mobile and may change hei associa ion
wi h base s a ions o e ime. This capaci y is go e ned by
ha dwa e cons ain s and egula o y policies ha in luence he
alloca ion o physical pa ame e s such as ansmission powe
and bandwid h. To quan i y he ne wo k pe o mance om
he use ’s pe spec i e, we model link capaci y (i.e., use -
BS link) based on he Shannon-Ha ley heo em [22], and
each BS alloca es he maximum achie able da a a e in a
communica ion channel subjec o noise and in e e ence o i s
connec ed use s. We conside a scena io whe e use s ansmi
da a o he base s a ion a he maximum capaci y o hei
channels. Acco dingly, he use da a h oughpu Tu( )[bi /s]
a ime s ep is modeled using Shannon’s capaci y heo em,
as shown in Equa ion 1a.
Tu( ) = Bu( )·log21 + ρu( )
Iu( ) + N(1a)
Iu( ) =
K
X
k=1
Pk( )·c
4πνduk( )α
(1b)
Bu( )[Hz] ep esen s he bandwid h alloca ed o use u
by BS, and Signal o In e e ence and Noise Ra io (SINR)
expe ienced by use uis calcula ed by he ac ion o he
quali y o he ecei ed signal ρu( )by use uin he p esence o
he in e e ence powe Iu( )(Equa ion 1b) a use ubased on
he F ee-Space Pa h Loss [23] om neighbo ing ansmissions
Pk( )o Knumbe o BSs and a unc ion o c he speed o
ligh , ν he equency o a adio wa e, duk( ) he dis ance
be ween BS kand use uwi h he pa h loss exponen α > 0,
and ambien he mal noise N. We also de ine he a e age o al
h oughpu as T=1
|T | P ∈T 1
|U| Pu∈U Tu( ).
To quan i y he equi y o esou ce (i.e., h oughpu ) dis i-
bu ion wi hin a subse U′⊆Uo use s in he sys em a ime
, we use he Jain Fai ness Index J(U′, )as in Equa ion 2.
J(U′, ) = (Pu∈U′Tu( ))2
|U′|Pu∈U′Tu( )2(2)
The highe he alue o he Jain Fai ness Index, he mo e
uni o m he h oughpu alloca ion ac oss he use s in he
conside ed se . We de ine he Global Fai ness J(U, )and he
Local Fai ness J(Uc, )as he h oughpu ai ness among all
use s in he sys em and among all use s managed by con olle
c, espec i ely, a ime . We also de ine he A e age Global
Fai ness as J(U) = 1
|T | P ∈T J(U, ), he A e age Local
Fai ness o con olle cas J(Uc) = 1
|T | P ∈T J(Uc( ), ).
B. P oblem Fo mula ion
Le us de ine he ai ness sensi i i y coe icien β∈[0,1] as
a weigh ing coe icien ha ep esen s he impo ance o use
h oughpu o e ai ness o he ne wo k policymake . Le us
assume he op imize should exclude o ches a o placemen
solu ions ha induce a global use ai ness below a minimum
h eshold Jmin. Le us also assume ha he e exis s a maximum
numbe mmax o o ches a o s a ne wo k node can hos due
o physical limi a ions. We now o mula e he o ches a o
placemen p oblem as a cons ained op imiza ion p oblem o
maximize a u ili y unc ion (Equa ion 3a), de ined as a con ex
combina ion o cumula i e use h oughpu and ai ness o e
ime, subjec o sys em cons ain s (Equa ions 3b and 3c).
maximize
O( ), po( )∈M,
∀o∈O( ),∀ ∈T X
∈T
βX
u∈U
Tu( ) + (1 −β)J(U, )(3a)
subjec o J(U, )≥Jmin,∀ ∈ T (3b)
X
o∈O( )
1[po( )=m]≤mmax,∀m∈M, ∀ ∈ T (3c)
, whe e 1[q] ep esen s he indica o unc ion ha e u ns 1
i p edica e qis ue. This ”o line” p oblem equi es comple e
knowledge o he ne wo k his o y o be sol ed, equi es
such knowledge o be collec ed a a cen alized loca ion, and
equi es he sol e o explo e an immense solu ion space as
a combina ion o all possible o ches a o se s and ela ed de-
ploymen loca ions, making he solu ion o such a p oblem im-
p ac ical. The e o e, we con e such a p oblem in o an online
e sion ha does no equi e comple e ne wo k in o ma ion bu
only his o ical and cu en s a e es ima ion o make op imal
o ches a o deploymen decisions wi h limi ed in o ma ion.
Simul aneously, we e o mula e he p esen ed p oblem in o
a decen alized e sion whose solu ion can be app oxima ed
h ough he collabo a ion among mul iple agen s, le e aging
he compu a ional capabili ies o e ed by he se o ne wo k
de ices.
IV. METHODOLOGY
The main objec i e o DERRIC-SO is o con inuously adap
he ne wo k o ches a o s’ deploymen based on he obse ed
sys em s a e, so ha hei managed con olle s’ placemen
maximizes a adeo be ween use h oughpu and ai ness.
We assume each o ches a o execu es a decen alized sel -
o ganiza ion RL agen , which sol es a sequen ial decision-
making p oblem (so, ao, o)( )by selec ing a local ac ion
ao( )∈ Ao( )on he en i onmen a each ime s ep , based on
he obse a ion so( )∈ So( )o he sys em a ime , and an
associa ed ewa d unc ion o( )∈ Ro( ), de ined as ollows.
1) S a e: Each o ches a o node builds a local es ima ion
so( )o he ne wo k s a e a ime (Equa ion 4) by ga he ing
en i onmen obse a ions such as he cu en o ches a o s’ de-
ploymen loca ions {po( )}o∈O( ), he end- o-end o ches a o -
use la ency ma ix L( ou, )∈R|O( )|×|Uc( )|, he numbe o
use s managed by all con olle s in he o ches a o domain
{Uc( )}c∈Co( ), and he numbe o managed con olle s Co( )
a ime .
We assume sel -o ganiza ion agen s can pe iodically ex-
change pa o he s a e in o ma ion so( )among hemsel es,
such as hei posi ion po( )∈Mand he numbe Co( )o hei
cu en ly managed con olle s, o imp o e he ue ne wo k
s a e es ima ion.
so( )=({po}, L( ou),{Uc},Co)( )(4)
2) Ac ion: Each o ches a o adop s a sha ed policy π
o op imize i s placemen h ough an ac ion ao( )∈
{0,...,2|M|} =Ao( )o one o ou di e en ypes: (1)
Reloca ion o S ay, (2) Te mina ion, and (3) Duplica ion
(Figu e 2), depending on he local s a e es ima ion so( )and
de ailed he e a e . All agen s use a sha ed policy o cen al-
ized coo dina ion du ing aining while enabling decen alized
decision-making du ing execu ion, ensu ing ha he agen s
can e ec i ely collabo a e in a mul i-agen en i onmen . The
ollowing sec ion ou lines he possible ac ions ha each agen
can ake.
Reloca ion o S ay: When he o ches a o selec s an ac ion
ao( )∈ {1,...,|M|} i mig a es o he a ge hos loca ion
ao( )∈M. I ao( ) = po( ), i.e. eloca ion o he cu en
loca ion, he o ches a o ”s ays” on i s cu en hos . Se ice
in e up ion is minimized h ough ho mig a ion, i.e., s a ing
he o ches a o a he a ge hos loca ion be o e e mina ing
he ins ance a he p e ious loca ion. This ac ion ocuses on
educing end- o-end o ches a o -use la ency and maximizing
use h oughpu and ai ness in he o ches a o domain.
Duplica ion: When he o ches a o selec s an ac ion
ao( )∈ {|M|+1,...,2|M|}, i s ays on he cu en hos loca-
ion and c ea es a new o ches a o ins ance on he a ge hos
loca ion ao( )−|M| ∈ M. Once a new o ches a o ins ance is
c ea ed as pa o his p ocess, and con olle s a e edis ibu ed
be ween he o iginal and new o ches a o s o balance hei
wo kload. This ac ion imp o es use h oughpu and ai ness
by dis ibu ing con olle managemen esponsibili ies h ough
load balancing, leading o mo e esponsi e con ol plane
ope a ions and educing esou ce block alloca ion delays.
Te mina ion: When he o ches a o selec s an ac ion
ao( )=0i e mina es i s ope a ion and ees he esou ces.
An o ches a o op s o e mina ion when i de ec s ha i s
wo kload is signi ican ly below capaci y, indica ing ine icien
esou ce u iliza ion. Be o e e mina ing, i coo dina es wi h
neighbo ing o ches a o s o ensu e i s con olle s can be e-
dis ibu ed wi hou o e loading o he domains o signi ican ly
inc easing la encies.
3) Rewa d: The ewa d unc ion guides each o ches a o ’s
decision-making p ocess by e alua ing he e ec i eness o i s
chosen ac ion. We de ine he ewa d unc ion o( )∈ Ro( ) =
R+ o he sel -o ganiza ion agen on o ches a o o∈ O( )
as a con ex combina ion o he use h oughpu and Local
Fai ness among he use s managed by all i s con olle s c∈
Co( )a ime (Equa ion 5).
o( ) = 1
|Co( )|X
c∈Co( )
βX
u∈Uc( )
Tu( ) + (1 −β)J(Uc( ), )
(5)
This ewa d s uc u e encou ages he o ches a o agen s o
imp o e ai use h oughpu in he ne wo k sys em.
4) RL Algo i hm: We employ he P oximal Policy Op-
imiza ion (PPO) algo i hm unde he Cen alized T aining,
Decen alized Execu ion (CTDE) pa adigm, using a single
sha ed policy among all o ches a o agen s o acili a e coo -
dina ed ye independen decision-making. All agen s in e ac
wi h he en i onmen and collec expe iences. The expe ience
o each o ches a o agen oa can be deno ed as Xo( ) =
so( ), ao( ), o( ), so( + 1),ˆ
Ao( ), ωo( ), whe e ˆ
Ao( )is
he ad an age es ima e o oa , and ωo( )is he p obabili y
a io o agen oa . Each agen compu es he ad an age es i-
ma e ˆ
Ao( ) = P∞
j=0(γλ)jδ +jusing Gene alized Ad an age
Es ima o (GAE) and du ing cen alized aining, all agen s’
expe iences a e agg ega ed o compu e he o al loss. The
s ep index j ep esen s how a in o he u u e he ad an age
es ima o looks om he cu en ime s ep . He e, δ =
o( ) + γF(so( + 1)) −F(so( )) is he Tempo al Di e ence
(TD) esidual, whe e F(so( )) deno es he alue unc ion ha
es ima es he expec ed e u n om s a e so( ). The pa ame e s
γ, λ ∈[0,1] a e he discoun a e and he GAE pa ame e ,
Fig. 2. An example o a sel -o ganiza ion app oach in which a single o ches a o agen can ake one o ou possible ac ions.
espec i ely. The e m ωo( ) = π(ao( )|so( ))
πold(ao( )|so( )) ep esen s he
p obabili y a io be ween he cu en policy πand p e ious
policy πold. The policy upda e uses he agg ega ed ad an age
es ima es om all agen s’ expe iences. Fo he sha ed policy π,
he clipped objec i e o he policy loss is calcula ed as Lπ=
E Po∈|O( )|min{ωo( )ˆ
Ao( ),clip(ωo( ),1−ς, 1 + ς)ˆ
Ao( )}
, whe e ς > 0is he clipping pa ame e . The loss is compu ed
o each agen , bu since he policy is sha ed, he g adien s
om all agen s a e agg ega ed o compu e he inal g adien
upda e.
Algo i hm 1 desc ibes how each o ches a o agen o ga-
nizes i sel using he RL algo i hm. In he i s sec ion (lines 1
o 2), he s a e o he o ches a o agen ois ini ialized by
andom O( )o ches a o nodes which a e placed on M
numbe o o ches a o hos s. The second sec ion (lines 3
o 14) desc ibes how he o ches a o ga he s he s a e in o -
ma ion om he p e ious ime s ep o op imize i s numbe and
placemen . I also co e s ewa d e alua ion and policy upda es
using he GAE [24] me hod. The hi d sec ion (lines 15
o 22) ou lines he de e mina ion o s a e pa ame e s such
as o ches a o deploymen loca ion, o ches a o -use la ency,
use -coun o con olle s managed by he o ches a o node,
and he numbe o managed con olle s by he o ches a o ,
which a e measu ed a each ime s ep. Las ly, he ou h sec ion
(lines 23 o 26) de ails he calcula ion o he o ches a o ’s
ewa d unc ion.
V. EXPERIMENTAL EVALUATION
A. Expe imen Se up
We pe o m simula ions o e i y he pe o mance o ou
me hod in e ms o use h oughpu , Global and Local Fai -
ness. We implemen ed ou me hod and wo baselines, namely
a Hie a chical O ches a o and a Single O ches a o , in a
simula ed Ne wo kX Py hon en i onmen . In he hie a chical
en i onmen , he e is a mas e o ches a o agen o deploy
local o ches a o s by he single-agen RL sys em, and local
o ches a o s u ilize he MARL sys em o deploy con olle s.
Mo eo e , he Single O ches a o en i onmen p esen s only
one single agen o manage he en i e ne wo k, such as he
deploymen o con olle s by he single agen RL sys em. Table
II gi es mo e de ails o he simula ion pa ame e s. We con-
side ed he Gigabi Eu opean Ad anced Ne wo k (GEANT)
[25] opology wi h 34 in as uc u e nodes ha can hos
o ches a o s and con olle s, and 20 BSs. UEs a e andomly
placed wi hin no malized uni squa e scena io, [0,1]2, and
use s mo e acco ding o a T unca ed Le y Walk model [26].
A ixed numbe o BSs a e andomly placed on he hos s in V
wi h he same bandwid h, co e age a ea, and noise. Figu e 3
shows he simula ion en i onmen s o DERRIC-SO and he
s a e-o - he-a baselines. In hese se ups, UEs and BSs a e
ixed o enable a consis en pe o mance compa ison ac oss
he baselines.
B. Resul Pe o mance
Figu e 4 shows he pe o mance o a e age use h oughpu
in inc easing he numbe o use s, in which DERRIC-SO
Fig. 3. Example o simula ion en i onmen s o DERRIC-SO, Hie a chical O ches a o and Single O ches a o wi h he same numbe o andomly placed
UEs and BSs
Fig. 4. Impac o he numbe o use s |U|on he
A e age Use Th oughpu T
Fig. 5. Impac o he numbe o use s |U|on
he A e age Global Fai ness J(U). D-SO deno es
DERRIC-SO, HO deno es Hie a chical O ches a-
o , and SO deno es Single O ches a o .
Fig. 6. Impac o he numbe o use s |U|on he
A e age Local Fai ness J(Uc)
consis en ly ou pe o ms bo h Hie a chical and Single o ches-
a o up o 35% and 77%, espec i ely. This ou come a ises
om op imal decen alized o ches a o domains in which
o ches a o s and con olle s a e placed op imally o dis ibu e
he wo kload on he nodes and e icien ly manage UEs while
accoun ing o UE mobili y and densi y pa e ns. Mo eo e ,
he MARL sys em accele a es decision-making p ocesses o e
ime by le e aging a se o agen s equal o he numbe o
o ches a o s. These agen s ope a e in pa allel, each indepen-
den ly aking ac ions a e e y ime s ep based on local ne wo k
condi ions. In con as , a single o ches a o wi h a single
agen is limi ed o sequen ial decision-making, subs an ially
educing he esponsi eness and e ec i eness o he placemen
op imiza ion p ocess in complex ne wo k en i onmen s.
Figu e 5 p esen s he Global Fai ness achie ed by ou
p oposed lea ning algo i hm, as well as by he MARL and
SO o ches a o app oaches. A highe alue o Jain’s index
indica es a mo e balanced h oughpu alloca ion ac oss in-
c easing he numbe o use s o h ee bandwid h alloca ion
le els a 200,400, and 800 MHz dedica ed o he BSs. E en
a he lowes bandwid h B= 200, DERRIC-SO demons a es
supe io pe o mance wi h ai ness alues 0.50 −0.35, while
Hie a chical O ches a o achie es 0.38 −0.27 and Single
O ches a o only manages 0.21−0.07. These nume ical di e -
ences emain consis en ac oss all use scales, wi h DERRIC-
SO main aining highe absolu e ai ness alues, demons a ing
be e scalabili y and esou ce alloca ion e ec i eness com-
pa ed o he s a e-o - he-a app oaches.
Figu e 6 illus a es he a e age Local Fai ness among use s
connec ed o BSs in each con olle domain. Al hough a
sligh decline in ai ness is obse ed as he numbe o use s
inc eases, ou me hod main ains a ela i ely s able pe o -
mance. In con as , bo h baseline me hods exhibi subs an ial
deg ada ion in ai ness me ics as he use coun escala es. This
enhanced ai ness can be p ima ily a ibu ed o op imizing
o ches a o domain bounda ies and he s a egic dis ibu ion
Algo i hm 1: DERRIC-SO Ope a ion
// All o ches a o s execu e his p ocess in pa allel
Da a: O ches a o se O, o ches a o hos s M, Con olle se Co,
Use se Uc, discoun a e γ, lea ning a e η
// Rewa d and Policy Ini ializa ion
1R←0,π←Ini ializePolicy()
// Ini ialize i s o ches a o loca ion uni o mly a
andom
2O( )← {o},po(1) sample
←−−−− U(M);
// each episode
3 o ϵ∈ E do
// each ime s ep
4 o ∈ T do
// Upda e s a e o o ches a o o
5so( )←
Ge O ches a o S a e( , {po}, L( ou),Co,{Uc})
// Selec ac ion acco ding o policy π
6ao( )sample
←−−−− π(ao( )|so( ))
// Op imize loca ion acco ding o ac ion
7O( )←Sel O ganiza ion(ao( ))
// Collec ewa d
8 o( )←Ge O ches a o Rewa d(Uc)
// Upda e e u ns
9R← o( ) + γR
// Compu e TD esidual
10 δ ← o( ) + γF (so( + 1)) −F(so( ))
// Compu e ad an age es ima es using GAE
11 ˆ
Ao( ) = P∞
j=0(γλ)jδ +j
// Compu e p obabili y a io be ween cu en
and p e ious policy
12 ωo( ) = π(ao( )|so( ))
πold(ao( )|so( ))
// Upda e policy wi h PPO clipped objec i e
13 Lπ←
min ωo( )ˆ
Ao( ),clip(ωo( ),1−ϵ, 1 + ϵ)ˆ
Ao( )
// Upda e policy using g adien ascen wi h
lea ning a e
14 π←π+η∇πLπ
15 Func ion
Ge O ches a o S a e( , {po}, L( ou),Co,{Uc}):
// Collec o ches a o -use la ency in he
o ches a o domain
16 L( ou, )←Measu eUse La ency(Uc)
// Collec use -coun o con olle s in he
o ches a o domain
17 Uc( )←Measu eUse Coun (Co)
// Collec deploymen loca ion o o ches a o
18 po( )←Measu eO ches a o Loca ion(O)
// Collec he numbe o con olle s o each
o ches a o
19 Co( )←Measu eO ches a o Load(O)
// B oadcas cu en posi ion and con olle -coun
o all o he o ches a o s
20 o x∈ O( )in pa allel do
21 sx( ) ansmi
←−−−− (po( ),Co( ))
22 e u n (sx, L( ou),{Uc})
23 Func ion Ge O ches a o Rewa d( , Uc):
24 Tu( )←Measu eUse Th oughpu (Uc)
25 J(Uc, )←Measu eIndexFai ness(Uc)
26 e u n
1
|Co( )|Pc∈Co( )βPu∈Uc( )Tu( ) + (1 −β)J(Uc( ), )
TABLE II
EXPERIMENT PARAMETERS
Pa ame e Value
Numbe o in as uc u e nodes V34
Numbe o o ches a o hos s Mand BS K6,20
Numbe o use s |U| {8,16,...,512,1024}
Numbe o episodes |E| and ime s eps |T | 250,15000
Bandwid h o BS Band noise N400MHz,7dB
Lea ning a e η, Discoun a e γ0.0001,0.9
Ba ch size 64
o use s ac oss con olle s, which collec i ely egula e use -
cen ic pe o mance me ics such as h oughpu .
Figu e 7 illus a es he p og ession o he ewa d aining
o he o ches a o agen s in e ms o use pe o mance o
DERRIC-SO compa ed o he wo baseline app oaches. The
esul s demons a e ha DERRIC-SO achie es up o 18% and
102% highe ewa ds and as e con e gence compa ed o
s a e-o - he-a Hie a chical and Single O ches a o baselines.
This supe io pe o mance s ems om he MARL sys em
a chi ec u e, whe e mul iple agen s, equal in numbe o he
o ches a o nodes, simul aneously obse e he en i onmen
and op imize decisions in pa allel o mo e e icien ne wo k
managemen . In con as , he Hie a chical O ches a o elies
on a sequen ial p ocess in which he uppe -le el o ches-
a o , unc ioning as a single agen RL sys em, mus i s
obse e he en i onmen and communica e wi h he lowe -
le el o ches a o s o op imize hei deploymen , a e which
he lowe -le el o ches a o s employ an MARL app oach o
deploy con olle s and manage he ne wo k. This addi ional
communica ion o e head esul s in slowe con e gence and
lowe cumula i e ewa ds o he o ches a ion laye . Simi-
la ly, he Single O ches a o model employs jus one agen
o manage he en i e ne wo k and de e mine he con olle
deploymen , equi ing mo e compu a ional esou ces o lea n
an op imal policy o sys em o ches a ion, esul ing in poo e
pe o mance and slowe con e gence.
Figu e 8 shows he mean policy loss in aining episodes
o h ee di e en o ches a ion models e alua ed using he
PPO algo i hm wi h he same hype pa ame e s. DERRIC-
SO consis en ly achie es 55% and 89% lowe policy loss,
indica ing mo e s able and e ec i e lea ning dynamics han
he Hie a chical and Single O ches a o models. In con as ,
he Single O ches a o shows highe and less s able policy
loss, e lec ing limi a ions in scalabili y and adap abili y.
VI. CONCLUSION
This pape add esses he p oblem o o ches a o o ganiza-
ion, whe e each o ches a o agen au onomously makes deci-
sions, such as eloca ion o emaining s a iona y, duplica ion,
and e mina ion, using an MARL algo i hm. This imp o emen
s ems om DERRIC-SO’s ully decen alized a chi ec u e,
which elimina es he need o cen alized communica ion
a highe managemen laye s. As a esul , i enables as e
Fig. 7. O ches a o aining ewa d o( )o e numbe o episodes con aining
100 UEs and a se o con olle s and o ches a o s deployed o e he GEANT
ne wo k opology
Fig. 8. Policy loss o e aining episodes using PPO, wi h 95% con idence
band a ound he a e age
decision-making and mi iga es he isk o a single poin o ail-
u e inhe en in cen alized app oaches. Mo eo e , DERRIC-
SO achie es a signi ican ly highe and mo e equi able h ough-
pu dis ibu ion among use s, e en unde a ying use densi ies
and mobili y pa e ns. This pe o mance aligns closely wi h he
scalabili y, adap abili y, and ai ness equi emen s en isioned
o nex -gene a ion 6G ne wo ks.
ACKNOWLEDGMENT
This wo k was unded by he SNS-JU 6G Cloud p ojec
unde he Eu opean Union’s Ho izon Eu ope Resea ch and In-
no a ion P og amme unde G an Ag eemen No. 101139073.
REFERENCES
[1] M. Polese, L. Bona i, S. D’o o, S. Basagni, and T. Melodia, “Unde -
s anding O-RAN: A chi ec u e, In e aces, Algo i hms, Secu i y, and
Resea ch Challenges,” IEEE Communica ions Su eys & Tu o ials,
ol. 25, no. 2, pp. 1376–1411, 2023.
[2] X. Lin, L. Kundu, C. Dick, and S. Velayu ham, “Emb acing AI in 5G-
Ad anced owa d 6G: A join 3GPP and O-RAN Pe spec i e,” IEEE
Communica ions S anda ds Magazine, ol. 7, no. 4, pp. 76–83, 2023.
[3] S. Niknam, A. Roy, H. S. Dhillon, S. Singh, R. Bane ji, J. H. Reed,
N. Saxena, and S. Yoon, “In elligen O-RAN o beyond 5G and 6G
Wi eless Ne wo ks,” in 2022 IEEE Globecom Wo kshops (GC Wkshps).
IEEE, 2022, pp. 215–220.
[4] J. A. Ayala-Rome o, A. Ga cia-Saa ed a, X. Cos a-Pe ez, and G. Iosi-
idis, “EdgeBOL: A Bayesian Lea ning App oach o he Join O ches-
a ion o RANs and Mobile Edge AI,” IEEE/ACM T ansac ions on
Ne wo king, ol. 31, no. 6, pp. 2978–2993, 2023.
[5] G. M. Almeida, G. Z. B uno, A. Hu , M. Hil unen, E. P. Dua e,
C. B. Bo h, and K. V. Ca doso, “RIC-O: E icien Placemen o a
Disagg ega ed and Dis ibu ed RAN In elligen Con olle wi h Dynamic
Clus e ing o Radio Nodes,” IEEE Jou nal on Selec ed A eas in Com-
munica ions, 2023.
[6] E. H. Bouzidi, A. Ou aga s, R. Langa , and R. Bou aba, “Dynamic
clus e ing o so wa e de ined ne wo k swi ches and con olle placemen
using deep ein o cemen lea ning,” Compu e ne wo ks, ol. 207, p.
108852, 2022.
[7] G. M. Almeida, G. Z. B uno, A. Hu , M. Hil unen, E. P. Dua e,
C. B. Bo h, and K. V. Ca doso, “RIC-O: E icien Placemen o a
Disagg ega ed and Dis ibu ed RAN In elligen Con olle Wi h Dy-
namic Clus e ing o Radio Nodes,” IEEE Jou nal on Selec ed A eas
in Communica ions, ol. 42, no. 2, pp. 446–459, 2024.
[8] A. Na wa ia, K. Soni, and A. P. Mazumda , “A Posi ion and Ene gy
Awa e Mul i-Objec i e Con olle Placemen and Re-placemen Scheme
in Dis ibu ed SDWSN,” The Jou nal o Supe compu ing, pp. 1–29,
2024.
[9] C. N´
u˜
nez-G´
omez, C. Ca i´
on, B. Camine o, and F. M. Delicado, “S-
HIDRA: A Blockchain and SDN domain-based A chi ec u e o O ches-
a e og Compu ing En i onmen s,” Compu e Ne wo ks, ol. 221, p.
109512, 2023.
[10] B. Li, X. Deng, and Y. Deng, “Mobile-edge Compu ing-based Delay
Minimiza ion Con olle Placemen in SDN-IoV,” Compu e Ne wo ks,
ol. 193, p. 108049, 2021.
[11] J. Ba anda and J. Mangues-Ba alluy, “End- o-End Ne wo k Se ice
O ches a ion in He e ogeneous Domains o Nex -Gene a ion Mobile
Ne wo ks,” in NOMS 2022-2022 IEEE/IFIP Ne wo k Ope a ions and
Managemen Symposium. IEEE, 2022, pp. 1–6.
[12] A. Chaoub, A. M¨
ammel¨
a, P. Ma inez-Julia, R. Chapa adza, M. Elko ob,
L. Ong, D. K ishnaswamy, A. An onen, and A. Du a, “Hyb id Sel -
O ganizing Ne wo ks: E olu ion, S anda diza ion T ends, and a 6G
a chi ec u e ision,” IEEE Communica ions S anda ds Magazine, ol. 7,
no. 1, pp. 14–22, 2023.
[13] C. Valen e, P. Valen e, P. Ri o, D. Raposo, M. Lu´
Is, and S. Sa gen o,
“5G RAN and Co e O ches a ion wi h ML-D i en QoS P o iling,” in
IEEE INFOCOM 2024-IEEE Con e ence on Compu e Communica ions
Wo kshops (INFOCOM WKSHPS). IEEE, 2024, pp. 1–6.
[14] G. Z. B uno, V. K. Radhak ishnan, G. M. Almeida, A. Hu , A. P.
da Sil a, K. V. Ca doso, L. A. DaSil a, and C. B. Bo h, “RIC-O:
An O ches a o o he Dynamic Placemen o a Disagg ega ed RAN
In elligen Con olle ,” in IEEE INFOCOM 2023-IEEE Con e ence on
Compu e Communica ions Wo kshops (INFOCOM WKSHPS). IEEE,
2023, pp. 1–2.
[15] S. D’O o, L. Bona i, M. Polese, and T. Melodia, “O ches RAN: Ne wo k
Au oma ion h ough O ches a ed In elligence in he Open RAN,” in
IEEE INFOCOM 2022-IEEE Con e ence on Compu e Communica ions.
IEEE, 2022, pp. 270–279.
[16] S. Kukli´
nski, R. Kołakowski, L. Tomaszewski, L. Sanab ia-Russo,
C. Ve ikoukis, C.-T. Phan, L. Zanzi, F. De o i, A. Ksen ini, C. Tselios
e al., “Monb5g: Ai/ml-capable Dis ibu ed O ches a ion and Man-
agemen F amewo k o Ne wo k Slices,” in 2021 IEEE In e na ional
Medi e anean Con e ence on Communica ions and Ne wo king (Med-
i Com). IEEE, 2021, pp. 29–34.
[17] M. A. Habib, H. Zhou, P. E. I u ia-Ri e a, M. Elsayed, M. Ba and,
R. Gaigalas, Y. Ozcan, and M. E ol-Kan a ci, “In en -d i en In elligen
Con ol and O ches a ion in O-RAN ia Hie a chical Rein o cemen
Lea ning,” in 2023 IEEE 20 h In e na ional Con e ence on Mobile Ad
Hoc and Sma Sys ems (MASS). IEEE, 2023, pp. 55–61.
[18] J. F. San os, W. Liu, X. Jiao, N. V. Ne o, S. Pollin, J. M. Ma quez-Ba ja,
I. Moe man, and L. A. DaSil a, “B eaking down ne wo k slicing: Hie -
a chical o ches a ion o end- o-end ne wo ks,” IEEE Communica ions
Magazine, ol. 58, no. 10, pp. 16–22, 2020.
[19] J. Mou a, “Decen alized Con ol O ches a ion o Dynamic Edge
P og ammable Sys ems,” in 2023 3 d In e na ional Con e ence on
Elec ical, Compu e , Communica ions and Mecha onics Enginee ing
(ICECCME). IEEE, 2023, pp. 1–6.
[20] X. Lyu, C. Ren, W. Ni, H. Tian, R. P. Liu, and Y. J. Guo, “Mul i-
Timescale Decen alized Online O ches a ion o So wa e-De ined Ne -
wo ks,” IEEE Jou nal on Selec ed A eas in Communica ions, ol. 36,
no. 12, pp. 2716–2730, 2018.
[21] W. Qi, Y. Xia, T. Ma, L. Zhu, and J. Zhu, “ELBCFVD: An E icien Low-
Ene gy Balanced Clus e ing Algo i hm Based on Fas Vo onoi Di ision
o Mobile Senso Ne wo ks,” IEEE Senso s Jou nal, 2024.
[22] M. E. Ekpenyong and P. J. Udoh, “Modeling he E ec o Bandwid h
Alloca ion on Ne wo k Pe o mance,” Science Wo ld Jou nal, ol. 9,
no. 4, pp. 12–22, 2014.
[23] M. Gao, S. Raman, Z. Sipus, and A. K. Sk i e ik, “Analy ic App ox-
ima ion o F ee-Space Pa h Loss o Implan ed An ennas,” IEEE Open
Jou nal o An ennas and P opaga ion, 2024.
[24] J. Schulman, P. Mo i z, S. Le ine, M. Jo dan, and P. Abbeel, “High-
dimensional Con inuous Con ol using Gene alized Ad an age Es ima-
ion,” a Xi p ep in a Xi :1506.02438, 2015.
[25] J.-I. Cas illo-Velazquez, I. Mu˜
noz-Ma ´
ınez, J.-A. D´
ıaz-Ram´
ı ez, and
E. F. O do˜
nez-Mo ales, “Managemen Emula ion o GEANT Ad anced
Ne wo k: 2020 Topology unde IP 6,” in 2020 IEEE ANDESCON.
IEEE, 2020, pp. 1–6.
[26] L. Cao and M. G abchak, “Smoo hly unca ed le y walks: Towa d a
ealis ic mobili y model,” in 2014 IEEE 33 d In e na ional Pe o mance
Compu ing and Communica ions Con e ence (IPCCC). IEEE, 2014, pp.
1–8.