scieee Science in your language
[en] (orig)

Efficient and Robust Semantic Image Communication via Stable Cascade

Author: Khalid, Rana Ahmad Bilal
Publisher: Zenodo
DOI: 10.5281/zenodo.17281324
Source: https://zenodo.org/records/17281324/files/ICML2025-ML4Wireless.pdf
E icien and Robus Seman ic Image Communica ion ia S able Cascade
Bilal Khalid 1Ped o F ei e 1Se gei K. Tu i syn 1Ja oslaw E. P ilepsky 1
Abs ac
Di usion Model (DM) based Seman ic Image
Communica ion (SIC) sys ems ace signi ican
challenges, such as slow in e ence speed and gen-
e a ion andomness, ha limi hei eliabili y and
p ac icali y. To o e come hese issues, we p o-
pose a no el SIC amewo k inspi ed by S able
Cascade, whe e ex emely compac la en image
embeddings a e used as condi ioning o he di -
usion p ocess. Ou app oach d as ically educes
he da a ansmission o e head, comp essing he
ansmi ed embedding o jus
0.29%
o he o igi-
nal image size. I ou pe o ms h ee benchma k
app oaches — he di usion SIC model condi-
ioned on segmen a ion maps (GESCO), he e-
cen S able Di usion (SD)-based SIC amewo k
(Img2Img-SC), and he con en ional JPEG2000
+
LDPC coding — by achie ing supe io e-
cons uc ion quali y unde noisy channel condi-
ions, as alida ed ac oss mul iple me ics. No-
ably, i also deli e s signi ican compu a ional
e iciency, enabling o e
3×
as e econs uc ion
o
512×512
images and mo e han
16×
as e o
1024 ×1024
images as compa ed o he app oach
adop ed in Img2Img-SC.
1. In oduc ion
Seman ic communica ion (SemCom) is a ans o ma i e ap-
p oach ha ocuses on e ec i ely con eying he meaning o
in o ma ion a he han ansmi ing aw bi da a (S ina i &
Ba ba ossa,2021). The goal is o communica e he essen ial
in o ma ion he ecei e needs o comple e i s ask success-
ully. This also makes i bandwid h e icien as signi ican ly
less da a has o be ansmi ed ac oss he communica ion
channel (Luo e al.,2022;Qin e al.,2021).
1
As on Ins i u e o Pho onic Technologies, As on Uni-
e si y, Bi mingham, UK. Co espondence o: Bilal Khalid
< [email p o ec ed]>.
P oceedings o he
42 nd
In e na ional Con e ence on Machine
Lea ning, Wo kshop on Machine Lea ning o Wi eless Communi-
ca ion and Ne wo ks, Vancou e , Canada, 2025. Copy igh 2025
by he au ho (s).
Figu e 1. 1024 ×1024
Image econs uc ions using ou model
unde di e en channel SNR condi ions. E en a an SNR o 1 dB,
images a e ai h ully econs uc ed and pe cep ually e y simila
o he ansmi ed images.
The ad ancemen o Deep Lea ning (DL) and gene a i e AI
has enabled he eme gence o SemCom as a iable al e na-
i e o adi ional communica ion. DL and gene a i e AI
models a e used o ex ac ing he ele an seman ic in o -
ma ion a he ansmi e end as well as o deciphe ing he
meaning behind his in o ma ion a he ecei e end. Deep
lea ning-based Join Sou ce-Channel Coding (DeepJSCC)
(Bou soula ze e al.,2019) was one o he i s app oaches
o inco po a e DL in wi eless sys em design. Va ia ional
Au oencode s (VAEs), Gene a i e Ad e sa ial Ne wo ks
(GANs), Di usion Models (DMs) and Flow-based Gene a-
i e Models (FGMs) a e he majo gene a i e AI echniques
now commonly used in SemCom sys ems (Xia e al.,2025).
Ou o hese, DMs ha e shown g ea po en ial a Seman ic
Image Communica ion (SIC) asks because o hei excep-
ional abili y o syn hesize high-quali y images (Dha iwal
& Nichol,2021). Howe e , one d awback o DMs is ha
hey a e inhe en ly slowe a in e ence because o hei i e -
a i e na u e. The in oduc ion o La en Di usion Models
(LDMs) (Rombach e al.,2022) has alle ia ed his p oblem
by pe o ming he di usion p ocess in a comp essed la en
space ins ead o he o iginal pixel space, enabling as and
high- esolu ion image gene a ion ia di usion.
Se e al DM-based SIC sys ems ha e been implemen ed in
1
E icien and Robus Seman ic Image Communica ion ia S able Cascade
Figu e 2.
Ou sys em model. A he ansmi e side, a compac image embedding
Z
o size [16, 24, 24] is ex ac ed om an image
X
o
size [3, 1024, 1024].
Z
is ansmi ed ac oss he physical channel. The ecei e uses he noisy embedding
ˆ
Z
as condi ioning o he LDM.
Finally, he VQGAN decode is used o p ojec he image back in o pixel space.
ecen yea s. In (G assucci e al.,2023), segmen a ion maps
a e used o guide he di usion p ocess. In (Yilmaz e al.,
2024), he p ima y image s uc u e is ansmi ed using he
DeepJSCC echnique, whe eas ine de ails a e gene a ed
using he di usion model. (Jiang e al.,2024) also use a
di usion model o e ine he econs uc ion ob ained a e
image decoding. Howe e , in e ence using hese app oaches
is ime-consuming. Recen ly, LDMs ha e been used o SIC
o speed up he in e ence p ocess. In (Nam e al.,2024;
Cicche i e al.,2024), ex condi ioning is used o guide
he gene a i e p ocess o S able Di usion’s ex - o-image
model (Rombach e al.,2022). In (Cicche i e al.,2024), he
gene a ion p ocess s a s om a noisy e sion o image em-
bedding ins ead o pu e noise. Al hough e icien in e ms
o bandwid h, hese models s uggle o ai h ully econs uc
he in ended image and su e om gene a ion andomness.
(Chen & Yang,2024) denoise a noisy image embedding
using an LDM, and he clean embedding is hen used o
econs uc he image using a seman ic decode . Ins ead o
p edic ing he noise in he image, (Yang e al.,2025) use a
di usion model o p edic he sou ce image in a ew denois-
ing s eps di ec ly. Bo h o hese models educe in e ence
ime bu ope a e a a lowe comp ession ac o as compa ed
o ou p oposed me hod.
In his pape , we p opose a no el SIC model inspi ed by
S able Cascade (SC) (Pe nias e al.,2023), a mul is age
ex - o-image LDM ha ope a es in a much smalle la en
space han S able Di usion (SD). Ou app oach achie es
he i ec a o high comp ession e iciency, as in e ence,
and pe cep ually aligned image econs uc ion, which is
missing in exis ing DM-based SIC sys ems. In ou me hod,
a highly comp essed image embedding is ex ac ed using a
seman ic encode and ansmi ed ac oss he physical chan-
nel. The noisy embedding is hen gi en as a condi ioning
signal o he LDM o SC ha p ojec s i in o a highe di-
mensional la en space whe e he seman ic decode ope a es.
Resul s indica e ha we ou pe o m benchma k models and
as shown in Figu e 1, gene a e consis en econs uc ions
e en unde ex emely poo channel Signal- o-Noise Ra io
(SNR) condi ions.
2. P oposed F amewo k
In his sec ion, he p oposed sys em model is explained.
The model is buil upon he a chi ec u e o SC ha has h ee
s ages, i.e., s ages A, B and C. As discussed below, ou
model is based on s age A and a ine uned s age B ha is
ained o wo k wi h noisy condi ioning.
S age A is a Vec o Quan ized Gene a i e Ad e sa ial Ne -
wo k (VQGAN) (Esse e al.,2021) wi h pa ame e s
Θ
ha
comp esses he image space by a ac o o
4
. The ela ion-
ship be ween an inpu image
X∈R3×1024×1024
and he
ou pu o VQGAN encode XVG is gi en as:
XVG = Θ(X).(1)
I
−1
Θ
ep esen s he VQGAN decode , he image can be
econs uc ed om he comp essed la en space using
−1
Θ(XVG)≈X. (2)
S age B is a LDM ha lea ns o gene a e he la en
space
XVG
gi en a highly comp essed la en ep esen a-
ion
Z
o
X
. This compac embedding is ob ained ia he
E icien Ne -V2 encode (Tan & Le,2019). Du ing he
o wa d p ocess in aining, he la en s
XVG
a e noised ac-
co ding o he ollowing ela ion:
XVG, =√¯α ·XVG, +√1−¯α ·ϵ. (3)
He e,
¯α
speci ies he noise schedule whe eas
ϵ
is he noise
sampled om a s anda d no mal dis ibu ion
N(0,1)
. A
2
E icien and Robus Seman ic Image Communica ion ia S able Cascade
any ime-s ep
, wi h noised la en s
XVG,
and noisy condi-
ional embedding
ˆ
Z
, he LDM is ained o p edic he noise
¯ϵ(XVG, , , ˆ
Z)
. The aining objec i e is o minimize he
loss unc ion
L
, de ined as he Mean-Squa ed E o (MSE)
be ween he p edic ed and ac ual noise:
L=E(XVG, , , ˆ
Z,ϵ)h∥ϵ−¯ϵ(XVG, , , ˆ
Z)∥2
2i.(4)
Tex embedding is also used as condi ioning o S age B
in he o iginal SC pape (Pe nias e al.,2023). Howe e ,
as no ed in he pape i sel , i has no signi ican impac on
he econs uc ion quali y o s age B as he condi ioning
p o ided by he image embedding is much s onge . Thus,
we do no conside ex condi ioning in ou model. The
ine- uning o S age B condi ioned on
ˆ
Z
makes i obus
o channel impai men s. Mo eo e , we do no conside
s age C ei he as i is p ima ily esponsible o ex - o-image
gene a ion.
3. Sys em Model
Figu e 2shows he h ee phases o ou sys em model i.e.
seman ic in o ma ion ex ac ion a he ansmi e , noisy
channel ansmission, and image econs uc ion a he e-
cei e .
3.1. Seman ic Fea u e Ex ac ion
As in (Pe nias e al.,2023), we u ilize he p e ained
E icien Ne -V2 image encode o ex ac a compac image
embedding. An inpu RGB image
X∈RN×H×W
is en-
coded in o a comp essed embedding
Z=E(X)1
. Despi e
i s compac size, his embedding con ains well-gene alized
ea u e ep esen a ions ha p o ide s onge guidance o
he di usion model as compa ed o ex embeddings. As a
esul , he econs uc ed image is e y simila o he o iginal
one, wi h di e ences in ine de ails only. Al hough image
gene a ion based solely on ex condi ioning is highly e i-
cien in e ms o bandwid h, i may esul in econs uc ions
ha a e seman ically qui e di e en om he sou ce image
(Nam e al.,2024). Fu he mo e, as compa ed o segmen a-
ion map-based condi ioning (G assucci e al.,2023), image
embeddings o e be e econs uc ion ideli y. Al hough
segmen a ion maps e ain spa ial s uc u e, hey o en lose
c ucial de ails such as ex u e, colo , and ine-g ained ea-
u es. Addi ionally, because hey p o ide only class-le el
in o ma ion, he same segmen a ion map can yield mul i-
ple plausible econs uc ions, in oducing a iabili y. To
achie e eliable, p edic able, and e icien SIC, we p opose
using ich image embedding as a mo e e ec i e condi ion-
ing signal, ensu ing educed gene a ion andomness and
high- ideli y econs uc ion o ansmi ed images.
1
The dimensionali ies o
X
;
N
is he numbe o channels, i.e.
3 o RGB, and
H
and
W
s and o he heigh and wid h pixel
esolu ion espec i ely.
3.2. Communica ion Channel
To main ain con o mi y wi h mos p e ious wo ks (G as-
succi e al.,2023;Yilmaz e al.,2024;Chen & Yang,2024;
Yang e al.,2025), we conside he widely adop ed addi i e
whi e Gaussian noise (AWGN) channel in ou simula ions.
The ex ac ed image embedding Z is ansmi ed ac oss he
AWGN channel whe e he noise
ϵ
is sampled om a ze o-
mean no mal dis ibu ion
N(0, σ2)
wi h a iance
σ2
. I
P
deno es he ecei ed signal powe , he channel condi ions
a e cha ac e ized by he Signal- o-Noise Ra io (SNR):
SNR = 10 log P
σ2(dB).(5)
Depending upon he SNR le el, noise is added o
Z
and he
dis o ed embedding ˆ
Zis ob ained as
ˆ
Z=Z+ϵ. (6)
3.3. Image Recons uc ion
The noisy image embedding
ˆ
Z
is used as a condi ioning
signal o he di usion model a he ecei e side. I should
be no ed ha in (Cicche i e al.,2024), a ex -condi ioned
di usion model s a s sampling om a noisy e sion o he
image embedding, whe eas, in ou model, a signi ican ly
mo e comp essed image embedding is used pu ely as a
condi ioning signal. A e he condi ional denoising p ocess
is comple e, he ou pu o he LDM is he p edic ed la en
space
ˆ
XVG
whe e he VQGAN decode ope a es. Finally,
in acco dance wi h Equa ion (2), he gene a ed image
ˆ
X
is
ob ained using −1
Θ(ˆ
XVG) = ˆ
X.
4. Expe imen al E alua ion
4.1. Model T aining
We ain ou model using he Ci yscapes da ase (Co d s
e al.,2016). The da ase con ains
3000
aining,
500
al-
ida ion, and
1500
es images. All images a e esized o
1024 ×1024
esolu ion. We ine une he p e- ained s age B
checkpoin o
15000
s eps using a ba ch size o 4, lea ning
a e o
1×10−4
, and AdamW op imize . To imp o e gene -
aliza ion and obus ness, he SNR is andomly selec ed o be
be ween
1−20
dB. A each aining s ep, image embeddings
a e ex ac ed and ansmi ed ac oss he AWGN channel.
The model is ained o use he noisy embeddings as condi-
ioning o econs uc images wi h he objec i e o minimiz-
ing he MSE loss in acco dance wi h Equa ion (4). In addi-
ion o he Ci yscapes da ase , we also e alua e ou model’s
pe o mance on he DIV2K da ase (Agus sson & Timo e,
2017), which is composed o highly di e se images. We do
no ine une ou model again o his da ase o in es iga e
how well i gene alizes on comple ely di e en and unseen
da a. All he aining and simula ions ha e been pe o med
3
E icien and Robus Seman ic Image Communica ion ia S able Cascade
Figu e 3.
Image econs uc ions using ou model, GESCO, Img2Img-SC and JPEG2000
+
LDPC in low SNR condi ions. I can be
obse ed ha ou model gene a es he mos seman ically simila images wi h he leas gene a ion andomness. The ed c osses indica e
ha he JPEG2000+LDPC sys em was unable o eco e he image a he co esponding SNR.
using a single NVIDIA RTX A6000 (48-GB) GPU. All code
sc ip s and ine- uned model weigh s will be accessible a :
h ps://gi hub.com/abilalk02/SC-SIC.
4.2. Simula ion Se ings
We compa e he pe o mance o ou model wi h (i) he
di usion SIC model condi ioned on segmen a ion maps
(GESCO) (G assucci e al.,2023), (ii) he S able Di usion-
based SIC model ha ansmi s ex and image embeddings
(Img2Img-SC) (Cicche i e al.,2024), and (iii) he con-
en ional JPEG2000 comp ession wi h Low-Densi y Pa i y-
Check (LDPC) e o co ec ion app oach. Fo e alua ion,
we gene a e
100
samples using each model wi h channel
SNR alues o
1,5,10,15
and
20
dB espec i ely. All sam-
ples a e o esolu ion
512 ×512
, excep o GESCO, whe e
he esolu ion is
256 ×5122
. Fo sampling wi h GESCO
and Img2Img-SC,
1000
and
30
denoising s eps a e used, e-
spec i ely, as in he o iginal pape s. Fo JPEG2000
+
LDPC,
Quad a u e Ampli ude Modula ion (QAM) is used and he
LDPC coding a e is se o
1/2
ollowing he me hod de-
sc ibed in (Bou soula ze e al.,2019).
2
I was no possible o gene a e
512 ×512
images using
GESCO wi hou al e ing he model a chi ec u e.
Pe o mance Me ics: To e alua e he pe cep ual and se-
man ic simila i y be ween he o iginal and gene a ed images,
we calcula e he Lea ned Pe cep ual Image Pa ch Simila -
i y (LPIPS) sco e (Zhang e al.,2018), F
´
eche Incep ion
Dis ance (FID) sco e (Sei ze ,2020) and S uc u al Simi-
la i y Index Measu e (SSIM) (Wang e al.,2004). We also
measu e he Peak Signal- o-Noise Ra io (PSNR) o e alu-
a e pixel-le el simila i y be ween images. Lowe alues o
LPIPS and FID indica e be e pe o mance, whe eas highe
alues o SSIM and PSNR indica e be e pe o mance.
4.3. Resul s
4.3.1. IMAGE RECONSTRUCTION QUALITY
We i s e alua e he econs uc ion quali y o ou model
agains exis ing app oaches, including GESCO, Img2Img-
SC, and he JPEG2000
+
LDPC amewo k. Figu e 3shows
he econs uc ion o a ansmi ed image a he ecei e
end using he ou models unde low SNR condi ions. Ou
model consis en ly achie es he mos accu a e econs uc-
ions o he o iginal image. E en a ex emely low SNR
le els o 5 dB and 1 dB, i p ese es objec cla i y and
ecognizabili y. In con as , he econs uc ion quali y o
GESCO de e io a es apidly as SNR dec eases, leading o
signi ican isual deg ada ion. Mo eo e , he ou pu p o-
4
E icien and Robus Seman ic Image Communica ion ia S able Cascade
5 10 15 20
SNR (dB)
100
200
300
400
500
FID
FID s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
5 10 15 20
SNR (dB)
0.2
0.4
0.6
0.8
1.0
LPIPS
LPIPS s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
5 10 15 20
SNR (dB)
0.0
0.2
0.4
0.6
SSIM
SSIM s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
5 10 15 20
SNR (dB)
0
5
10
15
20
25
PSNR
PSNR s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
Figu e 4.
Pe o mance compa ison be ween ou model, GESCO,
Img2Img-SC and JP2+LDPC a di e en SNRs.
duced by Img2Img-SC is loosely ied o he o iginal image
because ex condi ioning in oduces signi ican andom-
ness in he gene a ion p ocess. Finally, he con en ional
JPEG2000
+
LDPC p oduces hea ily dis o ed ou pu , and
e o co ec ion comple ely ails a low SNR, as was ob-
se ed ea lie (Bou soula ze e al.,2019;Jiang e al.,2024).
Fo cases whe e i ails o econs uc he images, we se
he PSNR and SSIM sco es o
0
, whe eas LPIPS and FID
sco es a e assigned an a bi a y maximum alue o
1
and
500 espec i ely.
The compa ison ac oss pe o mance me ics on he
Ci yscapes es da a, shown in Figu e 4, also e eals ha ou
model achie es he bes esul s. In e ms o FID and LPIPS,
on a e age, ou model imp o es on he esul s o he nex -
bes app oach om Img2Img-SC by
43%
and
55%
, espec-
i ely. Simila ly, in e ms o SSIM and PSNR, ou model
gi es he bes esul s, main aining good pe o mance e en a
low SNR. Fo SNR g ea e han 10 dB, JPEG2000
+
LDPC
achie es compa able PSNR and SSIM o ou model e en
hough i s econs uc ions a e hea ily dis o ed, ha e a i-
ac s, and lack de ails. This can be a ibu ed o he ac ha
JPEG2000 comp ession p ese es low- equency compo-
nen s and s uc u al in eg i y. PSNR and SSIM p ima ily
assess pixel-le el accu acy and s uc u al simila i y, espec-
i ely. In con as , LPIPS and FID a e mo e sensi i e o
pe cep ually signi ican dis o ions, cap u ing he loss o
ine de ails, educed ealism, and unna u al ex u es. Thus,
high PSNR and SSIM sco es can misleadingly o e es ima e
he pe o mance o JPEG2000
+
LDPC, ailing o e lec he
pe cep ual deg ada ion. Mo eo e , as discussed, he con en-
ional me hod ails o econs uc he images a low SNR.
O e all, ou model imp o es SSIM by
56%
and PSNR by
23%
as compa ed o Img2Img-SC. The esul s o ou model
imp o e u he when gene a ing 1024 ×1024 images.
Ou -512 Ou -1024 Img2Img-512 Img2Img-1024
Model
0
5
10
15
20
25
30
Time (s)
0.78 1.72 2.53
29.19
GESCO, on a e age, akes 324
seconds o gene a e 256 x 512
image on he same GPU
wi h 1000 di usion imes eps
Model In e ence Time Compa ison
Figu e 5.
In e ence ime compa ison o ou model wi h GESCO
and Img2Img-SC.
4.3.2. INFERENCE SPEED AND BANDWIDTH
EFFICIENCY
In e ms o compu a ional complexi y, we e alua e bo h in-
e ence la ency and he dimensionali y o he ansmi ed
da a. As shown in Figu e 5, he model om (G assucci
e al.,2023), which does no u ilize an LDM, exhibi s signi -
ican ly highe la ency, equi ing
5
minu es and
24
seconds
o image econs uc ion wi h
T= 1000
denoising s eps.
Ou me hod achie es subs an ially lowe in e ence ime,
jus
0.78
seconds o
512×512
images, making i
3×
as e
han Img2Img-SC. Fo
1024 ×1024
images, ou model ac-
cele a es econs uc ion u he , achie ing speeds o e
16×
as e han ha o Img2Img-SC.
Table 1. Dimensionali y Compa ison
T ansmi ed Da a Dimensionali y Comp ession Ra io % o o iginal
O iginal Image [3,512,512] − −
Ou Model [16,12,12] 341 0.29%
Img2Img-SC [4,64,64] 48 2.08%
DIFFSC [8,32,32] 96 1.04%
CASC [8,32,32] 96 1.04%
Mo eo e , in e ms o dimensionali y, Table 1shows ha
we achie e a highe Comp ession Ra io (CR) as compa ed
o o he s a e-o - he-a DM-based SIC sys ems. Following
he de ini ion in (Jiang e al.,2024), whe e CR is de ined as
he a io o he inpu image’s dimensionali y o ha o i s
encoded ep esen a ion, ou app oach comp esses an RGB
image o size
[3,512,512]
in o a compac embedding o
[16,12,12]
, achie ing an excep ional CR o
341
– meaning
ha he ansmi ed da a is only
0.29
% o he o iginal image
size. This highligh s he ema kable bandwid h e iciency
o ou me hod.
5

E icien and Robus Seman ic Image Communica ion ia S able Cascade
5 10 15 20
SNR (dB)
50
100
150
200
250
300
FID
FID s SNR
Ci yscapes
Unseen DIV2K
5 10 15 20
SNR (dB)
0.2
0.3
0.4
0.5
0.6
LPIPS
LPIPS s SNR
Ci yscapes
Unseen DIV2K
5 10 15 20
SNR (dB)
0.3
0.4
0.5
0.6
0.7
SSIM
SSIM s SNR
Ci yscapes
Unseen DIV2K
5 10 15 20
SNR (dB)
15.0
17.5
20.0
22.5
25.0
PSNR
PSNR s SNR
Ci yscapes
Unseen DIV2K
Figu e 6. Pe o mance o ou model on unseen DIV2K da a.
4.3.3. RECONSTRUCTION PREDICTABILITY
We assess econs uc ion p edic abili y ac oss a ying SNR
condi ions using he LPIPS me ic. Fo each case, we
simula e image ansmission
25
imes wi h ixed pa am-
e e s, compu ing he mean
(µ)
and s anda d de ia ion
(σ)
o LPIPS sco es ac oss all pai wise compa isons o gene -
a ed images. As shown in Table 2, ou model achie es
he lowes a e age LPIPS sco e and s anda d de ia ion,
(µ±σ) = (0.173 ±0.003)
a SNR
= 20
dB, indica ing
minimal gene a ion andomness. Thus, he p oposed model
is able o econs uc images eliably and consis en ly.
Table 2. P edic abili y Compa ison
SNR (dB) LPIPS Sco e (µ±σ)
Ou -1024 Ou -512 GESCO Img2Img-SC
20 0.173 ±0.003 0.205 ±0.005 0.401 ±0.014 0.520 ±0.011
15 0.195 ±0.003 0.223 ±0.006 0.433 ±0.012 0.541 ±0.017
10 0.229 ±0.003 0.264 ±0.008 0.424 ±0.017 0.522 ±0.012
5 0.287 ±0.004 0.314 ±0.009 0.575 ±0.021 0.554 ±0.019
1 0.351 ±0.006 0.371 ±0.013 0.613 ±0.017 0.578 ±0.019
4.3.4. GENERALIZATION ON UNSEEN DATA
We also analyze he pe o mance o ou model, ained on
he Ci yscapes da ase , on en i ely unseen da a. Fo his
pu pose, we use he DIV2K da ase ha con ains di e se
images, including landscapes, people, a chi ec u e, and ani-
mals. Figu e 6indica es ha he e is a signi ican deg ada-
ion in pe o mance on his new da a ac oss all ou me ics.
Fo example, a an SNR o
15
dB, LPIPS inc eases om
0.17 o 0.4, whe eas FID inc eases om 45 o 83, indica ing
a subs an ial loss in pe cep ual quali y. Howe e , a close
look a he gene a ed images, Figu e 7, e eals ha much o
his deg ada ion may be a ibu ed o he sha p di e ences in
Figu e 7.
Image econs uc ions on unseen DIV2K da a. I can be
seen ha he model does well o mi iga e he noise and econs uc
seman ically simila images conside ing ha i was no ine uned
o his da ase .
he colo s be ween he o iginal and gene a ed images. The
model does ai ly well o econs uc hese unseen images
and mi iga e he e ec s o noise, bu since i is ine uned on
he Ci yscapes da ase , he gene a ed images ha e a colo
one ha esembles e y closely o ha o he images in he
said da ase . These esul s sugges ha ine- uning a S able
Cascade model on a single la ge and highly di e se da ase
may enable i o handle a wide ange o image ypes wi h
s ong pe o mance.
4.3.5. ABLATION STUDIES
Finally, we pe o m abla ion es s o compa e he pe o -
mance o ou ine- uned model agains he o iginal S able
Cascade model in he seman ic image communica ion sce-
na io. Figu e 8a shows ha wi hou ine- uning, he o igi-
nal model’s pe o mance deg ades sha ply wi h dec easing
SNR. In pa icula , a SNR less han
10
dB, he images
gene a ed using he o iginal model a e hea ily co up ed
by noise. This is also e iden om Figu e 9, which shows
ha he o iginal model is unable o mi iga e he channel
e ec s. These indings alida e ou aining app oach and
demons a e he subs an ial pe o mance gains achie ed by
ine- uning he model o wo k wi h noisy image embedding
as a condi ioning signal.
We also analyze he impac o inc easing he size o he
ex ac ed image embedding on he gene a ion quali y o
6
E icien and Robus Seman ic Image Communica ion ia S able Cascade
5 10 15 20
SNR (dB)
100
200
300
400
500
FID
FID s SNR
Fine uned
O iginal SC Model
5 10 15 20
SNR (dB)
0.2
0.3
0.4
0.5
0.6
LPIPS
LPIPS s SNR
Fine uned
O iginal SC Model
5 10 15 20
SNR (dB)
0.60
0.65
0.70
0.75
SSIM
SSIM s SNR
Fine uned
O iginal SC Model
5 10 15 20
SNR (dB)
16
18
20
22
24
26
PSNR
PSNR s SNR
Fine uned
O iginal SC Model
(a)
5 10 15 20
SNR (dB)
40
50
60
70
80
90
FID
FID s SNR
[16, 24, 24]
[16, 32, 32]
5 10 15 20
SNR (dB)
0.15
0.20
0.25
0.30
LPIPS
LPIPS s SNR
[16, 24, 24]
[16, 32, 32]
5 10 15 20
SNR (dB)
0.700
0.725
0.750
0.775
0.800
SSIM
SSIM s SNR
[16, 24, 24]
[16, 32, 32]
5 10 15 20
SNR (dB)
22
23
24
25
26
PSNR
PSNR s SNR
[16, 24, 24]
[16, 32, 32]
(b)
Figu e 8.
Resul s o abla ion expe imen s highligh ing (a) he pe o mance gains ob ained ia ine- uning and (b) he impac o inc easing
he embedding size om [16, 24, 24] o [16, 32, 32] on pe o mance me ics.
Figu e 9.
Images econs uc ed by he o iginal S able Cascade
model. I can be seen ha wi hou p ope ine- uning, he o iginal
model ails o deal wi h he e ec s o channel noise.
1024 ×1024
images. I can be seen om Figu e 8b ha
he e is a no iceable imp o emen in pe o mance ac oss
all ou pe o mance me ics when he embedding size is
inc eased om
[16,24,24]
o
[16,32,32]
. Quan i a i ely, on
a e age, LPIPS, FID, and SSIM sco es imp o e by g ea e
han
10%
. Howe e , hese imp o emen s come a a cos o
he comp ession a io ha d ops om
341
o
192
. Hence,
he e is an unde s andable adeo be ween pe o mance
and bandwid h e iciency
5. Conclusion
In his pape , we in oduce a no el DM-based SIC ame-
wo k ha le e ages he S able Cascade a chi ec u e o
achie e an excep ional balance o speed, comp ession, and
ideli y unde noisy channel condi ions. Ou me hod ans-
mi s a highly compac image embedding, only
0.29
% o
he o iginal size, and econs uc s
512 ×512
images in jus
0.78
seconds –
3×
as e han Img2Img-SC. Ex ensi e e al-
ua ions using pe cep ual quali y me ics, including LPIPS,
SSIM, and FID, demons a e he noise obus ness o ou
app oach and i s supe io i y o e exis ing benchma ks. Ad-
di ionally, ou amewo k minimizes gene a ion andomness
by achie ing an LPIPS sco e a iance o only 0.003 a SNR
g ea e han
10
dB, ensu ing ai h ul and consis en image
econs uc ion. Fu u e wo k may explo e u he op imiza-
ions o minimize in e ence ime and ex end he amewo k
o high- ideli y seman ic ideo communica ion.
Acknowledgemen s
This esea ch has ecei ed unding om he Eu opean
Union’s Ho izon Eu ope esea ch and inno a ion p o-
g amme MSCA-DN NESTOR (G.A. 101119983). The
au ho s also acknowledge EPSRC p ojec TRANSNET
(EP/R035342/1). Expe imen s we e un on As on EPS Ma-
chine Lea ning Se e , unded by he EPSRC Co e Equip-
men Fund, G an EP/V036106/1.
Re e ences
Agus sson, E. and Timo e, R. N i e 2017 challenge on
single image supe - esolu ion: Da ase and s udy. In
The IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR) Wo kshops, July 2017.
Bou soula ze, E., Ku ka, D. B., and G
¨
und
¨
uz, D. Deep join
sou ce-channel coding o wi eless image ansmission.
7
E icien and Robus Seman ic Image Communica ion ia S able Cascade
IEEE T ansac ions on Cogni i e Communica ions and
Ne wo king, 5(3):567–579, 2019.
Chen, W. and Yang, Q. Casc: Condi ion-awa e seman-
ic communica ion wi h la en di usion models. a Xi
p ep in a Xi :2411.06552, 2024.
Cicche i, G., G assucci, E., Pa k, J., Choi, J., Ba ba ossa,
S., and Comminiello, D. Language-o ien ed seman ic
la en ep esen a ion o image ansmission. In 2024
IEEE 34 h In e na ional Wo kshop on Machine Lea ning
o Signal P ocessing (MLSP), pp. 1–6. IEEE, 2024.
Co d s, M., Om an, M., Ramos, S., Reh eld, T., Enzweile ,
M., Benenson, R., F anke, U., Ro h, S., and Schiele,
B. The ci yscapes da ase o seman ic u ban scene un-
de s anding. In P oceedings o he IEEE con e ence on
compu e ision and pa e n ecogni ion, pp. 3213–3223,
2016.
Dha iwal, P. and Nichol, A. Di usion models bea gans
on image syn hesis. Ad ances in Neu al In o ma ion
P ocessing Sys ems, 34:8780–8794, 2021.
Esse , P., Rombach, R., and Omme , B. Taming ans o me s
o high- esolu ion image syn hesis. In P oceedings o
he IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, pp. 12873–12883, 2021.
G assucci, E., Ba ba ossa, S., and Comminiello, D. Gene a-
i e seman ic communica ion: Di usion models beyond
bi eco e y. a Xi p ep in a Xi :2306.04321, 2023.
Jiang, Z., Liu, X., Yang, G., Li, W., Li, A., and Wang,
G. Di sc: Seman ic communica ion amewo k wi h
enhanced denoising h ough di usion p obabilis ic mod-
els. In ICASSP, IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing - P oceedings,
pp. 13071–13075. Ins i u e o Elec ical and Elec on-
ics Enginee s Inc., 2024. ISBN 9798350344851. doi:
10.1109/ICASSP48485.2024.10448094.
Luo, X., Chen, H.-H., and Guo, Q. Seman ic communica-
ions: O e iew, open issues, and u u e esea ch di ec-
ions. IEEE Wi eless Communica ions, 29(1):210–219,
2022.
Nam, H., Pa k, J., Choi, J., Bennis, M., and Kim, S.-L.
Language-o ien ed communica ion wi h seman ic coding
and knowledge dis illa ion o ex - o-image gene a ion.
In ICASSP 2024-2024 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP), pp.
13506–13510. IEEE, 2024.
Pe nias, P., Rampas, D., Rich e , M. L., Pal, C. J., and
Aub e ille, M. W
¨
u s chen: An e icien a chi ec u e o
la ge-scale ex - o-image di usion models. a Xi p ep in
a Xi :2306.00637, 2023.
Qin, Z., Tao, X., Lu, J., Tong, W., and Li, G. Y. Seman-
ic communica ions: P inciples and challenges. a Xi
p ep in a Xi :2201.01389, 2021.
Rombach, R., Bla mann, A., Lo enz, D., Esse , P., and
Omme , B. High- esolu ion image syn hesis wi h la-
en di usion models. In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), pp. 10684–10695, June 2022.
Sei ze , M. py o ch- id: FID Sco e o PyTo ch.
h ps://
gi hub.com/msei ze /py o ch- id
, Augus
2020. Ve sion 0.3.0.
S ina i, E. C. and Ba ba ossa, S. 6g ne wo ks: Beyond shan-
non owa ds seman ic and goal-o ien ed communica ions.
Compu e Ne wo ks, 190:107930, 2021.
Tan, M. and Le, Q. E icien ne : Re hinking model scal-
ing o con olu ional neu al ne wo ks. In In e na ional
con e ence on machine lea ning, pp. 6105–6114. PMLR,
2019.
Wang, Z., Bo ik, A. C., Sheikh, H. R., and Simoncelli,
E. P. Image quali y assessmen : om e o isibili y
o s uc u al simila i y. IEEE T ansac ions on Image
P ocessing, 13(4):600–612, 2004.
Xia, L., Sun, Y., Liang, C., Zhang, L., Im an, M. A., and
Niya o, D. Gene a i e AI o Seman ic Communica ion:
A chi ec u e, Challenges, and Ou look. IEEE Wi eless
Communica ions, 32(1):132–140, 2025. doi: 10.1109/
MWC.003.2300351.
Yang, P., Zhang, G., and Cai, Y. Ra e-adap i e gene a-
i e seman ic communica ion using condi ional di usion
models. IEEE Wi eless Communica ions Le e s, 14(2):
539–543, 2025.
Yilmaz, S. F., Niu, X., Bai, B., Han, W., Deng, L., and
G
¨
und
¨
uz, D. High pe cep ual quali y wi eless image deli -
e y wi h denoising di usion models. In IEEE INFOCOM
2024-IEEE Con e ence on Compu e Communica ions
Wo kshops (INFOCOM WKSHPS), pp. 1–5. IEEE, 2024.
Zhang, R., Isola, P., E os, A. A., Shech man, E., and Wang,
O. The un easonable e ec i eness o deep ea u es as a
pe cep ual me ic. In P oceedings o he IEEE con e ence
on compu e ision and pa e n ecogni ion, pp. 586–595,
2018.
8