Efficient and Robust Semantic Image Communication via Stable Cascade

Author: Khalid, Rana Ahmad Bilal

Publisher: Zenodo

DOI: 10.5281/zenodo.17281324

Source: https://zenodo.org/records/17281324/files/ICML2025-ML4Wireless.pdf

E icien and Robus Seman ic Image Communica ion ia S able Cascade
Bilal Khalid 1Ped o F ei e 1Se gei K. Tu i syn 1Ja oslaw E. P ilepsky 1
Abs ac
Di usion Model (DM) based Seman ic Image
Communica ion (SIC) sys ems ace signi ican
challenges, such as slow in e ence speed and gen-
e a ion andomness, ha limi hei eliabili y and
p ac icali y. To o e come hese issues, we p o-
pose a no el SIC amewo k inspi ed by S able
Cascade, whe e ex emely compac la en image
embeddings a e used as condi ioning o he di -
usion p ocess. Ou app oach d as ically educes
he da a ansmission o e head, comp essing he
ansmi ed embedding o jus
0.29%
o he o igi-
nal image size. I ou pe o ms h ee benchma k
app oaches — he di usion SIC model condi-
ioned on segmen a ion maps (GESCO), he e-
cen S able Di usion (SD)-based SIC amewo k
(Img2Img-SC), and he con en ional JPEG2000
+
LDPC coding — by achie ing supe io e-
cons uc ion quali y unde noisy channel condi-
ions, as alida ed ac oss mul iple me ics. No-
ably, i also deli e s signi ican compu a ional
e iciency, enabling o e
3×
as e econs uc ion
o
512×512
images and mo e han
16×
as e o
1024 ×1024
images as compa ed o he app oach
adop ed in Img2Img-SC.
1. In oduc ion
Seman ic communica ion (SemCom) is a ans o ma i e ap-
p oach ha ocuses on e ec i ely con eying he meaning o
in o ma ion a he han ansmi ing aw bi da a (S ina i &
Ba ba ossa,2021). The goal is o communica e he essen ial
in o ma ion he ecei e needs o comple e i s ask success-
ully. This also makes i bandwid h e icien as signi ican ly
less da a has o be ansmi ed ac oss he communica ion
channel (Luo e al.,2022;Qin e al.,2021).
1
As on Ins i u e o Pho onic Technologies, As on Uni-
e si y, Bi mingham, UK. Co espondence o: Bilal Khalid
< [email p o ec ed]>.
P oceedings o he
42 nd
In e na ional Con e ence on Machine
Lea ning, Wo kshop on Machine Lea ning o Wi eless Communi-
ca ion and Ne wo ks, Vancou e , Canada, 2025. Copy igh 2025
by he au ho (s).
Figu e 1. 1024 ×1024
Image econs uc ions using ou model
unde di e en channel SNR condi ions. E en a an SNR o 1 dB,
images a e ai h ully econs uc ed and pe cep ually e y simila
o he ansmi ed images.
The ad ancemen o Deep Lea ning (DL) and gene a i e AI
has enabled he eme gence o SemCom as a iable al e na-
i e o adi ional communica ion. DL and gene a i e AI
models a e used o ex ac ing he ele an seman ic in o -
ma ion a he ansmi e end as well as o deciphe ing he
meaning behind his in o ma ion a he ecei e end. Deep
lea ning-based Join Sou ce-Channel Coding (DeepJSCC)
(Bou soula ze e al.,2019) was one o he i s app oaches
o inco po a e DL in wi eless sys em design. Va ia ional
Au oencode s (VAEs), Gene a i e Ad e sa ial Ne wo ks
(GANs), Di usion Models (DMs) and Flow-based Gene a-
i e Models (FGMs) a e he majo gene a i e AI echniques
now commonly used in SemCom sys ems (Xia e al.,2025).
Ou o hese, DMs ha e shown g ea po en ial a Seman ic
Image Communica ion (SIC) asks because o hei excep-
ional abili y o syn hesize high-quali y images (Dha iwal
& Nichol,2021). Howe e , one d awback o DMs is ha
hey a e inhe en ly slowe a in e ence because o hei i e -
a i e na u e. The in oduc ion o La en Di usion Models
(LDMs) (Rombach e al.,2022) has alle ia ed his p oblem
by pe o ming he di usion p ocess in a comp essed la en
space ins ead o he o iginal pixel space, enabling as and
high- esolu ion image gene a ion ia di usion.
Se e al DM-based SIC sys ems ha e been implemen ed in
1
E icien and Robus Seman ic Image Communica ion ia S able Cascade
Figu e 2.
Ou sys em model. A he ansmi e side, a compac image embedding
Z
o size [16, 24, 24] is ex ac ed om an image
X
o
size [3, 1024, 1024].
Z
is ansmi ed ac oss he physical channel. The ecei e uses he noisy embedding
ˆ
Z
as condi ioning o he LDM.
Finally, he VQGAN decode is used o p ojec he image back in o pixel space.
ecen yea s. In (G assucci e al.,2023), segmen a ion maps
a e used o guide he di usion p ocess. In (Yilmaz e al.,
2024), he p ima y image s uc u e is ansmi ed using he
DeepJSCC echnique, whe eas ine de ails a e gene a ed
using he di usion model. (Jiang e al.,2024) also use a
di usion model o e ine he econs uc ion ob ained a e
image decoding. Howe e , in e ence using hese app oaches
is ime-consuming. Recen ly, LDMs ha e been used o SIC
o speed up he in e ence p ocess. In (Nam e al.,2024;
Cicche i e al.,2024), ex condi ioning is used o guide
he gene a i e p ocess o S able Di usion’s ex - o-image
model (Rombach e al.,2022). In (Cicche i e al.,2024), he
gene a ion p ocess s a s om a noisy e sion o image em-
bedding ins ead o pu e noise. Al hough e icien in e ms
o bandwid h, hese models s uggle o ai h ully econs uc
he in ended image and su e om gene a ion andomness.
(Chen & Yang,2024) denoise a noisy image embedding
using an LDM, and he clean embedding is hen used o
econs uc he image using a seman ic decode . Ins ead o
p edic ing he noise in he image, (Yang e al.,2025) use a
di usion model o p edic he sou ce image in a ew denois-
ing s eps di ec ly. Bo h o hese models educe in e ence
ime bu ope a e a a lowe comp ession ac o as compa ed
o ou p oposed me hod.
In his pape , we p opose a no el SIC model inspi ed by
S able Cascade (SC) (Pe nias e al.,2023), a mul is age
ex - o-image LDM ha ope a es in a much smalle la en
space han S able Di usion (SD). Ou app oach achie es
he i ec a o high comp ession e iciency, as in e ence,
and pe cep ually aligned image econs uc ion, which is
missing in exis ing DM-based SIC sys ems. In ou me hod,
a highly comp essed image embedding is ex ac ed using a
seman ic encode and ansmi ed ac oss he physical chan-
nel. The noisy embedding is hen gi en as a condi ioning
signal o he LDM o SC ha p ojec s i in o a highe di-
mensional la en space whe e he seman ic decode ope a es.
Resul s indica e ha we ou pe o m benchma k models and
as shown in Figu e 1, gene a e consis en econs uc ions
e en unde ex emely poo channel Signal- o-Noise Ra io
(SNR) condi ions.
2. P oposed F amewo k
In his sec ion, he p oposed sys em model is explained.
The model is buil upon he a chi ec u e o SC ha has h ee
s ages, i.e., s ages A, B and C. As discussed below, ou
model is based on s age A and a ine uned s age B ha is
ained o wo k wi h noisy condi ioning.
S age A is a Vec o Quan ized Gene a i e Ad e sa ial Ne -
wo k (VQGAN) (Esse e al.,2021) wi h pa ame e s
Θ
ha
comp esses he image space by a ac o o
4
. The ela ion-
ship be ween an inpu image
X∈R3×1024×1024
and he
ou pu o VQGAN encode XVG is gi en as:
XVG = Θ(X).(1)
I
−1
Θ
ep esen s he VQGAN decode , he image can be
econs uc ed om he comp essed la en space using
−1
Θ(XVG)≈X. (2)
S age B is a LDM ha lea ns o gene a e he la en
space
XVG
gi en a highly comp essed la en ep esen a-
ion
Z
o
X
. This compac embedding is ob ained ia he
E icien Ne -V2 encode (Tan & Le,2019). Du ing he
o wa d p ocess in aining, he la en s
XVG
a e noised ac-
co ding o he ollowing ela ion:
XVG, =√¯α ·XVG, +√1−¯α ·ϵ. (3)
He e,
¯α
speci ies he noise schedule whe eas
ϵ
is he noise
sampled om a s anda d no mal dis ibu ion
N(0,1)
. A
2
E icien and Robus Seman ic Image Communica ion ia S able Cascade
any ime-s ep
, wi h noised la en s
XVG,
and noisy condi-
ional embedding
ˆ
Z
, he LDM is ained o p edic he noise
¯ϵ(XVG, , , ˆ
Z)
. The aining objec i e is o minimize he
loss unc ion
L
, de ined as he Mean-Squa ed E o (MSE)
be ween he p edic ed and ac ual noise:
L=E(XVG, , , ˆ
Z,ϵ)h∥ϵ−¯ϵ(XVG, , , ˆ
Z)∥2
2i.(4)
Tex embedding is also used as condi ioning o S age B
in he o iginal SC pape (Pe nias e al.,2023). Howe e ,
as no ed in he pape i sel , i has no signi ican impac on
he econs uc ion quali y o s age B as he condi ioning
p o ided by he image embedding is much s onge . Thus,
we do no conside ex condi ioning in ou model. The
ine- uning o S age B condi ioned on
ˆ
Z
makes i obus
o channel impai men s. Mo eo e , we do no conside
s age C ei he as i is p ima ily esponsible o ex - o-image
gene a ion.
3. Sys em Model
Figu e 2shows he h ee phases o ou sys em model i.e.
seman ic in o ma ion ex ac ion a he ansmi e , noisy
channel ansmission, and image econs uc ion a he e-
cei e .
3.1. Seman ic Fea u e Ex ac ion
As in (Pe nias e al.,2023), we u ilize he p e ained
E icien Ne -V2 image encode o ex ac a compac image
embedding. An inpu RGB image
X∈RN×H×W
is en-
coded in o a comp essed embedding
Z=E(X)1
. Despi e
i s compac size, his embedding con ains well-gene alized
ea u e ep esen a ions ha p o ide s onge guidance o
he di usion model as compa ed o ex embeddings. As a
esul , he econs uc ed image is e y simila o he o iginal
one, wi h di e ences in ine de ails only. Al hough image
gene a ion based solely on ex condi ioning is highly e i-
cien in e ms o bandwid h, i may esul in econs uc ions
ha a e seman ically qui e di e en om he sou ce image
(Nam e al.,2024). Fu he mo e, as compa ed o segmen a-
ion map-based condi ioning (G assucci e al.,2023), image
embeddings o e be e econs uc ion ideli y. Al hough
segmen a ion maps e ain spa ial s uc u e, hey o en lose
c ucial de ails such as ex u e, colo , and ine-g ained ea-
u es. Addi ionally, because hey p o ide only class-le el
in o ma ion, he same segmen a ion map can yield mul i-
ple plausible econs uc ions, in oducing a iabili y. To
achie e eliable, p edic able, and e icien SIC, we p opose
using ich image embedding as a mo e e ec i e condi ion-
ing signal, ensu ing educed gene a ion andomness and
high- ideli y econs uc ion o ansmi ed images.
1
The dimensionali ies o
X
;
N
is he numbe o channels, i.e.
3 o RGB, and
H
and
W
s and o he heigh and wid h pixel
esolu ion espec i ely.
3.2. Communica ion Channel
To main ain con o mi y wi h mos p e ious wo ks (G as-
succi e al.,2023;Yilmaz e al.,2024;Chen & Yang,2024;
Yang e al.,2025), we conside he widely adop ed addi i e
whi e Gaussian noise (AWGN) channel in ou simula ions.
The ex ac ed image embedding Z is ansmi ed ac oss he
AWGN channel whe e he noise
ϵ
is sampled om a ze o-
mean no mal dis ibu ion
N(0, σ2)
wi h a iance
σ2
. I
P
deno es he ecei ed signal powe , he channel condi ions
a e cha ac e ized by he Signal- o-Noise Ra io (SNR):
SNR = 10 log P
σ2(dB).(5)
Depending upon he SNR le el, noise is added o
Z
and he
dis o ed embedding ˆ
Zis ob ained as
ˆ
Z=Z+ϵ. (6)
3.3. Image Recons uc ion
The noisy image embedding
ˆ
Z
is used as a condi ioning
signal o he di usion model a he ecei e side. I should
be no ed ha in (Cicche i e al.,2024), a ex -condi ioned
di usion model s a s sampling om a noisy e sion o he
image embedding, whe eas, in ou model, a signi ican ly
mo e comp essed image embedding is used pu ely as a
condi ioning signal. A e he condi ional denoising p ocess
is comple e, he ou pu o he LDM is he p edic ed la en
space
ˆ
XVG
whe e he VQGAN decode ope a es. Finally,
in acco dance wi h Equa ion (2), he gene a ed image
ˆ
X
is
ob ained using −1
Θ(ˆ
XVG) = ˆ
X.
4. Expe imen al E alua ion
4.1. Model T aining
We ain ou model using he Ci yscapes da ase (Co d s
e al.,2016). The da ase con ains
3000
aining,
500
al-
ida ion, and
1500
es images. All images a e esized o
1024 ×1024
esolu ion. We ine une he p e- ained s age B
checkpoin o
15000
s eps using a ba ch size o 4, lea ning
a e o
1×10−4
, and AdamW op imize . To imp o e gene -
aliza ion and obus ness, he SNR is andomly selec ed o be
be ween
1−20
dB. A each aining s ep, image embeddings
a e ex ac ed and ansmi ed ac oss he AWGN channel.
The model is ained o use he noisy embeddings as condi-
ioning o econs uc images wi h he objec i e o minimiz-
ing he MSE loss in acco dance wi h Equa ion (4). In addi-
ion o he Ci yscapes da ase , we also e alua e ou model’s
pe o mance on he DIV2K da ase (Agus sson & Timo e,
2017), which is composed o highly di e se images. We do
no ine une ou model again o his da ase o in es iga e
how well i gene alizes on comple ely di e en and unseen
da a. All he aining and simula ions ha e been pe o med
3
E icien and Robus Seman ic Image Communica ion ia S able Cascade
Figu e 3.
Image econs uc ions using ou model, GESCO, Img2Img-SC and JPEG2000
+
LDPC in low SNR condi ions. I can be
obse ed ha ou model gene a es he mos seman ically simila images wi h he leas gene a ion andomness. The ed c osses indica e
ha he JPEG2000+LDPC sys em was unable o eco e he image a he co esponding SNR.
using a single NVIDIA RTX A6000 (48-GB) GPU. All code
sc ip s and ine- uned model weigh s will be accessible a :
h ps://gi hub.com/abilalk02/SC-SIC.
4.2. Simula ion Se ings
We compa e he pe o mance o ou model wi h (i) he
di usion SIC model condi ioned on segmen a ion maps
(GESCO) (G assucci e al.,2023), (ii) he S able Di usion-
based SIC model ha ansmi s ex and image embeddings
(Img2Img-SC) (Cicche i e al.,2024), and (iii) he con-
en ional JPEG2000 comp ession wi h Low-Densi y Pa i y-
Check (LDPC) e o co ec ion app oach. Fo e alua ion,
we gene a e
100
samples using each model wi h channel
SNR alues o
1,5,10,15
and
20
dB espec i ely. All sam-
ples a e o esolu ion
512 ×512
, excep o GESCO, whe e
he esolu ion is
256 ×5122
. Fo sampling wi h GESCO
and Img2Img-SC,
1000
and
30
denoising s eps a e used, e-
spec i ely, as in he o iginal pape s. Fo JPEG2000
+
LDPC,
Quad a u e Ampli ude Modula ion (QAM) is used and he
LDPC coding a e is se o
1/2
ollowing he me hod de-
sc ibed in (Bou soula ze e al.,2019).
2
I was no possible o gene a e
512 ×512
images using
GESCO wi hou al e ing he model a chi ec u e.
Pe o mance Me ics: To e alua e he pe cep ual and se-
man ic simila i y be ween he o iginal and gene a ed images,
we calcula e he Lea ned Pe cep ual Image Pa ch Simila -
i y (LPIPS) sco e (Zhang e al.,2018), F
´
eche Incep ion
Dis ance (FID) sco e (Sei ze ,2020) and S uc u al Simi-
la i y Index Measu e (SSIM) (Wang e al.,2004). We also
measu e he Peak Signal- o-Noise Ra io (PSNR) o e alu-
a e pixel-le el simila i y be ween images. Lowe alues o
LPIPS and FID indica e be e pe o mance, whe eas highe
alues o SSIM and PSNR indica e be e pe o mance.
4.3. Resul s
4.3.1. IMAGE RECONSTRUCTION QUALITY
We i s e alua e he econs uc ion quali y o ou model
agains exis ing app oaches, including GESCO, Img2Img-
SC, and he JPEG2000
+
LDPC amewo k. Figu e 3shows
he econs uc ion o a ansmi ed image a he ecei e
end using he ou models unde low SNR condi ions. Ou
model consis en ly achie es he mos accu a e econs uc-
ions o he o iginal image. E en a ex emely low SNR
le els o 5 dB and 1 dB, i p ese es objec cla i y and
ecognizabili y. In con as , he econs uc ion quali y o
GESCO de e io a es apidly as SNR dec eases, leading o
signi ican isual deg ada ion. Mo eo e , he ou pu p o-
4
E icien and Robus Seman ic Image Communica ion ia S able Cascade
5 10 15 20
SNR (dB)
100
200
300
400
500
FID
FID s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
5 10 15 20
SNR (dB)
0.2
0.4
0.6
0.8
1.0
LPIPS
LPIPS s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
5 10 15 20
SNR (dB)
0.0
0.2
0.4
0.6
SSIM
SSIM s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
5 10 15 20
SNR (dB)
0
5
10
15
20
25
PSNR
PSNR s SNR
Ou Model
GESCO
Img2Img-SC
JP2 + LDPC
Figu e 4.
Pe o mance compa ison be ween ou model, GESCO,
Img2Img-SC and JP2+LDPC a di e en SNRs.
duced by Img2Img-SC is loosely ied o he o iginal image
because ex condi ioning in oduces signi ican andom-
ness in he gene a ion p ocess. Finally, he con en ional
JPEG2000
+
LDPC p oduces hea ily dis o ed ou pu , and
e o co ec ion comple ely ails a low SNR, as was ob-
se ed ea lie (Bou soula ze e al.,2019;Jiang e al.,2024).
Fo cases whe e i ails o econs uc he images, we se
he PSNR and SSIM sco es o
0
, whe eas LPIPS and FID
sco es a e assigned an a bi a y maximum alue o
1
and
500 espec i ely.
The compa ison ac oss pe o mance me ics on he
Ci yscapes es da a, shown in Figu e 4, also e eals ha ou
model achie es he bes esul s. In e ms o FID and LPIPS,
on a e age, ou model imp o es on he esul s o he nex -
bes app oach om Img2Img-SC by
43%
and
55%
, espec-
i ely. Simila ly, in e ms o SSIM and PSNR, ou model
gi es he bes esul s, main aining good pe o mance e en a
low SNR. Fo SNR g ea e han 10 dB, JPEG2000
+
LDPC
achie es compa able PSNR and SSIM o ou model e en
hough i s econs uc ions a e hea ily dis o ed, ha e a i-
ac s, and lack de ails. This can be a ibu ed o he ac ha
JPEG2000 comp ession p ese es low- equency compo-
nen s and s uc u al in eg i y. PSNR and SSIM p ima ily
assess pixel-le el accu acy and s uc u al simila i y, espec-
i ely. In con as , LPIPS and FID a e mo e sensi i e o
pe cep ually signi ican dis o ions, cap u ing he loss o
ine de ails, educed ealism, and unna u al ex u es. Thus,
high PSNR and SSIM sco es can misleadingly o e es ima e
he pe o mance o JPEG2000
+
LDPC, ailing o e lec he
pe cep ual deg ada ion. Mo eo e , as discussed, he con en-
ional me hod ails o econs uc he images a low SNR.
O e all, ou model imp o es SSIM by
56%
and PSNR by
23%
as compa ed o Img2Img-SC. The esul s o ou model
imp o e u he when gene a ing 1024 ×1024 images.
Ou -512 Ou -1024 Img2Img-512 Img2Img-1024
Model
0
5
10
15
20
25
30
Time (s)
0.78 1.72 2.53
29.19
GESCO, on a e age, akes 324
seconds o gene a e 256 x 512
image on he same GPU
wi h 1000 di usion imes eps
Model In e ence Time Compa ison
Figu e 5.
In e ence ime compa ison o ou model wi h GESCO
and Img2Img-SC.
4.3.2. INFERENCE SPEED AND BANDWIDTH
EFFICIENCY
In e ms o compu a ional complexi y, we e alua e bo h in-
e ence la ency and he dimensionali y o he ansmi ed
da a. As shown in Figu e 5, he model om (G assucci
e al.,2023), which does no u ilize an LDM, exhibi s signi -
ican ly highe la ency, equi ing
5
minu es and
24
seconds
o image econs uc ion wi h
T= 1000
denoising s eps.
Ou me hod achie es subs an ially lowe in e ence ime,
jus
0.78
seconds o
512×512
images, making i
3×
as e
han Img2Img-SC. Fo
1024 ×1024
images, ou model ac-
cele a es econs uc ion u he , achie ing speeds o e
16×
as e han ha o Img2Img-SC.
Table 1. Dimensionali y Compa ison
T ansmi ed Da a Dimensionali y Comp ession Ra io % o o iginal
O iginal Image [3,512,512] − −
Ou Model [16,12,12] 341 0.29%
Img2Img-SC [4,64,64] 48 2.08%
DIFFSC [8,32,32] 96 1.04%
CASC [8,32,32] 96 1.04%
Mo eo e , in e ms o dimensionali y, Table 1shows ha
we achie e a highe Comp ession Ra io (CR) as compa ed
o o he s a e-o - he-a DM-based SIC sys ems. Following
he de ini ion in (Jiang e al.,2024), whe e CR is de ined as
he a io o he inpu image’s dimensionali y o ha o i s
encoded ep esen a ion, ou app oach comp esses an RGB
image o size
[3,512,512]
in o a compac embedding o
[16,12,12]
, achie ing an excep ional CR o
341
– meaning
ha he ansmi ed da a is only
0.29
% o he o iginal image
size. This highligh s he ema kable bandwid h e iciency
o ou me hod.
5

E icien and Robus Seman ic Image Communica ion ia S able Cascade
5 10 15 20
SNR (dB)
50
100
150
200
250
300
FID
FID s SNR
Ci yscapes
Unseen DIV2K
5 10 15 20
SNR (dB)
0.2
0.3
0.4
0.5
0.6
LPIPS
LPIPS s SNR
Ci yscapes
Unseen DIV2K
5 10 15 20
SNR (dB)
0.3
0.4
0.5
0.6
0.7
SSIM
SSIM s SNR
Ci yscapes
Unseen DIV2K
5 10 15 20
SNR (dB)
15.0
17.5
20.0
22.5
25.0
PSNR
PSNR s SNR
Ci yscapes
Unseen DIV2K
Figu e 6. Pe o mance o ou model on unseen DIV2K da a.
4.3.3. RECONSTRUCTION PREDICTABILITY
We assess econs uc ion p edic abili y ac oss a ying SNR
condi ions using he LPIPS me ic. Fo each case, we
simula e image ansmission
25
imes wi h ixed pa am-
e e s, compu ing he mean
(µ)
and s anda d de ia ion
(σ)
o LPIPS sco es ac oss all pai wise compa isons o gene -
a ed images. As shown in Table 2, ou model achie es
he lowes a e age LPIPS sco e and s anda d de ia ion,
(µ±σ) = (0.173 ±0.003)
a SNR
= 20
dB, indica ing
minimal gene a ion andomness. Thus, he p oposed model
is able o econs uc images eliably and consis en ly.
Table 2. P edic abili y Compa ison
SNR (dB) LPIPS Sco e (µ±σ)
Ou -1024 Ou -512 GESCO Img2Img-SC
20 0.173 ±0.003 0.205 ±0.005 0.401 ±0.014 0.520 ±0.011
15 0.195 ±0.003 0.223 ±0.006 0.433 ±0.012 0.541 ±0.017
10 0.229 ±0.003 0.264 ±0.008 0.424 ±0.017 0.522 ±0.012
5 0.287 ±0.004 0.314 ±0.009 0.575 ±0.021 0.554 ±0.019
1 0.351 ±0.006 0.371 ±0.013 0.613 ±0.017 0.578 ±0.019
4.3.4. GENERALIZATION ON UNSEEN DATA
We also analyze he pe o mance o ou model, ained on
he Ci yscapes da ase , on en i ely unseen da a. Fo his
pu pose, we use he DIV2K da ase ha con ains di e se
images, including landscapes, people, a chi ec u e, and ani-
mals. Figu e 6indica es ha he e is a signi ican deg ada-
ion in pe o mance on his new da a ac oss all ou me ics.
Fo example, a an SNR o
15
dB, LPIPS inc eases om
0.17 o 0.4, whe eas FID inc eases om 45 o 83, indica ing
a subs an ial loss in pe cep ual quali y. Howe e , a close
look a he gene a ed images, Figu e 7, e eals ha much o
his deg ada ion may be a ibu ed o he sha p di e ences in
Figu e 7.
Image econs uc ions on unseen DIV2K da a. I can be
seen ha he model does well o mi iga e he noise and econs uc
seman ically simila images conside ing ha i was no ine uned
o his da ase .
he colo s be ween he o iginal and gene a ed images. The
model does ai ly well o econs uc hese unseen images
and mi iga e he e ec s o noise, bu since i is ine uned on
he Ci yscapes da ase , he gene a ed images ha e a colo
one ha esembles e y closely o ha o he images in he
said da ase . These esul s sugges ha ine- uning a S able
Cascade model on a single la ge and highly di e se da ase
may enable i o handle a wide ange o image ypes wi h
s ong pe o mance.
4.3.5. ABLATION STUDIES
Finally, we pe o m abla ion es s o compa e he pe o -
mance o ou ine- uned model agains he o iginal S able
Cascade model in he seman ic image communica ion sce-
na io. Figu e 8a shows ha wi hou ine- uning, he o igi-
nal model’s pe o mance deg ades sha ply wi h dec easing
SNR. In pa icula , a SNR less han
10
dB, he images
gene a ed using he o iginal model a e hea ily co up ed
by noise. This is also e iden om Figu e 9, which shows
ha he o iginal model is unable o mi iga e he channel
e ec s. These indings alida e ou aining app oach and
demons a e he subs an ial pe o mance gains achie ed by
ine- uning he model o wo k wi h noisy image embedding
as a condi ioning signal.
We also analyze he impac o inc easing he size o he
ex ac ed image embedding on he gene a ion quali y o
6
E icien and Robus Seman ic Image Communica ion ia S able Cascade
5 10 15 20
SNR (dB)
100
200
300
400
500
FID
FID s SNR
Fine uned
O iginal SC Model
5 10 15 20
SNR (dB)
0.2
0.3
0.4
0.5
0.6
LPIPS
LPIPS s SNR
Fine uned
O iginal SC Model
5 10 15 20
SNR (dB)
0.60
0.65
0.70
0.75
SSIM
SSIM s SNR
Fine uned
O iginal SC Model
5 10 15 20
SNR (dB)
16
18
20
22
24
26
PSNR
PSNR s SNR
Fine uned
O iginal SC Model
(a)
5 10 15 20
SNR (dB)
40
50
60
70
80
90
FID
FID s SNR
[16, 24, 24]
[16, 32, 32]
5 10 15 20
SNR (dB)
0.15
0.20
0.25
0.30
LPIPS
LPIPS s SNR
[16, 24, 24]
[16, 32, 32]
5 10 15 20
SNR (dB)
0.700
0.725
0.750
0.775
0.800
SSIM
SSIM s SNR
[16, 24, 24]
[16, 32, 32]
5 10 15 20
SNR (dB)
22
23
24
25
26
PSNR
PSNR s SNR
[16, 24, 24]
[16, 32, 32]
(b)
Figu e 8.
Resul s o abla ion expe imen s highligh ing (a) he pe o mance gains ob ained ia ine- uning and (b) he impac o inc easing
he embedding size om [16, 24, 24] o [16, 32, 32] on pe o mance me ics.
Figu e 9.
Images econs uc ed by he o iginal S able Cascade
model. I can be seen ha wi hou p ope ine- uning, he o iginal
model ails o deal wi h he e ec s o channel noise.
1024 ×1024
images. I can be seen om Figu e 8b ha
he e is a no iceable imp o emen in pe o mance ac oss
all ou pe o mance me ics when he embedding size is
inc eased om
[16,24,24]
o
[16,32,32]
. Quan i a i ely, on
a e age, LPIPS, FID, and SSIM sco es imp o e by g ea e
han
10%
. Howe e , hese imp o emen s come a a cos o
he comp ession a io ha d ops om
341
o
192
. Hence,
he e is an unde s andable adeo be ween pe o mance
and bandwid h e iciency
5. Conclusion
In his pape , we in oduce a no el DM-based SIC ame-
wo k ha le e ages he S able Cascade a chi ec u e o
achie e an excep ional balance o speed, comp ession, and
ideli y unde noisy channel condi ions. Ou me hod ans-
mi s a highly compac image embedding, only
0.29
% o
he o iginal size, and econs uc s
512 ×512
images in jus
0.78
seconds –
3×
as e han Img2Img-SC. Ex ensi e e al-
ua ions using pe cep ual quali y me ics, including LPIPS,
SSIM, and FID, demons a e he noise obus ness o ou
app oach and i s supe io i y o e exis ing benchma ks. Ad-
di ionally, ou amewo k minimizes gene a ion andomness
by achie ing an LPIPS sco e a iance o only 0.003 a SNR
g ea e han
10
dB, ensu ing ai h ul and consis en image
econs uc ion. Fu u e wo k may explo e u he op imiza-
ions o minimize in e ence ime and ex end he amewo k
o high- ideli y seman ic ideo communica ion.
Acknowledgemen s
This esea ch has ecei ed unding om he Eu opean
Union’s Ho izon Eu ope esea ch and inno a ion p o-
g amme MSCA-DN NESTOR (G.A. 101119983). The
au ho s also acknowledge EPSRC p ojec TRANSNET
(EP/R035342/1). Expe imen s we e un on As on EPS Ma-
chine Lea ning Se e , unded by he EPSRC Co e Equip-
men Fund, G an EP/V036106/1.
Re e ences
Agus sson, E. and Timo e, R. N i e 2017 challenge on
single image supe - esolu ion: Da ase and s udy. In
The IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR) Wo kshops, July 2017.
Bou soula ze, E., Ku ka, D. B., and G
¨
und
¨
uz, D. Deep join
sou ce-channel coding o wi eless image ansmission.
7
E icien and Robus Seman ic Image Communica ion ia S able Cascade
IEEE T ansac ions on Cogni i e Communica ions and
Ne wo king, 5(3):567–579, 2019.
Chen, W. and Yang, Q. Casc: Condi ion-awa e seman-
ic communica ion wi h la en di usion models. a Xi
p ep in a Xi :2411.06552, 2024.
Cicche i, G., G assucci, E., Pa k, J., Choi, J., Ba ba ossa,
S., and Comminiello, D. Language-o ien ed seman ic
la en ep esen a ion o image ansmission. In 2024
IEEE 34 h In e na ional Wo kshop on Machine Lea ning
o Signal P ocessing (MLSP), pp. 1–6. IEEE, 2024.
Co d s, M., Om an, M., Ramos, S., Reh eld, T., Enzweile ,
M., Benenson, R., F anke, U., Ro h, S., and Schiele,
B. The ci yscapes da ase o seman ic u ban scene un-
de s anding. In P oceedings o he IEEE con e ence on
compu e ision and pa e n ecogni ion, pp. 3213–3223,
2016.
Dha iwal, P. and Nichol, A. Di usion models bea gans
on image syn hesis. Ad ances in Neu al In o ma ion
P ocessing Sys ems, 34:8780–8794, 2021.
Esse , P., Rombach, R., and Omme , B. Taming ans o me s
o high- esolu ion image syn hesis. In P oceedings o
he IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, pp. 12873–12883, 2021.
G assucci, E., Ba ba ossa, S., and Comminiello, D. Gene a-
i e seman ic communica ion: Di usion models beyond
bi eco e y. a Xi p ep in a Xi :2306.04321, 2023.
Jiang, Z., Liu, X., Yang, G., Li, W., Li, A., and Wang,
G. Di sc: Seman ic communica ion amewo k wi h
enhanced denoising h ough di usion p obabilis ic mod-
els. In ICASSP, IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing - P oceedings,
pp. 13071–13075. Ins i u e o Elec ical and Elec on-
ics Enginee s Inc., 2024. ISBN 9798350344851. doi:
10.1109/ICASSP48485.2024.10448094.
Luo, X., Chen, H.-H., and Guo, Q. Seman ic communica-
ions: O e iew, open issues, and u u e esea ch di ec-
ions. IEEE Wi eless Communica ions, 29(1):210–219,
2022.
Nam, H., Pa k, J., Choi, J., Bennis, M., and Kim, S.-L.
Language-o ien ed communica ion wi h seman ic coding
and knowledge dis illa ion o ex - o-image gene a ion.
In ICASSP 2024-2024 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP), pp.
13506–13510. IEEE, 2024.
Pe nias, P., Rampas, D., Rich e , M. L., Pal, C. J., and
Aub e ille, M. W
¨
u s chen: An e icien a chi ec u e o
la ge-scale ex - o-image di usion models. a Xi p ep in
a Xi :2306.00637, 2023.
Qin, Z., Tao, X., Lu, J., Tong, W., and Li, G. Y. Seman-
ic communica ions: P inciples and challenges. a Xi
p ep in a Xi :2201.01389, 2021.
Rombach, R., Bla mann, A., Lo enz, D., Esse , P., and
Omme , B. High- esolu ion image syn hesis wi h la-
en di usion models. In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), pp. 10684–10695, June 2022.
Sei ze , M. py o ch- id: FID Sco e o PyTo ch.
h ps://
gi hub.com/msei ze /py o ch- id
, Augus
2020. Ve sion 0.3.0.
S ina i, E. C. and Ba ba ossa, S. 6g ne wo ks: Beyond shan-
non owa ds seman ic and goal-o ien ed communica ions.
Compu e Ne wo ks, 190:107930, 2021.
Tan, M. and Le, Q. E icien ne : Re hinking model scal-
ing o con olu ional neu al ne wo ks. In In e na ional
con e ence on machine lea ning, pp. 6105–6114. PMLR,
2019.
Wang, Z., Bo ik, A. C., Sheikh, H. R., and Simoncelli,
E. P. Image quali y assessmen : om e o isibili y
o s uc u al simila i y. IEEE T ansac ions on Image
P ocessing, 13(4):600–612, 2004.
Xia, L., Sun, Y., Liang, C., Zhang, L., Im an, M. A., and
Niya o, D. Gene a i e AI o Seman ic Communica ion:
A chi ec u e, Challenges, and Ou look. IEEE Wi eless
Communica ions, 32(1):132–140, 2025. doi: 10.1109/
MWC.003.2300351.
Yang, P., Zhang, G., and Cai, Y. Ra e-adap i e gene a-
i e seman ic communica ion using condi ional di usion
models. IEEE Wi eless Communica ions Le e s, 14(2):
539–543, 2025.
Yilmaz, S. F., Niu, X., Bai, B., Han, W., Deng, L., and
G
¨
und
¨
uz, D. High pe cep ual quali y wi eless image deli -
e y wi h denoising di usion models. In IEEE INFOCOM
2024-IEEE Con e ence on Compu e Communica ions
Wo kshops (INFOCOM WKSHPS), pp. 1–5. IEEE, 2024.
Zhang, R., Isola, P., E os, A. A., Shech man, E., and Wang,
O. The un easonable e ec i eness o deep ea u es as a
pe cep ual me ic. In P oceedings o he IEEE con e ence
on compu e ision and pa e n ecogni ion, pp. 586–595,
2018.
8

Related note

Why institutions use Plag.ai for originality review, entry 19
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai