A Novel Vision Transformer for Camera-LiDAR Fusion Based Traffic Object Segmentation

Author: Tahves, Toomas; Gu, Junyi; Bellone, Mauro; Sell, Raivo

Publisher: Zenodo

DOI: 10.5220/0013239000003890

Source: https://zenodo.org/records/17660856/files/132390.pdf

A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic
Objec Segmen a ion
Toomas Tah es1 a, Junyi Gu1,2 b, Mau o Bellone3 c and Rai o Sell1 d
1Depa men o Mechanical and Indus ial Enginee ing, Tallinn Uni e si y o Technology, Es onia
2Dep . o Compu e Science and Enginee ing, Chalme s Uni e si y o Technology and Uni e si y o Go henbu g, Sweden
3FinEs Cen e o Sma Ci ies, Tallinn Uni e si y o Technology, Es onia
Keywo ds: Dense Vision T ans o me s, Seman ic Segmen a ion, Senso Fusion, Residual Neu al Ne wo k.
Abs ac : This pape p esen s Came a-LiDAR Fusion T ans o me (CLFT) models o a ic objec segmen a ion, which
le e age he usion o came a and LiDAR da a using ision ans o me s. Building on he me hodology o
isual ans o me s ha exploi he sel -a en ion mechanism, we ex end segmen a ion capabili ies wi h addi-
ional classi ica ion op ions o a di e se class o objec s including cyclis s, a ic signs, and pedes ians ac oss
di e se wea he condi ions. Despi e good pe o mance, he models ace challenges unde ad e se condi ions
which unde sco es he need o u he op imiza ion o enhance pe o mance in da kness and ain. In summa y,
he CLFT models o e a compelling solu ion o au onomous d i ing pe cep ion, ad ancing he s a e-o - he-
a in mul imodal usion and objec segmen a ion, wi h ongoing e o s equi ed o add ess exis ing limi a ions
and ully ha ness hei po en ial in p ac ical deploymen s.
1 INTRODUCTION
This wo k ex ends ou p e ious wo k on came a-
LiDAR usion ans o me (CLFT) (Gu e al., 2024),
which u ilizes he encode -decode s uc u e o a
ans o me ne wo k bu uses a no el p og essi e-
assemble s a egy o ision ans o me s. We elab-
o a e on he CLFT me hodology and ex end segmen-
a ion wi h addi ional classi ica ion op ions. Ou goal
is o ou pe o m exis ing CNN and isual ans o me
models by le e aging came a and LiDAR da a usion.
T ans o me s (Vaswani e al., 2023), ini ially in-
oduced o language models, ely on a mechanism
called sel -a en ion o p ocess inpu da a pa ches.
This allows models o globally weigh he impo ance
o di e en pa s o inpu da a simul aneously, hus
imp o ing compu a ion e iciency. Since ans o m-
e s do no con ain in o ma ion abou he o de o in-
pu okens, posi ional encodings a e added o inpu
embeddings o e ain in o ma ion which is c ucial o
emembe in asks such as language ansla ion and
image ecogni ion.
Vision ans o me s (ViT) (Doso i skiy e al.,
2021) apply he ans o me a chi ec u e o image
ah ps://o cid.o g/0009-0008-0050-2146
bh ps://o cid.o g/0000-0002-5976-6698
ch ps://o cid.o g/0000-0003-3692-0688
dh ps://o cid.o g/0000-0003-1409-0206
da a by di iding images in o pa ches and ea ing
each pa ch as a oken which allows models o cap-
u e global con ex and ela ionships be ween di -
e en pa s o an image. Dense p edic ion ans-
o me s (DPT) (Ran l e al., 2021) p ocess im-
age pa ches simila ly o ViTs bu ocus on gene a -
ing pixel-le el p edic ions by le e aging he s eng hs
o ans o me s in cap u ing long- ange dependencies
and con ex ual in o ma ion. Ou hypo hesis is ha
he combina ion o ViT and DPT can g ab dependen-
cies in he da a imp o ing he in e p e a ion o less-
ep esen ed classes in conside a ion ha au onomous
d i ing da ase s a e s ongly unbalanced o ehicles.
Following his line o esea ch, ou wo k p o ides
he ollowing main con ibu ions:
• We enhanced he CLFT model o handle a b oade
spec um o a ic objec s, including cyclis s,
signs, and pedes ians.
• Th ough ex ensi e es ing, we demons a ed ha
ou model achie es supe io accu acy and pe -
o mance me ics compa ed o o he isual ans-
o me models.
• By le e aging he s eng hs o mul i-modal sen-
so usion and he mul i-a en ion mechanism, he
CLFT model p o es o be a solu ion o di e se
en i onmen al condi ions, including challenging
wea he scena ios.
566
Tah es, T., Gu, J., Bellone, M. and Sell, R.
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion.
DOI: 10.5220/0013239000003890
In P oceedings o he 17 h In e na ional Con e ence on Agen s and A i icial In elligence (ICAART 2025) - Volume 2, pages 566-573
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Copy igh ©2025 by Pape published unde CC license (CC BY-NC-ND 4.0)
2 RELATED WORK
The usion o came a and LiDAR da a is a widely e-
sea ched opic in mul imodal usion wi h applica ions
in objec de ec ion and segmen a ion. Va ious ech-
niques ha e been p oposed o e he yea s o sol e
hese p oblems, (Cui e al., 2022) p oposed he ol-
lowing ca ego iza ion op ions: signal-le el, ea u e-
le el, esul -le el, and mul i-le el usion. Signal-le el
usion depends on aw senso da a, while i is sui able
o dep h comple ion (Cheng e al., 2019) (Lin e al.,
2022) and landma k de ec ion (Lee and Pa k, 2021)
(Cal agi one e al., 2018), i s ill su e s om loss o
ex u e in o ma ion. Voxel g id o 2D p ojec ion a e
used o ep esen LiDAR da a as ea u e maps, o in-
s ance, he implemen a ion o VoxelNe (Zhou and
Tuzel, 2017) uses aw poin clouds as oxels be o e
using LiDAR da a wi h came a pixels. Resul -le el
usion inc eases accu acy by me ging p edic ion e-
sul s om di e en model ou pu s (Ja i z e al., 2020)
(Gu e al., 2018). Th ough e iewing he li e a u e, i
is possible o obse e ha he ecen end is o shi
owa ds mul i-le el usion, which ep esen s a com-
bina ion o all o he usion s a egies. The compu-
a ional complexi y esul ing om LiDAR 3D da a
is ackled by educing he dimensionali y o a wo-
dimensional image o exploi he exis ing image p o-
cessing me hods. Ou wo k uses a ans o me -based
ne wo k o in eg a ing came a and LiDAR da a in a
c oss- usion s a egy in he decode laye s.
The a en ion mechanism in oduced in he ans-
o me a chi ec u e in (Vaswani e al., 2023) has
a emendous impac in a ious ields, especially in
na u al language p ocessing (Xiao and Zhu, 2023)
and compu e ision. One no able a ian is he i-
sion ans o me (ViT) (Doso i skiy e al., 2021),
which excels in au onomous d i ing asks by han-
dling global con ex s and long- ange dependencies.
Pe cei ing he su ounding a ea in a wo-dimensional
plane p ima ily in ol es ex ac ing in o ma ion om
came a images wi h no able wo ks like bi d eye iew
ans o me s o oad su ace segmen a ion p esen ed
in (Zhu e al., 2024). O he ecen app oaches include
ligh weigh ans o me s o lane shape p edic ion
and combined seman ic and ins ance segmen a ion
(Lai-Dang, 2024). Th ee-dimensional au onomous
d i ing pe cep ion is an ex ensi ely esea ched opic
ocusing on objec de ec ion and segmen a ion. In
(Wang e al., 2021) DETR3D, he au ho s p esen a
mul i-came a objec de ec ion me hod, unlike o he s
ha ely on monocula images, i ex ac s 2D ea u es
om images and uses 3D objec que ies o link ea-
u es o 3D posi ions ia came a ans o ma ion ma-
ices. FUTR3D (Chen e al., 2023) employs a que y-
based Modali y-Agnos ic Fea u e Sample (MAFS),
oge he wi h a ans o me decode wi h a se - o-se
loss o 3D de ec ion, hus a oiding using la e usion
heu is ics and pos -p ocessing icks. BEVFo me
(Li e al., 2022) imp o es objec de ec ion and map
segmen a ion wi h spa ial and empo al a en ion lay-
e s ia spa io empo al ans o me s.
Recen wo ks emphasize he usion o came a and
LiDAR da a o enhanced pe cep ion. CLFT models,
o ins ance, p ocess LiDAR poin clouds as image
iews o achie e 2D seman ic segmen a ion, b idging
gaps in mul i-modal seman ic objec segmen a ion.
3 METHODOLOGY
In his sec ion, we elabo a e on he de ailed s uc-
u e o he CLFT ne wo k in he sequen ial o de o
da a p ocessing, aiming o p o ide an exclusi e in-
sigh in o how he senso y da a lows in he ne wo k,
hus, bene i s he unde s anding and ep oducibili y o
ou wo k.
The CLFT ne wo k achie es he came a-LiDAR
usion by p og essi ely assembling ea u es om
each modali y i s and hen conduc ing he c oss-
usion a he end. Figu a i ely, he CLFT ne wo k
has wo di ec ions o p ocess he inpu came a and
LiDAR da a in pa allel; he in eg a ion o wo modal-
i ies happens a he ‘ usion’ s age in he ne wo k’s de-
code block. In gene al, he e a e h ee s eps in he
en i e p ocess. The i s s ep is p e-p ocessing he in-
pu , which embeds he image-like da a o he lea n-
able ans o me okens; he second s ep closely ol-
lows he p o ocols o ViT (Doso i skiy e al., 2021)
encode s o encode he embedded okens; he las s ep
is he pos -p ocessing o he da a, which p og essi ely
assembles and uses he ea u e ep esen a ions o ac-
qui e segmen a ion p edic ions. The de ails o he
h ee s eps a e desc ibed in he ollowing h ee sub-
sec ions.
3.1 Embedding
The came a and LiDAR inpu da a p e-p ocessing is
independen and in pa allel. As men ioned in Sec-
ion 1, we selec he LiDAR p ocessing s a egy o
p ojec he poin cloud da a on o he came a plane,
hus a aining he LiDAR p ojec ion images. Fo deep
mul i-modal senso usion, he ansi ion om di e -
en inpu s o a uni ied modali y simpli ies he ne wo k
s uc u e and minimizes he usion e o s.
As shown in Fig. 1, he e a e a o al o ou s eps
in he embedding module. The i s s ep is esiz-
ing he came a and LiDAR ma ices o =384 and
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
567
Figu e 1: Embedding p ocess o came a and LiDAR da a. (a) The o iginal image is esized o a esolu ion o 384 ×384 o
s anda dize he inpu dimensions. (b) The inpu image is segmen ed in o non-o e lapping ixed-size pa ches o 16×16 pixels.
(c) Pa ches a e la ened in o one-dimensional embedded ec o s, wi h an addi ional posi ional embedding (colo ed in o ange)
added o p o ide spa ial in o ma ion. (d) The combined pa ch embeddings a e p ocessed h ough Mul ilaye Pe cep ons
(MLPs) wi h dimensions E=¯
D×D, esul ing in a ma ix ha se es as he inpu o he ans o me encode . The whole
igu e is based on he CLFT-Base a ian .
c=384, whe e is he numbe o ows and cis he
numbe o columns. The second s ep segmen s he
inpu image in o non-o e lapping ixed-size pa ches.
The size o each pa ch pin pixels is 16 ×16. The e-
o e, he dimension ¯
Do he oken ep esen ing one
pa ch is 16 ×16 ×3=768. In he hi d s ep, pa ches
a e la ened in o one-dimensional embedded ec o s
Xo leng h ∗c
p∗p=576 o se e as inpu okens o
he ans o me model. Since ans o me s inhe -
en ly lack he capaci y o comp ehend spa ial and wo-
dimensional neighbo hood s uc u e ela ionships be-
ween pa ches, we inco po a e an ex a posi ional em-
bedding in o each pa ch (Doso i skiy e al., 2021).
The addi ional embedding p o ides he ne wo k wi h
essen ial in o ma ion ega ding he ela i e spa ial po-
si ions o he pa ches wi hin he o iginal image. Se-
quen ially, in he las s ep, we pass he combined
pa ch embeddings h ough he Mul ilaye Pe cep ons
(MLPs) wi h dimensions o E=¯
D×D.Dindica es
he ne wo k’s a ious ea u e dimensions o di e en
ne wo k pa ame e con igu a ions. The esul ing ma-
ix X×Eis he inpu o he ans o me encode o
u he lea ning and p ocessing.
3.2 Encode
The essence o he ans o me encode is he Mul i-
Head Sel -A en ion (MHSA) mechanism (Vaswani
e al., 2023), which allows he ne wo k o weigh he
impo ance o each pa ch ela i e o each o he . Wi h
he assis ance o MHSA, he neu al ne wo ks e ec-
i ely cap u e global dependencies and in o ma ion
by compu ing a en ion sco es be ween all pai s o
pa ches. Mo eo e , hese sco es a e used o gene a e
weigh ed sums o he pa ch embeddings. The encode
ou pu consis s o embedding ma ices, each co e-
sponding o a pa ch in he o iginal image.
Figu e 2 illus a es he de ailed p ocess o ou
CLFT encode . The inpu o he encode is he e-
sul ing ma ix X′=X×E om he p e ious em-
bedding s ep (see Fig. 2(a)). The ma ix X′con-
ains he image’s pa ch and posi ion embeddings, as
well as he lea nable class okens. The dimension
o he X′is (576 +1)×768, which means he e a e
576 pa ch embeddings and one ex a posi ion embed-
ding. This app oach is inspi ed by BERTs okeniza-
ion me hod, which uses simila embeddings o cap-
u e con ex ual in o ma ion wi hin ex (De lin e al.,
2019). The mul i-head X′ma ix is hen eshaped in o
577 ×3×768, which ep esen s a Que y, Key and,
Value (QKV) ma ix, espec i ely. Equa ion 1 shows
he mul i-head a en ion Hcalcula ion in his s ep.
H(Q,K,V) =
N
M
i=1
hiWO(1)
whe e Lmeans conca ena ion o head ec o s side
by side wi h each o he , and WOis he weigh ma ix
used o linea ly ans o m he conca ena ed ou pu s.
Each head hiis calcula ed indi idually using i s own
se o p ojec ion ma ices as ollows:
hi=A(QWQ
i,KWK
i,VWV
i)(2)
whe e Adeno es he a en ion mechanism o he
que ies (Q), keys (K), and alues (V). P ojec ion
ma ices WQ
i,WK
i, and WV
i o he i- h head a e cal-
cula ed as ollows:
WQ
i=R(dm×dk)
WK
i=R(dm×dk)
WV
i=R(dm×d )
(3)
The So max a en ion mechanism ollows he
equa ion 4:
A(Q,K,V) = so max(QKT
√dk
)V(4)
whe e e m QKT ep esen s he do p oduc o he
que ies and he ansposed keys, gene a ing a sim-
ila i y sco e be ween each que y-key pai . Squa e
ICAART 2025 - 17 h In e na ional Con e ence on Agen s and A i icial In elligence
568
Figu e 2: Encode p ocess. (a) The ou pu om embedding is no malized and passed h ough linea laye s in o he mul i-head
a en ion block. (b) The ma ix is spli in o KQV ma ices, upon which So Max and a en ion ope a ions a e pe o med. The
KQV ma ices a e hen eshaped in o a single ma ix. (c) Finally, linea ope a ions a e execu ed, and he esul is p ocessed
h ough he MLP block.
oo o he key dimension dkp e en s he do p oduc
om becoming oo la ge, which s abilizes he g adi-
en s du ing aining. The So max unc ion is applied
o he scaled simila i y sco es, con e ing hem in o
a en ion weig hs, which de e mine he impo ance o
each key- alue pai o he gi en que y. Finally, he
a en ion weigh s a e used o compu e a weigh ed sum
o he alues V, p oducing inal ou pu o he a en ion
mechanism o each head.
The QKV ma ices a e hen eshaped in o N×
577 ×64, whe e Ns ands o he numbe o laye s
de ined in CLFT con igu a ion (as shown in Table 1).
A las , he me ics go hough he no maliza ion and
MLP laye s o be he inpu o CLFT decode (Fig.
2(c)).
Table 1 ou lines ou po en ial con igu a ion op-
ions o CLFT encode . The names ollow he
ViT con en ions. Each con igu a ion ea u es p ede-
ined ans o me laye s and a ea u e dimension D
wi h ixed-size okens. The CLFT-Hyb id con igu-
a ion dis inguishes i sel om he o he s by using a
ResNe 50 esidual ne wo k (He e al., 2015) o con-
e 768 ×768 images in o 14 ×14 pa ches, hen la -
ened in o one-dimensional ec o s o size 196.
Table 1: CLFT con igu a ion a ian s.
Type Laye s Fea u e dimension D
CLFT-Base 12 768
CLFT-La ge 24 1024
CLFT-Huge 32 1280
CLFT-Hyb id 12 768
3.3 Decode
The decode module p ocesses he okens om en-
code laye s o p og essi ely assemble he ea u e
ep esen a ions in o a 3D ma ix. This ma ix can
be isualized as an image o make p edic ions. We
ex end he h ee-s age eassembly ope a ion ini ially
p oposed in he (Ran l e al., 2021), including da a
eading, conca ena ing, and esampling, wi h he ex-
a s age o execu e he c oss- usion o came a and
LiDAR da a.
In he i s s age o eassembly, shown in Fig. 3(a),
we append a special classi ica ion oken o a se o
N okens, po en ially cap u ing global in o ma ion.
(Ran l e al., 2021) ha e e alua ed h ee di e en
a ian s o his mappings:
• One ha igno es he special class oken and p o-
cesses only he indi idual okens.
• One ha p opaga es in o ma ion om he class o-
ken o all o he okens.
• One ha conca ena es he class oken o all o he
okens, hen p ojec s he combined ep esen a ion
h ough a linea laye ollowed by he GELU ac-
i a ion unc ion o in oduce non-linea i y.
Figu e 3(b) shows he second s age o he decode .
A o al amoun o N okens a e shaped in o an image-
like ea u e map wi h he aid o posi ion okens. The
ea u e map wi h Dchannels is conca ena ed in o a
esul R=
pxc
pxD.
Figu e 3(c) illus a es he hi d and las s age. The
ea u e maps is i s scaled o size R=
sxc
sxˆ
D, whe e
ˆ
Dis se as 256 in all expe imen s. Fea u es om ea ly
laye s a e esampled a highe esolu ions, while ea-
u es om deepe laye s o he ans o me a e e-
sampled a lowe esolu ions. The CLFT-Base a i-
an uses laye s l={3,6,9,12}, and he CLFT-La ge
a ian u ilizes laye s l={5,12,18,24} o ex ac
ea u es. The CLFT-Hyb id a ian employs ResNe
laye s o ini ial ea u e ex ac ion and inco po a es
ans o me laye s l={9,12} o deepe ea u e ep-
esen a ion. The scaling coe icien s sis {4,8,16,32}.
In he las c oss- usion s age, came a and LiDAR
ea u es a e combined om ea u e maps in pa al-
lel. Ex ac ed ea u e maps a e combined using he
Re ineNe -based ea u e usion me hod, which em-
ploys wo esidual con olu ion uni s (RCUs) in a se-
quence. Resul s om came a and LiDAR ep esen a-
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
569
Figu e 3: Decode p ocess. (a) The inpu enso , ep esen ing da a, is conca ena ed wi h classi ica ion okens. (b) These
okens a e hen conca ena ed based on hei posi ional in o ma ion, yielding an image-like ep esen a ion. Two con olu ion
ope a ions, along wi h up-sampling and down-sampling, a e applied. (c) C oss- usion is applied o combine came a and
LiDAR da a, p og essi ely in eg a ing ou pu s om esidual compu a ion uni s om p e ious s eps. The inal p edic ed
segmen a ion is compu ed h ough decon olu ion and up-sampling blocks.
ions a e summed om he p e ious usion s age and
passed h ough ano he RCU. The ou pu o he las
RCU is passed o a de-con olu ional laye and up-
sampled o compu e he p edic ed segmen a ion.
4 DATASET CONFIGURATION
Waymo Open Da se (WOD) is designed o aid e-
sea che s in au onomous d i ing. I includes da a
om came a and LiDAR senso s which a e collec ed
in u ban and subu ban en i onmen s unde di e se
d i ing condi ions. I con ains labels o 4 objec
classes - ehicles, pedes ians, cyclis s, and signs. We
ha e manually pa i ioned he da ase in o ou sub-
se s: d y day, ainy day, d y nigh , and ainy nigh ,
and he amoun o ames pe subse is shown in Ta-
ble 2.
We use in e sec ion o e union (IoU) o e alua e
he pe o mance o he model along wi h alues o
p ecision and ecall. IoU compu a ion is ex ended o
alida e mul i-class seman ic segmen a ion by assign-
ing pixel alues o oid and excluding hem om inal
alida ion. We compa e g ound u h (Waymo label
alues) o he ou pu o he CLFT model o measu e
he pe o mance o ou wo k.
Table 2: F ame coun pe subse in WOD.
D y day Rainy day D y nigh Rainy nigh
14940 4520 1640 900
4.1 Me ics
We use he in e sec ion o e union (IoU) as he p i-
ma y indica ion o e alua e he pe o mance o ou
ne wo ks. In addi ion, we p o ide he esul s o p e-
cision and ecall. The IoU is p ima ily used in ob-
jec de ec ion applica ions, in which he ou pu is he
bounding box a ound he objec . We modi y he o di-
na y IoU algo i hm o i he mul i-class pixel-wise se-
man ic objec segmen a ion. Gi en a se o p ede ined
seman ic classes Ldeno ed by L={0,1, ..., L−1}.
Each pixel in he image can be ep esen ed as a pai
(pL,gL), whe e pLand gLindica e he p edic ion and
g ound- u h class, espec i ely. The pe o mance o
he ne wo ks is measu ed by he s a is ics o he num-
be o pixels ha ha e iden ical classes indica ed in
p edic ion and g ound u h. No all pixels ha e a
alid label, he e o e ambiguous pixels ha all ou
o he class lis a e assigned as oid and no coun ed
in he e alua ion. The IoU o each class is gi en by
Equa ion 5, whe e Lmeans he non-iden ical class.
IoUL=∑(pLgL)
∑(pLgL)+∑(pLgL) +∑(pLgL)(5)
Co espondingly, he p ecision and ecall a e ob-
ained by Equa ion 6 and 7.
P ecisionL=∑(pLgL)
∑(pLgL)+∑(pLgL)(6)
RecallL=∑(pLgL)
∑(pLgL)+∑(pLgL)(7)
5 EXPERIMENTAL RESULTS
5.1 Expe imen al Se up
The ans o me -based ne wo ks we e ained on
se e s equipped wi h N idia A100 80GB g aphics
ca ds. Each aining session u ilized a ba ch size o
24, unning o up o 400 epochs. Ea ly s opping c i-
e ia we e implemen ed o p e en o e - i ing and o
ensu e e icien use o compu a ional esou ces.
ICAART 2025 - 17 h In e na ional Con e ence on Agen s and A i icial In elligence
570

Table 3: Pe o mance compa ison o CLFT-Hyb id me hod du ing a ious wea he condi ions.
IoU P ecision Recall
Cyclis Pedes ian Sign Cyclis Pedes ian Sign Cyclis Pedes ian Sign
D y day
Came a 64.17 67.88 45.48 83.79 79.99 65.41 73.27 81.76 59.88
LiDAR 64.06 68.21 45.22 83.41 79.84 64.45 73.41 82.41 60.24
Came a+LiDAR 60.96 67.75 45.09 82.73 79.42 61.97 69.86 82.17 62.34
Rainy day
Came a 70.75 61.98 35.49 86.19 80.19 68.98 79.80 73.18 42.23
LiDAR 73.76 62.84 37.05 89.53 80.79 68.02 80.73 73.89 44.86
Came a+LiDAR 72.63 62.50 37.82 87.27 79.84 62.30 81.24 74.22 49.03
D y nigh
Came a 66.11 66.11 32.82 83.60 81.48 56.74 75.96 77.80 43.77
LiDAR 66.95 66.87 32.70 87.13 80.69 57.23 74.30 79.61 43.27
Came a+LiDAR 61.55 65.68 31.87 79.06 79.80 50.52 73.53 78.78 46.33
Rainy nigh
Came a 16.38 43.57 40.45 42.30 66.13 64.81 21.10 56.09 51.83
LiDAR 50.11 49.54 39.04 71.10 64.22 59.07 62.92 68.42 53.53
Came a+LiDAR 63.41 48.13 37.42 79.94 70.40 55.28 75.41 60.33 53.67
The da ase was di ided in o h ee pa s: 60%
o aining, 20% o alida ion, and 20% o es ing.
This dis ibu ion ensu es a balanced app oach, allow-
ing he model o lea n e ec i ely, alida e i s pe o -
mance du ing aining, and be e alua ed on unseen
da a o assess i s gene aliza ion capabili ies.
A o al o nine aining sessions we e conduc ed,
each wi h di e en ne wo k pa ame e s: CLFT-Base,
CLFT-La ge, and CLFT-Hyb id. Sepa a e aining
sessions we e pe o med o LiDAR-only, came a-
only, and c oss- usion o came a+LiDAR da a o
comp ehensi ely e alua e he pe o mance ac oss di -
e en senso con igu a ions.
5.2 Va ying Wea he Condi ions
We conduc ed an analysis o he ne wo k pe o mance
ac oss ou dis inc wea he condi ions: d y day, ainy
day, d y nigh , and ainy nigh . The esul s o he
CLFT-Hyb id me hod unde hese a ious condi ions
a e summa ized in Table 3.
In d y day condi ions, he pe o mance o he
CLFT-Hyb id model using LiDAR alone (IoU: 64%
o cyclis s, 68% o pedes ians) was compa able o
using came a da a alone (IoU: 64% o cyclis s, 68%
o pedes ians) and sligh ly be e han he combined
da a.
Du ing ainy day condi ions, LiDAR da a ou pe -
o med came a da a (IoU: 74% o cyclis s, 63% o
pedes ians s. 71% o cyclis s, 62% o pedes i-
ans). This is an expec ed esul as he came a is
blu ed by ain, while LiDARs a e ypically less a -
ec ed. Combined da a was compe i i e, wi h IoU o
73% o cyclis s and 63% o pedes ians, showing
LiDAR’s esilience agains isual noise and low-ligh
en i onmen s.
Unde d y nigh condi ions, LiDAR da a pe -
o med be e han bo h combined and came a da a
alone (IoU: 67% o cyclis s, 67% o pedes ians s.
66% o cyclis s and pedes ians wi h came a), p e-
sen ing LiDAR’s ad an age in low ligh condi ions.
Unde ainy nigh condi ions, he combined Li-
DAR+Came a da a yielded he highes pe o mance
(IoU: 63% o cyclis s, 48% o pedes ians s. 50%
o cyclis s and 50% o pedes ians wi h LiDAR
alone). C oss- usion e ec i ely le e aged comple-
men a y in o ma ion, p o iding dep h and ex u e de-
ails.
5.3 Va ying Ne wo k Con igu a ions
The pe o mance me ics o di e en CLFT con igu-
a ions unde d y day condi ions a e summa ized in
Table 4. The CLFT-Base con igu a ion showed ha
using ei he came a o LiDAR alone p o ides compa-
able esul s, bu combining hem did no yield sig-
ni ican imp o emen s. The CLFT-La ge con igu a-
ion bene i ed om highe p ecision, especially when
combining da a sou ces, sugges ing be e accu acy in
iden i ying objec s, hough IoU did no signi ican ly
imp o e. The CLFT-Hyb id con igu a ion pe o med
he bes o e all, pa icula ly using ei he came a da a
alone o LiDAR da a alone. This model e ec i ely
le e ages he s eng hs o bo h da a ypes, wi h he
usion o bo h da a sou ces yielding high ecall o
signs.
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
571
Table 4: Pe o mance me ics unde d y day condi ions o di e en CLFT con igu a ions.
Cyclis Pedes ian Sign
IoU P ecision Recall IoU P ecision Recall IoU P ecision Recall
CLFT-Base C 50.07 84.72 55.04 65.71 80.56 78.09 41.27 66.46 52.13
CLFT-Base L 47.01 84.27 51.53 64.06 78.60 77.59 39.76 63.15 51.78
CLFT-Base C+L 48.31 80.48 54.73 65.11 77.85 79.92 41.33 61.35 55.88
CLFT-La ge C 53.50 83.61 59.77 66.03 82.11 77.12 41.17 68.81 50.61
CLFT-La ge L 53.91 84.53 59.81 66.31 80.06 79.43 41.44 64.49 53.70
CLFT-La ge C+L 53.58 85.11 59.12 66.10 82.28 77.07 41.90 70.07 51.03
CLFT-Hyb id C 64.17 83.79 73.27 67.88 79.99 81.76 45.48 65.41 59.88
CLFT-Hyb id L 64.06 83.41 73.41 68.21 79.84 82.41 45.22 64.45 60.24
CLFT-Hyb id C+L 60.96 82.73 69.86 67.75 79.42 82.17 45.09 61.97 62.34
5.4 Compa ison o O he Ne wo ks
We compa ed ou esul s o hose o adi ional Fully
Con olu ional Ne wo ks (FCN) (Gu e al., 2022) and
panop ic ne wo ks as p esen ed in (Gu e al., 2024).
The CLFT-Hyb id achie ed highe IoU sco es (e.g.,
64% o cyclis s and 68% o pedes ians in d y day
condi ions) compa ed o ypical FCN and panop ic
ne wo ks, which o en s uggle wi h complex scenes
and poo isibili y. Unlike FCNs and panop ic ne -
wo ks ha ely on single modali ies, he CLFT e ec-
i ely combines LiDAR and came a da a, enhancing
pe o mance, especially in challenging scena ios like
ainy nigh s (IoU: 63% o cyclis s).
6 CONCLUSION
In his pape , we demons a ed he e ec i eness o
Came a-LiDAR Fusion T ans o me (CLFT) mod-
els in achie ing success ul objec segmen a ion by
le e aging senso c oss- usion and he ans o me ’s
mul i-a en ion mechanism. The CLFT-Hyb id model
showed ema kable imp o emen s in segmen a ion
accu acy o cyclis s, pedes ians, and a ic signs.
The CLFT models main ained high pe o mance
ac oss a a ie y o wea he condi ions, including day,
ain, and nigh scena ios. By combining he s eng hs
o bo h LiDAR and came a da a, he CLFT model e -
ec i ely u ilized c oss- usion o enhance o e all pe -
o mance. The ans o me ’s mul i-a en ion mecha-
nism enabled he CLFT models o ocus on ele an
ea u es and imp o e objec de ec ion and segmen a-
ion accu acy.
Despi e hese p omising esul s, se e al chal-
lenges emain. The CLFT models exhibi ed a i-
abili y in pe o mance unde ad e se wea he condi-
ions. Fo ins ance, while LiDAR alone pe o med
well in ai condi ions, he usion o LiDAR and cam-
e a da a some imes led o subop imal esul s. The
models showed dec eased pe o mance in nigh and
ainy condi ions. The CLFT models, especially la ge
con igu a ions, equi e signi ican compu a ional e-
sou ces, which poses challenges o eal- ime imple-
men a ion in esou ce-cons ained en i onmen s.
Fu u e wo k should ocus on imp o ing he accu-
acy o CLFT models in challenging en i onmen s,
explo ing mo e da a usion echniques, and in eg a -
ing addi ional senso modali ies o u he enhance
o e all pe o mance.
ACKNOWLEDGMENT
Pa o he esea ch has ecei ed unding om he ol-
lowing g an s: he Eu opean Union’s Ho izon 2020
Resea ch and Inno a ion P og amme p ojec Fines
Twins (g an No. 856602) and AI-Enabled Da a Li e-
cycles Op imiza ion and Da a Spaces In eg a ion o
Inc eased E iciency and In e ope abili y PLIADES,
g an ag eemen No. 101135988.
REFERENCES
Cal agi one, L., Bellone, M., S ensson, L., and Wahde, M.
(2018). Lida -came a usion o oad de ec ion using
ully con olu ional neu al ne wo ks.
Chen, X., Zhang, T., Wang, Y., Wang, Y., and Zhao, H.
(2023). Fu 3d: A uni ied senso usion amewo k
o 3d de ec ion.
Cheng, X., Wang, P., Guan, C., and Yang, R. (2019).
Cspn++: Lea ning con ex and esou ce awa e con o-
lu ional spa ial p opaga ion ne wo ks o dep h com-
ple ion.
Cui, Y., Chen, R., Chu, W., Chen, L., Tian, D., Li, Y.,
and Cao, D. (2022). Deep lea ning o image and
poin cloud usion in au onomous d i ing: A e iew.
IEEE T ansac ions on In elligen T anspo a ion Sys-
ems, 23(2):722–739.
De lin, J., Chang, M.-W., Lee, K., and Tou ano a, K.
(2019). Be : P e- aining o deep bidi ec ional ans-
o me s o language unde s anding.
ICAART 2025 - 17 h In e na ional Con e ence on Agen s and A i icial In elligence
572
Doso i skiy, A., Beye , L., Kolesniko , A., Weissenbo n,
D., Zhai, X., Un e hine , T., Dehghani, M., Minde e ,
M., Heigold, G., Gelly, S., Uszko ei , J., and Houlsby,
N. (2021). An image is wo h 16x16 wo ds: T ans-
o me s o image ecogni ion a scale.
Gu, J., Bellone, M., Pi oˇ
nka, T., and Sell, R. (2024). Cl :
Came a-lida usion ans o me o seman ic segmen-
a ion in au onomous d i ing. IEEE T ansac ions on
In elligen Vehicles, pages 1–12.
Gu, J., Bellone, M., Sell, R., and Lind, A. (2022). Objec
segmen a ion o au onomous d i ing using iseau o
da a. Elec onics, 11(7).
Gu, S., Lu, T., Zhang, Y., Al a ez, J. M., Yang, J., and
Kong, H. (2018). 3-d lida + monocula came a:
An in e se-dep h-induced usion amewo k o u ban
oad de ec ion. IEEE T ansac ions on In elligen Ve-
hicles, 3(3):351–360.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep esid-
ual lea ning o image ecogni ion.
Ja i z, M., Vu, T.-H., de Cha e e, R., ´
Emilie Wi bel, and
P´
e ez, P. (2020). xmuda: C oss-modal unsupe ised
domain adap a ion o 3d seman ic segmen a ion.
Lai-Dang, Q.-V. (2024). A su ey o ision ans o me s in
au onomous d i ing: Cu en ends and u u e di ec-
ions.
Lee, J.-S. and Pa k, T.-H. (2021). Fas oad de ec ion by
cnn-based came a–lida usion and sphe ical coo di-
na e ans o ma ion. IEEE T ansac ions on In elligen
T anspo a ion Sys ems, 22(9):5802–5810.
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q.,
and Dai, J. (2022). Be o me : Lea ning bi d’s-eye-
iew ep esen a ion om mul i-came a images ia
spa io empo al ans o me s.
Lin, Y., Cheng, T., Zhong, Q., Zhou, W., and Yang, H.
(2022). Dynamic spa ial p opaga ion ne wo k o
dep h comple ion.
Ran l, R., Bochko skiy, A., and Kol un, V. (2021). Vision
ans o me s o dense p edic ion.
Vaswani, A., Shazee , N., Pa ma , N., Uszko ei , J., Jones,
L., Gomez, A. N., Kaise , L., and Polosukhin, I.
(2023). A en ion is all you need.
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., and
Solomon, J. (2021). De 3d: 3d objec de ec ion om
mul i- iew images ia 3d- o-2d que ies.
Xiao, T. and Zhu, J. (2023). In oduc ion o ans o me s:
an nlp pe spec i e.
Zhou, Y. and Tuzel, O. (2017). Voxelne : End- o-end lea n-
ing o poin cloud based 3d objec de ec ion.
Zhu, Y., Jia, X., Yang, X., and Yan, J. (2024). Fla u-
sion: Del ing in o de ails o spa se ans o me -based
came a-lida usion o au onomous d i ing.
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
573

Related note

Why institutions use Plag.ai for originality review, entry 85
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by academic integrity officers in doctoral schools, editorial boards, quality-assurance offices, and student services, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also more transparent source review, better handling of multilingual submissions, and faster first-level screening. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For journal manuscripts, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai