A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic
Objec Segmen a ion
Toomas Tah es1 a, Junyi Gu1,2 b, Mau o Bellone3 c and Rai o Sell1 d
1Depa men o Mechanical and Indus ial Enginee ing, Tallinn Uni e si y o Technology, Es onia
2Dep . o Compu e Science and Enginee ing, Chalme s Uni e si y o Technology and Uni e si y o Go henbu g, Sweden
3FinEs Cen e o Sma Ci ies, Tallinn Uni e si y o Technology, Es onia
Keywo ds: Dense Vision T ans o me s, Seman ic Segmen a ion, Senso Fusion, Residual Neu al Ne wo k.
Abs ac : This pape p esen s Came a-LiDAR Fusion T ans o me (CLFT) models o a ic objec segmen a ion, which
le e age he usion o came a and LiDAR da a using ision ans o me s. Building on he me hodology o
isual ans o me s ha exploi he sel -a en ion mechanism, we ex end segmen a ion capabili ies wi h addi-
ional classi ica ion op ions o a di e se class o objec s including cyclis s, a ic signs, and pedes ians ac oss
di e se wea he condi ions. Despi e good pe o mance, he models ace challenges unde ad e se condi ions
which unde sco es he need o u he op imiza ion o enhance pe o mance in da kness and ain. In summa y,
he CLFT models o e a compelling solu ion o au onomous d i ing pe cep ion, ad ancing he s a e-o - he-
a in mul imodal usion and objec segmen a ion, wi h ongoing e o s equi ed o add ess exis ing limi a ions
and ully ha ness hei po en ial in p ac ical deploymen s.
1 INTRODUCTION
This wo k ex ends ou p e ious wo k on came a-
LiDAR usion ans o me (CLFT) (Gu e al., 2024),
which u ilizes he encode -decode s uc u e o a
ans o me ne wo k bu uses a no el p og essi e-
assemble s a egy o ision ans o me s. We elab-
o a e on he CLFT me hodology and ex end segmen-
a ion wi h addi ional classi ica ion op ions. Ou goal
is o ou pe o m exis ing CNN and isual ans o me
models by le e aging came a and LiDAR da a usion.
T ans o me s (Vaswani e al., 2023), ini ially in-
oduced o language models, ely on a mechanism
called sel -a en ion o p ocess inpu da a pa ches.
This allows models o globally weigh he impo ance
o di e en pa s o inpu da a simul aneously, hus
imp o ing compu a ion e iciency. Since ans o m-
e s do no con ain in o ma ion abou he o de o in-
pu okens, posi ional encodings a e added o inpu
embeddings o e ain in o ma ion which is c ucial o
emembe in asks such as language ansla ion and
image ecogni ion.
Vision ans o me s (ViT) (Doso i skiy e al.,
2021) apply he ans o me a chi ec u e o image
ah ps://o cid.o g/0009-0008-0050-2146
bh ps://o cid.o g/0000-0002-5976-6698
ch ps://o cid.o g/0000-0003-3692-0688
dh ps://o cid.o g/0000-0003-1409-0206
da a by di iding images in o pa ches and ea ing
each pa ch as a oken which allows models o cap-
u e global con ex and ela ionships be ween di -
e en pa s o an image. Dense p edic ion ans-
o me s (DPT) (Ran l e al., 2021) p ocess im-
age pa ches simila ly o ViTs bu ocus on gene a -
ing pixel-le el p edic ions by le e aging he s eng hs
o ans o me s in cap u ing long- ange dependencies
and con ex ual in o ma ion. Ou hypo hesis is ha
he combina ion o ViT and DPT can g ab dependen-
cies in he da a imp o ing he in e p e a ion o less-
ep esen ed classes in conside a ion ha au onomous
d i ing da ase s a e s ongly unbalanced o ehicles.
Following his line o esea ch, ou wo k p o ides
he ollowing main con ibu ions:
• We enhanced he CLFT model o handle a b oade
spec um o a ic objec s, including cyclis s,
signs, and pedes ians.
• Th ough ex ensi e es ing, we demons a ed ha
ou model achie es supe io accu acy and pe -
o mance me ics compa ed o o he isual ans-
o me models.
• By le e aging he s eng hs o mul i-modal sen-
so usion and he mul i-a en ion mechanism, he
CLFT model p o es o be a solu ion o di e se
en i onmen al condi ions, including challenging
wea he scena ios.
566
Tah es, T., Gu, J., Bellone, M. and Sell, R.
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion.
DOI: 10.5220/0013239000003890
In P oceedings o he 17 h In e na ional Con e ence on Agen s and A i icial In elligence (ICAART 2025) - Volume 2, pages 566-573
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Copy igh ©2025 by Pape published unde CC license (CC BY-NC-ND 4.0)
2 RELATED WORK
The usion o came a and LiDAR da a is a widely e-
sea ched opic in mul imodal usion wi h applica ions
in objec de ec ion and segmen a ion. Va ious ech-
niques ha e been p oposed o e he yea s o sol e
hese p oblems, (Cui e al., 2022) p oposed he ol-
lowing ca ego iza ion op ions: signal-le el, ea u e-
le el, esul -le el, and mul i-le el usion. Signal-le el
usion depends on aw senso da a, while i is sui able
o dep h comple ion (Cheng e al., 2019) (Lin e al.,
2022) and landma k de ec ion (Lee and Pa k, 2021)
(Cal agi one e al., 2018), i s ill su e s om loss o
ex u e in o ma ion. Voxel g id o 2D p ojec ion a e
used o ep esen LiDAR da a as ea u e maps, o in-
s ance, he implemen a ion o VoxelNe (Zhou and
Tuzel, 2017) uses aw poin clouds as oxels be o e
using LiDAR da a wi h came a pixels. Resul -le el
usion inc eases accu acy by me ging p edic ion e-
sul s om di e en model ou pu s (Ja i z e al., 2020)
(Gu e al., 2018). Th ough e iewing he li e a u e, i
is possible o obse e ha he ecen end is o shi
owa ds mul i-le el usion, which ep esen s a com-
bina ion o all o he usion s a egies. The compu-
a ional complexi y esul ing om LiDAR 3D da a
is ackled by educing he dimensionali y o a wo-
dimensional image o exploi he exis ing image p o-
cessing me hods. Ou wo k uses a ans o me -based
ne wo k o in eg a ing came a and LiDAR da a in a
c oss- usion s a egy in he decode laye s.
The a en ion mechanism in oduced in he ans-
o me a chi ec u e in (Vaswani e al., 2023) has
a emendous impac in a ious ields, especially in
na u al language p ocessing (Xiao and Zhu, 2023)
and compu e ision. One no able a ian is he i-
sion ans o me (ViT) (Doso i skiy e al., 2021),
which excels in au onomous d i ing asks by han-
dling global con ex s and long- ange dependencies.
Pe cei ing he su ounding a ea in a wo-dimensional
plane p ima ily in ol es ex ac ing in o ma ion om
came a images wi h no able wo ks like bi d eye iew
ans o me s o oad su ace segmen a ion p esen ed
in (Zhu e al., 2024). O he ecen app oaches include
ligh weigh ans o me s o lane shape p edic ion
and combined seman ic and ins ance segmen a ion
(Lai-Dang, 2024). Th ee-dimensional au onomous
d i ing pe cep ion is an ex ensi ely esea ched opic
ocusing on objec de ec ion and segmen a ion. In
(Wang e al., 2021) DETR3D, he au ho s p esen a
mul i-came a objec de ec ion me hod, unlike o he s
ha ely on monocula images, i ex ac s 2D ea u es
om images and uses 3D objec que ies o link ea-
u es o 3D posi ions ia came a ans o ma ion ma-
ices. FUTR3D (Chen e al., 2023) employs a que y-
based Modali y-Agnos ic Fea u e Sample (MAFS),
oge he wi h a ans o me decode wi h a se - o-se
loss o 3D de ec ion, hus a oiding using la e usion
heu is ics and pos -p ocessing icks. BEVFo me
(Li e al., 2022) imp o es objec de ec ion and map
segmen a ion wi h spa ial and empo al a en ion lay-
e s ia spa io empo al ans o me s.
Recen wo ks emphasize he usion o came a and
LiDAR da a o enhanced pe cep ion. CLFT models,
o ins ance, p ocess LiDAR poin clouds as image
iews o achie e 2D seman ic segmen a ion, b idging
gaps in mul i-modal seman ic objec segmen a ion.
3 METHODOLOGY
In his sec ion, we elabo a e on he de ailed s uc-
u e o he CLFT ne wo k in he sequen ial o de o
da a p ocessing, aiming o p o ide an exclusi e in-
sigh in o how he senso y da a lows in he ne wo k,
hus, bene i s he unde s anding and ep oducibili y o
ou wo k.
The CLFT ne wo k achie es he came a-LiDAR
usion by p og essi ely assembling ea u es om
each modali y i s and hen conduc ing he c oss-
usion a he end. Figu a i ely, he CLFT ne wo k
has wo di ec ions o p ocess he inpu came a and
LiDAR da a in pa allel; he in eg a ion o wo modal-
i ies happens a he ‘ usion’ s age in he ne wo k’s de-
code block. In gene al, he e a e h ee s eps in he
en i e p ocess. The i s s ep is p e-p ocessing he in-
pu , which embeds he image-like da a o he lea n-
able ans o me okens; he second s ep closely ol-
lows he p o ocols o ViT (Doso i skiy e al., 2021)
encode s o encode he embedded okens; he las s ep
is he pos -p ocessing o he da a, which p og essi ely
assembles and uses he ea u e ep esen a ions o ac-
qui e segmen a ion p edic ions. The de ails o he
h ee s eps a e desc ibed in he ollowing h ee sub-
sec ions.
3.1 Embedding
The came a and LiDAR inpu da a p e-p ocessing is
independen and in pa allel. As men ioned in Sec-
ion 1, we selec he LiDAR p ocessing s a egy o
p ojec he poin cloud da a on o he came a plane,
hus a aining he LiDAR p ojec ion images. Fo deep
mul i-modal senso usion, he ansi ion om di e -
en inpu s o a uni ied modali y simpli ies he ne wo k
s uc u e and minimizes he usion e o s.
As shown in Fig. 1, he e a e a o al o ou s eps
in he embedding module. The i s s ep is esiz-
ing he came a and LiDAR ma ices o =384 and
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
567
Figu e 1: Embedding p ocess o came a and LiDAR da a. (a) The o iginal image is esized o a esolu ion o 384 ×384 o
s anda dize he inpu dimensions. (b) The inpu image is segmen ed in o non-o e lapping ixed-size pa ches o 16×16 pixels.
(c) Pa ches a e la ened in o one-dimensional embedded ec o s, wi h an addi ional posi ional embedding (colo ed in o ange)
added o p o ide spa ial in o ma ion. (d) The combined pa ch embeddings a e p ocessed h ough Mul ilaye Pe cep ons
(MLPs) wi h dimensions E=¯
D×D, esul ing in a ma ix ha se es as he inpu o he ans o me encode . The whole
igu e is based on he CLFT-Base a ian .
c=384, whe e is he numbe o ows and cis he
numbe o columns. The second s ep segmen s he
inpu image in o non-o e lapping ixed-size pa ches.
The size o each pa ch pin pixels is 16 ×16. The e-
o e, he dimension ¯
Do he oken ep esen ing one
pa ch is 16 ×16 ×3=768. In he hi d s ep, pa ches
a e la ened in o one-dimensional embedded ec o s
Xo leng h ∗c
p∗p=576 o se e as inpu okens o
he ans o me model. Since ans o me s inhe -
en ly lack he capaci y o comp ehend spa ial and wo-
dimensional neighbo hood s uc u e ela ionships be-
ween pa ches, we inco po a e an ex a posi ional em-
bedding in o each pa ch (Doso i skiy e al., 2021).
The addi ional embedding p o ides he ne wo k wi h
essen ial in o ma ion ega ding he ela i e spa ial po-
si ions o he pa ches wi hin he o iginal image. Se-
quen ially, in he las s ep, we pass he combined
pa ch embeddings h ough he Mul ilaye Pe cep ons
(MLPs) wi h dimensions o E=¯
D×D.Dindica es
he ne wo k’s a ious ea u e dimensions o di e en
ne wo k pa ame e con igu a ions. The esul ing ma-
ix X×Eis he inpu o he ans o me encode o
u he lea ning and p ocessing.
3.2 Encode
The essence o he ans o me encode is he Mul i-
Head Sel -A en ion (MHSA) mechanism (Vaswani
e al., 2023), which allows he ne wo k o weigh he
impo ance o each pa ch ela i e o each o he . Wi h
he assis ance o MHSA, he neu al ne wo ks e ec-
i ely cap u e global dependencies and in o ma ion
by compu ing a en ion sco es be ween all pai s o
pa ches. Mo eo e , hese sco es a e used o gene a e
weigh ed sums o he pa ch embeddings. The encode
ou pu consis s o embedding ma ices, each co e-
sponding o a pa ch in he o iginal image.
Figu e 2 illus a es he de ailed p ocess o ou
CLFT encode . The inpu o he encode is he e-
sul ing ma ix X′=X×E om he p e ious em-
bedding s ep (see Fig. 2(a)). The ma ix X′con-
ains he image’s pa ch and posi ion embeddings, as
well as he lea nable class okens. The dimension
o he X′is (576 +1)×768, which means he e a e
576 pa ch embeddings and one ex a posi ion embed-
ding. This app oach is inspi ed by BERTs okeniza-
ion me hod, which uses simila embeddings o cap-
u e con ex ual in o ma ion wi hin ex (De lin e al.,
2019). The mul i-head X′ma ix is hen eshaped in o
577 ×3×768, which ep esen s a Que y, Key and,
Value (QKV) ma ix, espec i ely. Equa ion 1 shows
he mul i-head a en ion Hcalcula ion in his s ep.
H(Q,K,V) =
N
M
i=1
hiWO(1)
whe e Lmeans conca ena ion o head ec o s side
by side wi h each o he , and WOis he weigh ma ix
used o linea ly ans o m he conca ena ed ou pu s.
Each head hiis calcula ed indi idually using i s own
se o p ojec ion ma ices as ollows:
hi=A(QWQ
i,KWK
i,VWV
i)(2)
whe e Adeno es he a en ion mechanism o he
que ies (Q), keys (K), and alues (V). P ojec ion
ma ices WQ
i,WK
i, and WV
i o he i- h head a e cal-
cula ed as ollows:
WQ
i=R(dm×dk)
WK
i=R(dm×dk)
WV
i=R(dm×d )
(3)
The So max a en ion mechanism ollows he
equa ion 4:
A(Q,K,V) = so max(QKT
√dk
)V(4)
whe e e m QKT ep esen s he do p oduc o he
que ies and he ansposed keys, gene a ing a sim-
ila i y sco e be ween each que y-key pai . Squa e
ICAART 2025 - 17 h In e na ional Con e ence on Agen s and A i icial In elligence
568
Figu e 2: Encode p ocess. (a) The ou pu om embedding is no malized and passed h ough linea laye s in o he mul i-head
a en ion block. (b) The ma ix is spli in o KQV ma ices, upon which So Max and a en ion ope a ions a e pe o med. The
KQV ma ices a e hen eshaped in o a single ma ix. (c) Finally, linea ope a ions a e execu ed, and he esul is p ocessed
h ough he MLP block.
oo o he key dimension dkp e en s he do p oduc
om becoming oo la ge, which s abilizes he g adi-
en s du ing aining. The So max unc ion is applied
o he scaled simila i y sco es, con e ing hem in o
a en ion weig hs, which de e mine he impo ance o
each key- alue pai o he gi en que y. Finally, he
a en ion weigh s a e used o compu e a weigh ed sum
o he alues V, p oducing inal ou pu o he a en ion
mechanism o each head.
The QKV ma ices a e hen eshaped in o N×
577 ×64, whe e Ns ands o he numbe o laye s
de ined in CLFT con igu a ion (as shown in Table 1).
A las , he me ics go hough he no maliza ion and
MLP laye s o be he inpu o CLFT decode (Fig.
2(c)).
Table 1 ou lines ou po en ial con igu a ion op-
ions o CLFT encode . The names ollow he
ViT con en ions. Each con igu a ion ea u es p ede-
ined ans o me laye s and a ea u e dimension D
wi h ixed-size okens. The CLFT-Hyb id con igu-
a ion dis inguishes i sel om he o he s by using a
ResNe 50 esidual ne wo k (He e al., 2015) o con-
e 768 ×768 images in o 14 ×14 pa ches, hen la -
ened in o one-dimensional ec o s o size 196.
Table 1: CLFT con igu a ion a ian s.
Type Laye s Fea u e dimension D
CLFT-Base 12 768
CLFT-La ge 24 1024
CLFT-Huge 32 1280
CLFT-Hyb id 12 768
3.3 Decode
The decode module p ocesses he okens om en-
code laye s o p og essi ely assemble he ea u e
ep esen a ions in o a 3D ma ix. This ma ix can
be isualized as an image o make p edic ions. We
ex end he h ee-s age eassembly ope a ion ini ially
p oposed in he (Ran l e al., 2021), including da a
eading, conca ena ing, and esampling, wi h he ex-
a s age o execu e he c oss- usion o came a and
LiDAR da a.
In he i s s age o eassembly, shown in Fig. 3(a),
we append a special classi ica ion oken o a se o
N okens, po en ially cap u ing global in o ma ion.
(Ran l e al., 2021) ha e e alua ed h ee di e en
a ian s o his mappings:
• One ha igno es he special class oken and p o-
cesses only he indi idual okens.
• One ha p opaga es in o ma ion om he class o-
ken o all o he okens.
• One ha conca ena es he class oken o all o he
okens, hen p ojec s he combined ep esen a ion
h ough a linea laye ollowed by he GELU ac-
i a ion unc ion o in oduce non-linea i y.
Figu e 3(b) shows he second s age o he decode .
A o al amoun o N okens a e shaped in o an image-
like ea u e map wi h he aid o posi ion okens. The
ea u e map wi h Dchannels is conca ena ed in o a
esul R=
pxc
pxD.
Figu e 3(c) illus a es he hi d and las s age. The
ea u e maps is i s scaled o size R=
sxc
sxˆ
D, whe e
ˆ
Dis se as 256 in all expe imen s. Fea u es om ea ly
laye s a e esampled a highe esolu ions, while ea-
u es om deepe laye s o he ans o me a e e-
sampled a lowe esolu ions. The CLFT-Base a i-
an uses laye s l={3,6,9,12}, and he CLFT-La ge
a ian u ilizes laye s l={5,12,18,24} o ex ac
ea u es. The CLFT-Hyb id a ian employs ResNe
laye s o ini ial ea u e ex ac ion and inco po a es
ans o me laye s l={9,12} o deepe ea u e ep-
esen a ion. The scaling coe icien s sis {4,8,16,32}.
In he las c oss- usion s age, came a and LiDAR
ea u es a e combined om ea u e maps in pa al-
lel. Ex ac ed ea u e maps a e combined using he
Re ineNe -based ea u e usion me hod, which em-
ploys wo esidual con olu ion uni s (RCUs) in a se-
quence. Resul s om came a and LiDAR ep esen a-
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
569
Figu e 3: Decode p ocess. (a) The inpu enso , ep esen ing da a, is conca ena ed wi h classi ica ion okens. (b) These
okens a e hen conca ena ed based on hei posi ional in o ma ion, yielding an image-like ep esen a ion. Two con olu ion
ope a ions, along wi h up-sampling and down-sampling, a e applied. (c) C oss- usion is applied o combine came a and
LiDAR da a, p og essi ely in eg a ing ou pu s om esidual compu a ion uni s om p e ious s eps. The inal p edic ed
segmen a ion is compu ed h ough decon olu ion and up-sampling blocks.
ions a e summed om he p e ious usion s age and
passed h ough ano he RCU. The ou pu o he las
RCU is passed o a de-con olu ional laye and up-
sampled o compu e he p edic ed segmen a ion.
4 DATASET CONFIGURATION
Waymo Open Da se (WOD) is designed o aid e-
sea che s in au onomous d i ing. I includes da a
om came a and LiDAR senso s which a e collec ed
in u ban and subu ban en i onmen s unde di e se
d i ing condi ions. I con ains labels o 4 objec
classes - ehicles, pedes ians, cyclis s, and signs. We
ha e manually pa i ioned he da ase in o ou sub-
se s: d y day, ainy day, d y nigh , and ainy nigh ,
and he amoun o ames pe subse is shown in Ta-
ble 2.
We use in e sec ion o e union (IoU) o e alua e
he pe o mance o he model along wi h alues o
p ecision and ecall. IoU compu a ion is ex ended o
alida e mul i-class seman ic segmen a ion by assign-
ing pixel alues o oid and excluding hem om inal
alida ion. We compa e g ound u h (Waymo label
alues) o he ou pu o he CLFT model o measu e
he pe o mance o ou wo k.
Table 2: F ame coun pe subse in WOD.
D y day Rainy day D y nigh Rainy nigh
14940 4520 1640 900
4.1 Me ics
We use he in e sec ion o e union (IoU) as he p i-
ma y indica ion o e alua e he pe o mance o ou
ne wo ks. In addi ion, we p o ide he esul s o p e-
cision and ecall. The IoU is p ima ily used in ob-
jec de ec ion applica ions, in which he ou pu is he
bounding box a ound he objec . We modi y he o di-
na y IoU algo i hm o i he mul i-class pixel-wise se-
man ic objec segmen a ion. Gi en a se o p ede ined
seman ic classes Ldeno ed by L={0,1, ..., L−1}.
Each pixel in he image can be ep esen ed as a pai
(pL,gL), whe e pLand gLindica e he p edic ion and
g ound- u h class, espec i ely. The pe o mance o
he ne wo ks is measu ed by he s a is ics o he num-
be o pixels ha ha e iden ical classes indica ed in
p edic ion and g ound u h. No all pixels ha e a
alid label, he e o e ambiguous pixels ha all ou
o he class lis a e assigned as oid and no coun ed
in he e alua ion. The IoU o each class is gi en by
Equa ion 5, whe e Lmeans he non-iden ical class.
IoUL=∑(pLgL)
∑(pLgL)+∑(pLgL) +∑(pLgL)(5)
Co espondingly, he p ecision and ecall a e ob-
ained by Equa ion 6 and 7.
P ecisionL=∑(pLgL)
∑(pLgL)+∑(pLgL)(6)
RecallL=∑(pLgL)
∑(pLgL)+∑(pLgL)(7)
5 EXPERIMENTAL RESULTS
5.1 Expe imen al Se up
The ans o me -based ne wo ks we e ained on
se e s equipped wi h N idia A100 80GB g aphics
ca ds. Each aining session u ilized a ba ch size o
24, unning o up o 400 epochs. Ea ly s opping c i-
e ia we e implemen ed o p e en o e - i ing and o
ensu e e icien use o compu a ional esou ces.
ICAART 2025 - 17 h In e na ional Con e ence on Agen s and A i icial In elligence
570
Table 3: Pe o mance compa ison o CLFT-Hyb id me hod du ing a ious wea he condi ions.
IoU P ecision Recall
Cyclis Pedes ian Sign Cyclis Pedes ian Sign Cyclis Pedes ian Sign
D y day
Came a 64.17 67.88 45.48 83.79 79.99 65.41 73.27 81.76 59.88
LiDAR 64.06 68.21 45.22 83.41 79.84 64.45 73.41 82.41 60.24
Came a+LiDAR 60.96 67.75 45.09 82.73 79.42 61.97 69.86 82.17 62.34
Rainy day
Came a 70.75 61.98 35.49 86.19 80.19 68.98 79.80 73.18 42.23
LiDAR 73.76 62.84 37.05 89.53 80.79 68.02 80.73 73.89 44.86
Came a+LiDAR 72.63 62.50 37.82 87.27 79.84 62.30 81.24 74.22 49.03
D y nigh
Came a 66.11 66.11 32.82 83.60 81.48 56.74 75.96 77.80 43.77
LiDAR 66.95 66.87 32.70 87.13 80.69 57.23 74.30 79.61 43.27
Came a+LiDAR 61.55 65.68 31.87 79.06 79.80 50.52 73.53 78.78 46.33
Rainy nigh
Came a 16.38 43.57 40.45 42.30 66.13 64.81 21.10 56.09 51.83
LiDAR 50.11 49.54 39.04 71.10 64.22 59.07 62.92 68.42 53.53
Came a+LiDAR 63.41 48.13 37.42 79.94 70.40 55.28 75.41 60.33 53.67
The da ase was di ided in o h ee pa s: 60%
o aining, 20% o alida ion, and 20% o es ing.
This dis ibu ion ensu es a balanced app oach, allow-
ing he model o lea n e ec i ely, alida e i s pe o -
mance du ing aining, and be e alua ed on unseen
da a o assess i s gene aliza ion capabili ies.
A o al o nine aining sessions we e conduc ed,
each wi h di e en ne wo k pa ame e s: CLFT-Base,
CLFT-La ge, and CLFT-Hyb id. Sepa a e aining
sessions we e pe o med o LiDAR-only, came a-
only, and c oss- usion o came a+LiDAR da a o
comp ehensi ely e alua e he pe o mance ac oss di -
e en senso con igu a ions.
5.2 Va ying Wea he Condi ions
We conduc ed an analysis o he ne wo k pe o mance
ac oss ou dis inc wea he condi ions: d y day, ainy
day, d y nigh , and ainy nigh . The esul s o he
CLFT-Hyb id me hod unde hese a ious condi ions
a e summa ized in Table 3.
In d y day condi ions, he pe o mance o he
CLFT-Hyb id model using LiDAR alone (IoU: 64%
o cyclis s, 68% o pedes ians) was compa able o
using came a da a alone (IoU: 64% o cyclis s, 68%
o pedes ians) and sligh ly be e han he combined
da a.
Du ing ainy day condi ions, LiDAR da a ou pe -
o med came a da a (IoU: 74% o cyclis s, 63% o
pedes ians s. 71% o cyclis s, 62% o pedes i-
ans). This is an expec ed esul as he came a is
blu ed by ain, while LiDARs a e ypically less a -
ec ed. Combined da a was compe i i e, wi h IoU o
73% o cyclis s and 63% o pedes ians, showing
LiDAR’s esilience agains isual noise and low-ligh
en i onmen s.
Unde d y nigh condi ions, LiDAR da a pe -
o med be e han bo h combined and came a da a
alone (IoU: 67% o cyclis s, 67% o pedes ians s.
66% o cyclis s and pedes ians wi h came a), p e-
sen ing LiDAR’s ad an age in low ligh condi ions.
Unde ainy nigh condi ions, he combined Li-
DAR+Came a da a yielded he highes pe o mance
(IoU: 63% o cyclis s, 48% o pedes ians s. 50%
o cyclis s and 50% o pedes ians wi h LiDAR
alone). C oss- usion e ec i ely le e aged comple-
men a y in o ma ion, p o iding dep h and ex u e de-
ails.
5.3 Va ying Ne wo k Con igu a ions
The pe o mance me ics o di e en CLFT con igu-
a ions unde d y day condi ions a e summa ized in
Table 4. The CLFT-Base con igu a ion showed ha
using ei he came a o LiDAR alone p o ides compa-
able esul s, bu combining hem did no yield sig-
ni ican imp o emen s. The CLFT-La ge con igu a-
ion bene i ed om highe p ecision, especially when
combining da a sou ces, sugges ing be e accu acy in
iden i ying objec s, hough IoU did no signi ican ly
imp o e. The CLFT-Hyb id con igu a ion pe o med
he bes o e all, pa icula ly using ei he came a da a
alone o LiDAR da a alone. This model e ec i ely
le e ages he s eng hs o bo h da a ypes, wi h he
usion o bo h da a sou ces yielding high ecall o
signs.
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
571
Table 4: Pe o mance me ics unde d y day condi ions o di e en CLFT con igu a ions.
Cyclis Pedes ian Sign
IoU P ecision Recall IoU P ecision Recall IoU P ecision Recall
CLFT-Base C 50.07 84.72 55.04 65.71 80.56 78.09 41.27 66.46 52.13
CLFT-Base L 47.01 84.27 51.53 64.06 78.60 77.59 39.76 63.15 51.78
CLFT-Base C+L 48.31 80.48 54.73 65.11 77.85 79.92 41.33 61.35 55.88
CLFT-La ge C 53.50 83.61 59.77 66.03 82.11 77.12 41.17 68.81 50.61
CLFT-La ge L 53.91 84.53 59.81 66.31 80.06 79.43 41.44 64.49 53.70
CLFT-La ge C+L 53.58 85.11 59.12 66.10 82.28 77.07 41.90 70.07 51.03
CLFT-Hyb id C 64.17 83.79 73.27 67.88 79.99 81.76 45.48 65.41 59.88
CLFT-Hyb id L 64.06 83.41 73.41 68.21 79.84 82.41 45.22 64.45 60.24
CLFT-Hyb id C+L 60.96 82.73 69.86 67.75 79.42 82.17 45.09 61.97 62.34
5.4 Compa ison o O he Ne wo ks
We compa ed ou esul s o hose o adi ional Fully
Con olu ional Ne wo ks (FCN) (Gu e al., 2022) and
panop ic ne wo ks as p esen ed in (Gu e al., 2024).
The CLFT-Hyb id achie ed highe IoU sco es (e.g.,
64% o cyclis s and 68% o pedes ians in d y day
condi ions) compa ed o ypical FCN and panop ic
ne wo ks, which o en s uggle wi h complex scenes
and poo isibili y. Unlike FCNs and panop ic ne -
wo ks ha ely on single modali ies, he CLFT e ec-
i ely combines LiDAR and came a da a, enhancing
pe o mance, especially in challenging scena ios like
ainy nigh s (IoU: 63% o cyclis s).
6 CONCLUSION
In his pape , we demons a ed he e ec i eness o
Came a-LiDAR Fusion T ans o me (CLFT) mod-
els in achie ing success ul objec segmen a ion by
le e aging senso c oss- usion and he ans o me ’s
mul i-a en ion mechanism. The CLFT-Hyb id model
showed ema kable imp o emen s in segmen a ion
accu acy o cyclis s, pedes ians, and a ic signs.
The CLFT models main ained high pe o mance
ac oss a a ie y o wea he condi ions, including day,
ain, and nigh scena ios. By combining he s eng hs
o bo h LiDAR and came a da a, he CLFT model e -
ec i ely u ilized c oss- usion o enhance o e all pe -
o mance. The ans o me ’s mul i-a en ion mecha-
nism enabled he CLFT models o ocus on ele an
ea u es and imp o e objec de ec ion and segmen a-
ion accu acy.
Despi e hese p omising esul s, se e al chal-
lenges emain. The CLFT models exhibi ed a i-
abili y in pe o mance unde ad e se wea he condi-
ions. Fo ins ance, while LiDAR alone pe o med
well in ai condi ions, he usion o LiDAR and cam-
e a da a some imes led o subop imal esul s. The
models showed dec eased pe o mance in nigh and
ainy condi ions. The CLFT models, especially la ge
con igu a ions, equi e signi ican compu a ional e-
sou ces, which poses challenges o eal- ime imple-
men a ion in esou ce-cons ained en i onmen s.
Fu u e wo k should ocus on imp o ing he accu-
acy o CLFT models in challenging en i onmen s,
explo ing mo e da a usion echniques, and in eg a -
ing addi ional senso modali ies o u he enhance
o e all pe o mance.
ACKNOWLEDGMENT
Pa o he esea ch has ecei ed unding om he ol-
lowing g an s: he Eu opean Union’s Ho izon 2020
Resea ch and Inno a ion P og amme p ojec Fines
Twins (g an No. 856602) and AI-Enabled Da a Li e-
cycles Op imiza ion and Da a Spaces In eg a ion o
Inc eased E iciency and In e ope abili y PLIADES,
g an ag eemen No. 101135988.
REFERENCES
Cal agi one, L., Bellone, M., S ensson, L., and Wahde, M.
(2018). Lida -came a usion o oad de ec ion using
ully con olu ional neu al ne wo ks.
Chen, X., Zhang, T., Wang, Y., Wang, Y., and Zhao, H.
(2023). Fu 3d: A uni ied senso usion amewo k
o 3d de ec ion.
Cheng, X., Wang, P., Guan, C., and Yang, R. (2019).
Cspn++: Lea ning con ex and esou ce awa e con o-
lu ional spa ial p opaga ion ne wo ks o dep h com-
ple ion.
Cui, Y., Chen, R., Chu, W., Chen, L., Tian, D., Li, Y.,
and Cao, D. (2022). Deep lea ning o image and
poin cloud usion in au onomous d i ing: A e iew.
IEEE T ansac ions on In elligen T anspo a ion Sys-
ems, 23(2):722–739.
De lin, J., Chang, M.-W., Lee, K., and Tou ano a, K.
(2019). Be : P e- aining o deep bidi ec ional ans-
o me s o language unde s anding.
ICAART 2025 - 17 h In e na ional Con e ence on Agen s and A i icial In elligence
572
Doso i skiy, A., Beye , L., Kolesniko , A., Weissenbo n,
D., Zhai, X., Un e hine , T., Dehghani, M., Minde e ,
M., Heigold, G., Gelly, S., Uszko ei , J., and Houlsby,
N. (2021). An image is wo h 16x16 wo ds: T ans-
o me s o image ecogni ion a scale.
Gu, J., Bellone, M., Pi oˇ
nka, T., and Sell, R. (2024). Cl :
Came a-lida usion ans o me o seman ic segmen-
a ion in au onomous d i ing. IEEE T ansac ions on
In elligen Vehicles, pages 1–12.
Gu, J., Bellone, M., Sell, R., and Lind, A. (2022). Objec
segmen a ion o au onomous d i ing using iseau o
da a. Elec onics, 11(7).
Gu, S., Lu, T., Zhang, Y., Al a ez, J. M., Yang, J., and
Kong, H. (2018). 3-d lida + monocula came a:
An in e se-dep h-induced usion amewo k o u ban
oad de ec ion. IEEE T ansac ions on In elligen Ve-
hicles, 3(3):351–360.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep esid-
ual lea ning o image ecogni ion.
Ja i z, M., Vu, T.-H., de Cha e e, R., ´
Emilie Wi bel, and
P´
e ez, P. (2020). xmuda: C oss-modal unsupe ised
domain adap a ion o 3d seman ic segmen a ion.
Lai-Dang, Q.-V. (2024). A su ey o ision ans o me s in
au onomous d i ing: Cu en ends and u u e di ec-
ions.
Lee, J.-S. and Pa k, T.-H. (2021). Fas oad de ec ion by
cnn-based came a–lida usion and sphe ical coo di-
na e ans o ma ion. IEEE T ansac ions on In elligen
T anspo a ion Sys ems, 22(9):5802–5810.
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q.,
and Dai, J. (2022). Be o me : Lea ning bi d’s-eye-
iew ep esen a ion om mul i-came a images ia
spa io empo al ans o me s.
Lin, Y., Cheng, T., Zhong, Q., Zhou, W., and Yang, H.
(2022). Dynamic spa ial p opaga ion ne wo k o
dep h comple ion.
Ran l, R., Bochko skiy, A., and Kol un, V. (2021). Vision
ans o me s o dense p edic ion.
Vaswani, A., Shazee , N., Pa ma , N., Uszko ei , J., Jones,
L., Gomez, A. N., Kaise , L., and Polosukhin, I.
(2023). A en ion is all you need.
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., and
Solomon, J. (2021). De 3d: 3d objec de ec ion om
mul i- iew images ia 3d- o-2d que ies.
Xiao, T. and Zhu, J. (2023). In oduc ion o ans o me s:
an nlp pe spec i e.
Zhou, Y. and Tuzel, O. (2017). Voxelne : End- o-end lea n-
ing o poin cloud based 3d objec de ec ion.
Zhu, Y., Jia, X., Yang, X., and Yan, J. (2024). Fla u-
sion: Del ing in o de ails o spa se ans o me -based
came a-lida usion o au onomous d i ing.
A No el Vision T ans o me o Came a-LiDAR Fusion Based T a ic Objec Segmen a ion
573