scieee Science in your language
[en] (orig)

SimForest: RGBD Instance Segmentation Dataset

Author: Avula, Ramana Reddy; Narkilahti, Aleksi; Wołk, Krzysztof
Publisher: Zenodo
DOI: 10.5281/zenodo.17299343
Source: https://zenodo.org/records/17299343/files/VCIP_2025_RISE.pdf
SimFo es : RGBD Ins ance Segmen a ion Da ase
Ramana Reddy A ula
Dependable T anspo Sys ems
RISE Resea ch Ins i u es o Sweden
Bo ˚
as, Sweden
amana. eddy[email p o ec ed]
0000-0001-9672-2689
Aleksi Na kilah i
F os Bi So wa e Lab
Lapland Uni e si y o Applied Sciences
Ro aniemi, Finland
[email p o ec ed]
0009-0008-3937-1139
K zysz o Wołk
DAC.digi al, SA
Al.G unwaldzka 472
80-309 Gda´
nsk, Poland
[email p o ec ed]
0000-0001-5030-334X
Abs ac —Au onomous pe cep ion in o es en i onmen s e-
qui es accu a e de ec ion and segmen a ion o complex na u al
objec s such as ees, ocks, and e ain ea u es. Howe e , he
sca ci y o la ge-scale, anno a ed o es da ase s, especially hose
wi h dep h and ins ance segmen a ion labels, hinde s p og ess in
deploying obus deep lea ning models o o es y applica ions.
In his pape , we p esen SimFo es , a 4K- esolu ion syn he ic
RGBD da ase gene a ed using a pho o ealis ic o es y simula o
buil on Un eal Engine 5. SimFo es comp ises 5,000 images, each
anno a ed wi h aligned RGB da a, dep h maps, ins ance seg-
men a ion masks, and de ailed me ada a including objec poses,
e ain dep h, came a pa ame e s, and en i onmen al condi ions
such as season, ime, and cloudiness. The i ual scenes a e
geo-loca ed and seasonally ma ched o a eal o es nea Ume˚
a,
Sweden. To demons a e he u ili y o SimFo es , we conduc an
expe imen al s udy in ol ing he de ec ion and segmen a ion o
ee unks using YOLO 11-based models ained on SimFo es
da a. The e alua ion shows s ong de ec ion accu acy (mAP@50
o 0.92) and solid segmen a ion pe o mance (mAP@50 o 0.74).
These indings highligh he po en ial o SimFo es as a aluable
esou ce o nea - ield RGBD pe cep ion in o es y and ela ed
ou doo obo ics applica ions.
Index Te ms—Ins ance segmen a ion, Syn he ic da ase , Fo es
pe cep ion, RGBD da ase , T ee unk de ec ion
I. INTRODUCTION
Au onomous pe cep ion in o es en i onmen s is a c i -
ical capabili y o a wide ange o applica ions, including
au onomous na iga ion, p ecision o es y, ecological mon-
i o ing, and disas e esponse. These asks equi e he ac-
cu a e de ec ion, localiza ion, and unde s anding o complex
na u al objec s, such as ees, ocks, and e ain ea u es, in
uns uc u ed ou doo scenes. T aining e ec i e deep lea ning
models o hese applica ions equi es access o la ge-scale,
high-quali y anno a ed da ase s. Unlike u ban o indoo en i-
onmen s, whe e da ase s such as COCO [1] and ImageNe [2]
ha e d i en signi ican ad ancemen s in objec de ec ion and
segmen a ion, o es en i onmen s lack compa able esou ces.
Exis ing eal-wo ld o es da ase s [3] a e ypically limi ed in
esolu ion, lack aligned dep h in o ma ion, o ail o cap u e
he seasonal and en i onmen al a iabili y ha cha ac e izes
na u al o es ecosys ems.
Collec ing la ge-scale eal-wo ld da ase s in o es en i on-
men s p esen s subs an ial challenges. Field campaigns a e
expensi e, ime-consuming, and o en cons ained by wea he
condi ions, accessibili y, and sa e y conce ns. Ob aining p e-
cise g ound u h anno a ions o 3D objec poses, spa ial e-
la ionships, and dep h in o ma ion equi es specialized equip-
men and expe ise ha may no be eadily a ailable. Mo e-
o e , he manual anno a ion o o es image y is pa icula ly
challenging due o he i egula shapes o na u al objec s, oc-
clusions caused by dense ege a ion, and he di icul y in p e-
cisely delinea ing bounda ies be ween o e lapping oliage and
b anches, o en equi ing se e al minu es pe image and in o-
ducing una oidable e o s in bounda y and occlusion handling
[4]. Add essing hese challenges, syn he ic da a gene a ion has
eme ged as a p omising solu ion. Simula ion en i onmen s can
gene a e unlimi ed amoun s o da a wi h pe ec g ound u h
anno a ions, including p ecise objec poses, dep h maps, and
ins ance segmen a ion masks. Addi ionally, syn he ic da ase s
can be gene a ed apidly and cos -e ec i ely, allowing o he
explo a ion o di e en scena ios and edge cases ha migh
be a e o haza dous o cap u e in eal-wo ld se ings.
In his pape , we in oduce SimFo es 1, a comp ehensi e
syn he ic RGBD da ase speci ically designed o o es pe -
cep ion asks. SimFo es le e ages a high- ideli y o es y sim-
ula o [5] buil on Un eal Engine 5 o gene a e high- esolu ion
images cap u ed ac oss all ou seasons unde a ying ligh ing
condi ions, cloudiness le els, and came a iewpoin s. Each im-
age is accompanied by an aligned dep h map, ins ance segmen-
a ion masks, and de ailed scene me ada a including objec -
le el 3D ans o ms, e ain dep h map, came a in insics,
came a pose, and en i onmen al pa ame e s such as season,
ime o day, and cloudiness. Ins ance segmen a ion anno a ions
a e p o ided o all isible objec s wi hin a 15-me e adius
om he came a, ocusing on nea - ield pe cep ion asks such
as obs acle a oidance and selec i e logging. The i ual scenes
a e geo-loca ed and ende ed o ma ch he appea ance and
s uc u e o a eal-wo ld o es nea Ume˚
a, Sweden.
A ecen ela ed wo k, SPREAD [6], p esen s a la ge-scale
syn he ic o es da ase buil also based on Un eal Engine 5,
bu co e ing mul iple o es biomes and p o iding RGB, dep h,
poin clouds, segmen a ion labels, and ee-le el me ada a such
as unk and canopy diame e and heigh . Howe e , unlike
SPREAD’s ex ensi e ye gene ic syn he ic da ase , SimFo es
is geo-loca ed o a speci ic eal o es and includes ine-g ained
nea - ield RGBD segmen a ion da a and ich e ain and en i-
1h ps://doi.o g/10.5281/zenodo.15911876
onmen al me ada a. Fu he mo e, he SPREAD da ase i sel
p o ides only RGB images a 960×540 esolu ion, al hough
he p o ided sc ip s suppo ende ing a 4K- esolu ion. In
con as , SimFo es o e s ull 4K- esolu ion RGBD ames,
making i a plug-and-play esou ce o high- ideli y pe cep ion
asks.
To e alua e he use ulness o SimFo es , we conduc an ex-
pe imen al s udy ocused on he de ec ion o ee logs sui able
o ha es ing, a key ask in p ecision o es y. We ain and
benchma k YOLO 11 [7] based objec de ec ion and ins ance
segmen a ion models exclusi ely on he SimFo es da ase .
The esul s p o ide empi ical alida ion o he da ase ’s quali y
and i s sui abili y o de eloping and benchma king RGBD
pe cep ion models in challenging o es y and o he ou doo
obo ic domains. In summa y, he key con ibu ions o his
pape a e as ollows:
•We in oduce SimFo es , a no el syn he ic RGBD da ase
o o es pe cep ion, comp ising 5,000 high- esolu ion
images wi h aligned dep h maps, ins ance segmen a ion
masks, and ex ensi e me ada a.
•We benchma k SimFo es on a p ac ical ee unk ha -
es ing ask using YOLO11x-based objec de ec ion and
ins ance segmen a ion models.
•We elease SimFo es as an open-sou ce benchma king
esou ce, comple e wi h i s ich me ada a, o suppo
esea ch in RGBD pe cep ion o o es y and ou doo
obo ics.
II. SIMFOREST DATASET
A. Da ase O e iew
The SimFo es da ase comp ises a o al o 5,000 anno a ed
ames, each con aining aligned RGB images, dep h maps,
and ins ance segmen a ion masks. These ames a e dis ibu ed
ac oss a di e se se o en i onmen al condi ions, co e ing all
ou seasons, a ying imes o day, cloudiness le els, and in a-
seasonal changes. Each da a sample includes he ollowing
componen s:
•RGB images: High- esolu ion JPEG images ende ed wi h
pho o ealis ic ligh ing and ex u es.
•Scene dep h maps: 32-bi RGB-encoded PNG images
whe e each pixel encodes a dep h alue in me e s using
h ee colo channels. The dep h Dcan be decoded using:
D=R+ 256 ·G+ 2562
·B
2563−1·1000
whe e R,G, and Ba e he ed, g een, and blue channel
alues, espec i ely.
•Ins ance anno a ions: Ins ance segmen a ion masks in
COCO o ma o each isible objec wi hin 15 me e s.
In addi ion, he da ase includes he ollowing me ada a o
enable de ailed scene unde s anding:
•Came a in insics and pose: Includes he came a in in-
sics and ull 6-DoF came a pose (posi ion and o ien a ion)
ela i e o he wo ld coo dina e ame.
•Ins ance me ada a: Fo each anno a ed ins ance, me ada a
includes he ins ance segmen a ion ID, ca ego y ID, spa-
ial loca ion, o ien a ion, and physical size in 3D space.
•En i onmen al condi ions: The simula ed en i onmen is
cha ac e ized wi h season, hou o day, mon h, and a
cloudiness sco e (0 o 1), enabling con olled expe imen s
in ol ing ligh ing and wea he a iabili y.
•Te ain dep h map: A sepa a e dep h map is included o
each ame, p o iding pixel-wise e ain dep h alues ha
a e use ul o ele a ion-awa e pe cep ion.
The SimFo es da ase is publicly a ailable on Zenodo [8]
unde he CC BY 4.0 license. Figu e 1 illus a es ep esen a i e
samples om he da ase , showing aligned RGB images, dep h
maps, and ins ance segmen a ion masks unde a ied en i-
onmen al condi ions. The da ase con ains a o al o 40,554
anno a ions ac oss 11 dis inc ca ego ies, wi h pine ees and
hei unks comp ising he majo i y o anno a ions (64%), as
shown in Figu e 2.
B. Da a Gene a ion P ocess
The SimFo es da ase was c ea ed using a high- ideli y
o es y simula ion en i onmen de eloped as pa o he
AGRARSENSE p ojec [9]. Buil on Un eal Engine 5, he
simula o u ilizes eal geospa ial da a o eplica e he s uc-
u al and isual cha ac e is ics o bo eal o es s accu a ely.
The i ual scenes a e based on a e ain map nea Ume˚
a
in no he n Sweden, wi h ege a ion asse s ep esen ing key
No dic species, including bi ch, pine, and sp uce ees. The
en i onmen suppo s pho o ealis ic ende ing o seasonal a i-
a ion, dynamic ligh ing, and a mosphe ic condi ions.
Da a gene a ion in ol ed cap u ing aligned images om
h ee i ual came as: RGB, dep h, and ins ance segmen a ion.
The came as we e con igu ed wi h a esolu ion o 3840×2160
pixels and a 90° ho izon al ield o iew. A 200m × 200m geo-
e e enced o es a ea was selec ed as he simula ion egion,
and his a ea was di ided in o a egula g id o guide he
sampling o came a posi ions. A each sampled g id cell, a
came a pose was de ined a a ixed heigh o 2 me e s abo e
he e ain su ace, emula ing he pe spec i e o a low- lying
d one na iga ing h ough he o es . The yaw angle o he
came a was andomly selec ed om 16 disc e e bins spaced
uni o mly o e 360◦. This app oach ensu ed b oad co e age o
he scene wi h di e se iewpoin s while main aining s uc u ed
sampling densi y ac oss he e ain.
To u he enhance di e si y and ealism, each image was
ende ed wi h andomized en i onmen al pa ame e s such as
season (win e , sp ing, summe , au umn), ime o day, mon h,
and cloudiness ( anging om clea o o e cas ). Sun posi ion
and ligh ing condi ions we e simula ed based on he eal-wo ld
sola angles co esponding o he geo-loca ion o he o es
in Ume˚
a, Sweden, and selec ed ime/mon h. This di e si y
enables sys ema ic e alua ion o pe cep ion models unde a
wide ange o condi ions, including challenging scena ios such
as low sunligh and sola gla e.
The cap u ed ins ance segmen a ion images we e pos -
p ocessed o ex ac masks o objec s whose nea es poin
RGB
Scene dep h
Anno a ions
Te ain dep h
Fig. 1: Samples om he SimFo es da ase showing aligned RGB images, scene dep h maps, ins ance segmen a ion masks,
and e ain dep h maps unde di e se en i onmen al condi ions. Each column ep esen s a di e en da a sample.
0 2,000 4,000 6,000 8,000 10,000 12,000 14,000
Numbe o Anno a ions
Te ain
Foliage
Bi ch
Pine
Sp uce
Sky
Rock
Snow
Bi ch_T unk
Pine_T unk
Sp uce_T unk
Ca ego y
3,711
1,565
793
14,012
879
5,000
202
1,632
594
11,816
350
Fig. 2: Dis ibu ion o anno a ions ac oss ca ego ies in he
SimFo es da ase .
o he came a alls wi hin a 15-me e adius. This dis ance-
based il e ing ensu es ha he anno a ions emain ele an
o nea - ield pe cep ion asks, such as obs acle a oidance and
selec i e logging. Fo each e ained ins ance, a 2D bounding
box was de i ed om he segmen a ion mask, while a 3D
bounding box was es ima ed using he co esponding pixel-
wise dep h map. A quali y con ol p ocedu e was applied o
il e ou ames wi h excessi e occlusion o poo isibili y,
ensu ing he da ase only includes scenes wi h meaning ul
and usable anno a ions. The inal anno a ions, including objec
ca ego ies, segmen a ion masks, and bounding boxes, we e
sa ed in he COCO o ma , ensu ing compa ibili y wi h widely
used aining and e alua ion pipelines.
III. EXPERIMENTAL EVALUATION
We benchma k he SimFo es da ase using s a e-o - he-a
objec de ec ion and ins ance segmen a ion models o assess
i s sui abili y o pe cep ion asks in o es en i onmen s.
Speci ically, we ocus on he de ec ion o ee unks sui able
o p ecision ha es ing. Fo his pu pose, we employ he
YOLO 11x and YOLO 11x-seg models o objec de ec-
ion and ins ance segmen a ion, espec i ely. All aining and
e alua ion da a we e d awn exclusi ely om he SimFo es
da ase , which spans all ou seasons and a di e se ange
o en i onmen al condi ions. This con olled se up isola es
he impac o syn he ic da a quali y on model pe o mance,
p o iding a baseline o u u e sim- o- eal ans e s udies.
Fo his s udy, we de i ed a dedica ed T ee T unk da ase
om SimFo es con aining 3,086 images and 11,872 anno a ed
ee unks, e aining only hose wi h a minimum diame e
o 10 cm and a minimum heigh o 2 m. The da a we e spli
andomly in an 80–20 a io, esul ing in 2,468 images wi h
9,567 anno a ions o aining and 618 images wi h 2,305
anno a ions o alida ion.
T aining was ini ialized om he de aul YOLO 11x and
YOLO 11x-seg weigh s, le e aging ans e lea ning om
p e ained 640×640 pixel models. The models we e ained
a an image size o 2,560×2,560 pixels o i e icien ly wi hin
he 24 GB VRAM o an NVIDIA RTX 4090 GPU. S anda d
YOLO augmen a ions such as andom scaling, ho izon al
lipping, ansla ion, and colo ji e we e e ained, while mo e
complex augmen a ions including mosaic, mixup, cu mix, and
copy-pas e we e disabled. T aining was con igu ed o up o
300 epochs wi h ea ly s opping enabled wi h a pa ience o
100, which igge ed a epoch 277 o objec de ec ion, while
ins ance segmen a ion comple ed all 300 epochs. The aining
(a) Objec de ec ion (YOLO 11x) aining loss
(b) Ins ance segmen a ion (YOLO 11x-seg) aining loss
Fig. 3: T aining loss cu es o objec de ec ion and ins ance
segmen a ion models, showing s able con e gence.
p ocess showed a consis en dec ease in aining losses o e
epochs o bo h objec de ec ion and ins ance segmen a ion,
indica ing s able con e gence. The indi idual loss componen s
o objec de ec ion (box, classi ica ion, and dis ibu ion ocal
losses) and o ins ance segmen a ion (including an addi ional
segmen a ion loss) a e illus a ed in Fig. 3a and Fig. 3b. To
main ain consis ency wi h he da ase ’s anno a ion policy, only
objec s wi hin 15 me e s o he came a we e conside ed du ing
pos -p ocessing, excluding p edic ed ins ances beyond his
ange. While his is less p ecise o bounding boxes, i p o ides
a easonable app oxima ion o he anno a ion cons ain s. All
iles equi ed o c ea ing he de i ed da ase in YOLO o ma ,
as well as o aining and alida ion o he models, a e
a ailable in a supplemen a y Gi Hub eposi o y2
As shown in Table I, he objec de ec ion model achie ed
s ong pe o mance, wi h a p ecision o 0.86 and an mAP@50
o 0.92. Al hough pe o mance dec eased unde he s ic e
mAP@50–95 me ic o 0.75, he esul s emain obus , high-
ligh ing he da ase ’s sui abili y o nea - ield de ec ion asks.
Ins ance segmen a ion pe o mance was sligh ly lowe , wi h
2h ps://gi hub.com/RISE-Dependable-T anspo -Sys ems/
SimFo es -YOLO-Toolki
TABLE I: Pe o mance o YOLO 11x and YOLO 11x-seg
models o ee unk de ec ion and segmen a ion.
Task P ecision Recall mAP@50 mAP@50–95
Objec de ec ion 0.8616 0.8685 0.9208 0.7466
Ins ance segmen a ion 0.6854 0.7514 0.7379 0.5753
a p ecision o 0.69, an mAP@50 o 0.74, and an mAP@50–
95 o 0.58, e lec ing he inc eased challenge o ine-g ained
pixel-le el localiza ion in dense o es scenes. These esul s
demons a e he alue o SimFo es as a high-quali y bench-
ma k o RGBD pe cep ion in uns uc u ed ou doo en i on-
men s, pa icula ly o objec de ec ion.
IV. CONCLUSION
This pape in oduces SimFo es , a high- esolu ion syn he ic
RGBD da ase de eloped o nea - ield o es pe cep ion asks
such as objec de ec ion and ins ance segmen a ion. Buil
using a high- ideli y Un eal Engine 5-based simula o and geo-
loca ed o a eal o es in no he n Sweden, SimFo es p o ides
5,000 images wi h aligned RGB, dep h, and ins ance seg-
men a ion da a, along wi h ich en i onmen al and geome ic
me ada a. The expe imen al esul s demons a e s ong de ec-
ion pe o mance (mAP@50 o 0.92) and solid segmen a ion
esul s (mAP@50 o 0.74), despi e he complexi y o na u al
o es en i onmen s. While pe o mance dec eases unde he
s ic e mAP@50–95 me ic (d opping o 0.75 o de ec ion
and 0.58 o segmen a ion), his is expec ed gi en he ine-
g ained localiza ion and mask p ecision i equi es.
The da ase ’s pho o ealis ic seasonal ende ing make i pa -
icula ly well-sui ed o pe cep ion sys ems ope a ing in bo eal
o es s, while i s syn he ic na u e enables scalable da a gene a-
ion wi hou he logis ical challenges o eal-wo ld collec ion.
By eleasing SimFo es as an open-access esou ce, we aim o
suppo esea ch in o es y obo ics, ecological moni o ing,
and o he ou doo applica ions whe e high- ideli y pe cep ion
is essen ial. I s 4K- esolu ion and ich mul imodal anno a ions
o e a aluable ounda ion o ad ancing RGBD pe cep ion
in na u al en i onmen s, complemen ing exis ing da ase s ha
lack dep h in o ma ion o high- esolu ion image y.
In u u e wo k, we will explo e how SimFo es can suppo
ans e lea ning and domain adap a ion in o es y applica-
ions. A key objec i e is o benchma k sim- o- eal gene aliza-
ion by es ing models ained on syn he ic da a agains eal-
wo ld o es image y. This e alua ion will p o ide ac ionable
insigh s in o na owing he simula ion- o- eali y gap and en-
hancing he p ac ical e ec i eness o syn he ic aining o
eal-wo ld o es y pe cep ion sys ems.
ACKNOWLEDGMENT
This wo k was ca ied ou wi hin AGRARSENSE p ojec
(G an Ag eemen No. 101095835), suppo ed by he Chips
JU and i s membe s, including op-up unding om Sweden,
Czechia, Finland, I eland, I aly, La ia, Ne he lands, No way,
Spain, and he Na ional Cen e o Resea ch and De elopmen
o Poland.
REFERENCES
[1] T.-Y. Lin, M. Mai e, S. Belongie, J. Hays, P. Pe ona, D. Ramanan, P.
Doll´
a , and C. L. Zi nick, “Mic oso coco: Common objec s in con ex ,”
in Eu opean con . on compu e ision, Sp inge , 2014, pp. 740–755.
[2] J. Deng, W. Dong, R. Soche , L.-J. Li, K. Li, and L. Fei-Fei, “ImageNe :
A La ge-Scale Hie a chical Image Da abase,” in CVPR09, 2009.
[3] J. Lagos, U. Lempi¨
o, and E. Rah u, “Finnwoodlands da ase ,” in Scan-
dina ian Con e ence on Image Analysis, Sp inge , 2023, pp. 95–110.
[4] Y. Lu, Y. Huang, S. Sun, S. Fei, and V. Chen, “Pu ee: A pho o ealis ic
la ge-scale i ual benchma k o o es aining,” in 2024 IEEE Con e -
ence on Vi ual Reali y and 3D Use In e aces Abs ac s and Wo kshops
(VRW), 2024, pp. 687–688. DOI: 10.1109/VRW62533.2024.00140.
[5] F os Bi So wa e Lab (Lapland UAS), Ag a sense simula o , Accessed:
2025-07-21. [Online]. A ailable: h ps://de .azu e.com/AMKF os Bi /
AGRARSENSE.
[6] Z. Feng, Y. She, and S. Kesha , “Sp ead: A la ge-scale, high- ideli y
syn he ic da ase o mul iple o es ision asks,” Ecological In o ma -
ics, ol. 87, p. 103 085, 2025, ISSN: 1574-9541. DOI: h ps://doi.o g/
10.1016/j.ecoin .2025.103085.
[7] G. Joche and J. Qiu, Ul aly ics YOLO11, e sion 11.0.0, 2024.
[Online]. A ailable: h ps://gi hub.com/ul aly ics/ul aly ics.
[8] R. R. A ula and A. Na kilah i, Sim o es : Rgbd ins ance segmen a ion
da ase , Zenodo, Jul. 2025. DOI: 10.5281/zenodo.15911876.
[9] CORDIS, AGRARSENSE - Sma , digi alized componen s and sys ems
o da a-based Ag icul u e and Fo es y. [Online]. A ailable: h ps://
co dis.eu opa.eu/p ojec /id/101095835.