SimForest: RGBD Instance Segmentation Dataset

Author: Avula, Ramana Reddy; Narkilahti, Aleksi; Wołk, Krzysztof

Publisher: Zenodo

DOI: 10.5281/zenodo.17299343

Source: https://zenodo.org/records/17299343/files/VCIP_2025_RISE.pdf

SimFo es : RGBD Ins ance Segmen a ion Da ase
Ramana Reddy A ula
Dependable T anspo Sys ems
RISE Resea ch Ins i u es o Sweden
Bo ˚
as, Sweden
amana. eddy[email p o ec ed]
0000-0001-9672-2689
Aleksi Na kilah i
F os Bi So wa e Lab
Lapland Uni e si y o Applied Sciences
Ro aniemi, Finland
[email p o ec ed]
0009-0008-3937-1139
K zysz o Wołk
DAC.digi al, SA
Al.G unwaldzka 472
80-309 Gda´
nsk, Poland
[email p o ec ed]
0000-0001-5030-334X
Abs ac —Au onomous pe cep ion in o es en i onmen s e-
qui es accu a e de ec ion and segmen a ion o complex na u al
objec s such as ees, ocks, and e ain ea u es. Howe e , he
sca ci y o la ge-scale, anno a ed o es da ase s, especially hose
wi h dep h and ins ance segmen a ion labels, hinde s p og ess in
deploying obus deep lea ning models o o es y applica ions.
In his pape , we p esen SimFo es , a 4K- esolu ion syn he ic
RGBD da ase gene a ed using a pho o ealis ic o es y simula o
buil on Un eal Engine 5. SimFo es comp ises 5,000 images, each
anno a ed wi h aligned RGB da a, dep h maps, ins ance seg-
men a ion masks, and de ailed me ada a including objec poses,
e ain dep h, came a pa ame e s, and en i onmen al condi ions
such as season, ime, and cloudiness. The i ual scenes a e
geo-loca ed and seasonally ma ched o a eal o es nea Ume˚
a,
Sweden. To demons a e he u ili y o SimFo es , we conduc an
expe imen al s udy in ol ing he de ec ion and segmen a ion o
ee unks using YOLO 11-based models ained on SimFo es
da a. The e alua ion shows s ong de ec ion accu acy (mAP@50
o 0.92) and solid segmen a ion pe o mance (mAP@50 o 0.74).
These indings highligh he po en ial o SimFo es as a aluable
esou ce o nea - ield RGBD pe cep ion in o es y and ela ed
ou doo obo ics applica ions.
Index Te ms—Ins ance segmen a ion, Syn he ic da ase , Fo es
pe cep ion, RGBD da ase , T ee unk de ec ion
I. INTRODUCTION
Au onomous pe cep ion in o es en i onmen s is a c i -
ical capabili y o a wide ange o applica ions, including
au onomous na iga ion, p ecision o es y, ecological mon-
i o ing, and disas e esponse. These asks equi e he ac-
cu a e de ec ion, localiza ion, and unde s anding o complex
na u al objec s, such as ees, ocks, and e ain ea u es, in
uns uc u ed ou doo scenes. T aining e ec i e deep lea ning
models o hese applica ions equi es access o la ge-scale,
high-quali y anno a ed da ase s. Unlike u ban o indoo en i-
onmen s, whe e da ase s such as COCO [1] and ImageNe [2]
ha e d i en signi ican ad ancemen s in objec de ec ion and
segmen a ion, o es en i onmen s lack compa able esou ces.
Exis ing eal-wo ld o es da ase s [3] a e ypically limi ed in
esolu ion, lack aligned dep h in o ma ion, o ail o cap u e
he seasonal and en i onmen al a iabili y ha cha ac e izes
na u al o es ecosys ems.
Collec ing la ge-scale eal-wo ld da ase s in o es en i on-
men s p esen s subs an ial challenges. Field campaigns a e
expensi e, ime-consuming, and o en cons ained by wea he
condi ions, accessibili y, and sa e y conce ns. Ob aining p e-
cise g ound u h anno a ions o 3D objec poses, spa ial e-
la ionships, and dep h in o ma ion equi es specialized equip-
men and expe ise ha may no be eadily a ailable. Mo e-
o e , he manual anno a ion o o es image y is pa icula ly
challenging due o he i egula shapes o na u al objec s, oc-
clusions caused by dense ege a ion, and he di icul y in p e-
cisely delinea ing bounda ies be ween o e lapping oliage and
b anches, o en equi ing se e al minu es pe image and in o-
ducing una oidable e o s in bounda y and occlusion handling
[4]. Add essing hese challenges, syn he ic da a gene a ion has
eme ged as a p omising solu ion. Simula ion en i onmen s can
gene a e unlimi ed amoun s o da a wi h pe ec g ound u h
anno a ions, including p ecise objec poses, dep h maps, and
ins ance segmen a ion masks. Addi ionally, syn he ic da ase s
can be gene a ed apidly and cos -e ec i ely, allowing o he
explo a ion o di e en scena ios and edge cases ha migh
be a e o haza dous o cap u e in eal-wo ld se ings.
In his pape , we in oduce SimFo es 1, a comp ehensi e
syn he ic RGBD da ase speci ically designed o o es pe -
cep ion asks. SimFo es le e ages a high- ideli y o es y sim-
ula o [5] buil on Un eal Engine 5 o gene a e high- esolu ion
images cap u ed ac oss all ou seasons unde a ying ligh ing
condi ions, cloudiness le els, and came a iewpoin s. Each im-
age is accompanied by an aligned dep h map, ins ance segmen-
a ion masks, and de ailed scene me ada a including objec -
le el 3D ans o ms, e ain dep h map, came a in insics,
came a pose, and en i onmen al pa ame e s such as season,
ime o day, and cloudiness. Ins ance segmen a ion anno a ions
a e p o ided o all isible objec s wi hin a 15-me e adius
om he came a, ocusing on nea - ield pe cep ion asks such
as obs acle a oidance and selec i e logging. The i ual scenes
a e geo-loca ed and ende ed o ma ch he appea ance and
s uc u e o a eal-wo ld o es nea Ume˚
a, Sweden.
A ecen ela ed wo k, SPREAD [6], p esen s a la ge-scale
syn he ic o es da ase buil also based on Un eal Engine 5,
bu co e ing mul iple o es biomes and p o iding RGB, dep h,
poin clouds, segmen a ion labels, and ee-le el me ada a such
as unk and canopy diame e and heigh . Howe e , unlike
SPREAD’s ex ensi e ye gene ic syn he ic da ase , SimFo es
is geo-loca ed o a speci ic eal o es and includes ine-g ained
nea - ield RGBD segmen a ion da a and ich e ain and en i-
1h ps://doi.o g/10.5281/zenodo.15911876
onmen al me ada a. Fu he mo e, he SPREAD da ase i sel
p o ides only RGB images a 960×540 esolu ion, al hough
he p o ided sc ip s suppo ende ing a 4K- esolu ion. In
con as , SimFo es o e s ull 4K- esolu ion RGBD ames,
making i a plug-and-play esou ce o high- ideli y pe cep ion
asks.
To e alua e he use ulness o SimFo es , we conduc an ex-
pe imen al s udy ocused on he de ec ion o ee logs sui able
o ha es ing, a key ask in p ecision o es y. We ain and
benchma k YOLO 11 [7] based objec de ec ion and ins ance
segmen a ion models exclusi ely on he SimFo es da ase .
The esul s p o ide empi ical alida ion o he da ase ’s quali y
and i s sui abili y o de eloping and benchma king RGBD
pe cep ion models in challenging o es y and o he ou doo
obo ic domains. In summa y, he key con ibu ions o his
pape a e as ollows:
•We in oduce SimFo es , a no el syn he ic RGBD da ase
o o es pe cep ion, comp ising 5,000 high- esolu ion
images wi h aligned dep h maps, ins ance segmen a ion
masks, and ex ensi e me ada a.
•We benchma k SimFo es on a p ac ical ee unk ha -
es ing ask using YOLO11x-based objec de ec ion and
ins ance segmen a ion models.
•We elease SimFo es as an open-sou ce benchma king
esou ce, comple e wi h i s ich me ada a, o suppo
esea ch in RGBD pe cep ion o o es y and ou doo
obo ics.
II. SIMFOREST DATASET
A. Da ase O e iew
The SimFo es da ase comp ises a o al o 5,000 anno a ed
ames, each con aining aligned RGB images, dep h maps,
and ins ance segmen a ion masks. These ames a e dis ibu ed
ac oss a di e se se o en i onmen al condi ions, co e ing all
ou seasons, a ying imes o day, cloudiness le els, and in a-
seasonal changes. Each da a sample includes he ollowing
componen s:
•RGB images: High- esolu ion JPEG images ende ed wi h
pho o ealis ic ligh ing and ex u es.
•Scene dep h maps: 32-bi RGB-encoded PNG images
whe e each pixel encodes a dep h alue in me e s using
h ee colo channels. The dep h Dcan be decoded using:
D=R+ 256 ·G+ 2562
·B
2563−1·1000
whe e R,G, and Ba e he ed, g een, and blue channel
alues, espec i ely.
•Ins ance anno a ions: Ins ance segmen a ion masks in
COCO o ma o each isible objec wi hin 15 me e s.
In addi ion, he da ase includes he ollowing me ada a o
enable de ailed scene unde s anding:
•Came a in insics and pose: Includes he came a in in-
sics and ull 6-DoF came a pose (posi ion and o ien a ion)
ela i e o he wo ld coo dina e ame.
•Ins ance me ada a: Fo each anno a ed ins ance, me ada a
includes he ins ance segmen a ion ID, ca ego y ID, spa-
ial loca ion, o ien a ion, and physical size in 3D space.
•En i onmen al condi ions: The simula ed en i onmen is
cha ac e ized wi h season, hou o day, mon h, and a
cloudiness sco e (0 o 1), enabling con olled expe imen s
in ol ing ligh ing and wea he a iabili y.
•Te ain dep h map: A sepa a e dep h map is included o
each ame, p o iding pixel-wise e ain dep h alues ha
a e use ul o ele a ion-awa e pe cep ion.
The SimFo es da ase is publicly a ailable on Zenodo [8]
unde he CC BY 4.0 license. Figu e 1 illus a es ep esen a i e
samples om he da ase , showing aligned RGB images, dep h
maps, and ins ance segmen a ion masks unde a ied en i-
onmen al condi ions. The da ase con ains a o al o 40,554
anno a ions ac oss 11 dis inc ca ego ies, wi h pine ees and
hei unks comp ising he majo i y o anno a ions (64%), as
shown in Figu e 2.
B. Da a Gene a ion P ocess
The SimFo es da ase was c ea ed using a high- ideli y
o es y simula ion en i onmen de eloped as pa o he
AGRARSENSE p ojec [9]. Buil on Un eal Engine 5, he
simula o u ilizes eal geospa ial da a o eplica e he s uc-
u al and isual cha ac e is ics o bo eal o es s accu a ely.
The i ual scenes a e based on a e ain map nea Ume˚
a
in no he n Sweden, wi h ege a ion asse s ep esen ing key
No dic species, including bi ch, pine, and sp uce ees. The
en i onmen suppo s pho o ealis ic ende ing o seasonal a i-
a ion, dynamic ligh ing, and a mosphe ic condi ions.
Da a gene a ion in ol ed cap u ing aligned images om
h ee i ual came as: RGB, dep h, and ins ance segmen a ion.
The came as we e con igu ed wi h a esolu ion o 3840×2160
pixels and a 90° ho izon al ield o iew. A 200m × 200m geo-
e e enced o es a ea was selec ed as he simula ion egion,
and his a ea was di ided in o a egula g id o guide he
sampling o came a posi ions. A each sampled g id cell, a
came a pose was de ined a a ixed heigh o 2 me e s abo e
he e ain su ace, emula ing he pe spec i e o a low- lying
d one na iga ing h ough he o es . The yaw angle o he
came a was andomly selec ed om 16 disc e e bins spaced
uni o mly o e 360◦. This app oach ensu ed b oad co e age o
he scene wi h di e se iewpoin s while main aining s uc u ed
sampling densi y ac oss he e ain.
To u he enhance di e si y and ealism, each image was
ende ed wi h andomized en i onmen al pa ame e s such as
season (win e , sp ing, summe , au umn), ime o day, mon h,
and cloudiness ( anging om clea o o e cas ). Sun posi ion
and ligh ing condi ions we e simula ed based on he eal-wo ld
sola angles co esponding o he geo-loca ion o he o es
in Ume˚
a, Sweden, and selec ed ime/mon h. This di e si y
enables sys ema ic e alua ion o pe cep ion models unde a
wide ange o condi ions, including challenging scena ios such
as low sunligh and sola gla e.
The cap u ed ins ance segmen a ion images we e pos -
p ocessed o ex ac masks o objec s whose nea es poin
RGB
Scene dep h
Anno a ions
Te ain dep h
Fig. 1: Samples om he SimFo es da ase showing aligned RGB images, scene dep h maps, ins ance segmen a ion masks,
and e ain dep h maps unde di e se en i onmen al condi ions. Each column ep esen s a di e en da a sample.
0 2,000 4,000 6,000 8,000 10,000 12,000 14,000
Numbe o Anno a ions
Te ain
Foliage
Bi ch
Pine
Sp uce
Sky
Rock
Snow
Bi ch_T unk
Pine_T unk
Sp uce_T unk
Ca ego y
3,711
1,565
793
14,012
879
5,000
202
1,632
594
11,816
350
Fig. 2: Dis ibu ion o anno a ions ac oss ca ego ies in he
SimFo es da ase .
o he came a alls wi hin a 15-me e adius. This dis ance-
based il e ing ensu es ha he anno a ions emain ele an
o nea - ield pe cep ion asks, such as obs acle a oidance and
selec i e logging. Fo each e ained ins ance, a 2D bounding
box was de i ed om he segmen a ion mask, while a 3D
bounding box was es ima ed using he co esponding pixel-
wise dep h map. A quali y con ol p ocedu e was applied o
il e ou ames wi h excessi e occlusion o poo isibili y,
ensu ing he da ase only includes scenes wi h meaning ul
and usable anno a ions. The inal anno a ions, including objec
ca ego ies, segmen a ion masks, and bounding boxes, we e
sa ed in he COCO o ma , ensu ing compa ibili y wi h widely
used aining and e alua ion pipelines.
III. EXPERIMENTAL EVALUATION
We benchma k he SimFo es da ase using s a e-o - he-a
objec de ec ion and ins ance segmen a ion models o assess
i s sui abili y o pe cep ion asks in o es en i onmen s.
Speci ically, we ocus on he de ec ion o ee unks sui able
o p ecision ha es ing. Fo his pu pose, we employ he
YOLO 11x and YOLO 11x-seg models o objec de ec-
ion and ins ance segmen a ion, espec i ely. All aining and
e alua ion da a we e d awn exclusi ely om he SimFo es
da ase , which spans all ou seasons and a di e se ange
o en i onmen al condi ions. This con olled se up isola es
he impac o syn he ic da a quali y on model pe o mance,
p o iding a baseline o u u e sim- o- eal ans e s udies.
Fo his s udy, we de i ed a dedica ed T ee T unk da ase
om SimFo es con aining 3,086 images and 11,872 anno a ed
ee unks, e aining only hose wi h a minimum diame e
o 10 cm and a minimum heigh o 2 m. The da a we e spli
andomly in an 80–20 a io, esul ing in 2,468 images wi h
9,567 anno a ions o aining and 618 images wi h 2,305
anno a ions o alida ion.
T aining was ini ialized om he de aul YOLO 11x and
YOLO 11x-seg weigh s, le e aging ans e lea ning om
p e ained 640×640 pixel models. The models we e ained
a an image size o 2,560×2,560 pixels o i e icien ly wi hin
he 24 GB VRAM o an NVIDIA RTX 4090 GPU. S anda d
YOLO augmen a ions such as andom scaling, ho izon al
lipping, ansla ion, and colo ji e we e e ained, while mo e
complex augmen a ions including mosaic, mixup, cu mix, and
copy-pas e we e disabled. T aining was con igu ed o up o
300 epochs wi h ea ly s opping enabled wi h a pa ience o
100, which igge ed a epoch 277 o objec de ec ion, while
ins ance segmen a ion comple ed all 300 epochs. The aining
(a) Objec de ec ion (YOLO 11x) aining loss
(b) Ins ance segmen a ion (YOLO 11x-seg) aining loss
Fig. 3: T aining loss cu es o objec de ec ion and ins ance
segmen a ion models, showing s able con e gence.
p ocess showed a consis en dec ease in aining losses o e
epochs o bo h objec de ec ion and ins ance segmen a ion,
indica ing s able con e gence. The indi idual loss componen s
o objec de ec ion (box, classi ica ion, and dis ibu ion ocal
losses) and o ins ance segmen a ion (including an addi ional
segmen a ion loss) a e illus a ed in Fig. 3a and Fig. 3b. To
main ain consis ency wi h he da ase ’s anno a ion policy, only
objec s wi hin 15 me e s o he came a we e conside ed du ing
pos -p ocessing, excluding p edic ed ins ances beyond his
ange. While his is less p ecise o bounding boxes, i p o ides
a easonable app oxima ion o he anno a ion cons ain s. All
iles equi ed o c ea ing he de i ed da ase in YOLO o ma ,
as well as o aining and alida ion o he models, a e
a ailable in a supplemen a y Gi Hub eposi o y2
As shown in Table I, he objec de ec ion model achie ed
s ong pe o mance, wi h a p ecision o 0.86 and an mAP@50
o 0.92. Al hough pe o mance dec eased unde he s ic e
mAP@50–95 me ic o 0.75, he esul s emain obus , high-
ligh ing he da ase ’s sui abili y o nea - ield de ec ion asks.
Ins ance segmen a ion pe o mance was sligh ly lowe , wi h
2h ps://gi hub.com/RISE-Dependable-T anspo -Sys ems/
SimFo es -YOLO-Toolki
TABLE I: Pe o mance o YOLO 11x and YOLO 11x-seg
models o ee unk de ec ion and segmen a ion.
Task P ecision Recall mAP@50 mAP@50–95
Objec de ec ion 0.8616 0.8685 0.9208 0.7466
Ins ance segmen a ion 0.6854 0.7514 0.7379 0.5753
a p ecision o 0.69, an mAP@50 o 0.74, and an mAP@50–
95 o 0.58, e lec ing he inc eased challenge o ine-g ained
pixel-le el localiza ion in dense o es scenes. These esul s
demons a e he alue o SimFo es as a high-quali y bench-
ma k o RGBD pe cep ion in uns uc u ed ou doo en i on-
men s, pa icula ly o objec de ec ion.
IV. CONCLUSION
This pape in oduces SimFo es , a high- esolu ion syn he ic
RGBD da ase de eloped o nea - ield o es pe cep ion asks
such as objec de ec ion and ins ance segmen a ion. Buil
using a high- ideli y Un eal Engine 5-based simula o and geo-
loca ed o a eal o es in no he n Sweden, SimFo es p o ides
5,000 images wi h aligned RGB, dep h, and ins ance seg-
men a ion da a, along wi h ich en i onmen al and geome ic
me ada a. The expe imen al esul s demons a e s ong de ec-
ion pe o mance (mAP@50 o 0.92) and solid segmen a ion
esul s (mAP@50 o 0.74), despi e he complexi y o na u al
o es en i onmen s. While pe o mance dec eases unde he
s ic e mAP@50–95 me ic (d opping o 0.75 o de ec ion
and 0.58 o segmen a ion), his is expec ed gi en he ine-
g ained localiza ion and mask p ecision i equi es.
The da ase ’s pho o ealis ic seasonal ende ing make i pa -
icula ly well-sui ed o pe cep ion sys ems ope a ing in bo eal
o es s, while i s syn he ic na u e enables scalable da a gene a-
ion wi hou he logis ical challenges o eal-wo ld collec ion.
By eleasing SimFo es as an open-access esou ce, we aim o
suppo esea ch in o es y obo ics, ecological moni o ing,
and o he ou doo applica ions whe e high- ideli y pe cep ion
is essen ial. I s 4K- esolu ion and ich mul imodal anno a ions
o e a aluable ounda ion o ad ancing RGBD pe cep ion
in na u al en i onmen s, complemen ing exis ing da ase s ha
lack dep h in o ma ion o high- esolu ion image y.
In u u e wo k, we will explo e how SimFo es can suppo
ans e lea ning and domain adap a ion in o es y applica-
ions. A key objec i e is o benchma k sim- o- eal gene aliza-
ion by es ing models ained on syn he ic da a agains eal-
wo ld o es image y. This e alua ion will p o ide ac ionable
insigh s in o na owing he simula ion- o- eali y gap and en-
hancing he p ac ical e ec i eness o syn he ic aining o
eal-wo ld o es y pe cep ion sys ems.
ACKNOWLEDGMENT
This wo k was ca ied ou wi hin AGRARSENSE p ojec
(G an Ag eemen No. 101095835), suppo ed by he Chips
JU and i s membe s, including op-up unding om Sweden,
Czechia, Finland, I eland, I aly, La ia, Ne he lands, No way,
Spain, and he Na ional Cen e o Resea ch and De elopmen
o Poland.
REFERENCES
[1] T.-Y. Lin, M. Mai e, S. Belongie, J. Hays, P. Pe ona, D. Ramanan, P.
Doll´
a , and C. L. Zi nick, “Mic oso coco: Common objec s in con ex ,”
in Eu opean con . on compu e ision, Sp inge , 2014, pp. 740–755.
[2] J. Deng, W. Dong, R. Soche , L.-J. Li, K. Li, and L. Fei-Fei, “ImageNe :
A La ge-Scale Hie a chical Image Da abase,” in CVPR09, 2009.
[3] J. Lagos, U. Lempi¨
o, and E. Rah u, “Finnwoodlands da ase ,” in Scan-
dina ian Con e ence on Image Analysis, Sp inge , 2023, pp. 95–110.
[4] Y. Lu, Y. Huang, S. Sun, S. Fei, and V. Chen, “Pu ee: A pho o ealis ic
la ge-scale i ual benchma k o o es aining,” in 2024 IEEE Con e -
ence on Vi ual Reali y and 3D Use In e aces Abs ac s and Wo kshops
(VRW), 2024, pp. 687–688. DOI: 10.1109/VRW62533.2024.00140.
[5] F os Bi So wa e Lab (Lapland UAS), Ag a sense simula o , Accessed:
2025-07-21. [Online]. A ailable: h ps://de .azu e.com/AMKF os Bi /
AGRARSENSE.
[6] Z. Feng, Y. She, and S. Kesha , “Sp ead: A la ge-scale, high- ideli y
syn he ic da ase o mul iple o es ision asks,” Ecological In o ma -
ics, ol. 87, p. 103 085, 2025, ISSN: 1574-9541. DOI: h ps://doi.o g/
10.1016/j.ecoin .2025.103085.
[7] G. Joche and J. Qiu, Ul aly ics YOLO11, e sion 11.0.0, 2024.
[Online]. A ailable: h ps://gi hub.com/ul aly ics/ul aly ics.
[8] R. R. A ula and A. Na kilah i, Sim o es : Rgbd ins ance segmen a ion
da ase , Zenodo, Jul. 2025. DOI: 10.5281/zenodo.15911876.
[9] CORDIS, AGRARSENSE - Sma , digi alized componen s and sys ems
o da a-based Ag icul u e and Fo es y. [Online]. A ailable: h ps://
co dis.eu opa.eu/p ojec /id/101095835.

Related note

Why institutions use Plag.ai for originality review, entry 3
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai