
Gencer Sumbul, Arne de Wall, Tristan Kreuziger, Filipe Marcelino,
Hugo Costa, Pedro Benevides, Mário Caetano, Begüm Demir,
Volker Markl
BigEarthNet-MM: A Large-Scale, Multimodal,
Multilabel Benchmark Archive for Remote Sensing
Image Classification and Retrieval
Open Access via institutional repository of Technische Universität Berlin
Document type
Journal article | Accepted version
(i. e. final author-created version that incorporates referee comments and is the version accepted for
publication; also known as: Author’s Accepted Manuscript (AAM), Final Draft, Postprint)
This version is available at
https://doi.org/10.14279/depositonce-14945
Citation details
Sumbul, Gencer; de Wall, Arne; Kreuziger, Tristan; Marcelino, Filipe; Costa, Hugo; Benevides, Pedro; Caetane,
Mário; Demir, Begüm; Markl, Volker (2021): BigEarthNet-MM: A Large-Scale, Multimodal, Multilabel
Benchmark Archive for Remote Sensing Image Classification and Retrieval [Software and Data Sets]. In: IEEE
Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–180, Sept. 2021,
https://doi.org/10.1109/MGRS.2021.3089174.
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other
uses, in any current or future media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of
any copyrighted component of this work in other works.
Terms of use
This work is protected by copyright and/or related rights. You are free to use this work in any way permitted by
the copyright and related rights legislation that applies to your usage. For other uses, you must obtain
permission from the rights-holder(s).

1
BigEarthNet-MM: A Large Scale Multi-Modal
Multi-Label Benchmark Archive for Remote
Sensing Image Classification and Retrieval
Gencer Sumbul, Graduate Student Member, IEEE, Arne de Wall, Student Member, IEEE,
Tristan Kreuziger, Student Member, IEEE, Filipe Marcelino, Hugo Costa, Pedro Benevides, M´
ario Caetano,
Beg¨
um Demir, Senior Member, IEEE, Volker Markl
Abstract—This paper presents the multi-modal BigEarthNet
(BigEarthNet-MM) benchmark archive made up of 590,326 pairs
of Sentinel-1 and Sentinel-2 image patches to support the deep
learning (DL) studies in multi-modal multi-label remote sensing
(RS) image retrieval and classification. Each pair of patches
in BigEarthNet-MM is annotated with multi-labels provided by
the CORINE Land Cover (CLC) map of 2018 based on its
thematically most detailed Level-3 class nomenclature. Our initial
research demonstrates that some CLC classes are challenging
to be accurately described by only considering (single-date)
BigEarthNet-MM images. In this paper, we also introduce an
alternative class-nomenclature as an evolution of the original
CLC labels to address this problem. This is achieved by interpret-
ing and arranging the CLC Level-3 nomenclature based on the
properties of BigEarthNet-MM images in a new nomenclature
of 19 classes. In our experiments, we show the potential of
BigEarthNet-MM for multi-modal multi-label image retrieval
and classification problems by considering several state-of-the-
art DL models. We also demonstrate that the DL models
trained from scratch on BigEarthNet-MM outperform those pre-
trained on ImageNet, especially in relation to some complex
classes, including agriculture and other vegetated and natural
environments. We make all the data and the DL models publicly
available at https://bigearth.net, offering an important resource
to support studies on multi-modal image scene classification and
retrieval problems in RS.
Index Terms—Multi-modal learning, multi-label image re-
trieval, image classification, deep learning, remote sensing.
I. INTRODUCTION
As a result of advancements in satellite technology, recent
years have witnessed a significant increase in the volume
of remote sensing (RS) image archives. Accordingly, the
development of accurate scene classification and content based
image retrieval (CBIR) systems in massive image archives
has attracted great attention in RS. CBIR systems aim to
achieve an efficient and precise retrieval of RS images from
large archives that are similar to a query image [1], [2].
RS image scene classification systems aim at automatically
assigning class labels to each RS image scene in a large
archive [3], [4]. Deep learning (DL) based methods have
Gencer Sumbul, Arne de Wall, Tristan Kreuziger, Beg¨
um Demir and Volker
Markl are with Technische Universit¨
at Berlin, Berlin, Germany.
Filipe Marcelino, Hugo Costa, Pedro Benevides, and M´
ario Caetano are
with Direc¸˜
ao-Geral do Territ´
orio (DGT), Lisbon, Portugal. Hugo Costa and
M´
ario Caetano are also with NOVA Information Management School (NOVA
IMS), Universidade Nova Lisboa, Campus de Campolide, 1070-312 Lisbon,
Portugal.
recently seen a rise in popularity in the context of RS image
scene classification and retrieval problems. Most DL models
require a high amount of annotated images during training
to optimize all parameters and reach a high performance. The
availability and quality of such data determine the feasibility of
many DL models. There are several benchmark archives made
publicly available for different RS applications (e.g., pixel-
based image classification). For a comprehensive list, we refer
the reader to [5]. To the best of our knowledge, most of the
existing publicly available benchmark archives for image scene
classification and retrieval problems contain: 1) single-modal
RS images (e.g., multispectral or SAR); and 2) single-label
image annotations (i.e., each image is annotated by a single
label that is associated with the most significant content of the
considered image). However, multi-modal images associated
with the same geographical area allow for rich characterization
of RS images and thus improve image retrieval performance
when jointly considered [6]. In addition, RS images usually
contain areas with a high variety of semantically complex con-
tent that must be reflected by more than one class annotation
through multiple class labels (multi-labels).
Thus, a benchmark archive consisting of multi-modal im-
ages annotated with multi-labels is needed. However, annotat-
ing RS images with multi-labels at a large-scale to drive DL
studies is time consuming, complex, and costly in operational
scenarios. To overcome this problem, a common approach
is to exploit DL models with proven architectures (such as
ResNet [7] or VGG [8]), which are pre-trained on publicly
available general purpose datasets in the computer vision (CV)
community (e.g., ImageNet [9]). The existing model is then
fine-tuned on a small set of RS images annotated with multi-
labels to calibrate the final layers. This strategy is also known
as a transfer learning strategy. There are several versions of
the above-mentioned models that have been pre-trained on
large-scale datasets in CV. However, we argue that this is
not a proper approach in RS, because of the differences in
image characteristics in CV and RS. For example, Sentinel-2
multispectral images have 13 spectral bands associated with
varying and lower spatial resolutions compared to the CV
images. In addition, the semantic content present in CV and
RS images is significantly different, and thus the respective
semantic classes differ from each other. To address this issue,
we have recently introduced BigEarthNet [10] as a large-scale
single-modal benchmark archive for RS image search, retrieval

2
and classification. BigEarthNet contains 590,326 Sentinel-
2 image patches annotated with multi-labels provided by
the CORINE Land Cover (CLC) map of 2018 (CLC 2018)
[11]. The CLC nomenclature includes land cover and land
use classes grouped in a three-level hierarchy, and for the
BigEarthNet image patches, the most thematically detailed
Level-3 class nomenclature is considered. However, there are
some CLC classes that are difficult to be identified by only
exploiting (single-date) Sentinel-2 images, because: i) land
use concepts associated with some classes (e.g., Dump sites,
Sport and leisure facilities) may not be visible from space
or fully recognizable with the spatial resolution of Sentinel-2
images, and ii) RS time series, which BigEarthNet does not
include, may be required to describe and discriminate some
classes (e.g., Non-irrigated arable land,Permanently irrigated
land). In addition, BigEarthNet is not suitable for the multi-
modal learning-based algorithm development and validation
purposes, since it only contains Sentinel-2 image patches.
To overcome these issues, in this paper we introduce the
multi-modal BigEarthNet (BigEarthNet-MM) that contains
590,326 pairs of Sentinel-2 and Sentinel-1 image patches.
We also introduce an alternative nomenclature for images in
BigEarthNet-MM as an evolution of the original CLC labels.
Fig. 1 shows an example of the BigEarthNet-MM image pairs
and their multi-labels from the new nomenclature.
II. DESCRIPTION OF BIGEARTHNET-MM
BigEarthNet-MM contains 590,326 pairs of Sentinel-1 and
Sentinel-2 image patches acquired over 10 different European
countries (Austria, Belgium, Finland, Ireland, Kosovo, Lithua-
nia, Luxembourg, Portugal, Serbia, Switzerland). Sentinel-
2 patches of BigEarthNet-MM are taken from the original
BigEarthNet [10]. To construct these patches, 125 Sentinel-2
tiles associated with less than 1% of cloud cover and acquired
between June 2017 and May 2018 were considered. All tiles
were atmospherically corrected by employing Sentinel-2’s
Level 2A product generation and formatting tool (sen2cor)
provided by the European Space Agency due to its proven
success in the literature. After the atmospheric correction, the
10th band of each image patch is not available anymore, as it
is the cirrus band (which is omitted in the Level 2A output for
its lack of surface information). Then, the tiles were divided
into 590,326 non-overlapping image patches, each of which
is a section of: 1) 120 ⇥120 pixels for 10m bands; 2) 60 ⇥60
pixels for 20m bands; and 3) 20 ⇥20 pixels for 60m bands.
One important goal during the tile selection process was to
represent all chosen geographic locations with images acquired
in different seasons. Due to the restrictions of finding tiles with
a low cloud cover percentage in the relatively narrow time
period, this has not been possible at each considered location.
Accordingly, the following respective numbers of patches for
autumn, winter, spring, and summer have been considered:
143557,72877,175937, and 126913. For the quality check
of patches, visual inspection was also employed, which led to
the identification of 70,987 Sentinel-2 image patches that are
fully covered by seasonal snow, cloud, and cloud shadow1.
1The lists are available at http://bigearth.net/#downloads.
To construct the Sentinel-1 patches of BigEarthNet-MM,
325 Sentinel-1 Ground Range Detected (GRD) products ac-
quired between June 2017 and May 2018 that jointly cover
the area of all original 125 Sentinel-2 tiles with close temporal
proximity were selected and processed. The selected scenes
provide dual-polarized information channels (VV and VH)
and are based on the interferometric wide swath (IW) mode,
which is the main acquisition mode over land. All scenes
were pre-processed by using the Sentinel-1 toolbox (S1TBX)
and the graph processing framework (GPF) of ESA’s Sentinel
Application Platform (SNAP). This includes the application
of precise orbit files, border and thermal noise removal,
radiometric calibration, and geometric correction (i.e., Range
Doppler terrain correction). Depending on the spatial extent
of the scene, either the SRTM 30 (for scenes below 60°
latitude) or the ASTER DEM (for scenes above 60° latitude,
where no SRTM 30 exists) were employed in the geometric
correction to project images from slant range to ground range.
Finally, the backscatter coefficient was converted to a decibel
(dB) scale. It is worth noting that, since the selection of the
speckle filter is considered to be application dependent, no
speckle filtering was applied in our pre-processing workflow
in order to preserve the full resolution. This approach is
also recommended by the Product Family Specification for
SAR of the CEOS Analysis Ready Data for Land (CARD4L)
framework2. Based on the pre-processed Sentinel-1 scenes, for
each Sentinel-2 patch, a corresponding Sentinel-1 patch with
a close timestamp was extracted. In addition, each Sentinel-1
patch inherited the annotations of the corresponding Sentinel-2
patch. The resulting Sentinel-1 image patches contain a spatial
resolution of 10m.
A. Class-Nomenclature of BigEartNet-MM
Each pair (which is made up of Sentinel-1 and Sentinel-
2 image patches acquired in the same geographical area)
in BigEarthNet-MM is associated with one or more class
labels (i.e. multi-labels) extracted from the CORINE land
cover map of 2018. CORINE land cover (CLC) is a pioneer
adventure initiated in the 80’s of the last century to produce
harmonized land cover land use (LCLU) maps in vector format
for the member states of the European Union. According
to the validation report of the CLC, the accuracy is around
85% [12]. Nowadays, CLC covers 39 countries from Europe
and was produced for five reference years, 1990, 2000, 2006,
2012, and 2018. The latter was produced with data of 2017-
2018, which matches the time frame of the images included in
BigEarthNet. Motivations for embracing a large-scale mapping
endeavor aimed at meeting the demand for spatially explicit
and harmonized information on land for a variety of purposes,
such as environmental management and decision making. The
crude state-of-the-art of the 1980’s technology and the large
spectrum of potential uses of the maps led to the definition of a
coarse spatial resolution and a nomenclature with some broad
class definitions, mixing land cover and land use concepts.
These definitions are implemented for map production by
visual interpretation of RS images and additional data in most
2https://ceos.org/ard/

3
countries. Additional data may include very high spatial reso-
lution imagery and official spatial data sets like land registers,
often to infer the land use. The same technical specifications
were preserved in map updating for historical consistency.
Thus the produced five CLC maps have a minimum mapping
unit of 25 ha and a minimum mapping width of 100 m,
and provide information on land according to a hierarchical
nomenclature of 44 classes at the most detailed level (Level3).
The image patches in BigEarthNet-MM are representative of
43 CLC classes. In the case that CLC maps are considered as
labeling sources for training the machine learning methods
to automatically analyse RS images, the modified versions
of the CLC nomenclature (which better fit the purpose of
the considered application) are commonly preferred. One of
the main reason is that RS systems directly observe the land
cover rather than the land use. The CLC land-use based labels
may not be fully recognizable through the RS images unless
they are not associated to very high spatial resolution. As an
example, in [13] CLC is used as a basis to collect training
data for supervised RS image classification, but classes such
as Discontinuous urban fabric and Sport and leisure facilities
that depend mainly on land use were removed. A deep revision
of the CLC program is actually under consideration following
the concept of the EIONET Action Group on Land monitoring
in Europe (EAGLE) [14].
To pay more justice to the properties of BigEarthNet-
MM image pairs, we introduce a new class-nomenclature by
modifying the multi-labels extracted from the CLC 2018.
To this end, the CLC Level-3 nomenclature is interpreted
and arranged in a new nomenclature of 19 classes3. Ten
classes of the original CLC nomenclature are maintained in
the new nomenclature, 22 classes are grouped into 9 new
classes, and 11 classes are removed. The classes maintained
are semantically homogeneous and largely related to land
cover, such as Broad-leaved forest and Beaches, dunes, sands.
Furthermore, CLC classes that are not feasible to be identified
by only using single-date BigEarthNet-MM images removed,
such as Burnt areas. Complex classes (which are often re-
moved when undertaking image classification) are maintained,
such as Complex cultivation patterns and Land principally
occupied by agriculture, with significant areas of natural
vegetation. The goal is to investigate the ability of DL models
to learn from spatial patterns that express semantic classes.
Classes are grouped when sharing similar land cover types
and spectral patterns. For example, Moors and heath land and
Sclerophyllous vegetation are grouped in a single class, and a
new class, Arable land, groups similar crops that require dense
time series (which not available in BigEarthNet-MM) for their
discrimination (e.g. irrigated and non-irrigated crops). Classes
that strongly depend on land use or need additional data for
their discrimination are removed. For example, class Airports
essentially relates to land use, and Intertidal flats appear in RS
images either with or without water depending on the image
acquisition time and hence require appropriate time series for
its classification. The number of labels associated with each
image pair varies between 1 and 12, while 96.80% of image
3https://bigearth.eu/BigEarthNetListofClasses.pdf
pairs are not associated with more than 5labels. Only 23 image
pairs are annotated with more than 9labels.
III. EXPERIMENTS
A. Experimental Design
The experiments were carried out in the context of content
based multi-modal multi-label RS image retrieval and classi-
fication. To achieve multi-modal learning, we stacked the VV
and VH bands of Sentinel-1 image patches, and the Sentinel-
2 bands associated with 10m and 20m spatial resolution into
one volume for each pair in BigEarthNet-MM. To this end, we
initially applied cubic interpolation to 20m bands of Sentinel-2
image patches. In the experiments, we did not use the Sentinel-
2 image bands associated with 60m spatial resolution (bands
1 and 9). This is due to the fact that these bands are mainly
used for cloud screening, atmospheric correction, and cirrus
detection in RS applications and do not embody a significant
amount of information for the characterization of semantic
content of RS images. In the experiments, we considered
the VGG model [8] and the ResNet model [7] at various
number of layers (VGG16, VGG19, ResNet50, ResNet101,
ResNet152). To fairly compare all models, we utilized the
Adam optimizer [15] with an initial learning rate of 103to
decrease the sigmoid cross-entropy loss. Except the learning
rate, we employed the same parameter values given in [3], [7],
[8]. The batch size is set to 256 for ResNet152 and to 500 for
all other models used in the experiments. We applied training
from scratch for 100 epochs, while the final layers of the pre-
trained models were fine-tuned separately on each modality
for 10 epochs. For all the models, we added a fully connected
layer that includes 19 neurons at the end of the network
for the classification. For image retrieval, we extracted image
features from the considered models and applied similarity
matching of the features based on the 2-distance measure. We
performed various experiments to analyze the effectiveness of:
i) learning from BigEarthNet-MM directly (through training
from scratch) instead of using the pre-trained models on
ImageNet; and ii) state-of-the-art CNN models trained and
evaluated on BigEarthNet-MM. To use the pre-trained models
on ImageNet, we used the late fusion of separately fine-
tuned models on Sentinel-1 and Sentinel-2 patches. In the
experiments, we did not use the Sentinel-2 patches that are
fully covered by seasonal snow, cloud, and cloud shadow.
After the arrangements of the new class nomenclature, 57 pairs
among the 590,326 pairs are not associated with any LCLU
labels. these pairs are not used in the experiments. We divided
the remaining dataset into: i) the training set of 269,695 pairs
of patches, ii) validation set of 123,723 pairs of patches, and
iii) the test set of 125,866 pairs of patches.
We performed our experiments on a cluster of 4 NVIDIA
Tesla V100 GPUs. The results of multi-modal multi-label im-
age classification were provided in terms of four performance
metrics: 1) Hamming loss (HL); 2) one-error (OE); 3) recall
(R); and 4) F2-Score (F2). For a detailed description of the
considered metrics, the reader is referred to [3].

4
TABLE I
CLASS-BASED F2SCORES (%)OBTAINED WHEN:I)TRANSFER LEARNING
FROM IMAGENET AND II)DIRECT LEARNING FROM BIGEARTHNET-MM
ARE USED FOR MULTI-MODAL MULTI-LABEL IMAGE CLASSIFICATION.
Class Transfer Learning
From ImageNet
Learning From
BigEarthNet-MM
Urban fabric 56.27 71.99
Industrial or commercial units 30.98 43.21
Arable land 80.05 83.62
Permanent crops 4.32 55.52
Pastures 50.98 74.77
Complex cultivation patterns 36.29 62.03
Land principally occupied by
agriculture, with significant
areas of natural vegetation
30.36 60.63
Agro-forestry areas 2.13 71.87
Broad-leaved forest 42.83 75.39
Coniferous forest 75.47 86.32
Mixed forest 72.19 81.31
Natural grassland and
sparsely vegetated areas 14.11 43.88
Moors, heathland and
sclerophyllous vegetation 5.29 59.91
Transitional woodland-shrub 41.23 64.21
Beaches, dunes, sands 43.67 63.39
Inland wetlands 8.20 57.81
Coastal wetlands 4.79 42.23
Inland waters 63.23 82.10
Marine waters 93.99 97.20
Average 39.81 67.23
B. Experimental Results
1) Comparison among the Strategies of Learning directly
from BigEarthNet-MM and Transfer Learning from the Im-
ageNet: In the first set of experiments, we compare the
effectiveness of learning directly from BigEarthNet-MM with
respect to transfer learning from ImageNet. To this end,
transfer learning strategy is applied by using the pre-trained
ResNet50 model trained on ImageNet, while direct learning
strategy is employed by using the ResNet50 trained from
scratch on BigEarthNet-MM. Table I shows the class-based
F2classification scores (known also as macro-averaged F2
scores [3]). By analyzing the table, one can see that learning
directly from BigEarthNet-MM achieves the highest score for
each class compared to the transfer learning strategy. As an
example, learning directly from BigEarthNet-MM provides
more than 12% and 25% higher scores for the classes Indus-
trial or commercial units and Complex cultivation patterns,
respectively, compared to the transfer learning strategy. The
difference in performance between these learning strategies
is more evident for more complex LULC classes. As an
example, learning directly from BigEarthNet-MM improves
the F2scores more than 54% and 69% for the classes Moors,
heathland and sclerophyllous vegetation and Agro-forestry
areas, respectively.
In the content of image retrieval, Fig. 1 shows an example
of a query pair and the retrieved pairs of images by these
strategies. By assessing the figure, one can observe that
TABLE II
OVERALL MULTI-LABEL CLASSIFICATION RESULTS UNDER DIFFERENT
METRICS AND DL MODELS FOR BIGEARTHNET-MM.
Model HL OE(%)R(%)F2(%)
VGG16 0.078 7.35 76.97 76.18
VGG19 0.080 8.12 76.17 75.35
ResNet50 0.074 5.93 80.05 78.73
ResNet101 0.074 6.46 78.85 77.88
ResNet152 0.073 6.42 78.13 77.46
when learning is achieved directly from BigEarthNet-MM,
the semantically more similar pairs of images are retrieved,
containing the Urban fabric and Arable land classes present
in the query. Learning directly from BigEarthNet-MM leads to
retrieval of a similar pair to the query even at the 100th retrieval
order. However, using transfer learning strategy results in
retrieval of pairs that contain Urban fabric and Arable land
classes which are not present in the query pair. One can
observe this behavior even at the 5th retrieved pair.
The main reasons of the success of directly learning from
BigEarthNet-MM are due to the fact that: 1) transfer learning
from ImageNet limits the accurate characterization of the
spectral content of RS images; 2) fine-tuning the pre-trained
model on ImageNet by using RS images can not be sufficient
to eliminate the semantic gap since the category labels present
in ImageNet are different from the land-cover class labels
present in BigEarthNet-MM; and 3) the pre-trained model was
trained for a single-label image classification scenario, and
thus limits the accurate characterization of the multiple land
cover classes present in BigEarthNet-MM.
2) Comparison of State-of-the-Art CNN Models: In the
second set of experiments, we compare the effectiveness of
the VGG and the ResNet models in the framework of multi-
modal multi-label classification. Table II shows the overall
classification results under different metrics (which are the
sample-averaged scores [3]). By analyzing the table, one can
observe that the ResNet model provides the highest scores
in all metrics. As an example, ResNet50 achieves more than
2% higher recall and F2scores compared to VGG models.
This improvement is due to the residual connections of the
ResNet model and their increased depth in terms of the number
of layers compared to the VGG model. Increasing the depth
of the considering models does not significantly affect the
performances, i.e., similar scores are obtained in all the metrics
under different depth values of the same model.
IV. DISCUSSION AND CONCLUSION
In this paper, we have presented the BigEarthNet-MM
benchmark archive that contains 590,326 pairs of Sentinel-1
and Sentinel-2 image patches with a new CLC-based class-
nomenclature to pay more justice to the properties of the
considered images. BigEarthNet-MM makes a significant ad-
vancement for the use of DL in RS, opening up promising
directions to support research studies in the framework of
multi-modal multi-label RS image scene classification and
retrieval. BigEarthNet-MM is suitable to assess DL based
methods for: i) learning from class-imbalanced multi-modal
Loading more pages...