scieee Science in your language
[en] (orig)
Deep Image Representation Learning
for Knowledge Discovery from Earth
Observation Data Archives
vorgelegt von
M. Sc.
GENCER SÜMBÜL
ORCID: 0000-0003-3690-3052
an der Fakultät IV - Elektrotechnik und Informatik
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doktor der Ingenieurwissenschaften
- Dr.-Ing. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Matthias Boehm
Gutachterin: Prof. Dr. Begüm Demir
Gutachter: Prof. Dr. Farid Melgani
Gutachter: Prof. Dr. Claudio Persello
Tag der wissenschaftlichen Aussprache: 09. Mai 2023
Berlin 2023
i
Abstract
Advances in remote sensing (RS) technology have increased the availability of images
regularly acquired by satelliteborne and airborne sensors, while free data policies
support researchers to have access to massive Earth observation data archives. To
automatically extract knowledge from these archives on a large-scale, deep learning
(DL) based RS image representation learning (IRL) has attracted great attention.
However, existing methods have limitations on: i) accurate characterization of high-
level semantic content and spectral information present in RS images; ii) modelling
RS image similarities by exploiting multi-label training images; iii) time efficient
and scalable information extraction; iv) effective IRL under noisy training labels;
and v) joint use of multiple learning tasks for describing the complex content of RS
images. This thesis aims to develop advanced DL-based IRL methods to tackle these
limitations, while a particular attention is devoted to image scene classification and
content-based image retrieval (CBIR) problems due to their importance for large-scale
knowledge discovery. In detail, we propose five DL-based IRL methods throughout
the thesis. First, a multi-label classification approach is introduced to accurately
describe complex spatial and spectral content of high-spatial resolution RS images,
where several spectral bands are associated with varying spatial resolutions. Second,
we propose an image triplet sampling method for IRL through the characterization
of RS image similarities, which forms the foundation for CBIR. Among multi-label
training images, this method selects a small set of the most representative and
informative image triplets that lead to a decrease in computational complexity and
an increase in learning speed without a significant loss in performance. Third, an
approach devoted to simultaneous RS image compression and indexing is introduced
for scalable CBIR. This approach characterizes hash codes of RS images on learning
based compression domain; and thus prevent the requirement of decoding images
prior to CBIR that can save a significant amount of time. Fourth, we propose an
approach for IRL when training data includes noisy labels. By integrating generative
reasoning into discriminative reasoning, our approach models the complementary
characteristics of discriminative and generative reasoning, and thus prevents the
interference of noisy labels during training. Fifth, a multitask learning approach is
introduced to achieve IRL when multiple learning tasks are jointly utilized. Due to
its loss functions and sequential optimization algorithm, this approach preserves the
plasticity for each task and the stability in between learning consecutive tasks. For
benchmarking the proposed methods, we introduce a large-scale multi-modal multi-
label benchmark RS image archive (denoted as BigEarthNet). It includes 590,326
pairs of Sentinel-1 and Sentinel-2 image patches acquired over 10 European countries.
We make BigEarthNet, its pre-trained DL models and the codes of all the methods
publicly available as open source contributions of the thesis.
ii
Zusammenfassung
Fortschritte in den Technologien der Fernerkundung (FK) haben zu einer erhöhten
Verfügbarkeit von Bildmaterial, das von satelliten- und flugzeuggestützten Sensoren
erfasst wird, geführt; gleichzeitig ermöglicht die kostenlose Freigabe von Datensätzen
Forschern den Zugang zu umfangreichen Archiven mit Erdbeobachtungsdaten. Hi-
erdurch ergibt sich ein Potential für tiefes Lernen (TF) basierte Repräsentationslernen
(RL) Studien zur automatischen Wissensentdeckung aus diesen Archiven. Beste-
hende Methoden haben jedoch Einschränkungen in Bezug auf: i) die genaue Charak-
terisierung des semantischen Inhalts und der spektralen Informationen der FK-Bilder;
ii) die korrekte Nutzung von FK-Bildern mit mehreren Labels während des Train-
ings; iii) die zeiteffiziente und skalierbare Informationsgewinnung; iv) effektives
RL unter fehlerhaften Trainingslabels; und v) die kombinierte Nutzung mehrerer
Lerntasks zur Beschreibung der Bildinhalte. Diese Arbeit zielt darauf ab, TF-basierte
RL-Methoden zu entwickeln, um diese Defizite zu beheben, wobei ein besonderes
Augenmerk auf die Klassifizierung von Bildszenen und inhaltsbasierte Bildabfra-
gen (IB) gelegt wird. Der erste Beitrag dieser Arbeit besteht in der Entwicklung
eines Multi-Label-Klassifikationsansatzes zur genauen Beschreibung des komplexen
räumlichen und spektralen Inhalts hochaufgelöster FK-Bilder. Als zweiten Beitrag
schlagen wir eine Bild-Tripel-Sampling-Methode für RL vor. Diese basiert auf der
Charakterisierung von Bildähnlichkeiten, die grundlegend für IB sind. Unter den
Trainingsbildern wählt die Methode eine kleine Anzahl verschiedener Anker sowie
relevante, harte und diversifizierte Positiv- und Negativbilder aus, die zu kleineren
Berechnungskomplexität ohne signifikanten Performanceverlust führen. Im dritten
Beitrag wird ein Ansatz zur gleichzeitigen FK-Bildkompression und Indizierung für
skalierbares IB vorgestellt. Unser Ansatz charakterisiert Hash-Codes von FK-Bildern
auf einer lernbasierten Kompressionsdomäne und erspart somit die Dekodierung von
Bildern vor der IB, was zu einer erheblichen Zeitersparnis führen kann. Als vierten
Beitrag schlagen wir einen Ansatz für RL vor, für den Fall, dass die Trainingsdaten
fehlerhafte Labels enthalten. Durch die Kombination von generativen und diskrim-
inativen Modellierungen nutzt unser Ansatz ihre komplementären Eigenschaften,
um die Störung durch fehlerhafte Labels während des Trainings zu verhindern. Im
fünften Beitrag wird ein Multitask-Lernansatz eingeführt, bei dem mehrere Lerntasks
kombiniert verwendet werden. Aufgrund seiner Verlustfunktionen und seines se-
quentiellen Optimierungsalgorithmus bewahrt dieser Ansatz die Plastizität für jeden
einzelnen Lerntask und die Stabilität zwischen aufeinanderfolgenden Lerntasks. Für
das Benchmarking der vorgeschlagenen Methoden besteht der letzte Beitrag dieser
Arbeit in der Erstellung von BigEarthNet, dem ersten gr angelegten multimodalen
Multi-Label-Benchmark-Archiv in FK. Wir stellen BigEarthNet, seine vortrainierten
TF-Modelle und die Codes aller Methoden als Open-Source-Beiträge der Dissertation
öffentlich zur Verfügung.
iii
Acknowledgements
This thesis has been made possible through the support of numerous people whom I
have had the privilege of meeting. First, I am deeply grateful to my supervisor, Prof.
Dr. Begüm Demir, for her time and encouragement from the first moment of my
PhD studies, and for being an exceptional mentor to be always available to transmit
her knowledge. Without her priceless guidance, this thesis would not have been
possible.
I would like to thank the members of my doctoral committee, Prof. Dr. Matthias
Boehm, Prof. Dr. Begüm Demir, Prof. Dr. Farid Melgani and Prof. Dr. Claudio
Persello for their interest to my studies, helpful feedbacks and thoughtful comments
that have helped me to improve my thesis significantly.
I also want to thank my colleagues from Remote Sensing Image Analysis (RSiM)
Group of TU Berlin, Genc Hoxha, Minh Tai Le, Mahdyar Ravanbakhsh, Barı¸s Büyük-
ta¸s, Yeti Gürbüz, Bernhard Föllmer, Lars Möllenbrok, Akshara Preethy Byju, Tristan
Kreuziger, Kai Norman Clasen, Martin Hermann Paul Fuchs, Tom Burgert, Huma
Ghani Zada, Ahmet Kerem Aksoy, Georgii Mikriukov, Steve Ahlswede, Sayantan Sen-
gupta, Adina Zell, Leonard Hackel, David Mickisch, Jakob Hackstein, Julia Henkel,
Adela Westedt, Martha Domhöfer, Tim Siebert, Theresa Follath and Kiril Murschel
for their help and all the enjoyable moments in Berlin.
Last but most important, I would like to express my sincere and deepest gratitude to
my wife, Kimya, for her continuous support and encouragement despite innumerable
sacrifices, for believing in my success all the time, for being my best friend and
companion in my life, and most significantly for her inestimable love. I would not be
where I am today without her.
This thesis is supported by the European Research Council (ERC) through the ERC-
2017-STG BigEarth Project under Grant 759764.
iv
Contents
Abstract i
Zusammenfassung ii
Acknowledgements iii
Contents iv
List of Figures vii
List of Tables xi
List of Abbreviations xiv
1 Introduction 1
1.1 Objectives and Novel Contributions of the Thesis . . . . . . . . . . . 6
1.2 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Additional Contributions . . . . . . . . . . . . . . . . . . . . . 10
1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
BigEarthNet: A Large Scale Benchmark Archive for Remote Sensing Image
Representation Learning 12
2.1 Introduction................................. 13
2.2 Limitations of Existing Archives . . . . . . . . . . . . . . . . . . . . . 14
2.3 BigEarthNet: A Large-Scale Benchmark Archive . . . . . . . . . . . . 15
2.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Comparison with Transfer Learning from ImageNet . . . . . . 20
2.5.2 Comparison of State-of-the-Art CNN Models . . . . . . . . . . 22
2.6 Conclusion.................................. 22
3
A Deep Multi-Attention Driven Approach for Multi-Label Remote Sensing
Image Classification 24
3.1 Introduction................................. 25
3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Spatial and Spectral Characterization of Local Areas . . . . . . 27
3.2.2 Definition of a Multi-Attention Driven Global Descriptor . . . 28
3.2.3 Classification of RS Image Scenes with Multi-Labels . . . . . . 31
3.3 Dataset Description and Experimental Design . . . . . . . . . . . . . 32
v
3.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Sensitivity Analysis of the Proposed Approach . . . . . . . . . 37
3.4.2 Comparison Among the Existing Approaches . . . . . . . . . 39
3.5 Conclusion.................................. 41
4
Remote Sensing Image Similarity Learning Through Informative and Rep-
resentative Triplets for Multi-Label Image Retrieval 43
4.1 Introduction................................. 44
4.2 RelatedWorks................................ 46
4.3 ProposedMethod.............................. 49
4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Diverse Anchor Selection . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Relevant, Hard and Diverse Positive-Negative Image Selection 50
4.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 52
4.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.1 Sensitivity Analysis of the Proposed Method . . . . . . . . . . 54
4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.3 Comparison with Different Triplet Sampling Methods . . . . 56
4.5.4 Comparison with the State-of-the-Art DML Approaches . . . 58
4.6 Conclusion.................................. 60
5
Towards Simultaneous Image Compression and Indexing for Scalable
Content-Based Retrieval in Remote Sensing 62
5.1 Introduction................................. 63
5.2 RelatedWorks................................ 65
5.3 Proposed SCI-CBIR Approach . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 First Step: DL-Based Compression . . . . . . . . . . . . . . . . 67
5.3.2 Second Step: Deep Hashing-Based Indexing . . . . . . . . . . 68
5.3.3 Multi-Stage Learning Procedure . . . . . . . . . . . . . . . . . 70
5.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 72
5.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Sensitivity Analysis of the Proposed SCI-CBIR Approach . . . 73
5.5.2 Comparison with Standard Approaches . . . . . . . . . . . . . 76
5.6 Conclusion.................................. 80
6
Generative Reasoning Integrated Label Noise Robust Deep Image Repre-
sentation Learning in Remote Sensing 82
6.1 Introduction................................. 83
6.2 RelatedWorks................................ 85
6.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Basics on Discriminative Reasoning . . . . . . . . . . . . . . . 87
6.3.2 Integration of Generative Reasoning . . . . . . . . . . . . . . . 87
vi
6.3.3 Label Noise Robust Hybrid Representation Learning . . . . . 89
6.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 91
6.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 91
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5.1 Sensitivity Analysis of the Proposed Approach . . . . . . . . . 93
6.5.2 Ablation Study of the Proposed Approach . . . . . . . . . . . 97
6.5.3 Comparison Among the State-of-the-Art Methods . . . . . . . 99
6.6 Conclusion.................................. 101
7
Plasticity-Stability Preserving Multi-Task Image Representation Learning
in Remote Sensing 102
7.1 Introduction................................. 103
7.2 RelatedWorks................................ 105
7.2.1 Single-Task Driven Methods . . . . . . . . . . . . . . . . . . . 105
7.2.2 Multi-Task Driven Methods . . . . . . . . . . . . . . . . . . . . 107
7.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.1 Plasticity Preservation . . . . . . . . . . . . . . . . . . . . . . . 108
7.3.2 Stability Preservation . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3.3 Sequential Optimization Algorithm . . . . . . . . . . . . . . . 110
7.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 114
7.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5.1 Sensitivity Analysis of the Proposed Approach . . . . . . . . . 116
7.5.2 Comparison with Existing Methods . . . . . . . . . . . . . . . 121
7.6 Conclusion.................................. 125
8 Conclusion and Outlook 128
8.1 Conclusion.................................. 128
8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 131
Bibliography 133
vii
List of Figures
2.1 An example of BigEarthNet image pairs and their multi-labels. . . . . 15
2.2
An example of the Sentinel-2 image patches of BigEarthNet that are
fully covered by seasonal snow, cloud and cloud shadow. . . . . . . . 16
2.3
An example of a query pair from the BigEarthNet archive and retrieved
image pairs obtained by using: 1) direct learning from BigEarthNet;
and 2) transfer learning from ImageNet in the framework of content-
based multi-modal multi-label image retrieval. . . . . . . . . . . . . . 21
3.1
Block diagram of the proposed approach for multi-label RS image
scene classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2
The proposed
K
-Branch CNN introduced in the first step of the pro-
posed approach. One local area is highlighted as an example to feed
into the corresponding CNN. . . . . . . . . . . . . . . . . . . . . . . . 28
3.3
Single LSTM cell with its inputs, gates and cell state followed by two
LSTM cells in a sequence. Without losing in generality, particular
sequence of the LSTM network (which starts with the first local area
and ends with the last local area) is chosen in the figure. . . . . . . . . 29
3.4
Proposed multi-attention strategy with bidirectional LSTM networks
for the second step of the proposed approach. . . . . . . . . . . . . . . 30
3.5
Detailed illustration of the three main steps of the proposed approach:
(a) spatial and spectral characterization of local areas; (b) definition of
a multi-attention driven global descriptor; (c) RS image scene classifi-
cation with multi-labels. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6
An example of the BigEarthNet-S2 images with the true multi-labels
and the multi-labels assigned by the ResNet18, ResNet34, VGG16,
VGG19, CA-LSTM and the proposed approach. . . . . . . . . . . . . . 40
4.1
An example of three triplets consisting of images from BigEarthNet-
S2. Each triplet given in different rows consists of an anchor (in blue
frame), a positive image (in green frame), and a negative image (in
red frame). The associated multi-labels are given below the respective
images..................................... 45
viii
4.2
An Abstract representation of triplet selection and the progress for
feature space update. Blue arrows indicate reducing distances for
updating the embedding, while red arrows indicate increasing the
distances.
Xa
marks a chosen anchor,
P1
,
P2
, and
P3
are positive images,
and
N1
,
N2
, and
N3
are negative images in different triplets. The triplet
(Xa
,
P1
,
N1)
is trivial because it already satisfies the margins, and thus
the corresponding distances are not updated. The triplet
(Xa
,
P2
,
N2)
leads to a relatively small error and the images are pushed and pulled
a little. The triplet
(Xa
,
P3
,
N3)
violates the margin greatly and causes
a significant error.
P3
is a positive image, but very far from the anchor,
so it is considered as a hard positive image.
N3
is respectively a hard
negativeimage. ............................... 47
4.3
A block scheme of the proposed triplet sampling method to drive the
training phase of a DNN for multi-label CBIR problems. . . . . . . . 49
4.4
An example of images from the UCMerced Land Use archive and the
multi-labels associated with them: (a) sand, sea (b) airplane, cars, grass,
pavement (c) bare-soil, buildings, grass (d) buildings, cars, pavement, trees. 52
4.5
An image retrieval example: (a) query image; (b) images retrieved by
TNDML; (c) images retrieved by RSDML; (d) images retrieved by the
proposed DAS-RHDIS method (IRS-BigEarthNet archive). . . . . . . 58
4.6
An image retrieval example: (a) query image; (b) images retrieved by
TNDML; (c) images retrieved by RSDML; (d) images retrieved by the
proposed DAS-RHDIS method (UCMerced archive). . . . . . . . . . . 59
4.7 F1
scores obtained by different triplet sampling strategies and the
number of accumulated triplets during the training (The UCMerced
archive). ................................... 60
5.1 Illustration of the proposed SCI-CBIR approach. . . . . . . . . . . . . 67
5.2
Multi-scale similarity index (MS-SSIM) in dB versus bpp obtained
by the proposed SCI-CBIR approach, IC-RNN and JPEG2000 for (a)
BigEarthNet-S2 and (b) MLRSNet archives. . . . . . . . . . . . . . . . 76
5.3
An RS image compression example: (a) original image; reconstructed
image at 0.7 bits per pixel (bpp) by (b) JPEG2000 [162]; (c) IC-RNN
[184]; and (d) the proposed SCI-CBIR approach (BigEarthNet-S2 archive).
77
5.4
An RS image compression example: (a) original image; reconstructed
image at 0.3 bpp by (b) JPEG2000 [162]; (c) IC-RNN [184]; and (d) the
proposed SCI-CBIR approach (MLRSNet archive). . . . . . . . . . . . 77
5.5
MAP versus bpp obtained by the proposed SCI-CBIR approach and
SI-CBIR for (a) BigEarthNet-S2 and (b) MLRSNet archives. . . . . . . 78
5.6
(a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed
SCI-CBIR at 0.62 bpp; (d) the proposed SCI-CBIR at 0.78 bpp; (e) the
proposed SCI-CBIR at 1.05 bpp; and (f) the proposed SCI-CBIR at 1.56
bpp (BigEarthNet-S2 archive). . . . . . . . . . . . . . . . . . . . . . . . 79
5.7
(a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed
SCI-CBIR at 0.33 bpp; (d) the proposed SCI-CBIR at 0.56 bpp; (e) the
proposed SCI-CBIR at 0.69 bpp; and (f) the proposed SCI-CBIR at 0.85
bpp (MLRSNet archive). . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
6.1
An illustration of the training of our GRID approach that jointly lever-
ages the robustness of generative reasoning towards noisy labels and
the effectiveness of discriminative reasoning on image representa-
tion learning. During the forward pass on a mini-batch
B
, the loss
values
Od(B)
,
Og(B)
and the predicted labels
Y
ˆd
,
Y
ˆg
are obtained
through discriminative and generative reasoning for a given learning
task. Then, the set
W
of training samples with noisy labels (i.e., noisy
samples) and the set
C
of training samples with correct labels (i.e.,
clean samples) are constructed through our automatic noisy sample
detection procedure based on the values of the loss function
L
asso-
ciated with the learning task. During the backward pass, the model
parameters except the CNN backbone parameters are updated with
all samples based on
γOd(B)
and
βOg(B)
. The parameters of the
CNN backbone are updated through: i) the generative task head for
the noisy samples based on
θOg(W)
; and ii) the discriminative task
head for the clean samples based on θOd(C). ............. 86
6.2
Noisy sample detection accuracy of the proposed GRID (BCE) ap-
proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)
40%, (e) 50%, (f) 60%; and
k
for
λk%
is set as equal to the SLNIR value
(DLRSDarchive). .............................. 94
6.3
Noisy sample detection accuracy of the proposed GRID (BCE) ap-
proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)
40%, (e) 50%, (f) 60%; and
k
for
λk%
is set as equal to the SLNIR value
(BigEarthNet-S2 archive). . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4
Noisy sample detection accuracy of the proposed GRID (PCE) ap-
proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)
40%, (e) 50%, (f) 60%; and
k
for
λk%
is set as equal to the SLNIR value
(DLRSDarchive). .............................. 96
6.5
Noisy sample detection accuracy of the proposed GRID (PCE) ap-
proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)
40%, (e) 50%, (f) 60%; and
k
for
λk%
is set as equal to the SLNIR value
(BigEarthNet-S2 archive). . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6
Results obtained by using: 1) discriminative reasoning; 2) generative
reasoning; 3) their standard joint learning; and 4) our label noise robust
hybrid representation learning strategy for different values of SLNIR
when RS IRL is achieved by: i) multi-label classification on (a) DLRSD
and (b) BigEarthNet-S2; ii) semantic segmentation on (c) DLRSD and
(d) BigEarthNet-S2; and iii) multi-label co-occurrence prediction on (e)
DLRSD and (f) BigEarthNet-S2. . . . . . . . . . . . . . . . . . . . . . . 98
x
7.1
An illustration of the proposed plasticity-stability preserving multi-
task learning (PLASTA-MTL) approach training, when two tasks
T1
and
T2
are considered. Standard and plasticity preservation back-
ward passes for (a)
T1
, and (c)
T2
are shown, while the changes over
the gradient vectors (b)
GLT1
and (d)
GLT2
during the plasticity
preservation of these tasks are visualized. (e) The backward pass for
stability preservation of all the tasks are given with (f) the illustration
of changes over their gradient vectors. . . . . . . . . . . . . . . . . . . 113
7.2
Normalized discounted cumulative gains (NDCG) versus the num-
ber of retrieved images obtained for the DLRSD archive when the
tasks
T1
,
T2
and
T3
are utilized in different orders for the PLASTA-
MTLapproach. ............................... 119
7.3
Mean Average Precision (mAP) versus the minimum number of train-
ing epochs for the DLRSD archive when the tasks: (a)
T2
and
T3
; (b)
T1
,
T2
and
T3
; and (c)
T1
,
T2
,
T3
and
T4
are utilized for the PLASTA-MTL
approach and the equal weighting method. . . . . . . . . . . . . . . . 121
7.4
Normalized discounted cumulative gains (NDCG) versus the number
of retrieved images obtained for the DLRSD archive when the tasks:
(a)
T1
and
T4
; (b)
T2
and
T3
; (c)
T1
,
T2
and
T4
; (d)
T2
,
T3
and
T4
; and (e)
T1
,
T2,T3and T4are used in the context of multi-task learning. . . . . . . 122
7.5
Normalized discounted cumulative gains (NDCG) versus the number
of retrieved images obtained for the BigEarthNet-S2 archive when the
tasks: (a)
T2
and
T3
; (b)
T2
and
T4
; (c)
T1
,
T3
and
T4
; (d)
T2
,
T3
and
T4
; and
(e) T1,T2,T3and T4are used in the context of multi-task learning. . . . 125
7.6
(a) Query image; and images retrieved by using (b) equal weighting;
(c) uncertainty weighting; (d) PCGrad; (e) GradNorm; (f) DWA; (g) the
proposed PLASTA-MTL approach when the tasks:
T1
,
T2
,
T3
and
T4
are
utilized for the DLRSD archive. . . . . . . . . . . . . . . . . . . . . . . 126
xi
List of Tables
2.1 A List of Existing RS Image Archives . . . . . . . . . . . . . . . . . . . 14
2.2
The list of classes within CLC and proposed class nomenclatures and
their associated numbers of image pairs. These numbers are obtained
after eliminating Sentinel-2 image patches that are fully covered by
seasonal snow, cloud, and cloud shadow. . . . . . . . . . . . . . . . . 18
2.3
Class-based
F2
Scores (%) obtained when: i) transfer learning from
ImageNet and ii) direct learning from BigEarthNet are used for multi-
modal multi-label image classification. . . . . . . . . . . . . . . . . . . 20
2.4
Overall Multi-Modal Multi-Label Classification Results Under Differ-
ent Metrics and DL Models for BigEarthNet. . . . . . . . . . . . . . . 22
3.1
Multi-Label Classification Accuracies and the Number of Required
Model Parameters (NP) When Using Local Areas With Different Sizes
for the Proposed Approach. . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2
Results Obtained by the SiB-CNN
RGB
, the SiB-CNN, the L-SiB-CNN
and the Proposed K-BranchCNN. .................... 38
3.3
Multi-Label Classification Accuracies obtained by Using Different
Steps of the Proposed Approach. . . . . . . . . . . . . . . . . . . . . . 39
3.4
Results Obtained by the ResNet18, ResNet34, VGG16, VGG19, CA-
LSTM and the Proposed Approach Together With the Number of
Required Model Parameters (NP). . . . . . . . . . . . . . . . . . . . . 39
4.1
The Performance of Different DL Model Architectures for the UCMerced
Archive. ................................... 55
4.2
The Effect of Varying Embedding Sizes on the Retrieval Performance
for the UCMerced Archive. . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3
Results obtained by the different anchor selection strategies (RAS, BAS
and proposed DAS) under different metrics for the UCMerced archive
when proposed RHDIS is used for positive and negative image selection.
56
4.4
Results obtained by the different positive and negative image selection
strategies (RIS, BIS and proposed RHDIS) under different metrics for
the UCMerced archive when proposed DAS is used for anchor selection.
56
4.5
The performance of different triplet selection methods for the IRS-
BigEarthNet and UCMerced archives. . . . . . . . . . . . . . . . . . . 57
4.6
The performance of different deep metric learning methods for the
IRS-BigEarthNet and UCMerced archives. . . . . . . . . . . . . . . . . 60
5.1
Results Obtained by Proposed SCI-CBIR For Different Values of
ηC
3
When the First Two Stages of Our Learning Procedure are Achieved at
Different Bit-rates (BigEarthNet-S2 Archive) . . . . . . . . . . . . . . . 74
xii
5.2 Results Obtained by Proposed SCI-CBIR with and without the Atten-
tion Layer When the First Two Stages of Our Learning Procedure are
Achieved at Different Bit-rates (BigEarthNet-S2 Archive) . . . . . . . 74
5.3
Results Obtained by Proposed SCI-CBIR under Different Activation
Functions (The BigEarthNet-S2 Archive) . . . . . . . . . . . . . . . . . 75
5.4
Results Obtained by Proposed SCI-CBIR For Different Automatic Loss
Weighting Techniques (BigEarthNet-S2 Archive) . . . . . . . . . . . . 75
5.5 Results Obtained by Proposed SCI-CBIR for Different Values of q. . 76
5.6
Retrieval Time per Image (in milliseconds) Obtained by SI-CBIR and
the Proposed SCI-CBIR Approach . . . . . . . . . . . . . . . . . . . . 78
5.7
Results Obtained by Proposed SCI-CBIR Trained with Our Multi-Stage
Learning Procedure and Standard Learning Procedure Associated to
Similar Bit-Rates (The BigEarthNet-S2 Archive) . . . . . . . . . . . . . 81
6.1
Results (%) Obtained by the Proposed GRID (BCE) Approach for
Different Values of λand SLNIR (%) (DLRSD archive) . . . . . . . . . 93
6.2
Results (%) Obtained by the Proposed GRID (BCE) Approach for
Different Values of λand SLNIR (%) (BigEarthNet-S2 archive) . . . . 94
6.3
Results (%) Obtained by the Proposed GRID (PCE) Approach for
Different Values of λand SLNIR (%) (DLRSD archive) . . . . . . . . . 95
6.4
Results (%) Obtained by the Proposed GRID (PCE) Approach for
Different Values of λand SLNIR (%) (BigEarthNet-S2 archive) . . . . 96
6.5
Results (%) Obtained by BCE, ELR [196], FL [195], ASL [198], Jo-
CoR [197] and the Proposed GRID (BCE) Approach Under Different
Values of SLNIR (%) (DLRSD archive) . . . . . . . . . . . . . . . . . . 99
6.6
Results (%) Obtained by BCE, ELR [196], FL [195], ASL [198], Jo-
CoR [197] and the Proposed GRID (BCE) Approach Under Different
Values of SLNIR (%) (BigEarthNet-S2 archive) . . . . . . . . . . . . . 99
6.7
Results (%) Obtained by PCE, LNC [31], RLL [79] and the Proposed
GRID (PCE) and GRID (RRL) Approaches Under Different Values of
SLNIR (%) (DLRSD archive) . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8
Results (%) Obtained by PCE, LNC [31], RLL [79] and the Proposed
GRID (PCE) and GRID (RRL) Approaches Under Different Values of
SLNIR (%) (BigEarthNet-S2 archive) . . . . . . . . . . . . . . . . . . . 100
7.1
Mean Average Precision (mAP) Scores Associated to the Different
Combinations of Tasks with Different Capabilities of the PLASTA-
MTL Approach are Utilized (The DLRSD Archive) . . . . . . . . . . . 117
7.2
Mean Average Precision (mAP) Scores When the Tasks
T1
,
T2
and
T3
are
Utilized in Different Orders for the PLASTA-MTL Approach (The
DLRSDArchive) .............................. 118
7.3
Training Times per Epoch on the DLRSD archive When the Different
Combinations of Tasks are Utilized for the Proposed PLASTA-MTL
Approach and Equal Weighting. . . . . . . . . . . . . . . . . . . . . . 120
7.4
Mean Average Precision (mAP) Scores When the Different Combina-
tions of Tasks are Utilized in the PLASTA-MTL Approach Compared
to Single Task Learning (The DLRSD Archive) . . . . . . . . . . . . . 122
xiii
7.5
Mean Average Precision (mAP) Scores Associated to the Different
Combinations of Tasks (The DLRSD Archive) . . . . . . . . . . . . . . 123
7.6
Mean Average Precision (mAP) Scores Associated to the Different
Combinations of Tasks (The BigEarthNet-S2 Archive) . . . . . . . . . 124
xiv
List of Abbreviations
AHCL Asymmetric Hash Code Learning
ASL Asymmetric Loss
BAS Batch Anchor Selection
BCE Binary Cross Entropy
BIS Batch Positive and Negative Image Selection
BigEarthNet-S1 Sentinel-1Image Patches of BigEarthNet
BigEarthNet-S2 Sentinel-2Image Patches of BigEarthNet
CA-LSTM
Class-Wise Attention-Based Convolutional and Bidirectional
LSTM Network
CBIR Content Based Image Retrieval
CLC CORINE Land Cover
CNN Convolutional Neural Network
CV Computer Vision
DAS Diverse Anchor Selection
DATL Dual Anchor Triplet Loss
DHCNN Deep Hashing Convolutional Neural Network
DHNN Deep Hashing Neural Network
DL Deep Learning
DML Deep Metric Learning
DNN Deep Neural Network
DWA Dynamic Weight Average
DenseNet Densely Connected Convolutional Network
ELBO Evidence Lower Bound
ELR Early-Learning Regularization
EO Earth Observation
FL Focal Loss
GRID Generative Reasoning Integrated Label Noise Robust Deep
Representation Learning
GradNorm Gradient Normalization for Adaptive Loss Balancing
IRL Image Representation Learning
JoCoR Joint Training with Co-Regularization
LNC
High-Resolution Land Cover Mapping through Learning with
Noise Correction
LSTM Long Short-Term Memory
LULC Land Use Land Cover
mAP Mean Average Precision
MS-SSIM Multi-Scale Structural Similarity Index Metric
MSE Mean Squared Error
MSL Multi Similarity Loss
xv
MTL Multi Task Learning
MiLaN Metric Learning-Based Deep Hashing Network
NDCG Normalized Discounted Cumulative Gains
NPL N-Pair Loss
PCGrad Projecting Conflicting Gradients
PLASTA-MTL Plasticity-Stability preserving Multi-Task Learning
PPL Plasticity Preserving Loss
RAS Random Anchor Selection
RHDIS
Relevant, Hard and Diverse Positive and Negative Image
Selection
RIS Random Positive and Negative Image Selection
RNN Rrecurrent Neural Network
RRL Region Representation Learning Loss
RSDML Enhancing Remote Sensing Image Retrieval using a Triplet Deep
Metric Learning Network
RS Remote Sensing
ResNet Residual Network
SAR Synthetic Aperture Radar
SCI-CBIR
Simultaneous Remote Sensing Image Compression and Indexing
for Scalable Content Based Image Retrieval
SLNIR Synthetic Label Noise Injection Ratio
SPL Stability Preserving Loss
SSHAAE Semi-Supervised Hashing Adverserial Autoencoder
STL Single-Task Learning
TNDML Deep Metric Learning Using Triplet Network
VAE Variational Auto-Encoder
VGG Very Deep Convolutional Networks
VGI Volunteered Geographic Information
xvi
;To the love of my life, Kimya.. . ;
1
Chapter 1
Introduction
Unprecedented advances in satellite technology have resulted in regular, frequent,
and high-resolution monitoring of the Earth surface, producing fast-growing Earth
observation (EO) data archives. As an example of the exceptionally fast growth rate
of these archives, the published volume of data through the Copernicus programme
(which is the European flagship satellite initiative with its Sentinel missions) during
only 2021 reached more than 7 PiB [1]. The rising operational capability of such
monitoring provides abundant information for the status of our planet. Accordingly,
EO data through especially the recent passive multispectral and synthetic aperture
radar (SAR) active instruments plays a crucial role to overcome the most pressing
global societal challenges, e.g., those defined by the Sustainable Development Goals
[2]. Sentinel-2 satellite, for instance, has been acquiring high-resolution multispectral
images characterized by 10 to 60 m spatial resolution, 13 spectral bands and revisit
time of 10 days since 2015, while the Sentinel-1 mission has been providing C-band
SAR images with up to 5 m spatial resolution and revisit time of 6 days since 2014.
Due to the open EO data access policies of recent satellite missions, most of the
remote sensing (RS) image archives are publicly available to researchers. This carries
a huge potential for climate change analysis, urban area studies, forestry applications,
emergency management for disaster relief efforts, water quality assessment, crop
monitoring, etc. To extract relevant information from such huge and ever-growing
RS image archives on a large scale that can have a substantial impact on societal
challenges, data-driven approaches are a crucial prerequisite.
From the virtuous circle between the tremendous expansion of the data era and the
investigations of computer science in the last decades, machine learning, notably
deep learning (DL), emerged as the most promising breakthrough among data-driven
approaches. These advances also made huge leaps for modeling and analyzing RS im-
ages due to several advantages of DL-based methods compared to their conventional
counterparts. DL-based methods allow to automatically learn RS image represen-
tations with multiple levels of abstractions by dynamizing deep neural networks
(DNNs) exclusively on data [3]. By relying on a huge amount of EO data, DL-based
image representation learning (IRL) becomes capable of modeling higher-level RS
image semantics and its complex patterns beyond the regional borders. Today, as it
is almost a common consensus, DL-based approaches are revolutionizing the way
we address challenges for IRL in RS. Thus, it carries a huge potential for automatic
knowledge discovery from massive EO image archives on a large scale.
Chapter 1. Introduction 2
DL-based IRL is generally achieved in a supervised way during the optimization
of a loss function on a training set based on the characteristics of a learning task
(e.g., single/multi-label classification, semantic segmentation etc.). To this end, the
considered DNN typically includes an image encoder (i.e., a CNN backbone) and
a task head including fully connected or convolutional layers (which is branched
out from the image encoder). The loss function is selected on the basis of the char-
acteristics of the considered learning task, and thus the model parameters of the
considered DNN are automatically learned during the optimization of this function.
Most of the DL-based IRL methods in RS utilize the following learning tasks: 1) scene
classification [4]–[14]; 2) similarity learning [15]–[27]; 3) image reconstruction [28],
[29]; 4) semantic segmentation [30], [31]; and 5) image captioning [32], [33]. Each
learning task has different objectives that leads to different optimization procedures
throughout the training of the considered DNN. Accordingly, learned image repre-
sentations have different characteristics for different learning tasks, and thus carry
different information to be utilized in the final EO application. As an example, when
the learning task is scene classification, RS image representations can be learned
by optimizing entropy-based loss functions. In this way, image representations are
encoded to separate pre-defined classes that maximizes inter-class distances in the
image representation space. For the similarity learning task, on the other hand, im-
age representations are learned to discriminate dissimilar RS images that minimizes
intra-class distance in the image representation space [34]. This can be achieved by
employing siamese CNNs on tuples of RS images to optimize triplet or contrastive
loss functions. If the task is chosen as the image reconstruction, auto-encoder neural
networks can be used first to construct the representations and then to recover RS
images with reconstruction loss. Once the model parameters of the DNN are learned
on a training set, they are utilized to obtain either image features or the predictions
of the task head from large-scale RS image archives for a final EO application. As
an example, for an EO application that requires to assign land-use/land-cover class
labels to RS images, class probabilities of a given RS image obtained by the task head
can be directly used to associate it with class labels. If an EO application performs
content-based image retrieval (CBIR), which aims to search for RS images similar
to a query image based on their semantic content, image representations of an RS
image archive obtained by the image encoder can be compared to that of the query
image for finding similar images.
We would like to note that automatic knowledge discovery from massive EO image
archives requires to employ DL-based IRL methods on a large scale. To this end, RS
image scene classification and CBIR have been among the most emerging solutions
in this regard. Accordingly, the development of IRL methods devoted to image scene
classification and CBIR problems has attracted great attention in RS community. Most
of these methods assume that each training image is annotated by a single (broad
category) label, which is associated to the most significant content of the image.
However, RS images typically contain areas with a high variety of semantically
complex content that must be reflected by more than one class annotation through
multiple class labels (multi-labels). Thus, DL-based IRL methods that properly
exploit training images annotated by multi-labels are recently found very promising
for RS images in the framework of image scene-classification and CBIR.
Chapter 1. Introduction 3
To employ DL-based IRL methods for scene classification problems based on images
annotated by multi-labels, attention-based DNNs have been attracted great attention
in RS, e.g., class-wise attention-based recurrent neural network [35], attention-aware
label relational reasoning network [36], encoder-decoder based deep attention neu-
ral network [37]. The attention strategies proposed in [35], [36] and [37] identify
informative areas of images through an attention map based on the feature maps
of convolutional layers. These strategies are effective for very high resolution aerial
images, however they can be insufficient for accurately describing the complex con-
tent of satellite RS images with high spatial resolution (e.g., Sentinel-2 and Landsat
multi-spectral images). Results carried out on very high resolution aerial images
with only RGB bands show the success of these strategies for the description of the
spatial image content. A direct adaptation of these methods for high dimensional RS
images may lead to an incomplete representation of the spectral information content.
These issues are critical particularly for images with several spectral bands with
varying spatial resolutions acquired by the new generation satellites (e.g., Sentinel-2).
Thus, methods that can efficiently and effectively describe the spatial and spectral
information content of high dimensional RS images are needed in the framework of
multi-label RS image scene classification.
In the context of DL-based IRL for CBIR problems, recent years have witnessed
the increasing attention of deep metric learning (DML) based methods that aim at
learning a representation space (in which similar images are located close to each
other). Such methods are mostly trained using a triplet loss function made up of three
images as: i) an anchor image; ii) a positive image that is similar to the anchor; and
iii) a negative image that is dissimilar to the anchor [38]. A difficult task in DML is to
construct the set of triplets. A simple strategy is to define triplets from an existing
training set of labeled images. In [27] a strategy is applied in a way that: i) an anchor
is randomly chosen from a mini-batch of training images; and then ii) one positive
image that has the same class label as the anchor is randomly chosen, while selecting
one negative image that has a different class label. For each anchor image, there can
be several positive and negative images. Thus, random selection does not guarantee
the selection of the most representative and informative images to the anchor and
can result in the construction of so-called trivial triplets. It is worth noting that one
can also exploit all the images in the mini-batch to construct triplets, as suggested
in [15]. However, this choice significantly increases the total number of triplets and
thus the computational complexity of the training phase of the CBIR system [39], [40].
To overcome the limitation of random selection, the DML methods that evaluate the
hardness of images during the sampling process are introduced in the CV literature.
Most of the triplet sampling methods in CV rely on single-label image annotations
to decide which images are positive or negative for a given anchor image. From
the DML perspective, the selection of triplets from training images annotated by
multi-labels is more complex than that from training images labeled by single-labels.
To achieve accurate DML in multi-label RS CBIR, methods that accurately select a set
of triplets from multi-label training images are needed.
For large-scale CBIR, fast and accurate indexing methods that allow approximate
nearest neighbor search are fundamental. In this perspective, hashing-based indexing
has recently attracted attention to solve the large-scale approximate nearest neighbor
Chapter 1. Introduction 4
search problems for CBIR due to its high time-efficient (in terms of both storage and
speed) and accurate search capability within huge image archives. DL-based hashing
methods map high-dimensional image representations into compact binary hash
codes while simultaneously optimizing IRL and hash code learning. Then, CBIR
can be achieved by calculating the Hamming distances with simple bit-wise XOR
operations [41]. Several DL-based hashing methods are presented in RS [20], [21], [27],
[42]–[46], which are potentially effective for CBIR in RS. It is noted that in massive EO
archives RS images are usually stored in compressed format to reduce their storage
sizes [47]. Thus, image decoding (i.e., decompression) is required before applying any
DL-based hashing method. This is computationally-demanding and impractical in
the case of large-scale CBIR problems. However, there is no hashing-based indexing
method in RS that can be applied in the compressed domain efficiently and effectively.
Accordingly, to achieve scalable CBIR in massive RS image archives, DL-based IRL
methods that can jointly characterize RS image representations and hash codes while
effectively compressing them are needed.
Most of the DL-based IRL methods require a huge amount of annotated RS images
during training to adjust the model parameters of the considered DNN and reach a
high performance. The availability and quality of such data determine the feasibility
of these methods. The process of collecting, preparing, and annotating RS images
on a large-scale to create sufficiently large high-quality archives to drive DL-based
studies is time consuming, complex, and costly in operational scenarios. Therefore,
most researchers rely on existing benchmark archives to employ and develop DL-
based methods. However, there are only few publicly available benchmark archives
in RS. Most of the existing archives feature a relatively small volume of images,
which is a limitation for DL-based studies due to the above-mentioned reasons. To
overcome this problem, a common approach is to exploit DNN models, which are
pre-trained on publicly available general purpose CV datasets. However, this is not
a viable approach in RS due to the differences in image characteristics in CV and
RS. As an example, Sentinel-2 images have 13 spectral bands associated to varying
and lower spatial resolutions with respect to CV images. In detail, RS benchmark
archives mostly contain single-label image annotations, i.e., each image is annotated
by a single high level land-use category label. However, as discussed above, RS
images must be reflected by more than one class annotation through low-level class
labels (i.e., multi-labels). Thus, a benchmark archive consisting of images annotated
with multi-labels is required. This lack of large-scale publicly available benchmark
archives of RS images with multi-labels prevents the wide spread adoption of DL
models in RS applications, even though raw data and potential applications do
exist. In addition, most of the existing publicly available benchmark archives contain
single-modal RS images (e.g., multispectral or SAR). However, multi-modal images
associated with the same geographical area allow for rich characterization of RS
images and thus improve image representation learning when jointly considered [48].
Thus, a large-scale benchmark archive consisting of multi-modal RS images annotated
with multi-labels is needed for DL-based IRL methods in RS.
In addition to the above-mentioned alternatives for obtaining high quantity of an-
notated training images, publicly available thematic maps (e.g., the CORINE Land
Cover inventory [49]), automatic labeling procedures, or volunteered geographic
Chapter 1. Introduction 5
information (VGI) as crowdsourced data can be also used in RS. These strategies
provide RS image annotations at zero cost. However, the considered thematic map or
VGI source can be outdated with respect to RS images due to possible changes on the
ground; or there can be annotation errors. Thus, these strategies increase the risk of
including noisy labels in training data. Learning RS image representations with noisy
labels may result in overfitting of the considered DNN to noisy labels and lack of its
generalization capability, and thus inaccurate characterization of RS images during
both training and inference [50], [51]. To address this problem, several methods,
mostly in CV community, are presented to improve the robustness of IRL when
training data includes noisy labels. All these methods are potentially effective for
DL-based IRL under noisy labels in RS. However, most of them are dependent on the
type of: i) label noise present in training data; ii) image annotation; iii) loss function
(e.g., cross-entropy, focal loss etc.); iv) DNN architecture; or v) learning task. Some
methods also require the availability of a subset of the training set, which includes
clean labels, or require the computationally demanding noise correction strategies
prior to training. Thus, they may not be directly integrated into different scenarios
associated to IRL in RS. Accordingly, DL-based IRL approaches that allow to learn
RS image representations under noisy labels independently of the IRL scenario being
considered are required.
In RS, it is common to employ single learning task for DL-based IRL. However, using
a single learning task may not be sufficient to describe the complex content of RS
images. To address this issue, multiple learning tasks can be jointly utilized for IRL.
When IRL is achieved based on multiple tasks, the resulting representation space can
better characterize the complex semantic content of RS images. Accordingly, a few
DL based multi-task learning (MTL) methods have been recently introduced in RS to
learn image representations through the joint optimization of multiple loss functions,
each of which is associated with a learning task [52]–[57]. Due to the complexity of
MTL problem, it is common that: i) tasks may compete or even distract each other
during training; ii) one of the tasks may dominate the whole learning procedure; or
iii) characterization of each task can be under-performed compared to single task
learning [58]. These problems undermine the effectiveness of whole representation
learning procedure [59]. These issues occur due to the stability-plasticity constraint
of MTL [60]. MTL methods require to be sensitive to new information learned from
each task that allows the contribution of each task to further improve modeling the
image characterization. This condition is known as plasticity. During the learning
process of a new task, new information encoded in the considered DNN should
not radically disrupt what is already characterized based on the other tasks. This
condition is known as stability. The MTL formulation of the existing methods (which
is based on joint optimization) is limited to control learning of each task. Thus, it
does not allow to control plasticity and stability of the whole learning procedure. It
is also worth noting that, by this formulation, whole learning procedure is sensitive
to proper selection of loss function weight for each task that generally requires a
grid search (which is computationally demanding) [61]. Thus, MTL methods that
can effectively combine multiple learning tasks without the need for selection of
loss weights while considering the stability-plasticity problem are needed in RS to
accurately apply IRL.
Chapter 1. Introduction 6
1.1 Objectives and Novel Contributions of the Thesis
The overall aim of this thesis is to develop advanced DL-based IRL methods for
information discovery from massive RS image archives and to construct a large-
scale RS image archive for benchmarking DL-based IRL methods in RS. For the
development of novel methodologies, a particular attention is devoted to scene-
classification and CBIR problems in RS due to their importance for information
discovery from massive RS image archives, while advanced methods for label-noise
robust and multi-task IRL are proposed independently from the learning task being
considered. To address the main challenges highlighted in the previous section, the
rest of this thesis is divided into six main chapters. In the following, the objectives
and contributions of these chapters are briefly explained.
In Chapter 2, we present a large-scale RS image benchmark archive, aiming to address
the limitations of existing benchmark archives, and thus to provide a high quantity of
annotated training RS images suitable for DL-based IRL methods in RS. To this end,
we introduce BigEarthNet as the first large-scale multi-modal multi-label benchmark
archive in RS that contains 590,326 pairs of Sentinel-2 and Sentinel-1 image patches
acquired over 10 European countries. Each pair in BigEarthNet is annotated with
multi-labels from the CORINE Land Cover (CLC) database of the year 2018. In
this chapter, we also introduce an alternative class-nomenclature since some CLC
classes can be challenging to be accurately described by only considering (single-date)
BigEarthNet image patches. An experimental analysis on: i) the comparison among
the strategies of IRL directly from RS images of BigEarthNet and transfer learning
from DNNs trained on computer vision images; and ii) several well-known CNN
architectures shows the effectiveness of BigEarthNet for scene classification and CBIR
problems in RS. It is worth noting that we make all the data and the well-known DL
models trained on BigEarthNet publicly available at
https://bigearth.net
, offering
an important resource to support studies on DL-based IRL in RS.
In Chapter 3, we introduce a novel DL-based IRL approach that aims at accurately
describing complex spatial and spectral content of RS images in the framework of
the multi-label classification of high-dimensional high-spatial resolution RS images.
The capability of the proposed approach is investigated in three consecutive steps:
1) spatial and spectral characterization of image local areas; 2) definition of a multi-
attention driven global descriptor; and 3) classification of RS image scenes with
multi-labels. In the first step, we present a novel branch-wise CNN architecture
(denoted as
K
-Branch CNN) that efficiently describes the complex content of local
areas of each image by different CNN branches specialized according to the spatial
resolutions of image bands. In the second step, we present a novel multi-attention
strategy in the framework of RNNs that: i) accurately identifies importance levels
(i.e., scores) for different local areas; and then ii) defines a global descriptor for each
image based on these scores. In this chapter, extensive experiments are performed
to analyze the effectiveness of the proposed approach in terms of the sensitivity
analysis and comparison among the existing approaches. We make the code of
the proposed approach publicly available at
https://gitlab.tubit.tu-berlin.de/
rsim/MAML-RSIC.
In Chapter 4, we present a novel image triplet sampling method for DL-based IRL
Chapter 1. Introduction 7
of RS images through the characterization of image similarities, which forms the
foundation for CBIR in RS. The proposed method aims at selecting a small set of the
most representative and informative triplets of multi-label training images based
on two consecutive steps. In the first step, a small number of diverse anchors is
selected based on a simple but efficient iterative algorithm. In the second step,
relevant, hard and diverse positive and negative images with respect to each anchor
are chosen based on a novel strategy. The effectiveness of the proposed method
is theoretically and experimentally investigated in terms of: i) the computational
complexity of the training phase with respect to the CBIR performance; and ii)
the learning efficiency via converge speed of the considered DNNs. In addition,
an overview of the existing triplet selection methods and the detailed literature
review on CBIR in RS are provided. It is worth mentioning that we make the code
of the proposed method publicly available at
https://git.tu-berlin.de/rsim/
image-retrieval-from-triplets.
In Chapter 5, we propose a novel approach devoted to simultaneous RS image
compression and indexing for scalable CBIR. The proposed approach aims to: i)
jointly characterize representations and hash codes of RS images on the learning
based compression domain; and thus ii) prevent the requirement of decoding RS
images prior to CBIR. This is achieved by two main steps: i) deep learning-based
compression; and ii) deep hashing-based indexing. The first step applies image
feature extraction and image reconstruction based on a pair of encoder and decoder
DNNs, while a probabilistic entropy model is employed to optimize the length of
the compressed bitstreams. The second step employs pairwise, bit-balancing and
classification loss functions for the generation of hash codes based on image features
characterized by the first step. To effectively characterize image features for both
image indexing and compression, we propose a novel multi-stage learning procedure
for the training of the proposed approach, allowing to automatically weight different
loss functions considered in both steps. As a first time in RS, the proposed approach
simultaneously applies RS image compression and indexing, and thus does not
require RS image decoding prior to CBIR that can save a significant amount of time
for operational applications. Through the extensive experiments for the sensitivity
analysis of the proposed approach and its comparison with standard approaches,
the effectiveness of simultaneous image compression and indexing for large-scale
knowledge discovery on RS image archives is investigated. We make the code of the
proposed approach available at https://git.tu-berlin.de/rsim/SCI-CBIR.
In Chapter 6, we introduce a novel generative reasoning integrated label noise robust
deep representation learning approach. The proposed approach aims at modeling the
complementary characteristics of discriminative and generative reasoning for IRL on
training RS images associated with noisy labels. To this end, for the first time in RS,
we integrate generative reasoning into discriminative reasoning through a variational
autoencoder for supervised IRL under noisy labels that leads to characterize accu-
rate RS image representations while preventing interference of noisy labels during
training. Unlike the existing label noise robust methods, the proposed approach does
not depend on the type of annotation, label noise, DNN architecture, loss function
or learning task. It also does not require a clean subset (training samples with clean
labels) of a training set or require a computationally demanding noise correction
Chapter 1. Introduction 8
strategy prior to training. Thus, our approach can be directly utilized for various
scenarios for IRL in RS. In this chapter, extensive experimental analysis is given on
two IRL scenarios, where training RS images are annotated with: 1) scene-level noisy
multi-labels; and 2) pixel-level noisy labels. Under these scenarios, we consider three
learning tasks with the corresponding loss functions and DNN architectures. We
would like to note that we will make the code of the proposed approach publicly
available at https://git.tu-berlin.de/rsim/GRID.
In Chapter 7, we explore DL-based IRL when multiple learning tasks are jointly
utilized and introduce a novel plasticity-stability preserving multi-task learning
approach that aims to preserve: 1) the plasticity for each task; and 2) the stability
in between learning consecutive tasks independently from the number of tasks
and the type of tasks. To this end, we introduce novel plasticity preserving and
stability preserving loss functions. The plasticity preserving loss (PPL) function
enforces an image representation space to be sensitive to new information learned
with each task during training. The stability preserving loss (SPL) function protects
the image representation space radically disrupted by each task during training. To
effectively apply these two loss functions, we also propose a sequential optimization
algorithm that adaptively adjust the interactions between task-specific learning
procedures, and thus to ensure plasticity and stability conditions for all the tasks. In
this chapter, analysis through extensive experiments for the sensitivity analysis of
the proposed approach and its comparison with state-of-the-art methods is provided
when different combinations of four learning tasks are utilized for IRL. As an open
source contribution, we make the code of the proposed approach publicly available
at https://git.tu-berlin.de/rsim/PLASTA-MTL.
1.2 List of Publications
During the PhD period of the author, the contributions of this thesis have been
published as journal articles or presented at scientific conferences, as it is a common
practice in computer science. The studies in subsection 1.2.1 list the contributions
of this thesis as publications. The work described in this thesis has also inspired
additional studies (which are listed in subsection 1.2.2) that are not discussed in this
thesis.
1.2.1 Contributions of the Thesis
Journal Articles
G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-
label remote sensing image classification,” IEEE Access, vol. 8, pp. 95 934–95946,
2020. DOI:10.1109/ACCESS.2020.2995805.
G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M.
Caetano, B. Demir, and V. Markl, “BigEarthNet-MM: A large scale multi-modal
multi-label benchmark archive for remote sensing image classification and
retrieval,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–
180, 2021. DOI:10.1109/MGRS.2021.3089174.
Chapter 1. Introduction 9
G. Sumbul, M. Ravanbakhsh, and B. Demir, “Informative and representative
triplet selection for multilabel remote sensing image retrieval,” IEEE Transactions
on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. DOI:
10.1109/TGRS.
2021.3124326.
G. Sumbul and B. Demir, “Plasticity-stability preserving multi-task learning
for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 60, pp. 1–16, 2022. DOI:10.1109/TGRS.2022.3160097.
G. Sumbul, J. Xiang, and B. Demir, “Towards simultaneous image compression
and indexing for scalable content-based retrieval in remote sensing,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022. DOI:
10.
1109/TGRS.2022.3204914.
G. Sumbul and B. Demir, “Generative reasoning integrated label noise robust
deep image representation learning,” IEEE Transactions on Image Processing,
2023. DOI:10.1109/TIP.2023.3293776.
Book Chapters
G. Sumbul, J. Kang, and B. Demir, “Deep learning for image search and retrieval
in large remote sensing archives,” in Deep Learning for the Earth Sciences: A
comprehensive approach to remote sensing, climate science and geosciences, Hoboken,
NJ, USA: Wiley, 2021, ch. 11, pp. 150–160. DOI:10.1002/9781119646181.ch11.
Conference Papers
G. Sumbul, M. Charfuelan, B. Demir, and M. Volker, “BigEarthNet: A large-
scale benchmark archive for remote sensing image understanding,” in Proceed-
ings of the IEEE International Geoscience and Remote Sensing Symposium, 2019,
pp. 5901–5904. DOI:10.1109/IGARSS.2019.8900532.
G. Sumbul and B. Demir, “A novel multi-attention driven system for multi-
label remote sensing image classification,” in Proceedings of the IEEE International
Geoscience and Remote Sensing Symposium, 2019, pp. 5726–5729. DOI:
10.1109/
IGARSS.2019.8898188.
G. Sumbul, M. Ravanbakhsh, and B. Demir, “A relevant, hard and diverse
triplet sampling method for multi-label remote sensing image retrieval,” in
Proceedings of the IEEE Mediterranean and Middle-East Geoscience and Remote
Sensing Symposium, 2022, pp. 5–8. DOI:10.1109/M2GARSS52314.2022.9839759.
G. Sumbul, J. Xiang, N. T. Madam, and B. Demir, “A novel framework to
jointly compress and index remote sensing images for efficient content-based
retrieval,” in Proceedings of the IEEE International Geoscience and Remote Sensing
Symposium, 2022, pp. 251–254. DOI:10.1109/IGARSS46834.2022.9884146.
G. Sumbul and B. Demir, “Label noise robust image representation learning
based on supervised variational autoencoders in remote sensing,” in Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium, 2023.
Chapter 1. Introduction 10
1.2.2 Additional Contributions
Journal Articles
A. Preethy Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Remote-sensing
image scene classification with deep neural networks in JPEG 2000 compressed
domain,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4,
pp. 3458–3472, 2021. DOI:10.1109/TGRS.2020.3007523.
G. Sumbul, S. Nayak, and B. Demir, “SD-RSIC: Summarization-driven deep
remote sensing image captioning,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 59, no. 8, pp. 6922–6934, 2021. DOI:10.1109/TGRS.2020.3031111.
Conference Papers
A. P. Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Approximating JPEG 2000
wavelet representation through deep neural networks for remote sensing image
scene classification,” in Proceedings of the Image and Signal Processing for Remote
Sensing Conference, vol. 11155, 2019, 111550S. DOI:10.1117/12.2534643.
K. Zhang, G. Sumbul, and B. Demir, “An approach to super-resolution of
sentinel-2 images based on generative adversarial networks,” in Proceedings of
the IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium,
2020, pp. 69–72. DOI:10.1109/M2GARSS47143.2020.9105165.
H. Yessou, G. Sumbul, and B. Demir, “A comparative study of deep learning
loss functions for multi-label remote sensing image classification,” in Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium, 2020, pp. 1349–
1352. DOI:10.1109/IGARSS39084.2020.9323583.
G. Sumbul and B. Demir, “A novel graph-theoretic deep representation learning
method for multi-label remote sensing image retrieval,” in Proceedings of the
IEEE International Geoscience and Remote Sensing Symposium, 2021, pp. 266–269.
DOI:10.1109/IGARSS47720.2021.9554466.
G. Sumbul, M. Müller, and B. Demir, “A novel self-supervised cross-modal im-
age retrieval method in remote sensing,” in Proceedings of the IEEE International
Conference on Image Processing, 2022, pp. 2426–2430. DOI:
10.1109/ICIP46576.
2022.9897475.
A. Zell, G. Sumbul, and B. Demir, “Deep metric learning-based semi-supervised
regression with alternate learning,” in Proceedings of the IEEE International Con-
ference on Image Processing, 2022, pp. 2411–2415. DOI:
10.1109/ICIP46576.2022.
9897939.
B. Büyüktas, G. Sumbul, and B. Demir, “Learning across decentralized multi-
modal remote sensing archives with federated learning,” in Proceedings of the
IEEE International Geoscience and Remote Sensing Symposium, 2023.
J. Henkel, G. Hoxha, G. Sumbul, L. Möllenbrok, and B. Demir, “Annotation
cost efficient active learning for remote sensing image retrieval,” in Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium, 2023.
Chapter 1. Introduction 11
1.3 Structure of the Thesis
The rest of this thesis is structured as follows:
Chapter 2 introduces BigEarthNet, which is a large-scale multi-modal multi-label RS
image archive, for benchmarking DL-based IRL methods in RS.
Chapter 3 presents our DL-based IRL approach for multi-label classification of high-
dimensional high-spatial resolution RS images.
Chapter 4 introduces our image triplet sampling method for DL-based IRL of RS
images through the characterization of image similarities in a metric space.
Chapter 5 presents our approach devoted to simultaneous RS image compression
and indexing for scalable CBIR.
Chapter 6 presents our generative reasoning integrated label noise robust deep
representation learning approach for IRL on training images with noisy labels.
Chapter 7 introduces our plasticity-stability preserving multi-task learning approach
for DL-based IRL when multiple learning tasks are jointly utilized.
Chapter 8 concludes this thesis with a summary, as well as a discussion for the future
research directions.
12
Chapter 2
BigEarthNet: A Large Scale Benchmark
Archive for Remote Sensing Image
Representation Learning
DL-based IRL methods in RS generally require the availability of a high quantity of
annotated training RS images for accurately learning the model parameters of the
considered DNN. To fulfill this requirement, this chapter presents the multi-modal
multi-label BigEarthNet benchmark archive made up of 590,326 pairs of Sentinel-1
and Sentinel-2 image patches acquired over 10 different European countries (Aus-
tria, Belgium, Finland, Ireland, Kosovo, Lithuania, Luxembourg, Portugal, Serbia,
Switzerland). Each pair of patches in BigEarthNet is annotated with multi-labels
provided by the CORINE Land Cover (CLC) map of 2018 based on its thematically
most detailed Level-3 class nomenclature. Some CLC classes are challenging to be
accurately described by only considering (single-date) BigEarthNet image patches. In
this chapter, we also introduce an alternative class-nomenclature as an evolution of
the original CLC labels to address this problem. This is achieved by interpreting and
arranging the CLC Level-3 nomenclature based on the properties of BigEarthNet im-
ages in a new nomenclature of 19 classes. In our experiments, we show the potential
of BigEarthNet for multi-modal multi-label CBIR and scene-classification problems
by considering several state-of-the-art DL models. We also demonstrate that the
DL models trained from scratch on BigEarthNet outperform those pre-trained on
ImageNet, especially in relation to some complex classes, including agriculture and
other vegetated and natural environments. We make all the data and the DL models
publicly available at
https://bigearth.net
, offering an important resource to sup-
port studies on DL-based IRL in RS. This chapter is mainly based on the following
publications:
G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M.
Caetano, B. Demir, and V. Markl, “BigEarthNet-MM: A large scale multi-modal
multi-label benchmark archive for remote sensing image classification and
retrieval,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–
180, 2021. DOI:10.1109/MGRS.2021.3089174.
G. Sumbul, M. Charfuelan, B. Demir, and M. Volker, “BigEarthNet: A large-
scale benchmark archive for remote sensing image understanding,” in Proceed-
ings of the IEEE International Geoscience and Remote Sensing Symposium, 2019,
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 13
pp. 5901–5904. DOI:10.1109/IGARSS.2019.8900532.
2.1 Introduction
Most of the DL-based RS image representation learning methods require a high
amount of annotated images during training to accurately optimize all parameters
and reach a high performance. The availability and quality of such data determine
the feasibility of many DL models. There are several benchmark archives made
publicly available for different RS applications. To the best of our knowledge, most
of the existing publicly available benchmark archives for image scene classification
and retrieval problems contain: 1) single-modal RS images (e.g., multispectral or
SAR); and 2) single-label image annotations (i.e., each image is annotated by a single
label that is associated with the most significant content of the considered image)
with a small number of annotated images. However, multi-modal images associated
with the same geographical area allow for rich characterization of RS images when
jointly considered [48]. In addition, RS images usually contain areas with a high
variety of semantically complex content that must be reflected by more than one class
annotation through multiple class labels (multi-labels).
Thus, a benchmark archive consisting of multi-modal RS images annotated with
multi-labels is needed. However, annotating RS images with multi-labels at a large-
scale to drive DL studies is time consuming, complex, and costly in operational
scenarios. To overcome this problem, a common approach is to exploit DL models
with proven architectures, which are pre-trained on publicly available general pur-
pose datasets in the computer vision (CV) community. However, we argue that this
is not a proper approach in RS, because of the differences in image characteristics
in CV and RS. For example, Sentinel-2 multispectral images have 13 spectral bands
associated with varying and lower spatial resolutions compared to the CV images.
To overcome these issues, in this chapter, we introduce BigEarthNet as a large-scale
multi-modal multi-label benchmark RS image archive that contains 590,326 pairs
of Sentinel-2 and Sentinel-1 image patches. Each pair of patches in BigEarthNet
is annotated with multi-labels provided by the CORINE Land Cover (CLC) map
of 2018 (CLC 2018) [49]. The CLC nomenclature includes land cover and land use
classes grouped in a three-level hierarchy, and for the BigEarthNet image patches,
the most thematically detailed Level-3 class nomenclature is considered. We would
like to note that there are some CLC classes that are difficult to be identified by
only exploiting (single-date) images, because: i) land use concepts associated with
some classes (e.g., Dump sites,Sport and leisure facilities) may not be visible from
space or fully recognizable with the spatial resolution of Sentinel images; and ii) RS
time series, which BigEarthNet does not include, may be required to describe and
discriminate some classes (e.g., Non-irrigated arable land,Permanently irrigated land).
To this end, we also introduce an alternative nomenclature for images in BigEarthNet
as an evolution of the original CLC labels. The rest of the chapter is organized as
follows. We first review the existing benchmark RS image archives in Section 2.2,
and then introduce BigEarthNet and the alternative class-nomenclature in Section
2.3. Section 2.5 provides the experimental results, while Section 2.4 provides the
experimental design. Section 2.6 draws the conclusion of this chapter.
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 14
TABLE 2.1: A LIST OF EXISTING RS IMAGE ARCHIVES
Dataset Name Image Type Annotation
Type
Number of
Images Publication
Year
UC Merced [84] Aerial RGB Single Label 2,100 2010
UC Merced [93] Aerial RGB Multi Label 2,100 2018
WHU-RS19 [85] Aerial RGB Single Label 1,005 2013
RSSCN7 [86] Aerial RGB Single Label 2,800 2015
SIRI-WHU [87] Aerial RGB Single Label 2,400 2016
RSC11 [94] Aerial RGB Single Label 1,232 2016
AID [88] Aerial RGB Single Label 10,000 2017
NWPU-RESISC45 [89] Aerial RGB Single Label 31,500 2017
RSI-CB [90] Aerial RGB Single Label 36,707 2017
PatternNet [92] Aerial RGB Single Label 30,400 2018
EuroSat [91] Satellite Multispectral Single Label 27,000 2019
DFC15 [35] Aerial RGB Multi Label 3,342 2019
2.2 Limitations of Existing Archives
Most of the existing benchmark archives in RS (UC Merced Land Use Dataset [84],
WHU-RS19 [85], RSSCN7 [86], SIRI-WHU [87], AID [88], NWPU-RESISC45 [89],
RSI-CB [90], EuroSat [91] and PatternNet [92]) contain small number of single-modal
RS images (e.g., multispectral or SAR) annotated with single category labels. Table
2.1 presents the list of the existing archives. These archives become popular for the
implementation, evaluation and validation of algorithms in the context of image
classification, search and retrieval tasks. However, RS community encounters very
critical limitations while using these archives for applying DL models. One of the
most critical limitations is that the number of annotated images included in the
existing archives is very small. In this respect, they are found insufficient to train
modern deep neural networks to reach a high generalization ability as the models
may overfit dramatically when using small training sets. In details, training such
networks on the existing archive images suffers from the problem of learning a
large number of parameters that prevents the accurate characterization of high-level
features in RS images.
It is worth mentioning that annotating RS images at a large-scale to drive DL studies
is time consuming, complex, and costly in operational scenarios. To overcome this
problem, a common approach is to exploit DL models with proven architectures
(such as ResNet [95] or VGG [96]), which are pre-trained on publicly available general
purpose datasets in the CV community (e.g., ImageNet [97]). The existing model is
then fine-tuned on a small set of RS images to calibrate the final layers. This strategy
is also known as a transfer learning strategy. There are several versions of the above-
mentioned models that have been pre-trained on large-scale datasets in CV. However,
we argue that this is not a proper approach in RS, because of the differences in image
characteristics in CV and RS. For example, Sentinel-2 multispectral images have 13
spectral bands associated with varying and lower spatial resolutions compared to the
CV images. High spectral resolution of the data can allow accurate characterization of
the complex semantic content at Sentinel-2 images if it is efficiently characterized. In
addition, the semantic content present in CV and RS images is significantly different,
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 15
Urban fabric, Arable land, Pastures,
Complex cultivation patterns
Urban fabric, Industrial or commercial units,
Land principally occupied by agriculture,
Mixed forest, Marine waters
Urban fabric, Arable land,
Land principally occupied by agriculture
Urban fabric, Permanent crops, Complex cultivation
patterns, Land principally occupied by agriculture,
Broad-leaved forest, Moors, heathland and
sclerophyllous vegetation
FIGURE 2.1: An example of BigEarthNet image pairs and their multi-labels.
and thus the respective semantic classes differ from each other. Accordingly, fine-
tuning pre-trained models for RS images may lead to weak discrimination ability for
land-cover classes in RS. Thus, fine-tuning may not be generally applicable to close
this semantic gap.
Another limitation of existing archives is that they contain images annotated by
single high-level category labels, which are related to the most significant content of
the image. However, RS images generally contain multiple classes so that they can
simultaneously be associated to different land-cover class labels (i.e., multi-labels).
Hence, a benchmark archive in RS consisting of images annotated with multi-labels
is required. Although the benchmark archives presented in [35], [93] contains images
with multi-labels, the sample size of this archive is very small to be efficiently utilized
for DL models.
The last limitation of existing RS image archives is that since researchers generally do
not have free access to satellite data together with their annotation, most of the bench-
mark archives contain only aerial images with RGB image bands as single-modal
data. Despite the fact that the benchmark archive proposed in [91] includes annotated
satellite images, it suffers from the limitation that is related to the number of images,
which is explained before, and it only includes multi-spectral Sentinel-2 images. It is
worth noting that multi-modal images associated with the same geographical area
allow for rich characterization of RS images and thus improve image representation
learning when jointly considered [48]. The lack of sufficient multi-modal satellite
images with annotations prevents to employ DL-based methods in a convenient way
for the complete understanding of huge amount of freely accessible satellite data
(e.g., Sentinel-1, Sentinel-2).
To overcome these issues, as the first large-scale multi-modal multi-label benchmark
archive in RS, we introduce BigEarthNet that contains 590,326 pairs of Sentinel-2 and
Sentinel-1 image patches. Fig. 2.1 shows an example of the BigEarthNet image pairs
and their multi-labels, while it is explained in detail in the following sections.
2.3 BigEarthNet: A Large-Scale Benchmark Archive
To overcome the limitations of existing archives, we introduce BigEarthNet (called
also as BigEarthNet-MM) that is the first large-scale multi-modal benchmark archive
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 16
FIGURE 2.2: An example of the Sentinel-2 image patches of BigEarthNet that are fully covered
by seasonal snow, cloud and cloud shadow.
in RS. BigEarthNet contains 590,326 pairs of Sentinel-1 and Sentinel-2 image patches
acquired over 10 different European countries (Austria, Belgium, Finland, Ireland,
Kosovo, Lithuania, Luxembourg, Portugal, Serbia, Switzerland). To construct Sentinel-
2 patches of BigEarthNet, 125 Sentinel-2 tiles associated with less than 1% of cloud
cover and acquired between June 2017 and May 2018 were considered. All tiles were
atmospherically corrected by employing Sentinel-2’s Level 2A product generation
and formatting tool (sen2cor) provided by the European Space Agency due to its
proven success in the literature. After the atmospheric correction, the 10
th
band of
each image patch is not available anymore, as it is the cirrus band (which is omitted
in the Level 2A output for its lack of surface information). Then, the tiles were
divided into 590,326 non-overlapping image patches, each of which is a section of: 1)
120
×
120 pixels for 10m bands; 2) 60
×
60 pixels for 20m bands; and 3) 20
×
20 pixels
for 60m bands. One important goal during the tile selection process was to represent
all chosen geographic locations with images acquired in different seasons. Due to the
restrictions of finding tiles with a low cloud cover percentage in the relatively narrow
time period, this has not been possible at each considered location. Accordingly, the
following respective numbers of patches for autumn, winter, spring, and summer
have been considered: 143557, 72877, 175937, and 126913. For the quality check
of patches, visual inspection was also employed, which led to the identification of
70,987 Sentinel-2 image patches that are fully covered by seasonal snow, cloud, and
cloud shadow1. An example for those cases is shown in Fig. 2.2.
To construct the Sentinel-1 patches of BigEarthNet, 325 Sentinel-1 Ground Range
Detected (GRD) products acquired between June 2017 and May 2018 that jointly
cover the area of all original 125 Sentinel-2 tiles with close temporal proximity were
selected and processed. The selected scenes provide dual-polarized information
channels (VV and VH) and are based on the interferometric wide swath (IW) mode,
which is the main acquisition mode over land. All scenes were pre-processed by
using the Sentinel-1 toolbox (S1TBX) and the graph processing framework (GPF) of
ESA’s Sentinel Application Platform (SNAP). This includes the application of precise
orbit files, border and thermal noise removal, radiometric calibration, and geometric
correction (i.e., Range Doppler terrain correction). Depending on the spatial extent of
the scene, either the SRTM 30 (for scenes below 60° latitude) or the ASTER DEM (for
1The lists are available at http://bigearth.net/#downloads.
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 17
scenes above 60° latitude, where no SRTM 30 exists) were employed in the geometric
correction to project images from slant range to ground range. Finally, the backscatter
coefficient was converted to a decibel (dB) scale. It is worth noting that, since the
selection of the speckle filter is considered to be application dependent, no speckle
filtering was applied in our pre-processing workflow in order to preserve the full
resolution. This approach is also recommended by the Product Family Specification
for SAR of the CEOS Analysis Ready Data for Land (CARD4L) framework
2
. Based
on the pre-processed Sentinel-1 scenes, for each Sentinel-2 patch, a corresponding
Sentinel-1 patch with a close timestamp was extracted. In addition, each Sentinel-1
patch inherited the annotations of the corresponding Sentinel-2 patch. The resulting
Sentinel-1 image patches contain a spatial resolution of 10m.
Each pair (which is made up of Sentinel-1 and Sentinel-2 image patches acquired in
the same geographical area) in BigEarthNet is associated with one or more class labels
(i.e. multi-labels) extracted from the CORINE land cover map of 2018. CORINE land
cover (CLC) is a pioneer adventure initiated in the 80’s of the last century to produce
harmonized land use land cover (LULC) maps in vector format for the member
states of the European Union. According to the validation report of the CLC, the
accuracy is around 85% [98]. Nowadays, CLC covers 39 countries from Europe and
was produced for five reference years, 1990, 2000, 2006, 2012, and 2018. The latter
was produced with data of 2017-2018, which matches the time frame of the images
included in BigEarthNet. Motivations for embracing a large-scale mapping endeavor
aimed at meeting the demand for spatially explicit and harmonized information on
land for a variety of purposes, such as environmental management and decision
making. The crude state-of-the-art of the 1980’s technology and the large spectrum
of potential uses of the maps led to the definition of a coarse spatial resolution
and a nomenclature with some broad class definitions, mixing land cover and land
use concepts. These definitions are implemented for map production by visual
interpretation of RS images and additional data in most countries. Additional data
may include very high spatial resolution imagery and official spatial data sets like
land registers, often to infer the land use. The same technical specifications were
preserved in map updating for historical consistency. Thus the produced five CLC
maps have a minimum mapping unit of 25 ha and a minimum mapping width of
100 m, and provide information on land according to a hierarchical nomenclature
of 44 classes at the most detailed level (Level3). The image patches in BigEarthNet
are representative of 43 CLC classes. In the case that CLC maps are considered as
labeling sources for training the machine learning methods to automatically analyse
RS images, the modified versions of the CLC nomenclature (which better fit the
purpose of the considered application) are commonly preferred. One of the main
reason is that RS systems directly observe the land cover rather than the land use.
The CLC land-use based labels may not be fully recognizable through the RS images
unless they are not associated to very high spatial resolution. As an example, in [99]
CLC is used as a basis to collect training data for supervised RS image classification,
but classes such as Discontinuous urban fabric and Sport and leisure facilities that depend
mainly on land use were removed. A deep revision of the CLC program is actually
2https://ceos.org/ard/
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 18
TABLE 2.2: THE LIST OF CLASSES WITHIN CLC AND PROPOSED CLASS NOMENCLATURES
AND THEIR ASSOCIATED NUMBERS OF IMAGE PAIRS. THESE NUMBERS ARE OBTAINED AFTER
ELIMINATING SENTINEL-2 IMAGE PATCHES THAT ARE FULLY COVERED BY SEASONAL SNOW,
CLOUD,AND CLOUD SHADOW.
CLC Class-Nomenclature Number of
Image Pairs
19 Classes Nomenclature
Number of Image Pairs
Total Training Validation Test
Continuous urban fabric 10,766
Discontinuous urban fabric 65,894 Urban fabric 74,891 38,783 18,180 17,928
Industrial or commercial units 11,865 Industrial or commercial units 11,865 6,182 2,875 2,808
Road and rail networks
and associated land 3,269
Port areas 453
Airports 820
Mineral extraction sites 4,225
Dump sites 822
Construction sites 1,081
Green urban areas 1,651
Sport and leisure facilities 4,983
removed
Non-irrigated arable land 183,987
Permanently irrigated land 13,571
Rice fields 3,793
Arable land 194,148 100,394 46,604 47,150
Vineyards 9,524
Permanent crops 29,350 15,862 6,676 6,812
Fruit trees and berry plantations 4,672
Olive groves 12,503
Annual crops associated
with permanent crops 7,019
Pastures 98,997 Pastures 98,997 50,981 23,846 24,170
Complex cultivation patterns 104,203 Complex cultivation patterns 104,203 53,534 25,031 25,638
Land principally occupied by
agriculture, with significant
areas of natural vegetation
130,637
Land principally occupied by
agriculture, with significant
areas of natural vegetation
130,637 67,260 31,325 32,052
Agro-forestry areas 30,649 Agro-forestry areas 30,649 15,790 7,598 7,261
Broad-leaved forest 141,300 Broad-leaved forest 141,300 73,411 33,759 34,130
Coniferous forest 164,775 Coniferous forest 164,775 86,569 38,674 39,532
Mixed forest 176,567 Mixed forest 176,567 91,930 41,996 42,641
Natural grassland 11,141 Natural grassland and
sparsely vegetated areas 12,022 6,663 2,560 2,799
Sparsely vegetated areas 1,202
Moors and heathland 5,073
Sclerophyllous vegetation 11,241
Moors, heathland and
sclerophyllous vegetation 16,267 8,438 3,970 3,859
Transitional woodland-shrub 148,950 Transitional woodland-shrub 148,950 77,593 35,146 36,211
Beaches, dunes, sands 1,536 Beaches, dunes, sands 1,536 1,197 118 221
Bare rock 2,894 removed
Burnt areas 304 removed
Inland marshes 5,516 Inland wetlands 22,100 11,620 5,131 5,349
Peatbogs 16,667
Salt marshes 1,339
Salines 424 Coastal wetlands 1,566 1,037 219 310
Intertidal flats 962 removed
Water courses 9,792
Water bodies 58,009 Inland waters 67,277 35,349 15,751 16,177
Coastal lagoons 1,495
Estuaries 1,064
Sea and ocean 72,522
Marine waters 74,877 39,114 17,740 18,023
under consideration following the concept of the EIONET Action Group on Land
monitoring in Europe (EAGLE) [100].
To pay more justice to the properties of BigEarthNet image pairs, we introduce
a new class-nomenclature by modifying the multi-labels extracted from the CLC
2018. To this end, the CLC Level-3 nomenclature is interpreted and arranged in
a new nomenclature of 19 classes (see Table 2.2). Ten classes of the original CLC
nomenclature are maintained in the new nomenclature, 22 classes are grouped into 9
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 19
new classes, and 11 classes are removed. The classes maintained are semantically
homogeneous and largely related to land cover, such as Broad-leaved forest and Beaches,
dunes, sands. Furthermore, CLC classes that are not feasible to be identified by only
using single-date BigEarthNet images removed, such as Burnt areas. Complex classes
(which are often removed when undertaking image classification) are maintained,
such as Complex cultivation patterns and Land principally occupied by agriculture, with
significant areas of natural vegetation. The goal is to investigate the ability of DL models
to learn from spatial patterns that express semantic classes. Classes are grouped
when sharing similar land cover types and spectral patterns. For example, Moors and
heath land and Sclerophyllous vegetation are grouped in a single class, and a new class,
Arable land, groups similar crops that require dense time series (which not available in
BigEarthNet) for their discrimination (e.g. irrigated and non-irrigated crops). Classes
that strongly depend on land use or need additional data for their discrimination are
removed. For example, class Airports essentially relates to land use, and Intertidal
flats appear in RS images either with or without water depending on the image
acquisition time and hence require appropriate time series for its classification. The
number of labels associated with each image pair varies between 1 and 12, while
96.80% of image pairs are not associated with more than 5 labels. Only 23 image
pairs are annotated with more than 9 labels.
2.4 Experimental Design
The experiments were carried out in the context of content based multi-modal multi-
label RS image retrieval and classification. To achieve multi-modal learning, we
stacked the VV and VH bands of Sentinel-1 image patches, and the Sentinel-2 bands
associated with 10m and 20m spatial resolution into one volume for each pair in
BigEarthNet. To this end, we initially applied cubic interpolation to 20m bands of
Sentinel-2 image patches. In the experiments, we did not use the Sentinel-2 image
bands associated with 60m spatial resolution (bands 1 and 9). This is due to the
fact that these bands are mainly used for cloud screening, atmospheric correction,
and cirrus detection in RS applications and do not embody a significant amount
of information for the characterization of semantic content of RS images. In the
experiments, we considered the VGG model [96] and the ResNet model [95] at
various number of layers (VGG16, VGG19, ResNet50, ResNet101, ResNet152). To
fairly compare all models, we utilized the Adam optimizer [101] with an initial
learning rate of 10
3
to decrease the sigmoid cross-entropy loss. Except the learning
rate, we employed the same parameter values given in [95], [96]. The batch size is
set to 256 for ResNet152 and to 500 for all other models used in the experiments.
We applied training from scratch for 100 epochs, while the final layers of the pre-
trained models were fine-tuned separately on each modality for 10 epochs. For all the
models, we added a fully connected layer that includes 19 neurons at the end of the
network for the classification. For image retrieval, we extracted image features from
the considered models and applied similarity matching of the features based on the
χ2
-distance measure. We performed various experiments to analyze the effectiveness
of: i) learning from BigEarthNet directly (through training from scratch) instead
of using the pre-trained models on ImageNet; and ii) state-of-the-art CNN models
trained and evaluated on BigEarthNet. To use the pre-trained models on ImageNet,
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 20
TABLE 2.3: CLASS-BASED
F2
SCORES (%) OBTAINED WHEN:I)TRANSFER LEARNING FROM
IMAGENET AND II)DIRECT LEARNING FROM BIGEARTHNET ARE USED FOR MULTI-MODAL
MULTI-LABEL IMAGE CLASSIFICATION.
Class Transfer Learning
From ImageNet
Learning From
BigEarthNet
Urban fabric 56.27 71.99
Industrial or commercial units 30.98 43.21
Arable land 80.05 83.62
Permanent crops 4.32 55.52
Pastures 50.98 74.77
Complex cultivation patterns 36.29 62.03
Land principally occupied by agriculture, with
significant areas of natural vegetation 30.36 60.63
Agro-forestry areas 2.13 71.87
Broad-leaved forest 42.83 75.39
Coniferous forest 75.47 86.32
Mixed forest 72.19 81.31
Natural grassland and sparsely vegetated areas 14.11 43.88
Moors, heathland and sclerophyllous vegetation 5.29 59.91
Transitional woodland-shrub 41.23 64.21
Beaches, dunes, sands 43.67 63.39
Inland wetlands 8.20 57.81
Coastal wetlands 4.79 42.23
Inland waters 63.23 82.10
Marine waters 93.99 97.20
Average 39.81 67.23
we used the late fusion of separately fine-tuned models on Sentinel-1 and Sentinel-2
patches. In the experiments, we did not use the Sentinel-2 patches that are fully
covered by seasonal snow, cloud, and cloud shadow. After the arrangements of the
new class nomenclature, 57 pairs among the 590, 326 pairs are not associated with any
LULC labels. these pairs are not used in the experiments. We divided the remaining
dataset into: i) the training set of 269,695 pairs of patches, ii) validation set of 123,723
pairs of patches, and iii) the test set of 125,866 pairs of patches.
We performed our experiments on a cluster of 4 NVIDIA Tesla V100 GPUs. The
results of multi-modal multi-label image classification were provided in terms of four
performance metrics: 1) Hamming loss (
HL
); 2) one-error (
OE
); 3) recall (
R
); and 4)
F2-Score (F2).
2.5 Experimental Results
2.5.1
Comparison among the Strategies of Learning directly from
BigEarthNet and Transfer Learning from the ImageNet
In the first set of experiments, we compare the effectiveness of learning directly from
BigEarthNet with respect to transfer learning from ImageNet. To this end, transfer
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 21
1st 5th 100th
Transfer Learning
from ImageNet
Learning Directly
from BigEarthNet-MM
Query Image Pair
Multi-Labels from
19 Classes Nomenclature
Urban fabric, Arable land, Coniferous forest,
Mixed forest, Transitional woodland, shrub
Urban fabric, Arable land, Coniferous forest,
Mixed forest, Transitional woodland, shrub
Urban fabric, Arable land, Coniferous forest,
Transitional woodland, shrub, Land principally
occupied by agriculture
Urban fabric, Arable land, Mixed forest, Land
principally occupied by agriculture
Urban fabric, Pastures, Transitional woodland, shrub,
Land principally occupied by agriculture, Inland
wetlands Urban fabric, Pastures, Coniferous forest Urban fabric, Permanent crops, Complex cultivation
patterns, Land principally occupied by agriculture
FIGURE 2.3: An example of a query pair from the BigEarthNet archive and retrieved image
pairs obtained by using: 1) direct learning from BigEarthNet; and 2) transfer learning from
ImageNet in the framework of content-based multi-modal multi-label image retrieval.
learning strategy is applied by using the pre-trained ResNet50 model trained on
ImageNet, while direct learning strategy is employed by using the ResNet50 trained
from scratch on BigEarthNet. Table 2.3 shows the class-based
F2
classification scores
(known also as macro-averaged
F2
scores). By analyzing the table, one can see that
learning directly from BigEarthNet achieves the highest score for each class compared
to the transfer learning strategy. As an example, learning directly from BigEarthNet
provides more than 12% and 25% higher scores for the classes Industrial or commercial
units and Complex cultivation patterns, respectively, compared to the transfer learning
strategy. The difference in performance between these learning strategies is more
evident for more complex LULC classes. As an example, learning directly from
BigEarthNet improves the
F2
scores more than 54% and 69% for the classes Moors,
heathland and sclerophyllous vegetation and Agro-forestry areas, respectively.
In the content of image retrieval, Fig. 2.3 shows an example of a query pair and
the retrieved pairs of images by these strategies. By assessing the figure, one can
observe that when learning is achieved directly from BigEarthNet, the semantically
more similar pairs of images are retrieved, containing the Urban fabric and Arable land
classes present in the query. Learning directly from BigEarthNet leads to retrieval of
a similar pair to the query even at the 100
th
retrieval order. However, using transfer
learning strategy results in retrieval of pairs that contain Urban fabric and Arable land
classes which are not present in the query pair. One can observe this behavior even
at the 5th retrieved pair.
The main reasons of the success of directly learning from BigEarthNet are due to the
fact that: 1) transfer learning from ImageNet limits the accurate characterization of
the spectral content of RS images; 2) fine-tuning the pre-trained model on ImageNet
by using RS images can not be sufficient to eliminate the semantic gap since the
category labels present in ImageNet are different from the land-cover class labels
present in BigEarthNet; and 3) the pre-trained model was trained for a single-label
image classification scenario, and thus limits the accurate characterization of the
multiple land cover classes present in BigEarthNet.
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 22
TABLE 2.4: OVERALL MULTI-MODAL MULTI-LABEL CLASSIFICATION RESULTS UNDER
DIFFERENT METRICS AND DL MODELS FOR BIGEARTHNET.
Model HL OE (%) R(%) F2(%)
VGG16 0.078 7.35 76.97 76.18
VGG19 0.080 8.12 76.17 75.35
ResNet50 0.074 5.93 80.05 78.73
ResNet101 0.074 6.46 78.85 77.88
ResNet152 0.073 6.42 78.13 77.46
2.5.2 Comparison of State-of-the-Art CNN Models
In the second set of experiments, we compare the effectiveness of the VGG and the
ResNet models in the framework of multi-modal multi-label classification. Table 2.4
shows the overall classification results under different metrics (which are the sample-
averaged scores). By analyzing the table, one can observe that the ResNet model
provides the highest scores in all metrics. As an example, ResNet50 achieves more
than 2% higher recall and
F2
scores compared to VGG models. This improvement
is due to the residual connections of the ResNet model and their increased depth in
terms of the number of layers compared to the VGG model. Increasing the depth of
the considering models does not significantly affect the performances, i.e., similar
scores are obtained in all the metrics under different depth values of the same model.
2.6 Conclusion
In this chapter, we have presented the BigEarthNet benchmark archive that contains
590,326 pairs of Sentinel-1 and Sentinel-2 image patches with a new CLC-based
class nomenclature to pay more justice to the properties of the considered images.
BigEarthNet makes a significant advancement for the use of DL in RS, opening up
promising directions to support research studies in the framework of multi-modal
multi-label RS image scene classification and retrieval. BigEarthNet is suitable to
assess DL based methods for: i) learning from class-imbalanced multi-modal data
(since the LULC classes are not equally represented in BigEarthNet); ii) transfer learn-
ing (since BigEarthNet currently contains only pairs of images from a small number
of European countries); and iii) also on unsupervised, self-supervised and semi-
supervised multi-modal learning for information discovery from big data archives.
We would like to note that Sentinel-1 image patches of BigEarthNet (denoted as
BigEarthNet-S1 hereafter) and Sentinel-2 image patches of BigEarthNet (denoted
as BigEarthNet-S2 hereafter) can be also separately employed for single-modal RS
image understanding problems.
It is worth noting that BigEarthNet has limitations for the RS applications that
require time-series data to accurately describe LULC classes, such as Non-irrigated
arable land,Permanently irrigated land. We would like to also note that some Sentinel-1
image patches can be contaminated by artefacts caused by either well-known Radio-
Frequency-Interference [102] or other dataset related issues, which are independent
from the pre-processing steps applied while constructing BigEarthNet. As a final
remark, we would like to point out that due to the use of labels from the CLC
Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 23
map, the BigEarthNet archive can be extended to a larger scale within Europe with
zero-annotation cost. As a future development of this work, we plan to enrich the
BigEarthNet archive by extending it to whole Europe.
24
Chapter 3
A Deep Multi-Attention Driven
Approach for Multi-Label Remote
Sensing Image Classification
DL-based IRL methods have been found popular in the framework of RS image
scene classification problems. Most of the existing methods assume that training
images are annotated by single-labels, however RS images typically contain multiple
classes and thus can simultaneously be associated with multi-labels. Despite the
success of existing methods in describing the information content of very high
resolution aerial images with RGB bands, any direct adaptation for high-dimensional
high-spatial resolution RS images falls short of accurate modeling the spectral and
spatial information content. To address this problem, this chapter presents a novel
approach in the framework of multi-label classification of high dimensional RS
images. The proposed approach is based on three main steps. The first step describes
the complex spatial and spectral content of image local areas by a novel
K
-Branch
CNN that includes spatial resolution specific CNN branches. The second step initially
characterizes the importance scores of different local areas of each image and then
defines a global descriptor for each image based on these scores. This is achieved
by a novel multi-attention strategy that utilizes the bidirectional long short-term
memory networks. The final step achieves the classification of RS image scenes with
multi-labels. Experiments carried out on BigEarthNet show the effectiveness of the
proposed approach in terms of multi-label classification accuracy compared to the
state-of-the-art approaches. The code of the proposed approach is publicly available
at
https://gitlab.tubit.tu-berlin.de/rsim/MAML-RSIC
. This chapter is mainly
based on the following publications:
G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-
label remote sensing image classification,” IEEE Access, vol. 8, pp. 95 934–95946,
2020. DOI:10.1109/ACCESS.2020.2995805.
G. Sumbul and B. Demir, “A novel multi-attention driven system for multi-
label remote sensing image classification,” in Proceedings of the IEEE International
Geoscience and Remote Sensing Symposium, 2019, pp. 5726–5729. DOI:
10.1109/
IGARSS.2019.8898188.
Chapter 3. A Deep Multi-Attention Driven Approach 25
3.1 Introduction
In recent years, DL-based IRL has attracted the attention of RS researchers for the
development of RS image scene classification methods, which aim at automatically
assigning class labels to each image scene in an RS archive. As an example, in [103] a
gradient boosting random convolutional network is proposed as an ensemble frame-
work to combine several deep neural networks for RS image scene classification
problems. In [104] feature learning strategies defined based on different training
procedures for convolutional neural networks (CNNs) are analyzed. In [105] a region
attention network, which assigns attention scores to candidate regions for the ex-
pected object locations, is introduced to learn the alignment of RS image scenes. to this
end, different image sources are used together for the identification of fine-grained
categories. In [106] a semi-supervised approach based on a generative adversarial
network is proposed for the cases that the amount of annotated training data is
insufficient. In [107] an intermediate feature aggregation method that progressively
combines the different level features of CNNs is proposed. In [108] a scale-free CNN
that transfers the fully connected layers in a pre-trained CNN model to convolutional
layers and then uses a general average pooling layer after the final convolutional
layer is introduced. The above-mentioned DL based approaches in RS assume that
each training image is annotated by a single (broad category) label, which is associ-
ated to the most significant content of the image. However, this assumption may not
be appropriate for complex scene classification applications where RS image scenes
contain multiple land-cover classes and thus simultaneously associated to different
class labels (i.e., multi-labels).
To train DL models with training images annotated by multi-labels, few DL based
multi-label scene classification methods have been recently introduced in RS. In [109]
a radial basis function neural network is applied on the CNN features of aerial
images as a multi-label classifier. In [110] a structured support vector machine that
models the spatial contiguity is utilized based on the CNN features of the aerial
images in the framework of multi-label classification. In these approaches, CNNs are
used as conventional transfer learning approaches, for which pre-trained models on
publicly available general purpose computer vision (CV) datasets (e.g., ImageNet)
act as fixed feature extractors without changing the model parameters. However,
this approach can reduce the multi-label scene classification accuracy because of
the differences in image characteristics in CV and RS. In [111] a data augmentation
strategy is introduced to avoid using a pre-trained network for an end-to-end training
of a shallow CNN. In this approach, to adapt the standard CNN architecture in multi-
label learning, the softmax function of the classification layer is changed into a
sigmoid function. The direct use of standard CNNs that are actually designed for the
images annotated by single-labels is a common approach in multi-label classification
problems. However, it may lead to inaccurate identification of the multiple classes
present in images. To overcome this limitation, integration of sequential neural
network approaches into CNN architectures is introduced in RS. In [35] a class-
wise attention-based recurrent neural network (RNN) is introduced to sequentially
model the co-occurrence relationship of multiple classes. In this approach, class
predictions are obtained one after another in the RNN sequence and each prediction
is based on the decisions made until the corresponding class is reached. In [36]
Chapter 3. A Deep Multi-Attention Driven Approach 26
an attention-aware label relational reasoning network is proposed to: i) localize
discriminative regions of aerial images; and ii) characterize the label relations present
in the images based on the localized feature maps. In [37] an encoder-decoder
neural network is introduced to characterize the aerial image features. In detail, a
squeeze excitation layer is used for modeling the channel-wise interdependencies of
the feature maps in the encoder, whereas a RNN based decoder is exploited as an
adaptive spatial attention mechanism. The attention strategies proposed in [35], [36]
and [37] identify informative areas of images through an attention map based on
the feature maps of convolutional layers. These strategies are effective for very high
resolution aerial images, however they can be insufficient for accurately describing
the complex content of satellite RS images with high spatial resolution (e.g., Sentinel-
2 and Landsat multispectral images). Results carried out on very high resolution
aerial images with only RGB bands show the success of these strategies for the
description of the spatial image content. A direct adaptation of these methods for
high dimensional RS images may lead to an incomplete representation of the spectral
information content. These issues are critical particularly for images with several
spectral bands with varying spatial resolutions acquired by the new generation
satellites (e.g., Sentinel-2). Thus, methods that can efficiently and effectively describe
the spatial and spectral information content of high dimensional RS images are
needed in the framework of multi-label RS image scene classification.
To address this problem, we propose a DL based approach that aims at accurately
describing complex spatial and spectral content of RS images in the framework of
multi-label RS image scene classification. To this end, the proposed approach is
based on three main steps: 1) spatial and spectral characterization of image local
areas; 2) definition of a multi-attention driven global descriptor; and 3) classification
of RS image scenes with multi-labels. The proposed approach assumes that RS
image bands can be associated with varying spatial resolutions and a set of training
images annotated with multi-labels (based on land-cover land-use classes present
in the images) is available. In the first step, we introduce a novel branch-wise CNN
architecture (which is called as
K
-Branch CNN) that efficiently describes the complex
content of local areas of each image by different CNN branches specialized according
to the spatial resolutions of image bands. In the second step, we present a novel
multi-attention strategy in the framework of RNNs that: i) accurately identifies
importance levels (i.e., scores) for different local areas; and then ii) defines a global
descriptor for each image based on these scores. In the third step, multi-labels are
automatically assigned to each RS image represented by the global descriptors. The
main novelty of the proposed approach consists in the design and development of:
i) the
K
-Branch CNN to efficiently model the complex information content of RS
images for which the spectral bands can be associated to varying spatial resolutions;
and ii) the multi-attention strategy that defines a global image descriptor based on
the extraction and exploitation of importance scores of image local areas. In order
to evaluate the performance of the proposed approach, several experiments are
carried out on BigEarthNet-S2. Unlike the conventional DL based methods in RS
that consider all the image bands as a single volume (after applying an interpolation
method to the lower spatial resolution bands) and define a global descriptor by
neglecting the importance scores of different local areas, the experimental results
show the success of the proposed approach. The rest of the chapter is organized as
Chapter 3. A Deep Multi-Attention Driven Approach 27
Spatial and Spectral Content
Characterization of Local
Areas
Definition of a
Multi-Attention Driven
Global Descriptor
Classification of
RS Image Scenes
with Multi-Labels
FIGURE 3.1: Block diagram of the proposed approach for multi-label RS image scene classifi-
cation.
follows. Section 3.2 introduces the proposed approach, while Section 3.3 explains
the design of experiments. Section 3.4 provides the experimental results. Section 3.5
draws the conclusion of this chapter.
3.2 Proposed Approach
Let
X={x1
,
. . .
,
xM}
be an archive that consists of
M
images, where
xi
is the
ith
image.
We assume that a set
T X
of labeled images is initially available. Each image in
T X
is associated with multi-labels from a label set
L={l1, ..., lS}
, where
|L| =S
.
Label information of
xi T
is defined by a binary vector
yi {
0,1
}S
, where each
element of
yi
indicates the presence or absence of label
ls L
in a sequence. We also
assume that spectral bands of each image
xi
can be associated to the
K
different spatial
resolutions, resulting in different pixel sizes. We aim to learn
F(x
;
θ) = g(f(x
;
θ))
that maps a new image
x
to multi-labels, where
f(·)
generates classification scores
for each label
ls
and
g(·)
produces
y
as a predicted label set and
θ
is the given
set of model parameters. We propose a multi-label RS image scene classification
approach made up of three main steps: 1) spatial and spectral characterization of
image local areas by a novel
K
-Branch CNN; 2) definition of a multi-attention driven
global descriptor with a novel multi-attention strategy; and 3) classification of RS
image scenes with multi-labels. Fig. 3.1 presents the block diagram of the proposed
approach and each step is explained in the following sub-sections.
3.2.1 Spatial and Spectral Characterization of Local Areas
To efficiently characterize the spatial and spectral content of image local areas, each
RS image is initially divided to
R
non-overlapping
w×w
sized local areas. Let
ρr
i
be the
rth
local area of
xi
. Then, for each local area, we define different sets of image
bands based on their spatial resolutions. Let
ρr
i,k
be the
kth
subset of the
rth
local area
for the corresponding spatial resolution, where
k {
1,2,...,
K}
and
r {
1,2,...,
R}
.
To accurately describe the local areas with varying spatial resolutions, we introduce
a
K
-Branch CNN that utilizes separate CNNs, each of which is designed to describe
the local areas of image bands with different spatial resolutions. Thus, the number
Chapter 3. A Deep Multi-Attention Driven Approach 28
Division of
Non-overlapping
Local Areas
Group Bands for
Each Spatial
Resolution
CNN
CNN
CNN
FIGURE 3.2: The proposed
K
-Branch CNN introduced in the first step of the proposed
approach. One local area is highlighted as an example to feed into the corresponding CNN.
K
of CNN branches is selected as the total number of different spatial resolutions.
If all spectral bands are associated to the same spatial resolution, the proposed
K
-
Branch CNN turns into a single branch CNN (i.e.,
K=
1). Each
ρr
i,k
are fed into
different branches of the
K
-Branch CNN. Let
φk
be the
kth
branch that provides local
descriptors associated with
kth
spatial resolution by applying convolutional layers
and a fully connected (FC) layer. Different local descriptors for all sets of image
bands are first characterized and then concatenated into one vector for one local area.
To effectively combine information from different branches, all concatenated feature
vectors are fed into a new FC layer to produce the local descriptors
ψi,r
. This step is
illustrated in Fig. 3.2.
The proposed
K
-Branch CNN describes the complex information content of image
local areas through specific branches associated to different spatial resolutions. By
this way, a unique CNN is used for the image bands with the same spatial resolution
unlike the traditional CNN based methods in RS (which consider all the image
bands as a single volume after applying interpolation to the low spatial resolution
bands). On the one side, this approach leads to an accurate characterization of the
content of high dimensional RS images. On the other side, due to modeling the local
areas, it requires a smaller number of model parameters being estimated. Thus, the
computational complexity of training phase is reduced, while the risk of over-fitting
on training data with low generalizing capability is avoided (since smaller neural
networks have less tendency for over-fitting).
3.2.2 Definition of a Multi-Attention Driven Global Descriptor
After obtaining the local descriptors
{ψi,r}R
r=1
in the first step, a global descriptor
can be defined by simply stacking all local descriptors. In this way, local descriptors
equally contribute to the definition of a global descriptor. However, local areas of an
RS image can be subject to different levels (i.e., scores) of importance to represent the
Chapter 3. A Deep Multi-Attention Driven Approach 29
tanh
LSTM
Cell
𝛅
𝛅𝛅tanh
x
x
x
+
LSTM
Cell
FIGURE 3.3: Single LSTM cell with its inputs, gates and cell state followed by two LSTM cells
in a sequence. Without losing in generality, particular sequence of the LSTM network (which
starts with the first local area and ends with the last local area) is chosen in the figure.
semantic content of the image. Accordingly, this step aims at accurately extracting
and exploiting importance levels of local areas of each image, while defining a global
image descriptor. To this end, we introduce a novel multi-attention strategy that is
defined based on long short-term memory (LSTM) networks [112].
An LSTM network contains sequentially ordered LSTM nodes (i.e., cells). Each
cell includes input gate (
i
), forget gate (
f
), output gate (
o
) and cell state (
c
). Cell
state characterizes the knowledge of observed inputs until the corresponding cell.
Different gates control how the cell state should behave according to different aims.
Forget gate decides which portion of the current cell state value should be forgotten.
Input gate controls which portion of the input should be read by cell state. Output
gate decides which portion of the cell state should be produced as the output of the
new cell state. The reader is referred to [113] for the detailed explanation. In the
proposed approach, each LSTM cell takes the descriptor of
rth
local area (
ψi,r
) from
the
K
-Branch CNN as input and employs the aforementioned operations as follows:
fr=δ(Wf,rψi,r+Uf,rhτ+bf,r)
ir=δ(Wi,rψi,r+Ui,rhτ+bi,r)
or=δ(Wo,rψi,r+Uo,rhτ+bo,r)
cr=frcτ+irtanh(Wc,rψi,r+Uc,rhτ+bc,r)
(3.1)
where
tanh
and
δ
are the hyperbolic tangent and sigmoid functions,
W.,r
and
b.,r
are
the weight and bias parameters; and the subscript of
r
refers to the parameters of
the LSTM cell associated with
rth
local area. All operations of one LSTM cell are
illustrated in Fig. 3.3. Each LSTM cell produces one preliminary attention score given
the sequence, hr|τ, based on the cell state and the gates as follows:
hr=hr|τ=ortanh(cr). (3.2)
We utilize two LSTM networks in a bidirectional manner to consider the different
orders of local areas and thus all LSTM cells are placed in two different sequences
Chapter 3. A Deep Multi-Attention Driven Approach 30
LSTM
Cell
LSTM
Cell
LSTM
Cell
LSTM
Cell
LSTM
Cell
LSTM
Cell
𝛅
𝛅
𝛅
FIGURE 3.4: Proposed multi-attention strategy with bidirectional LSTM networks for the
second step of the proposed approach.
with different parameters. Each cell of the first LSTM network produces the pre-
liminary attention score of one local area concerning the knowledge acquired from
the attention scores of previous local areas (i.e., previous cells). Thus
τ
becomes
r
1 in (3.1). The second LSTM network employs the same idea by considering
the subsequent local areas and thus
τ
becomes
r+
1 in (3.1). In the context of bidi-
rectional LSTM networks, forward and backward sequences can be combined by
using the concatenation, the summation or the multiplication operations [114], [115].
The concatenation operation is a widely used operation in the literature. However,
it requires a fully connected layer for the reduction of a vector into a single value,
which can significantly increase the computational complexity of the whole approach.
When multiplication operation is used, the resulting value can be dominated by one
of the sequences, if the preliminary attention score is a negative value. Accordingly,
we select the summation operation for combining the sequences. To this end, after
obtaining two preliminary attention scores from the different orders, we apply the
final attention score of the rth local area αi,ras follows:
αi,r=δhr|r1+hr|r+1
2. (3.3)
This produces an attention score for the
rth
local area within the range of
[
0,1
]
. For
the beginning of passes (
r=
1 or
r=R
),
τ
refers to an initial state of the nodes.
Each attention score shows the importance level of the considered local area for the
complete characterization of the whole image content. Accordingly, multi-attention
scores
{αi,r}R
r=1
for the
ith
image
xi
show the different importance levels of the image
local areas. The proposed multi-attention strategy is illustrated in Fig. 3.4.
Let
i
be the multi-attention driven global descriptor of the
xi
. After obtaining the
multi-attention scores, the global descriptor
i
is defined by the concatenation of
Chapter 3. A Deep Multi-Attention Driven Approach 31
K-Branch
CNN
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Concatenation of Weighted
Local Descriptors
Coniferous
forest
Mixed
forest
Transitional
woodland
(a) (b) (c)
FIGURE 3.5: Detailed illustration of the three main steps of the proposed approach: (a) spatial
and spectral characterization of local areas; (b) definition of a multi-attention driven global
descriptor; (c) RS image scene classification with multi-labels.
local descriptors weighted by attention scores as follows:
i= [αi,1ψ
i,1, . . . , αi,Rψ
i,R]. (3.4)
Due to this step, the proposed approach extracts and exploits the importance scores
of local areas of each image instead of equally considering them.
3.2.3 Classification of RS Image Scenes with Multi-Labels
This step aims to classify RS images into multi-labels by using the multi-attention
driven global descriptor
i
obtained in the second step of the proposed approach.
To this end, we employ a FC layer
f(·)
as a classifier that generates class scores
zlj
for each class label
lj
in the sequence based on the global descriptor
i
. Then, we
obtain the class posterior probability of
lj
for the image
xi
with the sigmoid function
as:
P(lj|xi) =
1
/(
1
+ezlj)
. After characterizing the class posterior probabilities, we
define the overall loss of the approach as the cross entropy loss throughout all labels
and images as follows:
xiT
S
j=1
[ljyi]log(P(lj|xi))
+(1[ljyi]) log(1P(lj|xi))
(3.5)
where
[ljyi]
is the Iverson bracket, which equals 1 if the
lj
is one of the true
multi-labels of
xi
, 0 otherwise. After end-to-end training of the entire neural net-
work by minimizing the cross-entropy loss, the parameters
θ
of the function
F
(i.e.,
model parameters of the approach) can be learned. Accordingly, our model becomes
capable of producing the posterior probabilities of multi-labels to be assigned to a
new RS image scene
x
. Then, the proposed approach predicts the multi-labels by
thresholding the probability values. Each step of the proposed approach is illustrated
in Fig. 3.5.
Chapter 3. A Deep Multi-Attention Driven Approach 32
3.3 Dataset Description and Experimental Design
3.3.1 Dataset Description
We conducted all experiments on BigEarthNet-S2, while we utilized multi-labels
based on the CLC class nomenclature. In the experiments, 70, 987 image patches
that are fully covered by seasonal snow, cloud and cloud shadow were not used.
According to our knowledge, BigEarthNet-S2 is the only archive in RS that includes
Sentinel-2 multispectral images, each of which is annotated with multi-labels. Thus,
we could only use it in the experiments in this chapter. The other benchmark archives,
e.g., DFC15 [35] and UC-Merced archives [93], consist of a very small number of RS
images that are annotated with multi-labels and contain only RGB bands. Thus, they
are not adequate to evaluate the proposed approach and are not considered in this
chapter.
The number of image patches associated with each BigEarthNet-S2 class varies sig-
nificantly in the archive. To construct training set (which is used for training the
considered neural networks), validation set (which is used for selecting hyperparam-
eters) and test set (which is used for accuracy assessment), one could apply random
sampling. However, when images with multi-labels are considered, this approach
has a risk that randomly selected images may not represent all classes present in
the whole archive. There are also other approaches to divide a dataset into train,
validation and test sets, however they are also designed for images annotated by
single-labels and thus not suitable for multi-label applications [116]. To this end, we
develop an algorithm to represent each BigEarthNet-S2 class with a sufficient number
of images in training, validation and test sets based on the label frequencies. The
algorithm starts by including all images to the the training set. Let
clmN
be the
number of images associated to the label
lm
in the training set, where
m {
1,
. . .
,
S}
,
and thus we define the frequency γlmof the label lmin the training set as follows:
γlm=clm
S
m=1clm
. (3.6)
Then, we define the cost of moving an image and its set of multi-labels from the
training set to either validation or test set as follows:
Cxi,yi=
S
m=1
γ
lm1
S
γlm
(3.7)
where
γ
lm
indicates the new frequency of the label
lm
after images are moved from
the training set to the validation or test sets. The algorithm first sorts the label list in
decreasing order based on the number of images associated to each class. Then, from
the sorted list, the images with the decreasing cost values associated to each class are
randomly selected and moved either to the validation set or to the test set. Since the
algorithm starts to operate on the images associated to the majority classes, most of
the images will be moved from the training set at the beginning. However, the cost
value will reach the stationary point when it operates on the images associated to
the minority classes. Application of this algorithm to the BigEarthNet-S2 results in a
validation set of 198,762 images, a test set of 203,269 images, and a training set of
117,308 images. The algorithm is summarized in Algorithm 1.
Chapter 3. A Deep Multi-Attention Driven Approach 33
Algorithm 1 Our algorithm for the selection of training, validation and test sets.
Input: X={x1,..., xM},L={l1,..., lS},Y={y1,..., yM}
Assumption: L
is sorted in decreasing order based the number of images associated
to each class.
1: function LABELFREQ(T,lm)
2: clm |{(xi,yi)|(xi,yi) T ,yi,m=1}|
3: γlmclm/(S
m=1clm)
4: return γlm
5: end function
6: function COST(T,(xi,yi),S,ΓL)
7: sum 0
8: for m1 to Sdo
9: γlmΓL
10: γ
lmLABELFREQ(T (xi,yi),lm)
11: sum sum (γ
lm1
S)/γlm
12: end for
13: return sum
14: end function
15: T={(xi,yi)|xi X,yi Y} Initial training set.
16: V=Initial validation set.
17: E=Initial test set.
18: S |L|
19: state COST(T,,S)
20: ΓLSS
m=1LABELFREQ(T,lm)Initial frequencies.
21: for m1 to Sdo
22: for all isuch that yi,m=1do
23: if COST(T,(xi,yi),S,ΓL)<state then
24: T T (xi,yi)
25: (V V + (xi,yi)) (E E + (xi,yi))
26: state COST(T,(xi,yi),S,ΓL)
27: end if
28: end for
29: end for
30: return T,V,EResulting sets.
3.3.2 Experimental Design
After the selection of training, validation and test sets, we divided each image into
non-overlapping local areas. Then, we employed three branch CNN (i.e.,
K=
3
for the
K
-Branch CNN) due to the three different spatial resolutions of Sentinel-2.
Accordingly, for each local area, we split the bands into three subsets. Then, we
stacked bands of each subset to obtain a single volume for each CNN branch. In
detail, the bands 2 to 4 and 8 (which have 10m spatial resolution) were fed into the
first branch, while the bands 5 to 7, 8A, 11 and 12 (which have 20m spatial resolution)
were fed into the second branch and the third branch takes as input the remaining
bands 1 and 9 (which have 60m spatial resolution). We selected the number of local
Chapter 3. A Deep Multi-Attention Driven Approach 34
areas and all other hyperparameters with respect to the classification performance
on the validation set. To select the local area size
w×w
,
w
is tested within the range
of
[
18,60
]
with a step size of 6. It is worth noting that, for the sizes, which are not
evenly divisible by the image size (120
×
120 for 10m bands, 60
×
60 for 20m bands,
20
×
20 for 60m bands), we applied zero padding to the image borders. Although the
same number of convolutional layers was used for all branches, the number of filters,
the exploitation of pooling strategy and the filter sizes vary among branches. It is
worth noting that the number of convolutional layers in all branches can be increased
at a large extent to achieve deeper models. However, this would also increase the
number of model parameters and thus the computational complexity. Accordingly,
three convolutional layers were used for all branches. For the first branch, 32 filters
with the size of 5
×
5, 32 filters with the size of 5
×
5 filters and 64 filters with the size
of 3
×
3 filters were selected. For the second branch, the same number of filters was
used, while 3
×
3 filters were employed in each layer. For the third branch, 32 filters
with the size of 2
×
2 were used in each layer. We utilized the stride of 1 and zero
padding in all convolutional layers to preserve the spatial dimensionality and not to
lose information. In addition, max-pooling was utilized in the first two branches to
provide partial translation invariance [117], which was not used in the last branch to
avoid further decreasing the spatial resolution. For the LSTM networks, we used a
128 dimensional memory.
We jointly trained all CNN branches, FC layers and LSTM networks (i.e., an end-
to-end learning of all steps was applied simultaneously). We used the Adam
method [101] of Stochastic Gradient Descent with the initial learning rate of 103to
decrease the sigmoid cross entropy loss, which aims at maximizing the log-likelihood
of the multi-labels in the training set. For the initialization of neural network weights,
we utilized the Xavier method [118] to keep the variance of weights similar among all
layers. We selected the 2
×
10
5
L2-regularization weight to layer-wise regularize the
weights. 20% dropping out probability was chosen for Dropout regularization [119]
to avoid the over-fitting of the proposed approach on the training set. In addition,
we utilized the Batch Normalization [120] to decrease the effect of different spectral
band statistics.
In the experiments, we compared the proposed approach with: 1) the Very Deep
Convolutional Networks (i.e., VGG networks) [121]; 2) the Deep Residual Nets (i.e.,
ResNet networks) [95]; and 3) the Class-Wise Attention-Based Convolutional and
Bidirectional LSTM Network [35] (denoted as CA-LSTM). For the VGG networks, we
selected 16 layers (VGG16) and 19 layers (VGG19) versions. At the similar depths
to the VGG networks, we selected 18 layers (ResNet18) and 34 layers (ResNet34)
versions of the ResNet networks. These are widely used CNNs for the image clas-
sification problems in the CV literature. We used the same parameters presented
in [121] and [95] for the VGG networks and the ResNet networks, respectively, except
only the considered learning rates. CA-LSTM is one of the few DL based approaches
proposed for the multi-label RS image scene classification task. For the CA-LSTM,
we used the same feature extraction module (which is ResNet50 [95]), same LSTM
network (bidirectional LSTM network with 2048 dimensional memory) and same
parameters presented in the [35] except the learning rate.
Chapter 3. A Deep Multi-Attention Driven Approach 35
We also evaluated the different steps at the proposed approach. To assess the ef-
fectiveness of the first step of the proposed approach (that is the
K
-Branch CNN),
we compared it with different single branch CNN approaches. To this end, we
initially applied cubic interpolation to 20m and 60m bands and stacked all bands
into one volume. Then, three different approaches are considered as follows: 1) a
single branch CNN that considers all the image bands as input and operates on
the whole images (denoted as SiB-CNN); 2) a single branch CNN that considers
all the image bands as input and operates on the local areas of images (denoted
as L-SiB-CNN); and 3) a single branch CNN that considers only RGB image bands
as input and operates on the whole images (denoted as SiB-CNN
RGB
). For these
approaches, the architecture of the first branch of the proposed
K
-Branch CNN is
used. To evaluate the effectiveness of the second step of the proposed approach
(that is the multi-attention strategy), we compared the results with those obtained
without using the multi-attention strategy (i.e., only the first step is used). For all
the experiments, we used the same training procedure from scratch with the same
number of epochs, learning rate and the number of mini-batches to compare different
approaches under the same setting. We performed our experiments on a cluster of 4
NVIDIA Tesla V100 GPUs.
Performance evaluation of any multi-label classification approach requires to analyze
several factors rather than only evaluating the number of correct predictions and
thus needs much more complex analysis with respect to the single-label case [122].
Accordingly, we utilized the different classification-based and ranking-based metrics
with varying characteristics to accurately evaluate the accuracy of the proposed
approach. Classification-based metrics consider the list of predicted classes, whereas
ranking-based metrics focus on the ordered list of probabilities for all classes.
Under the category of classification-based metrics, results of experiments were pro-
vided in terms of three performance metrics: 1) Recall (
R
); 2)
F2
-Score (
F2
); and 3)
Hamming loss (
HL
). Classification-based metrics can be calculated by: i) giving
equal importance to each sample of the test set (sample averaging); ii) giving equal
importance to each class (macro averaging); and iii) comparing the overall test set
with the ground reference (micro averaging) regardless of giving importance to
neither each sample nor each class.
Let
TPij
,
FPij
,
FNij
and
TNij
indicate the conditions of true positive, false positive,
false negative and true negative, respectively, for the
ith
image and
jth
label (
lj
),
where each of them takes 0 or 1 and
TPij +FPij +FNij +TNij =
1 holds. The recall
is expressed by different averaging methods as follows:
Rsmpl =1
M
M
i=1
S
j=1TPij
S
j=1TPij +FNij
(3.8)
Rmacr =1
S
S
j=1
M
i=1TPij
M
i=1TPij +FNij
(3.9)
Rmicr =M
i=1S
j=1TPij
M
i=1S
j=1TPij +FNij
. (3.10)
Chapter 3. A Deep Multi-Attention Driven Approach 36
The
F2
-Score is the weighted harmonic mean of the correct prediction rates among
the considered ground reference and the multi-label predictions. Thus, it is expressed
by different averaging techniques as follows [123]:
F2
smpl =1
M
M
i=1
S
j=15TPij
S
j=15TPij +4FNij +FPij
(3.11)
F2
macr =1
S
S
j=1
M
i=1TPij
M
i=15TPij +4FNij +FPij
(3.12)
F2
micr =M
i=1S
j=15TPij
M
i=1S
j=15TPij +4FNij +FPij
. (3.13)
The Hamming loss is the average Hamming distance between the ground reference
labels and predicted multi-labels. Thus, it is defined as follows [124]:
HL =1
M
M
i=1
1
S
S
j=1
[ljyiljy
i](3.14)
where is the XOR logical operation.
Under the category of ranking-based metrics, results of experiments are provided
in terms of four performance evaluation metrics: 1) Ranking loss (
RL
); 2) One error
(
OE
); 3) Coverage (
COV
); and 4) Label ranking average precision (
LRAP
). All the
ranking-based metrics are defined with respect to the ranking of the
jth
label in the
class probabilities result of an multi-label classification approach for the
ith
image
that is defined as
rankij =|k:P(lk|xi)P(lj|xi)|
. Unlike the classification-based
metrics, ranking-based metrics are calculated only by giving equal importance to
each sample of the test set.
Accordingly, ranking loss is the rate of wrongly ordered label pairs (i.e., the proba-
bility of a label, which is irrelevant to the image, is higher than a ground reference
label), and thus expressed as follows [125]:
RL =1
M
M
i=1
1
|yi|(S|yi|)
ljyi
lk/yi
rankik rankij. (3.15)
The one error is the rate of test images whose predicted label having the highest
ranking is not in the ground reference and thus defined as follows [122]:
OE =1
M
M
i=1
[argmax
j
rankij /yi]. (3.16)
The coverage calculates the average number of labels required to be included in the
prediction list of a multi-label classifier such that all ground reference labels will be
predicted. Accordingly, it is defined as follows [125]:
COV =1
M
M
i=1
max
ljyi
rankij. (3.17)
Chapter 3. A Deep Multi-Attention Driven Approach 37
TABLE 3.1: MULTI-LABEL CLASSIFICATION ACCURACIES AND THE NUMBER OF REQUIRED
MODEL PARAMETERS (NP) WHEN USING LOCAL AREAS WITH DIFFERENT SIZES FOR THE
PROPOSED APPROACH.
Local Area Size (w×w) Classification-Based Metrics (%) Ranking-Based Metrics NP
(×106)
10m 20m 60m Rmacr Rsmpl Rmicr F2
macr F2
smpl F2
micr HL RL(
%
)OE(
%
)COV LRAP(
%
)
18×18 9×9 3×3 51.0 68.1 61.7 46.9 68.9 64.2 4.1 2.7 6.5 5.8 85.3 0.71
24×24 12×12 4×4 50.0 66.2 59.2 46.0 67.4 62.2 4.1 2.8 6.8 5.9 85.1
0.93
30×30 15×15 5×5 52.8 68.5 62.3 47.2 69.4 64.7 4.0 2.6 5.8 5.7 85.9
1.13
36×36 18×18 6×6 54.4 70.7 65.0 47.8 71.1 66.5 4.1 2.6 6.2 5.7 85.7
1.70
42×42 21×21 7×7 53.3 70.7 64.8 46.2 71.0 66.6 4.1 2.6 6.5 5.8 85.6
2.03
48×48 24×24 8×854.6 72.5 67.1 48.0 72.2 68.2 4.1 2.6 6.5 5.8 85.3
2.81
54×54 27×27 9×9 54.2 72.3 67.1 48.1 72.2 68.4 4.1 2.6 6.3 5.7 85.6
3.29
60×60 30×30 10×10 54.1 72.4 66.8 46.7 71.8 67.6 4.3 2.9 6.9 6.1 84.4
4.26
For each ground reference label, the label ranking average precision calculates the
rate of higher-ranked ground reference labels. This is expressed as follows [122]:
LRAP=1
M
M
i=1
1
yi
ljyi
|{lk:rankik rankij,lkyi}|
rankij
. (3.18)
It is worth noting that, for any multi-label classifier,
LRAP
provides scores strictly
greater than 0 unlike the other metrics [122]. Thus, small differences in the score of
this metric can be more informative compared to other metrics (e.g., recall). Smaller
values of the Hamming loss, ranking loss, one error and coverage indicate better
performance of an approach, whereas higher values of the recall,
F2
-Score and the
label ranking average precision are associated to better performance.
3.4 Experimental Results
We carried out different kinds of experiments in order to: 1) perform a sensitivity
analysis with respect to different parameter settings and strategies; and 2) compare
the effectiveness of the proposed approach with the widely used deep CNNs and
one recent multi-label RS image scene classification approach [35].
3.4.1 Sensitivity Analysis of the Proposed Approach
In this section, we performed the sensitivity analysis of the proposed approach under
different parameter settings and strategies.
In the first set of trials, we analyzed the effect of utilizing local areas with different
sizes in terms of the multi-label classification accuracy and computational complexity.
Table 3.1 shows the results with the required number of parameters under different
sizes of local areas. By analyzing the table, one can see that the reduction of computa-
tional complexity highly depends on the local area size
w×w
. This is due to the fact
that enlarging the local areas increases the number of parameters required to learn.
Chapter 3. A Deep Multi-Attention Driven Approach 38
TABLE 3.2: RESULTS OBTAINED BY THE SIB-CNN
RGB
,THE SIB-CNN, THE L-SIB-CNN AND
THE PROPOSED K-BRANCH CNN.
Method Classification-Based Metrics (%) Ranking-Based Metrics
Rmacr Rsmpl Rmicr F2
macr F2
smpl F2
micr HL RL(
%
)OE(
%
)COV LRAP(
%
)
SiB-CNNRGB
33.6 53.7 45.6 35.1 56.0 49.8
4.8 4.1 13.5 7.1 80.0
SiB-CNN
39.1 60.5 52.8 40.9 62.4 56.7
4.4 3.4 9.6 6.5 83.0
L-SiB-CNN
44.0
65.7 58.8
41.2 66.2
62.9 4.1 2.8 7.4 5.9 84.8
Proposed K-Branch CNN 46.8
64.7 57.7
44.6 66.3
61.0
4.1 2.6 6.3 5.7 85.4
As an example, using 18
×
18 sized local areas reduces the number of parameters by
a half order of magnitude compared to the case for which 60
×
60 sized local areas
are used. From the Table 3.1 one can also observe that the accuracies obtained by
different sizes of local areas are similar to each other under most of the metrics. As
an example, using 60
×
60 sized local areas provides almost the same
F2
macr
score
compared to the case 30
×
30 sized local area is considered. In few cases, there are
noticeable differences in the results associated to metrics. As an example, using
48
×
48 sized local areas results in more than 7% higher
Rmicr
compared to using
24
×
24 sized local areas. This is due to the fact that a smaller window size may
reduce the capability of describing the spatial information content. All these results
show that the selection of local area size in a proper range does not significantly affect
the classification accuracy of the proposed approach, however considerably changes
the computational complexity. Accordingly, for the rest of the experiments we used
30
×
30 sized local areas for 10m resolution bands since it provides the best values in
ranking-based metrics and Hamming loss with a significantly reduced number of
parameters (that is less than a half of those required for 48
×
48, 54
×
54 and 60
×
60
local area sizes).
In the second set of trials, we analyzed the effect of the first step of the proposed
approach on the multi-label classification accuracy. To this end, we compare the
results of the
K
-Branch CNN (which is introduced in the first step) with those ob-
tained by the SiB-CNN
RGB
(which is a single branch CNN that considers RGB bands),
SiB-CNN (which is a single branch CNN that considers all bands) and L-SiB-CNN
(which is a single branch CNN that considers all bands and operates on the image
local areas). Table 3.2 shows the multi-label classification accuracies under differ-
ent metrics. From this table, one can observe that the proposed
K
-Branch CNN
provides the best scores under most of the metrics. As an example, the proposed
K
-Branch CNN provides more than 9%, almost 4% and more than 3% higher
F2
macr
scores compared to the SiB-CNN
RGB
, SiB-CNN and L-SiB-CNN, respectively. In
greater detail, the SiB-CNN provides more than 6% higher
F2
smpl
score by achieving a
reduction of about 4% in one error compared to the SiB-CNN
RGB
. This shows that
using spectral bands associated to 20m and 60m spatial resolutions improves the
multi-label classification accuracy. Moreover, the L-SiB-CNN provides more than 4%
higher
F2
smpl
score by achieving a reduction of more than 9% in coverage compared to
the SiB-CNN. This indicates that exploiting local areas of images also improves the
multi-label classification accuracy. In addition, the proposed
K
-Branch CNN leads to
a reduction of about 7% in Hamming loss and more than 7% higher
Rmacr
compared
Chapter 3. A Deep Multi-Attention Driven Approach 39
TABLE 3.3: MULTI-LABEL CLASSIFICATION ACCURACIES OBTAINED BY USING DIFFERENT
STEPS OF THE PROPOSED APPROACH.
Steps of the Proposed Approach Classification-Based Metrics (%) Ranking-Based Metrics
1st 2nd 3rd Rmacr Rsmpl Rmicr F2
macr F2
smpl F2
micr HL RL(
%
)OE(
%
)COV LRAP(
%
)
46.8 64.7 57.7 44.6 66.3 61.0
4.1 2.6 6.3 5.7 85.4
52.8 68.5 62.3 47.2 69.4 64.7 4.0 2.6 5.8 5.7 85.9
TABLE 3.4: RESULTS OBTAINED BY THE RESNET18, RESNET34, VGG16, VGG19, CA-LSTM
AND THE PROPOSED APPROACH TOGETHER WITH THE NUMBER OF REQUIRED MODEL
PARAMETERS (NP).
Method Classification-Based Metrics (%) Ranking-Based Metrics NP
(×106)
Rmacr Rsmpl Rmicr F2
macr F2
smpl F2
micr HL RL(
%
)OE(
%
)COV LRAP(
%
)
ResNet18 [95]
36.1 59.9 52.6 34.8 61.1 55.5
4.9 8.1 12.8 11.4 75.5
11.2
ResNet34 [95]
37.0 64.4 57.8 35.7 64.6 59.8
4.9 6.3 12.6 9.6 77.6
21.3
VGG16 [121]
37.8 62.7 55.6 36.1 64.2 58.7
4.5 3.3 10.1 6.3 82.4
134.4
VGG19 [121]
41.5 61.5 54.2 38.1 63.0 57.4
4.6 3.5 11.1 6.4 81.5
139.8
CA-LSTM [35]
43.5 64.8 58.5 40.4 65.5 60.7
4.7 3.7 9.9 6.8 81.5
33.5
Proposed Approach 52.8 68.5 62.3 47.2 69.4 64.7 4.0 2.6 5.8 5.7 85.9 1.1
to the SiB-CNN. All these results show that the
K
-Branch CNN much more accurately
characterizes the spectral content of RS images by utilizing all spectral bands with
different spatial resolutions in branch-wise CNN architecture compared to single
branch CNN approaches (which require to apply interpolation to lower resolution
bands).
In the third set of trials, we evaluated the effect of the second step of the proposed
approach. To this end, we compared the results of proposed approach with those
obtained by neglecting the multi-attention strategy (i.e., only the first step is used).
When the second step is neglected, global descriptors are obtained by the concate-
nation of local descriptors without weighted by attention scores. Table 3.3 shows
the multi-label classification accuracies under different metrics. From this table, one
can observe that when the use of multi-attention strategy significantly improves the
classification accuracy under all the metrics. As an example, the improvements are
6% in
Rmacr
and more than 3% in
F2
smpl
score. This shows the effect of modeling the
importance scores of image local areas for the characterization of a global descriptor.
3.4.2 Comparison Among the Existing Approaches
In the fourth set of trials, we compared the effectiveness of the proposed approach
with the ResNet architectures at the depths of 18 and 34 (ResNet18 and ResNet34),
the VGG architectures at the depth 16 and 19 (VGG16 and VGG19) and the CA-LSTM
(which is a recent multi-label RS scene classification approach). Table 3.4 shows
the multi-label classification results of these methods under different metrics. By
analyzing the table, one can observe that our proposed approach leads to the highest
accuracies with the lowest number of parameters. As an example, the proposed
Chapter 3. A Deep Multi-Attention Driven Approach 40
RS Images Multi-Labels ResNet18 [27] ResNet34 [27] VGG16 [26] VGG19 [26] CA-LSTM [11] Proposed Approach
Coniferous forest,
Water bodies
Pastures,
Land principally
occupied by
agriculture,
Coniferous forest,
Transitional
woodland/shrub
Pastures,
Coniferous forest,
Natural grassland,
Moors and heathland,
Transitional
woodland/shrub
Pastures,
Coniferous forest
Broad-leaved forest,
Coniferous forest,
Mixed forest,
Water bodies
Broad-leaved forest,
Coniferous forest,
Mixed forest,
Water bodies
Coniferous forest,
Water bodies
Discontinuous
urban fabric,
Port areas,
Pastures,
Coniferous forest,
Coastal lagoons
Discontinuous
urban fabric,
Port areas,
Pastures,
Coniferous forest,
Coastal lagoons
Discontinuous urban
fabric, Industrial or
commercial units,
Green urban areas,
Coniferous forest,
Mixed forest,
Coastal lagoons,
Sea and ocean
Discontinuous urban
fabric, Industrial or
commercial units,
Port areas, Green urban
areas, Coniferous forest,
Mixed forest,
Transitional
woodland/shrub,
Water courses
Pastures,
Land principally
occupied by
agriculture,
Coniferous forest,
Transitional
woodland/shrub
Pastures,
Land principally
occupied by
agriculture,
Natural grassland
Coniferous forest,
Water bodies
Discontinuous
urban fabric,
Pastures,
Coniferous forest,
Coastal lagoons
Mixed forest,
Water bodies
Mixed forest,
Water bodies
Pastures,
Land principally
occupied by
agriculture,
Natural
grassland
Pastures,
Land principally
occupied by
agriculture,
Mixed forest
Discontinuous
urban fabric,
Industrial or
commercial units
Discontinuous
urban fabric,
Port areas
FIGURE 3.6: An example of the BigEarthNet-S2 images with the true multi-labels and
the multi-labels assigned by the ResNet18, ResNet34, VGG16, VGG19, CA-LSTM and the
proposed approach.
approach provides 15% higher
Rmacr
, more than 5% higher
F2
smpl
score and a reduction
of more than 21% in ranking loss compared to the VGG16 (which is one of the well
known CNNs for image classification problems). Moreover, the proposed approach
requires a significantly reduced number of parameters that is more than two orders
of magnitude compared to the VGG16. Even with the deeper architecture (VGG19),
the VGG approach is not capable of increasing the classification accuracy (while
providing the lowest scores under all metrics except the
Rmacr
and
F2
macr
compared
to the VGG16) and requires the highest number of parameters to learn. As an
example, the VGG16 leads to a reduction of about 6% in ranking loss. This shows
that increasing the depth of a CNN is not sufficient to obtain accurate multi-label
RS classification results. In addition, the proposed approach leads to more than 11%
higher
F2
macr
score and more than 8% higher
LRAP
score with a reduced number of
parameters that is more than an order of magnitude lower compared to the ResNet34
(which is one of the most popular CNNs due to the integration of residual connections
with convolutional layers). The proposed approach provides better metric values
(e.g., more than 9% higher
Rmacr
, 7% higher
F2
macr
score, 9% higher
Rmacr
and a
reduction of about 30% in ranking loss) also compared to the CA-LSTM. This success
has been achieved with the significantly reduced number of parameters by more than
an order of magnitude. All these results clearly show that the proposed approach
reduces the needs for very deep CNNs to achieve a high classification accuracy.
This is an important advantage, since reducing the number of model parameters to
achieve promising performance is as important as the classification accuracy for DL
based approaches. Figure 3.6 shows an example of BigEarthNet images with the
true multi-labels and the multi-labels assigned by the ResNet18, ResNet34, VGG19,
VGG16, CA-LSTM and the proposed approach. By analyzing the figure, one can see
that our proposed approach accurately predicts all classes without predicting any
wrong ones. Unlike the proposed approach, the VGG16 and VGG19 predict several
unrelated classes. As an example, both of the approaches predict broad-leaved forest
and mixed-forest classes for the first image, although this image does not contain these
Chapter 3. A Deep Multi-Attention Driven Approach 41
classes. ResNet18 and ResNet34 are able to accurately predict only some of the multi-
labels. As an example, for the image in the center, the ResNet networks correctly
predict pastures and land principally occupied by agriculture classes, however coniferous
forest and transitional woodland/shrub classes are not predicted and thus missed. These
results prove that the VGG and ResNet networks are less accurate in the prediction
of all classes present in the images with respect to the proposed approach. From the
figure, one can see that the CA-LSTM provides accurate results for the top image
without any wrong classification. However, for more complex images, this approach
is not capable of identifying some classes. As an example, for the image in the center,
the CA-LSTM wrongly predicts mixed forest and natural grassland classes instead
of coniferous forest and transitional woodland/shrub classes. However, these classes
are accurately predicted by the proposed approach. As another example, for the
bottom image, the CA-LSTM does not provide any correct prediction, whereas the
proposed approach correctly predicts all the classes. These results, again, prove that
the proposed approach more accurately describes the complex spatial and spectral
content of RS images compared to the CA-LSTM.
3.5 Conclusion
In this chapter, we have introduced a novel DL based approach for multi-label remote
sensing image scene classification. The proposed approach is made up of three
main steps. The first step achieves spatial and spectral characterization of image
local areas by a novel
K
-Branch CNN, which includes spatial resolution specific
CNN branches. The second step initially estimates the multiple attention scores
to identify the importance levels (i.e., scores) of different image local areas. This
is achieved by the novel bidirectional LSTM-based multi-attention strategy. Then,
each image is represented by a global descriptor defined on the basis of the attention
scores. In the third step, images modeled by the multi-attention driven global
descriptors are classified and multi-label predictions are obtained. Experimental
results obtained on the BigEarthNet (which is a large-scale Sentinel-2 benchmark
archive) demonstrate that the proposed approach significantly improves the multi-
label scene classification accuracy compared to the well known deep CNNs and
the state-of-the-art attention driven multi-label RS image classification approach.
Moreover, the proposed approach provides a computationally more efficient solution
for multi-label classification problems due to the significant reduction in the number
of model parameters. Decreasing the model complexity reduces the risk of over-
fitting (which also contributes to the improvement in the classification accuracy). All
the results confirm that the proposed approach is much more suitable to be used
within the operational RS scene classification scenarios, where the images contain
highly complex spatial and spectral information content. The main reasons for the
success of the proposed approach are summarized as follows:
1.
Due to the proposed
K
-Branch CNN (which includes a specialized branch in
terms of the DL techniques utilized throughout layers for the set of image
bands with the same spatial resolution), the proposed approach significantly
improves the characterization of complex spatial and spectral content of high-
dimensional RS images with high-spatial resolution. Moreover,
K
-Branch CNN
Chapter 3. A Deep Multi-Attention Driven Approach 42
leads to a significant reduction on the computational complexity of the entire
approach by reducing the number of model parameters.
2.
Due to the proposed multi-attention strategy (which efficiently exploits the
bidirectional LSTM sequences on the local descriptors of each RS image to
estimate the multi-attention scores), the proposed approach accurately extracts
and exploits the importance levels of image local areas which are then used to
define the global descriptors.
It is worth noting that although in our experiments we have used the Sentinel-2
multispectral images (which include 13 bands associated to three different spatial
resolutions), the proposed approach can be used with any multispectral RS image.
This can be achieved by selecting: i) the number
K
of branches as the total number of
different spatial resolutions associated to the considered RS image bands; and ii) the
proper values of the hyperparameters for each branch in the
K
-Branch CNN. If all
the image bands are associated to the same spatial resolution value, the
K
-Branch
CNN turns into a single branch CNN (i.e.,
K=
1). It is also important to note that
when RS image bands with varying spatial resolutions are considered, the most
straightforward way is to apply interpolation to the lower spatial resolution bands
and then to use a single-branch CNN. However, the experimental results show that
the use of interpolation may lead to a loss on the scene classification accuracy.
As a final remark, it is worth noting that to define the local areas of each image,
we simply divide images into non-overlapping blocks. As a future work, we plan
to apply a strategy for an adaptive definition of local areas based on the semantic
content of RS images that can further improve the classification accuracy. Moreover,
we also plan to develop a data summarization strategy [126] instead of stacking local
descriptors in the second step of the proposed approach.
43
Chapter 4
Remote Sensing Image Similarity
Learning Through Informative and
Representative Triplets for Multi-Label
Image Retrieval
Learning the similarity between RS images forms the foundation for CBIR. Recently,
deep metric learning approaches that map the semantic similarity of images into an
embedding (metric) space have been found very popular in RS. A common approach
for learning the metric space relies on the selection of triplets of similar (positive)
and dissimilar (negative) images to a reference image called as an anchor. Choosing
triplets is a difficult task particularly for multi-label RS CBIR, where each training
image is annotated by multiple class labels. To address this problem, in this chapter,
we propose a novel triplet sampling method in the framework of DNNs defined for
multi-label RS CBIR problems. The proposed method selects a small set of the most
representative and informative triplets based on two main steps. In the first step, a
set of anchors that are diverse to each other in the embedding space is selected from
the current mini-batch using an iterative algorithm. In the second step, different
sets of positive and negative images are chosen for each anchor by evaluating the
relevancy, hardness and diversity of the images among each other based on a novel
strategy. Experimental results obtained on two multi-label benchmark archives
show that the selection of the most informative and representative triplets in the
context of DNNs results in: i) reducing the computational complexity of the training
phase of the DNNs without any significant loss on the performance; and ii) an
increase in learning speed since informative triplets allow fast convergence. The code
of the proposed method is publicly available at
https://git.tu-berlin.de/rsim/
image-retrieval-from-triplets
. This chapter is mainly based on the following
publications:
G. Sumbul, M. Ravanbakhsh, and B. Demir, “Informative and representative
triplet selection for multilabel remote sensing image retrieval,” IEEE Transactions
on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. DOI:
10.1109/TGRS.
2021.3124326.
G. Sumbul, J. Kang, and B. Demir, “Deep learning for image search and retrieval
in large remote sensing archives,” in Deep Learning for the Earth Sciences: A
Chapter 4. Informative & Representative Triplet Selection for CBIR 44
comprehensive approach to remote sensing, climate science and geosciences, Hoboken,
NJ, USA: Wiley, 2021, ch. 11, pp. 150–160. DOI:10.1002/9781119646181.ch11.
G. Sumbul, M. Ravanbakhsh, and B. Demir, “A relevant, hard and diverse
triplet sampling method for multi-label remote sensing image retrieval,” in
Proceedings of the IEEE Mediterranean and Middle-East Geoscience and Remote
Sensing Symposium, 2022, pp. 5–8. DOI:10.1109/M2GARSS52314.2022.9839759.
4.1 Introduction
One of the most emerging applications in RS is the accurate retrieval of RS images
from fast-growing archives. Thus, the development of content-based image retrieval
(CBIR) methods, which aim to search for RS images similar to a query image based
on their semantic content, has recently attracted great attention. The performance
of any CBIR system relies on its capability to learn discriminative and robust image
representations to describe the complex semantic content of RS images.
Conventional CBIR systems exploit hand-crafted features to describe the content of
images. As an example, Wang and Newsam present a retrieval system employing
the well-known scale-invariant feature transform (SIFT) to extract bag-of-visual-
words representations of image features [127]. Aptoula introduces the use of bag-
of-morphological-words representations for local texture descriptors [128]. In [129],
a comparative analysis of local binary patterns (LBP) that capture local patterns
between neighboring pixels is presented. Chaudhuri et al. present a method that
represents image content by a graph, where the graph nodes describe the image
region properties and the edges represent the spatial relationships among the regions
[93]. Binary hash codes obtained through kernel-based hashing methods are found
effective for describing RS images in [130]. After extracting the image features, the
most similar images with respect to a query image can be found by performing
the
k
-nearest neighbor (
k
-nn) search algorithm. In the case of graph-based image
representations, graph comparison methods such as the inexact graph matching
approach proposed by Chaudhuri et al. [131] can be used. The images represented
by binary hash codes can be searched and retrieved by using the computationally
efficient hamming distance [130].
The above-mentioned CBIR systems cannot simultaneously optimize feature learning
and image retrieval, and thus result in a limited capability to represent the high-
level semantic content of RS images. This issue leads to insufficient search and
retrieval performance. To overcome this problem, CBIR systems based on DNNs
have been recently presented in RS. As an example, Li et al. propose a method
that fuses deep features and hand-crafted features [132]. This method exploits four
convolutional neural networks (CNNs) to extract features at different steps and
with different coarse levels. Then, these deep features are fused with traditional
image descriptors such as LBPs and SIFT to be used in the retrieval process. A
convolutional autoencoder is used by Tang et al. to obtain deep bag-of-words image
descriptors in [28]. To this end, a reconstruction loss function that minimizes the
error between the input and the extracted descriptors is considered. Imbriaco et
al. extract local convolutional features and aggregate them into a global descriptor,
Chapter 4. Informative & Representative Triplet Selection for CBIR 45
Arable land, Pastures,
Coniferous forest
Arable land,
Pastures
Mixed forest,
Inland wetlands
Arable land, Pastures,
Complex cultivation
patterns
Arable land, Pastures,
Complex cultivation
patterns
Broad-leaved forest,
Coniferous forest
Pastures,
Coniferous forest
Arable land, Pastures,
Coniferous forest Marine waters
FIGURE 4.1: An example of three triplets consisting of images from BigEarthNet-S2. Each
triplet given in different rows consists of an anchor (in blue frame), a positive image (in green
frame), and a negative image (in red frame). The associated multi-labels are given below the
respective images.
where the deep features are extracted through a pre-trained model without any fine-
tuning [9]. Boualleg and Farrah address the semantic gap between low-level features
and high-level perception of semantic similarity in [133]. This is achieved by using a
CNN to detect semantic concepts and a relevance feedback strategy to ensure that
CBIR results match with a query image. Sabahi et al. address the above-mentioned
semantic gap by employing a recurrent neural network to model the human visual
memory [134].
In recent years, deep metric learning (DML) based methods that aim at learning a
feature space (in which similar images are close to each other) have attracted attention
in RS. Current DML models are mostly trained using a triplet loss function made up
of three images as: i) an anchor image; ii) a positive image that is similar to the anchor;
and iii) a negative image that is dissimilar to the anchor [38]. An example of triplets
constructed from BigEarthNet-S2 can be seen in Fig. 4.1. A difficult task in DML is to
construct the set of triplets. A simple strategy is to define triplets from an existing
training set of labeled images. Roy et al. apply a strategy that: i) randomly selects
an anchor from a mini-batch of training images; and then ii) randomly chooses one
positive image that has the same class label as the anchor, while selecting one negative
image that has a different class label [27]. Similarly, Lai et al. select triplets randomly
based on the class labels of training images to train an end-to-end model for hashing
[135]. For each anchor image, there can be several positive and negative images.
Chapter 4. Informative & Representative Triplet Selection for CBIR 46
Thus, random selection does not guarantee the selection of the most representative
and informative images to the anchor and can result in the construction of so-called
trivial triplets (see Section 4.2 for details). We would like to note that one can also
exploit all the images in the mini-batch to construct triplets, as suggested in [15].
However, this choice significantly increases the total number of triplets and thus the
computational complexity of the training phase of the retrieval system [39], [40].
To overcome the limitation of random selection, the DML methods that evaluate
the hardness of images during the sampling process are introduced in the computer
vision (CV) literature (see Section 4.2 for details). According to our knowledge, most
of the triplet sampling methods in CV assume that each image is annotated by a
single label associated with the most significant content of the considered image and
thus rely on single-label image annotations to decide which images are positive or
negative for a given anchor image. However, RS images typically consist of multiple
classes and thus can simultaneously be associated with different class labels (i.e.,
multi-labels). From the DML perspective, the selection of triplets from training
images annotated by multi-labels is more complex than that from training images
labeled by single-labels. To achieve accurate DML in multi-label RS CBIR, methods
that accurately select a set of triplets from multi-label training images are needed.
To address this problem, we propose a novel triplet sampling method in the frame-
work of DML designed for multi-label RS CBIR problems. Unlike the existing triplet
sampling methods, the proposed method aims to select a small set of triplets from
each mini-batch of multi-label training images. To this end, the proposed method
consists of two consecutive steps. In the first step, a small number of diverse anchors
is selected based on a simple but efficient iterative algorithm. In the second step,
relevant, hard and diverse positive and negative images with respect to each anchor
are chosen based on a novel strategy. Then, the triplets are constructed from the
selected anchors and their respective positive and negative images. Based on these
consecutive steps, the proposed method constructs a small number of the most infor-
mative and representative triplets to drive DML, resulting in an accurate CBIR and
also in a reduced training complexity for the considered DNN. It is worth noting that
the proposed triplet sampling method is independent of the considered DNN archi-
tecture, and therefore can be used within any DNN presented in the literature. In the
experiments, different DNN architectures are considered, while the
k
-nn algorithm
is used for the retrieval process after the characterization of the image descriptors
through the considered method. Experiments carried out on two multi-label RS
benchmark archives demonstrate the effectiveness of the proposed method.
The rest of the chapter is organized as follows: Section 4.2 presents the related
works on triplet sampling. Section 4.3 introduces the proposed method. Section
4.4 describes the considered datasets and the experimental setup, while Section 4.5
provides the experimental results. Section 4.6 concludes the chapter.
4.2 Related Works
The development of DML methods that aim to learn a metric space (in which seman-
tically similar images are close to each other) is important for an accurate CBIR. It
Chapter 4. Informative & Representative Triplet Selection for CBIR 47
FIGURE 4.2: An Abstract representation of triplet selection and the progress for feature space
update. Blue arrows indicate reducing distances for updating the embedding, while red
arrows indicate increasing the distances.
Xa
marks a chosen anchor,
P1
,
P2
, and
P3
are positive
images, and
N1
,
N2
, and
N3
are negative images in different triplets. The triplet
(Xa
,
P1
,
N1)
is
trivial because it already satisfies the margins, and thus the corresponding distances are not
updated. The triplet
(Xa
,
P2
,
N2)
leads to a relatively small error and the images are pushed
and pulled a little. The triplet
(Xa
,
P3
,
N3)
violates the margin greatly and causes a significant
error.
P3
is a positive image, but very far from the anchor, so it is considered as a hard positive
image. N3is respectively a hard negative image.
has been shown that the triplet-based DML methods perform considerably well for
the CBIR tasks [27], [136]. The triplet-based DML methods use triplets of images to
learn a metric space by means of the triplet loss [38]. The optimization objective is to
minimize the feature distance between the anchor and its positive sample (i.e., image)
while maximizing the feature distance between the anchor and the negative sample.
The goal is to ensure that the positive sample is closer to the anchor than the negative
sample by at least a margin. During the training of a triplet-based DML method, for
the triplets that consist of a positive image inside the margin and the negative image
outside the margin, a zero value triplet loss is obtained, leading to small gradient
values and slow convergence. For the triplets that consist of a positive image visually
less similar to the anchor (i.e., outside the margin) and a negative image visually more
similar to the anchor (i.e., inside the margin), a high triplet loss value is obtained.
High loss values lead to large gradient values, and thus the parameters of the model
are updated. When a positive image is far from the margin, it is called as a hard
positive image. A negative image is called as hard negative if it is inside the margin and
very close to the anchor. If the distance between the anchor and positive image of a
triplet is higher than the distance between the anchor and negative image, the triplet
is considered as a hard triplet. In Fig. 4.2, an abstract representation of the triplet
selection and the feature space update is demonstrated. The images
P1
,
P2
and
P3
are
the positive images for the anchor
Xa
in different triplets, while images
N1
,
N2
and
N3
are the negative images for the anchor
Xa
. After updating the embedding (metric)
space using the selected triplets,
P2
and
P3
are pulled closer to the anchor
Xa
, while
N2
and
N3
pushed far away from the anchor
Xa
towards outside the margin. The
positive image
P1
is inside the margin and negative image
N1
is outside the margin,
and thus triplet
(Xa
,
P1
,
N1)
is a trivial triplet. The positive image
P3
is a hard positive
image for anchor
Xa
, since it is outside the margin and far from the anchor image.
The negative image
N3
is a hard negative image, as it is very close to the anchor.
The triplet
(Xa
,
P3
,
N3)
is a hard triplet, and causes a high loss value to update the
Chapter 4. Informative & Representative Triplet Selection for CBIR 48
parameters of the model. Since the trivial triplets are not sufficiently informative and
lead to slow convergence, the use of hard triplets has been considered to overcome
this problem.
Most of the methods in RS do not consider the hardness of the images in the selected
triplets and exploit the random triplet selection strategy as mentioned in the intro-
duction [15], [19], [27]. Unlike RS, in the CV community, the use of triplets is more
extended and the importance of the hardness is widely studied [137]–[140]. As an ex-
ample, Xuan et al. propose a triplet selection strategy that selects the closest positive
sample (easy positive) and the closest negative (hard negative) for each anchor [137].
Yuan et al. propose a hard-aware deeply cascaded (HDC) embedding method [140].
For each anchor and a selected positive sample, HDC selects the negative samples at
multiple hardness levels to construct different triplets. Hardness levels are defined
based on the distances in the embedding space. Yang et al. investigate the impor-
tance of hard positive images by combining a positive image with all negative image
pairs in the batch [40]. Then, the positive images are weighted and hard positives
are preferred. Ge et al., propose a hard triplet selection method that constructs a
class-level hierarchical tree of image features for the whole dataset, where visually
similar classes are merged recursively [139]. Then, the selection of the triplets is
done based on a distance computed between an anchor image and different pairs
of image classes through the hierarchical tree. In addition to the methods that aim
to select triplets, there are also several works that focus on reformulating the triplet
loss function to emphasize the effect of hard triplets [141]–[143]. As an example,
Zhang et al. adapt the focal loss that is initially defined for classification problems
and propose an extended version for triplets as an alternative to the triplet loss [142].
This loss function ensures that more importance is given to hard triplets than easier
ones, and thus the model can learn from the most informative triplets and converge
faster. Kim et al. developed an adapted version of the triplet loss for pose estimation
[141]. This loss function preserves the distance ratios from the label space in the
embedding space. In [143], the multi-similarity loss function is proposed to refor-
mulate the triplet loss with a weighting strategy. By using the weighting strategy,
this loss function considers the relative similarity of all positive and all negative
samples in a mini-batch. In [144], the multi-class N-pair loss function is proposed
to generalize the triplet loss function for multiple negative images associated with
an anchor. In detail, for each anchor image, one positive image and several negative
images are selected as hard negatives from different negative classes. In [19], the
dual-anchor triplet loss function is introduced as an extension of the triplet loss. In
addition to the objectives of the triplet loss, this loss function also aims at increasing
the distance between the positive and negative images for a given anchor. Wang et al.
extend the concept of triplets to the whole mini-batch, where all available images are
first sorted and then divided into a positive set and a negative set [145]. Afterward,
an extension of the triplet loss is used to force a margin between the two sets by
using all the images. This loss function employs a weighting strategy to increase the
importance of the hard negative images. In [146], it is shown that when an accurate
sampling strategy is considered, deep learning (DL) models with different modified
loss functions provide similar accuracies. This proves the fact that triplet selection
is as important as loss function in the framework of DML. Most of the triplet-based
methods in CV assume that a single label is associated with each image. However,
Chapter 4. Informative & Representative Triplet Selection for CBIR 49
Proposed informative and representative triplet selection method
Diverse anchor
selection (DAS)
Relevant, hard and diverse positive
and negative image selection (RHDIS)
Image embedding
extraction
Triplet loss
calculation
DNN model
updating
L
XaA
P
B
Updated parameters Extracted features
XaA
NXa
Xa
Pt
Nt
FIGURE 4.3: A block scheme of the proposed triplet sampling method to drive the training
phase of a DNN for multi-label CBIR problems.
RS images typically consist of multiple classes and are associated with multi-label,
which makes selecting triplets more complex than the single-label scenario.
4.3 Proposed Method
4.3.1 Problem Formulation
Let
X={X1
,
. . .
,
XM}
be an archive consisting of
M
images, where
Xm
is the
m
-th
image in the archive. We assume that a training set
XTX
is available. Each image
in
XT
is annotated with a set of class labels, which describe the content of the image.
Let
L={
1,2,...,
N}
be the set of all possible class labels. Each image
XjXT
is
associated with a multi-label vector
Lj={l1
j
,
l2
j
,...,
lN
j}
, where
li
j=
1, if the class label
iL
is associated to the image
Xj
, and
li
j=
0 otherwise. Each training image
Xj
is
annotated with at least one class label.
We propose a novel triplet sampling method in the framework of DL-based multi-
label CBIR. The proposed method aims: i) to select a small set of informative as
well as representative triplets from each training mini-batch
B
; and ii) to accurately
describe the complex semantic content of RS images. To this end, it consists of two
consecutive steps: 1) selection of anchors that are diverse to each other in the feature
space; 2) selection of positive and negative images with respect to each selected
anchor. To achieve the latter step, we jointly evaluate the relevancy, hardness and
diversity of the images during the selection (See Fig. 4.3). The proposed method
is independent of the considered DL model and can be used with any DL model
designed for CBIR problems. In the following subsections, the two steps of the
proposed method are described in detail.
4.3.2 Diverse Anchor Selection (DAS)
The first step of the proposed method aims to find a small set of the most repre-
sentative anchors. As mentioned before, all samples (i.e., images) in the mini-batch
B
could be selected as anchors. However, such an approach results in a large and
Chapter 4. Informative & Representative Triplet Selection for CBIR 50
redundant set of triplets and increases the computational complexity of the train-
ing. In detail, the complexity of the training grows cubically, if all possible triplets
are exploited [142]. Selecting a small set of anchors can significantly reduce the
computational complexity of the training. To this end, we introduce a simple but
efficient diverse anchor selection (DAS) strategy. The DAS strategy aims to select
diverse anchors from the mini-batch that, when included in the set of triplets, can
improve the retrieval performance. To this end, it exhibits an iterative algorithm to
evaluate the diversity in the feature space among the samples from the mini-batch.
The algorithm starts with an empty set
A=
. The first anchor is selected randomly
from the current mini-batch
B
and added into
A
. At each iteration, a new anchor that
is associated with the highest distance from all already selected anchors is selected
from B. In detail, at the h-th iteration h-th anchor image Xhis selected as:
Xh=argmax
XbB\Amax
XaAD(Xb,Xa), (4.1)
where
D(·
,
·)
is the feature similarity measure, defined as the Euclidean distance
between two images in the feature space. It is worth noting that the Euclidean
distances are normalized based on min-max normalization. The steps are iterated
until
H
anchors are selected. Due to the selection of anchors that are as distant as
possible to each other in the feature space, the diversity among the selected anchors
with respect to their correlation in the feature space is maximized. This results
in selecting a representative set of anchors, forming the basis for the positive and
negative image selection step.
4.3.3
Relevant, Hard and Diverse Positive-Negative Image Selec-
tion (RHDIS)
The second step of the proposed method aims to select positive and negative images
for each anchor that are informative (i.e., relevant and hard) and representative (i.e.,
diverse to each other in the feature space). This is achieved by a novel relevant, hard
and diverse positive and negative image selection strategy (RHDIS). The relevancy
of an image to an anchor is defined based on its multi-label similarity with respect
to the considered anchor. In detail, a positive image can be associated with high
relevancy to an anchor if their class label similarity is high and vice versa. A negative
image can be relevant to an anchor if its class label similarity is small and vice versa.
The hardness of an image is associated with its distance to the considered anchor in
the feature space. In detail, a positive image can be hard if its distance to the anchor
in the embedding space is high, whereas a negative image can be considered hard if
its distance to the anchor is small.
The proposed RHDIS strategy initially evaluates the informativeness (i.e., relevancy
and hardness) of the images to select the candidates for positive and negative images
related to each anchor image. Then, the representative (diverse) ones among the
most informative positive and negative images are selected to construct the triplets.
To this end, for each image
Xb
in the mini-batch
B
, informativeness scores
Ip(Xa
,
Xb)
(which shows if
Xb
is a candidate positive image) and
In(Xa
,
Xb)
(which shows if
Xb
Chapter 4. Informative & Representative Triplet Selection for CBIR 51
is a candidate negative image) with respect to anchor Xaare initially computed as:
Ip(Xa,Xb) = β×S(Xa,Xb) + (1β)×D(Xa,Xb), (4.2)
In(Xa,Xb) = β×[1S(Xa,Xb)]+ (1β)×[1D(Xa,Xb)], (4.3)
where
S(Xa
,
Xb)
shows the class label similarity between the image
Xb
and
Xa
.
S(Xa
,
Xb)[
0,1
]
is calculated based on the soft pair-wise similarity measure (i.e.,
the distance between the multi-label vector
La
of
Xa
and
Lb
of
Xb
) [147]. If
S(Xa
,
Xb)
is high,
Xb
can be considered as a relevant positive image, whereas if [1-
S(Xa
,
Xb)
] is
high,
Xb
can be considered as a relevant negative image.
D(Xa
,
Xb)
is the distance
between
Xb
and
Xa
in the embedding space and measures the hardness of images as
mentioned before. If both
D(Xa
,
Xb)
and
S(Xa
,
Xb)
are high, the image
Xb
can be con-
sidered as a relevant and hard positive image. If both [1-
S(Xa
,
Xb)
] and [1-
D(Xa
,
Xb)
]
are high, the image
Xb
can be considered as a relevant and hard negative image.
β[
0,1
]
is the weighting parameter and can be adjusted to give more importance to
either the relevancy or the hardness of the image.
To construct a set
PXa={P1
,
P2
,...,
PC}
of
C
positive images for an anchor
Xa
, the
image in the mini-batch associated with the highest
Ip
score with respect to
Xa
is
chosen as the first positive image. Then, the next images are iteratively selected. We
apply an iterative approach similar to the DAS introduced in the first step to select
the most representative images. At
t
-th iteration,
t
-th positive image
Pt
is selected as:
Pt=argmax
XbB\PXahγ×Ip(Xa,Xb) + (1γ)×max
PcPXa
D(Xb,Pc)i. (4.4)
This process is repeated until the desired number of positive images is selected. The
parameter γ[0,1]controls the influence of the diversity term.
To construct a set
NXa={N1
,
N2
,...,
NC}
of
C
negative images for each anchor
Xa
,
the image with the highest
In
score in the mini-batch with regard to
Xa
is selected as
the first negative image. Afterward, the subsequent negative images are iteratively
selected. At t-th iteration, the t-th negative image Ntis selected as:
Nt=argmax
XbB\NXahγ×In(Xa,Xb) + (1γ)×max
NcNXa
D(Xb,Nc)i. (4.5)
This selection strategy ensures that the selected positive and negative images for
each anchor are informative (i.e., hard and relevant) and representative (i.e., diverse
among each other in the feature space). After selecting the final set of triplets from
the mini-batch B, the triplet loss function is calculated as:
L=
XaA
PtPXa
NtNXa
max [D(Xa,Pt)D(Xa,Nt) + α],0, (4.6)
where
α
is a margin enforced between positive and negative images for an anchor
image. After an end-to-end training of the whole neural network by minimizing the
triplet loss and learning the network parameters, the descriptors (i.e., features) of the
images in
X\XT
are obtained. Then, the
k
most semantically similar images with
regard to a given query image
XqX
are selected by comparing their descriptors
based on the k-nn algorithm.
Chapter 4. Informative & Representative Triplet Selection for CBIR 52
(a) (b) (c) (d)
FIGURE 4.4: An example of images from the UCMerced Land Use archive and the multi-
labels associated with them: (a) sand, sea (b) airplane, cars, grass, pavement (c) bare-soil, buildings,
grass (d) buildings, cars, pavement, trees.
4.4 Dataset Description and Experimental Design
4.4.1 Dataset Description
To evaluate the proposed method, we conducted experiments on two different multi-
label RS archives: BigEarthNet-S2 and UC Merced Land Use (UCMerced) archive [84].
In the experiments, for BigEarthNet-S2, we considered the images acquired over
Ireland in the summer of 2017 (denoted as IRS-BigEarthNet). IRS-BigEarthNet
contains 15,894 Sentinel-2 images, each of which is made up of 120
×
120 pixels for
10 meter bands, 60
×
60 pixels for 20 meter bands and 20
×
20 pixels for 60 meter
bands. In the experiments, we excluded the 60 meter bands and applied bicubic
interpolation to 20 meter bands that results in 10 bands, each of which has a size
of 120
×
120 pixels. The class labels of the images were used based on the 19 class
nomenclature. Images with snow cover, cloud cover and cloud shadows are excluded
from training and evaluation.
The UCMerced archive consists of 2100 images selected from aerial orthoimagery
with a spatial resolution of 30cm. Each image has a size of 256
×
256 pixels. The
images are annotated with multi-labels by Chaudhuri et al. [93]. There are 17 classes
in total, with at least one and a maximum of seven class labels per image. Fig.
4.4 shows an example of images from this archive along with their multi-label
annotations.
The two benchmark archives differ greatly in size, complexity and characteristics.
This allows us to demonstrate the general applicability and success of the proposed
triplet sampling method in different scenarios. We randomly split UCMerced images
into 60% for training, 20% for validation and 20% for testing. For IRS-BigEarthNet,
the officially provided splits into training, validation and evaluation sets were used.
During the training step, all triplets were sampled from the training set. Query
images were taken from the validation set, while image retrieval was applied to the
evaluation set.
4.4.2 Experimental Design
In the experiments, different CNN architectures were considered as backbones, while
an additional fully connected layer was added to produce image embeddings. The
resulting CNNs were trained for image retrieval by means of the triplet loss. It is
worth noting that our method does not depend on a specific DL model architecture.
Chapter 4. Informative & Representative Triplet Selection for CBIR 53
In our experiments, we evaluated three different CNN architectures: i) a shallow
convolutional neural network (S-CNN); ii) DenseNet-121 [148]; and iii) ResNet-50
[95]. S-CNN consists of three convolutional layers with 32, 32 and 64 filters having
5
×
5, 5
×
5 and 3
×
3 filter sizes, respectively. We added one fully connected (FC)
layer and one classification layer to the output of last convolutional layer, while
zero padding for convolution operations and max-pooling between layers were used.
The last two architectures are well-known deep models, while the first architecture
is an explicitly shallow model. All models were used without pre-training. The
size of mini-batch for IRS-BigEarthNet and UCMerced was selected as 300 and 100,
respectively. The training was performed for 100 epochs with the Adam optimizer,
using an initial learning rate of 0.001 (which was exponentially decayed every 5
epochs by 5%). The margin parameter αof the triplet loss was set to 0.2. The values
of
β
and
γ
were set to 0.5 and 0.1, respectively, based on a grid search strategy.
All the experiments were conducted on NVIDIA Tesla V100 GPUs with 32 GBs of
memory. The results were provided in terms of the different evaluation metrics as:
accuracy, precision, recall and
F1
score [93]. These values were the average of the
values obtained by retrieving the 30 and 10 most similar images for IRS-BigEarthNet
and UCMerced, respectively.
We carried out different kinds of experiments in order to: 1) perform a sensitivity
analysis with respect to different network architectures and embedding sizes; 2)
conduct an ablation study of the proposed triplet sampling method; 3) compare our
method with different triplet sampling methods; and 4) compare our method with
state of the art DML based methods. To perform the ablation study, we compared the
proposed diverse anchor selection (DAS) strategy (see Section 4.3.2 for the details)
with two frequently used anchor selection strategies that are:
Batch anchor selection (BAS): This strategy selects each image in the mini-batch
as an anchor once and can be considered an upper bound strategy for the triplet
selection. This strategy does not miss any information provided by specific
triplets. However, it leads to a very high number of final triplets that can be
redundant.
Random anchor selection (RAS): This strategy selects a fixed number of anchors
from the mini-batch without any prior assumption. It is simple, but there is
no guarantee that the randomly chosen anchors provide a good basis for the
triplets.
In the experiments, 10% of all possible anchors from the mini-batch was chosen for
the RAS and the proposed DAS strategies. We compared the proposed relevant, hard
and diverse positive-negative image selection (RHDIS) strategy (see Section 4.3.3 for
the details) with two baselines that are:
Batch positive and negative image selection (BIS): This strategy uses all images
in the mini-batch. Each image is used as the positive and the negative images
once. It covers all possible triplets, leading to a very high number of final
triplets.
Random positive and negative image selection (RIS): This strategy randomly
selects sets of positive and negative images and combines all of them into
Chapter 4. Informative & Representative Triplet Selection for CBIR 54
triplets. Many of the resulting triplets may be trivial, but it requires no prior
knowledge and provides a lower bound baseline.
In the experiments, we also assessed the effectiveness of the joint use of the above-
mentioned strategies with proposed DAS and RHDIS for the selection of anchors as
well as positive and negative images. This is important as the anchor selection step
is independent from the step of the positive and negative image selection, and thus
proposed selection strategies can be combined with the other well-known strategies.
In the experiments, we also compared the proposed DAS-RHDIS method with two
triplet sampling methods: 1) the deep metric learning using triplet network, which
uses RAS for anchor selection and RIS for positive and negative image selection
(denoted as TNDML) [149]; and 2) enhancing remote sensing image retrieval using a
triplet deep metric learning network, which employs BAS for the anchor selection
and BIS for positive and negative image selection (denoted as RSDML) [15]. We also
compared the proposed DAS-RHDIS method with state-of-the-art DML methods for
CBIR: 1) the content-based medical image retrieval (CBMIR) system, which utilizes
a pair-wise similarity loss function to force all positive images to be close, while
separating all the negative images with a fixed distance [150]; 2) the multi-similarity
loss with general pair weighting for deep metric learning (denoted as MSL) [143]; 3)
the dual-anchor triplet loss (denoted as DATL) proposed in [19]; and 4) the improved
deep metric learning with multi-class N-pair loss objective (denoted as NPL) [144].
For all the methods, we used the same CNN architecture and training setup as in our
method.
4.5 Experimental Results
4.5.1 Sensitivity Analysis of the Proposed Method
In this sub-section, we present the results of the sensitivity analysis for the proposed
triplet sampling method (denoted as DAS-RHDIS) in terms of different DL model
architectures and different embedding sizes. To analyze the proposed DAS-RHDIS
method in the framework of different DL models designed for multi-label RS CBIR,
we selected the CNN architectures of: i) S-CNN; ii) DenseNet-121; and iii) ResNet-50.
The embedding size for each architecture was set to 256. In Table 4.1, the results are
shown for the UCMerced archive. By assessing the table, one can observe that all
the considered DL model architectures provide a high performance. As an example,
although S-CNN is an explicitly shallow architecture, it achieves more than 50%
F1
score as in Dense-Net-121 and ResNet-50. This shows that the proposed DAS-RHDIS
method is architecture-independent. One can also see from the table that the best
scores under all metrics were obtained when ResNet-50 was utilized. As an example,
ResNet-50 provides almost 9% higher precision and 8.5% higher recall compared
to DenseNet-121. When compared with S-CNN, ResNet-50 leads to more than 14%
higher
F1
score and accuracy. These results show that a proper selection of a DL
model architecture can improve performance. For the rest of the experiments, we
provided the results obtained with ResNet-50 due to its proven success.
Chapter 4. Informative & Representative Triplet Selection for CBIR 55
TABLE 4.1: THE PERFORMANCE OF DIFFERENT DL MODEL ARCHITECTURES FOR THE
UCMERCED ARCHIVE.
Architecture Metric (%)
Accuracy Precision Recall F1Score
S-CNN 40.5 48.9 51.9 50.3
DenseNet-121 45.5 54.4 58.0 56.1
ResNet-50 54.5 63.3 66.5 64.8
TABLE 4.2: THE EFFECT OF VARYING EMBEDDING SIZES ON THE RETRIEVAL PERFORMANCE
FOR THE UCMERCED ARCHIVE.
Embedding
Size
Metric (%)
Accuracy Precision Recall F1Score
256 54.5 63.3 66.5 64.8
512 56.2 64.6 69.0 66.7
1024 56.8 65.3 70.0 67.5
2048 50.3 58.4 62.8 60.5
In Table 4.2, the results obtained by using different embedding sizes are shown for
the UCMerced archive. We evaluated the effect of the embedding sizes 256, 512, 1024
and 2048 used in the proposed DAS-RHDIS method. From the table, one can see that
the highest scores under all metrics are obtained when the embedding size is 1024.
Further increase of the embedding size to 2048 does not improve the performance.
As an example, the proposed method with the embedding size of 1024 provides
a 7% higher
F1
score compared to that of 2048. This is in line with the works in
literature, which demonstrate that beyond a certain size, adding any new embedding
dimension may not improve the performance [151]–[153]. By analyzing the table, one
can also observe that the lowest performance is obtained when the embedding size is
256. In this case, the
F1
score is reduced by almost 3% compared to the embedding
size of 1024. Accordingly, for the rest of the experiments, we set the embedding size
to 1024. These results were also confirmed through experiments obtained by using
the IRS-BigEarthNet archive (not reported for space constraints).
4.5.2 Ablation Study
In this sub-section, we performed an ablation study to analyze the effectiveness of
the proposed DAS and RHDIS strategies. To demonstrate the effectiveness of the
proposed DAS strategy, we compare it with RAS and BAS strategies. Table 4.3 shows
the results associated with the different anchor strategies for the UCMerced archive
when the proposed RHDIS strategy is used for positive and negative image selection.
By analyzing the table, one can observe that the proposed DAS strategy provides the
highest scores under all the metrics compared to RAS and BAS. As an example, the
proposed DAS strategy provides more than 7% higher accuracy compared to RAS
under the same number of anchors (which is set to 10 in the experiments) when the
positive and negative selection strategy is set to proposed RHDIS. In addition, the
proposed DAS strategy leads to almost 4% higher recall with a smaller number of
anchors compared to BAS. It is worth noting that BAS uses all the possible anchors
Chapter 4. Informative & Representative Triplet Selection for CBIR 56
TABLE 4.3: RESULTS OBTAINED BY THE DIFFERENT ANCHOR SELECTION STRATEGIES (RAS,
BAS AND PROPOSED DAS) UNDER DIFFERENT METRICS FOR THE UCMERCED ARCHIVE
WHEN PROPOSED RHDIS IS USED FOR POSITIVE AND NEGATIVE IMAGE SELECTION.
Anchor
Selection Strategy
Metric (%)
Accuracy Precision Recall F1Score
RAS 49.2 58.1 61.9 60.0
BAS 53.5 62.0 66.5 64.2
Proposed DAS 56.8 65.3 70.0 67.5
TABLE 4.4: RESULTS OBTAINED BY THE DIFFERENT POSITIVE AND NEGATIVE IMAGE SELEC-
TION STRATEGIES (RIS, BIS AND PROPOSED RHDIS) UNDER DIFFERENT METRICS FOR THE
UCMERCED ARCHIVE WHEN PROPOSED DAS IS USED FOR ANCHOR SELECTION.
Positive and Negative
Selection Strategy
Metric (%)
Accuracy Precision Recall F1Score
RIS 48.6 57.4 60.1 58.7
BIS 48.9 57.6 61.4 59.4
Proposed RHDIS 56.8 65.3 70.0 67.5
from the mini-batch (i.e., 100 anchors). This shows the success of the proposed
DAS strategy to select diverse and representative anchors with respect to random
sampling and batch selection strategies.
In order to demonstrate the effectiveness of the proposed RHDIS strategy, we com-
pare it with RIS and BIS strategies. Table 4.4 shows the results associated with the
different positive and negative image selection strategies for the UCMerced archive
when the proposed DAS strategy is used for anchor selection. From the table, one
can see that the proposed RHDIS strategy achieves the highest performance under all
metrics compared to RIS and BIS. As an example, the recall of the proposed RHDIS
strategy is more than 8% higher compared to that of BIS when the anchor selection
strategy is set to proposed DAS. It is worth noting that BIS exploits all positive and
negative images in the batch, while RHDIS relies on a much smaller number of
triplets to achieve this result. The performance of RIS is lower than RHDIS and BIS
under each metric when the anchor selection strategy is set to proposed DAS. For
example, the recall obtained by RIS is about 10% lower than that of proposed RHDIS
under the same number of triplets. This shows the effectiveness of the proposed
RHDIS selection strategy to select relevant, hard and diverse positive-negative im-
ages compared to random sampling and batch selection strategies for a given set of
anchors. These results were also confirmed through experiments obtained by using
the IRS-BigEarthNet archive.
4.5.3 Comparison with Different Triplet Sampling Methods
In this sub-section, we evaluate the effectiveness of the proposed DAS-RHDIS method
compared to different triplet selection methods, which are: TNDML [149], and
RSDML [15]. Table 4.5 shows the corresponding image retrieval performances on the
IRS-BigEarthNet and the UCMerced archives. By analyzing the table, one can see
Chapter 4. Informative & Representative Triplet Selection for CBIR 57
TABLE 4.5: THE PERFORMANCE OF DIFFERENT TRIPLET SELECTION METHODS FOR THE IRS-
BIGEARTHNET AND UCMERCED ARCHIVES.
Archive Method Metric (%)
Accuracy Precision Recall F1Score
IRS-BigEarthNet
TNDML [149] 59.3 73.7 73.8 73.8
RSDML [15] 60.2 75.4 73.9 74.6
Proposed DAS-RHDIS 62.7 77.7 75.7 76.7
UCMerced
TNDML [149] 44.0 52.6 55.8 54.2
RSDML [15] 48.4 56.3 61.9 59.0
Proposed DAS-RHDIS 56.8 65.3 70.0 67.5
that the proposed DAS-RHDIS method leads to the highest scores under all metrics
for both archives. For example, DAS-RHDIS outperforms TNDML by 4% in precision
and more than 3% in accuracy for the IRS-BigEarthNet archive, more than 13% in
F1
score and almost 15% in recall for the UCMerced archive. The proposed DAS-RHDIS
method provides about 2% higher and 8% higher
F1
scores compared to the RSDML
method for IRS-BigEarthNet and UCMerced, respectively. These results demonstrate
the success of the proposed DAS-RHDIS method compared to other triplet sampling
methods.
Fig. 4.5 shows an example of images retrieved from IRS-BigEarthNet by TNDML,
RSDML and the proposed DAS-RHDIS when the query image contains Arable land,
Pastures and Complex cultivation patterns. The retrieval order of images is given below
the query image. By analyzing the figure, one can observe that the classes of Pasture
and Arable land are very prominent in all retrieved images by RSDML and DAS-
RHDIS, while TNDML provides similar images to the query only at the retrieved
orders of 5 and 10. When DAS-RHDIS is compared with RSDML, the proposed
method retrieves semantically more similar images. One of the reasons is that the
RSDML relies only on the class label similarity, while the proposed DAS-RHDIS
method: i) extracts and exploits the semantic content of the images; and ii) considers
the diversity and hardness of images during triplet selection. We observed similar
behavior for the UCMerced archive. Fig. 4.6 shows an example of images retrieved
from UCMerced. The query image for this example only contains the Field class.
Most of the images retrieved by the proposed method (except the 20
th
image) belong
to the same class with the query (see Fig. 4.6-d). However, only a small number of
images retrieved by the TNDML and the RSDML methods contains the Field class
(see Fig. 4.6-b and 4.6-c).
During the learning of a metric space by using the triplet loss, a small subset of the
available triplets carries the information needed to learn an accurate representation
for image retrieval. The proposed DAS-RHDIS identifies these triplets and only
learns from a subset of selected informative and representative samples, reducing the
number of training triplets. Fig. 4.7 shows the performance of TNDML, RSDML and
the proposed DAS-RHDIS method in terms of the number of accumulated training
triplets under the same number of epochs (which is set to 100 in the experiments) for
the UCMerced archive. The horizontal axis shows the number of triplets in a logarith-
mic scale, while the vertical axis shows the corresponding
F1
scores. The performance
is associated with the numbers of triplets, which are utilized by the considered triplet
Chapter 4. Informative & Representative Triplet Selection for CBIR 58
(a)
2nd 5th 10th
(b)
15th 20th
(c)
(d)
FIGURE 4.5: An image retrieval example: (a) query image; (b) images retrieved by TNDML;
(c) images retrieved by RSDML; (d) images retrieved by the proposed DAS-RHDIS method
(IRS-BigEarthNet archive).
selection method. The annotation points indicate the number of triplets needed for
the considered method to reach at least 90% of its final performance. From the figure,
one can observe that even after the last training epoch of the proposed DAS-RHDIS
method, the total number of triplets is significantly smaller than the first epoch of
the RSDML method. During training, the RSDML selects more triplets at each epoch
compared to the other two methods. This is due to the characteristic of RSDML that
selects all the possible triplets from a mini-batch, which grows cubically. The final
F1
score of our proposed method is more than 8% higher than RSDML with significantly
less number of total triplets. One can also see from the figure that TNDML (which
uses random triplet selection) under the same number of triplets with our method
leads to a significant performance drop. The
F1
score obtained by TNDML is 13%
lower than the
F1
score obtained by the proposed DAS-RHDIS method. These results
show the effectiveness of our method to select a subset of informative triplets during
training, resulting in faster convergence and a performance gain in the retrieval.
4.5.4 Comparison with the State-of-the-Art DML Approaches
In this sub-section, we assessed the effectiveness of the proposed DAS-RHDIS method
compared to the state-of-the-art deep metric learning approaches, which are: CB-
MIR [150], MSL [143], DATL [19] and NPL [144]. Table 4.6 shows the results under
different metrics for the IRS-BigEarthNet and UCMerced archives. By analyzing
the table, one can see that the proposed DAS-RHDIS method leads to the highest
scores under all metrics for both archives. As an example, the proposed DAS-RHDIS
Chapter 4. Informative & Representative Triplet Selection for CBIR 59
(a)
2nd 5th 10th
(b)
15th 20th
(c)
(d)
FIGURE 4.6: An image retrieval example: (a) query image; (b) images retrieved by TNDML;
(c) images retrieved by RSDML; (d) images retrieved by the proposed DAS-RHDIS method
(UCMerced archive).
method provides 2% higher and 8% higher accuracy compared to the DATL method
for IRS-BigEarthNet and UCMerced, respectively. The table also shows that the
CBMIR and the MSL methods obtain the lowest scores in most of the metrics. For
example, CBMIR provides more than 4% lower and 14% lower precision than the
proposed DAS-RHDIS for IRS-BigEarthNet and UCMerced, respectively. Since the
loss function in CBMIR forces a fixed distance for all images, it is more restrictive
compared to the triplet-based DML losses. This can lead to learning the metric space,
in which the similarity between the images are not properly characterized [146].
When compared with the MSL method, DAS-RHDIS achieves 7% higher recall and
more than 4% higher accuracy for the IRS-BigEarthNet archive, more than 7% higher
precision and 8% higher
F1
score for the UCMerced archive. Despite the proven
success of the MSL method for single label images, we observed that the full capac-
ity of this method is not applicable for multi-label images. Since the MSL method
considers all the possible negatives and positives and their relative feature distances
among each other, its performance is very sensitive to the proper definition of the
positive and the negative sets for a given anchor. However, the evident distinction
of these sets is difficult to achieve for multi-label images. When compared with the
NPL method, the proposed DAS-RHDIS method provides 2% higher and 7% higher
F1
scores for IRS-BigEarthNet and UCMerced, respectively. It is worth noting that
NPL obtains relatively closer results to the proposed DAS-RHDIS due to its negative
mining strategy. NPL uses an extension of the triplet loss, which selects multiple
negative images from different negative classes for each anchor and positive image.
This negative mining strategy allows NPL to include class-based diversity among
Chapter 4. Informative & Representative Triplet Selection for CBIR 60
104105106107108
0.3
0.4
0.5
0.6
0.7
690×105triplets
4×105triplets
Number of Accumulated Triplets
F1Score
TNDML RSDML Proposed DAS-RHDIS
FIGURE 4.7:
F1
scores obtained by different triplet sampling strategies and the number of
accumulated triplets during the training (The UCMerced archive).
TABLE 4.6: THE PERFORMANCE OF DIFFERENT DEEP METRIC LEARNING METHODS FOR THE
IRS-BIGEARTHNET AND UCMERCED ARCHIVES.
Archive Method Metric (%)
Accuracy Precision Recall F1Score
IRS-BigEarthNet
CBMIR [150] 59.6 73.2 74.6 73.9
MSL [143] 57.9 75.0 68.7 71.7
DATL [19] 60.6 75.3 74.0 74.7
NPL [144] 60.8 76.5 72.6 74.5
Proposed DAS-RHDIS 62.7 77.7 75.7 76.7
UCMerced
CBMIR [150] 42.0 50.9 53.0 51.9
MSL [143] 46.6 58.1 61.0 59.5
DATL [19] 48.7 57.2 60.7 58.9
NPL [144] 51.8 61.5 58.7 60.1
Proposed DAS-RHDIS 56.8 65.3 70.0 67.5
the negative samples. However, in NPL, the hardness and diversity in the positive
samples are not considered, resulting in the selection of trivial triplets. This can affect
its performance for the retrieval task. The proposed DAS-RHDIS identifies informa-
tive and representative triplets by relying on the relevancy, hardness and diversity of
images. This allows us to reach more effective image retrieval performance compared
to the other methods.
4.6 Conclusion
This chapter introduces a novel method to select a set of informative and represen-
tative triplets from multi-label training images to achieve deep metric learning for
multi-label CBIR problems in RS. The proposed triplet sampling method is defined
based on a two-steps procedure and applied on each training mini-batch of a DL-
based retrieval system. In the first step, diverse anchor images are selected based on a
simple but efficient iterative algorithm. Then, in the second step, sets of positive and
Chapter 4. Informative & Representative Triplet Selection for CBIR 61
negative images for each anchor are selected based on relevancy, hardness and di-
versity of the positive and negative images. Finally, the triplets are constructed from
the selected anchors and their respective positive and negative images. Through the
above-mentioned steps, the proposed method results in selecting a compact subset of
informative and representative triplets, which enables accurate and efficient learning
of DL models for multi-label CBIR in RS. Experimental results obtained on two multi-
label RS benchmark archives under different DL architectures show the effectiveness
of the proposed method in CBIR problems. In detail, the results have demonstrated
that most of the available triplets do not contribute to the learning progress and can
be safely discarded. Focusing on a small informative and representative subset is
sufficient for achieving comparable performance compared to the case, for which
all possible triplets are used. It is worth noting that the proposed triplet sampling
method does not rely on a specific DL architecture and can be adapted to any metric
learning method.
As a final remark, we would like to point out that the proposed method currently
relies on the class labels to select positive and negative images for each anchor. As a
future work, we plan to develop an unsupervised strategy that can select informative
positive and negative images without requiring any land-use land-cover class label.
62
Chapter 5
Towards Simultaneous Image
Compression and Indexing for Scalable
Content-Based Retrieval in Remote
Sensing
Due to the rapidly growing RS image archives, images are usually stored in a com-
pressed format for reducing their storage sizes. Thus, most of the existing CBIR
systems require fully decoding images (i.e., decompression) that is computationally
demanding for large-scale archives. To address this issue, in this chapter, we intro-
duce a novel approach devoted to simultaneous RS image compression and indexing
for scalable content-based image retrieval (denoted as SCI-CBIR). The proposed
SCI-CBIR prevents the requirement of decoding RS images prior to image search and
retrieval. To this end, it includes two main steps: i) deep learning-based compression;
and ii) deep hashing-based indexing. The first step effectively compresses RS images
by employing a pair of deep encoder and decoder neural networks and an entropy
model. The second step produces hash codes with a high discrimination capability
for RS images by employing pairwise, bit-balancing and classification loss functions.
For the training of the SCI-CBIR approach, we also introduce a novel multi-stage
learning procedure with automatic loss weighting techniques to characterize RS
image representations that are appropriate for both RS image indexing and compres-
sion. The proposed learning procedure enables automatically weighting different
loss functions considered for the proposed approach instead of computationally
demanding grid search. Experimental results show the effectiveness of the proposed
approach when compared to widely used approaches in RS. The code of the proposed
approach is available at
https://git.tu-berlin.de/rsim/SCI-CBIR
. This chapter
is mainly based on the following publications:
G. Sumbul, J. Xiang, and B. Demir, “Towards simultaneous image compression
and indexing for scalable content-based retrieval in remote sensing,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022. DOI:
10.
1109/TGRS.2022.3204914.
G. Sumbul, J. Xiang, N. T. Madam, and B. Demir, “A novel framework to
jointly compress and index remote sensing images for efficient content-based
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 63
retrieval,” in Proceedings of the IEEE International Geoscience and Remote Sensing
Symposium, 2022, pp. 251–254. DOI:10.1109/IGARSS46834.2022.9884146.
5.1 Introduction
For large-scale CBIR, fast and accurate indexing methods that allow approximate
nearest neighbour search are fundamental. In this perspective, hashing-based in-
dexing has recently attracted attention to solve the large-scale approximate nearest
neighbor search problems for RS CBIR due to its high time-efficient (in terms of
both storage and speed) and accurate search capability within huge image archives.
Hashing methods map high-dimensional image features into compact binary hash
codes [154]. Then, image retrieval can be achieved by calculating the Hamming
distances with simple bit-wise XOR operations [41]. Several hashing methods are
presented in RS [20], [21], [27], [42]–[46], [130], [155], [156]. The traditional hashing
methods extract hand-crafted image features and map them into low-dimensional
binary codes by using hashing functions [130], [155], [156]. In these methods, image
feature extraction and hash code generation are separately applied. Thus, they are
not capable of simultaneously optimizing feature learning and hash code learning
that results in the limited capability of generated hash codes to represent the high-
level semantic content of RS images. Recently, several deep hashing-based indexing
methods are introduced in RS to address this issue. As an example, in [21] a deep
hashing neural network (DHNN) is introduced to learn high-level semantic features
and compact hash codes in an end-to-end manner. To improve the training stability
of deep neural networks (DNNs) while learning hash codes, DHNN generates the
continuous approximations of hash codes during training while exploiting quan-
tization loss to push the approximated hash codes towards the discrete values. In
greater details, the likelihood pairwise loss is utilized in DHNN to preserve the
similarity of images on their hash codes. However, the pairwise loss can lead similar
images to cluster together in a small portion of the Hamming space that prevents to
generate discriminative hash codes. To avoid this problem, in [42], a deep hashing
convolutional neural network (DHCNN) is introduced to employ image labels for
learning more discriminative hash codes. To this end, DHCNN learns to predict
image labels together with generating hash codes by jointly optimizing cross-entropy
loss with pairwise and quantization losses. Despite the success of pairwise loss in
these methods, triplet loss has been found more effective than the pairwise loss by
introducing a margin threshold between the similar and dissimilar images. Accord-
ingly, in [27], metric learning-based deep hashing network (MiLaN) is introduced
to combine quantization loss with triplet loss. In addition, MiLaN also employs
bit-balancing loss for maximizing code variance and information by forcing each bit
to have an equal chance of being 0 or 1. Unlike the above-mentioned methods, which
utilize pre-trained convolutional neural networks (CNNs), in [43], a semi-supervised
hashing adversarial autoencoder (SSHAAE) is introduced to employ an adversarial
autoencoder network for generating the discriminative and similarity preserved hash
codes with low quantization errors by end-to-end training. In addition to losses used
by DHCNN, SSHAAE also employs bit-balancing and reconstruction losses. In [44],
a generative adversarial network is exploited for hash code learning while similar
losses to DCHHN for the generator and a sigmoid function for the discriminator
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 64
are used to determine if the generated codes are true codes that comply with the
bit-balancing rule. In [45], a meta-hashing algorithm is introduced to increase the
generalization capability of DNNs utilized for hash code generation under a small
number of training samples. To this end, this algorithm employs few-shot meta
learning for hash code generation by dividing a learning objective into multiple
sub-tasks and using all training samples multiple times. In [46], an asymmetric
hash code learning (AHCL) method is proposed to increase the training efficiency
of DNNs for hash code learning. To this end, AHCL learns a deep hashing function
only for query images, while hash codes of archive images are obtained from query
hash codes based on class label similarity.
The above mentioned hashing-based indexing methods are potentially effective for
RS CBIR. RS images are usually stored in compressed format in archives to reduce
their storage sizes [47]. Thus, image decoding (i.e., decompression) is required before
applying any hashing method. This is computationally-demanding and impractical
in the case of large-scale CBIR problems. According to our knowledge, there is no
hashing-based indexing method in RS that can be applied in the compressed domain
efficiently and effectively. To address this issue, in this chapter we introduce a novel
approach devoted to simultaneous RS image compression and indexing for scal-
able content-based image retrieval (denoted as SCI-CBIR). Unlike the existing CBIR
approaches in RS, the proposed approach simultaneously indexes RS images with
hash codes while effectively compressing them. To this end, the proposed SCI-CBIR
is made up of two main steps: i) deep learning-based compression; and ii) deep
hashing-based indexing. The first step applies image feature extraction and image
reconstruction based on a pair of encoder and decoder DNNs, while a probabilistic
entropy model is employed to optimize the length of the compressed bitstreams. The
second step employs pairwise, bit-balancing and classification loss functions for the
generation of hash codes based on image features characterized by the first step. To
effectively characterize image features for both image indexing and compression,
we propose a novel multi-stage learning procedure for the training of the proposed
SCI-CBIR approach, allowing to automatically weight different loss functions con-
sidered in both steps. Please note that the aim of this study is to introduce neither
compression nor hashing algorithm but to propose a novel approach that simultane-
ously indexes and compresses RS images. Due to the proposed approach, the need
for decompressing images prior to indexing, unlike the existing CBIR approaches
in RS, is fully eliminated. The main contributions of this work are summarized as
follows:
As a first time in RS, the proposed SCI-CBIR approach simultaneously applies
RS image compression and indexing, and thus does not require RS image de-
coding prior to CBIR that can save a significant amount of time for operational
applications.
The proposed multi-stage learning procedure automatically weights all the
considered loss functions that allows to: i) learn appropriate RS image represen-
tations for both image compression and indexing; ii) eliminate computationally
demanding grid search; and iii) automatically achieve different rate-distortion
trade-off points.
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 65
The proposed SCI-CBIR approach is independent from image compression
and indexing methods being selected, and can operate with any DNN-based
method.
The rest of this chapter is organized as follows: Section 5.2 presents the related works
on RS image compression and RS CBIR on compressed domain. Section 5.3 introduces
the proposed SCI-CBIR approach. Section 5.4 describes the considered RS image
archives and the experimental setup, while Section 5.5 provides the experimental
results. Section 5.6 concludes the chapter.
5.2 Related Works
In this section, we survey the existing methods for RS image compression and RS
CBIR on compressed domain. Traditional RS image compression methods are cate-
gorized into three groups: i) prediction-based methods, which predict each spectral
band based on the other bands and encodes the prediction residuals to bitstreams
(e.g., CCDCS-123 multi- and hyperspectral image compression standard [157]); ii)
vector quantization methods, which independently reduce the clusters of image
pixels with similar characteristics by grouping them together (e.g., mean-normalized
vector quantization [158]); and iii) transform-based methods, which map RS im-
ages to transform domain (e.g., Karhunen-Loéve transform [159], discrete cosine
transform [160], discrete wavelet transform [161] etc.) representations, and thus
reduce the correlation among image pixels. Although prediction-based compression
methods apply lossless compression and embody a low computational complexity,
their compression ratio is generally low that makes them infeasible for large-scale
RS archives. Vector quantization methods provide a higher compression ratio than
the prediction-based methods. However, training these methods and generating
required codebooks can be computationally demanding. Transform-based methods
generally provide a high compression ratio and speed of computation, and thus
are widely used for RS image compression on operational archives. Among several
transform-based methods, JPEG 2000 [162] became very popular in RS due to its mul-
tiresolution paradigm, scalability and high compression ratio. JPEG 2000 algorithm
is widely used to compress RS images acquired by most of the recent satellites (such
as Sentinel-2 [163]).
Recent studies on learning-based compression show that deep learning (DL) based
compression methods preserve the perceptual quality of images at lower bit rates
compared to traditional methods such as JPEG2000 [164]. DL-based image compres-
sion methods usually consist of a pair of encoder and decoder DNNs for feature
extraction and image reconstruction, and an entropy model for bit-rate optimization.
According to the type of the DNN, recent DL-based image compression methods can
be divided into one-time feed-forward and multistage recurrent based compression
methods [165]. One-time feed-forward DNNs (e.g., CNNs) employ only one time
of image encoding and decoding, and thus require to be trained multiple times for
different bit-rates. However, for multi-stage recurrent DNNs (e.g., recurrent neural
networks), image encoding is iteratively applied, while the number of iterations
determines a variable range of bit-rates within a single training. In RS, few DL-based
image compression methods have been proposed in the framework of a standard
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 66
CNN-based image compression, where a piecewise linear approximation to the oc-
currences of pixel values is used as an entropy model. In [166], a residual network
framework is introduced to adapt the standard CNN-based compression for multi-
spectral RS images by characterizing RS image representations with residual blocks
and a weighted feature channel module. In [164], spectral–spatial feature partitioned
extraction is integrated into the standard CNN-based compression to characterize
spatial and spectral content of RS images in a parallel fashion. In [167], polydirec-
tional CNNs are introduced in the standard CNN-based compression to separately
extract the spectral and spatial RS image features for preventing the dominance of
either spatial or spectral content. In the computer vision community, generalized divi-
sive normalization [168], residual blocks [169], attention modules [170] and non-local
networks [171] have been employed in the context of CNN-based compression to
further reduce the spatial redundancy when characterizing image latents that results
in lower bit-rate for entropy encoding. To further improve the compression ratio,
hyperpriors [172], autoregressive context models [173] and discretized Gaussian
mixture likelihoods [170] are incorporated into the entropy model for more accurate
bit-rate optimization. The reader is referred to [165] for recent advances on DL-based
image compression.
According to our knowledge, there is only one study in RS that is devoted to apply
CBIR in compressed domain [174]. To reduce the time required for fully-decoding
images, in [174] a coarse to fine progressive RS image description and retrieval
system in the partially decoded JPEG 2000 compressed domain is proposed. In that
system, the code-blocks associated only to the coarse wavelet resolution are initially
decoded. Then the most irrelevant images to the query image are discarded based
on the similarities computed on the coarse resolution wavelet features of the query
and archive images. The processes of code-blocks decoding and elimination of the
irrelevant images are iterated until the codestreams associated to the highest wavelet
resolution are decoded. Finally, the most similar images to the query are chosen.
Although that system reduces significantly the retrieval time compared to those
that require full decoding, it still requires a partial decompression that may require
significant time for operational CBIR applications.
As mentioned above, DL-based image compression methods are much more success-
ful to preserve the perceptual quality of images at lower bit-rate values compared
to JPEG2000 [164]. According to our knowledge, our SCI-CBIR approach is the first
study in the framework of the scalable CBIR on the DL-based compressed domain in
RS.
5.3 Proposed SCI-CBIR Approach
Let
X={x1
,
. . .
,
xM}
be an RS image archive that includes
M
non-compressed images,
where
xt
is the
t
th image in the archive. We assume that a training set
T X
is
available, where
xi T
is associated with a set of class labels
li {
0,1
}K
and
K
is
the number of classes.
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 67
First Step: DL-Based Compression
Second Step: Deep Hashing Based Indexing
Entropy Modelling
Image
Encoding
Compression
Decoding
Index
Decoding
Hash Code
Generation
Class Prediction
FIGURE 5.1: Illustration of the proposed SCI-CBIR approach.
The proposed SCI-CBIR approach aims to achieve accurate CBIR in a scalable way
without any need for decompression of RS images prior to CBIR. Accordingly, SCI-
CBIR simultaneously: i) compresses each image
xi X
into a bitstream; and ii)
indexes each image through a
q
bit hash code
bi
(which is stored in a hash table
for scalable CBIR). This is achieved based on two-steps: i) DL-based compression;
and ii) deep hashing-based indexing. For the training of SCI-CBIR, we introduce a
multi-stage learning procedure to automatically define different loss weights and rate-
distortion trade-off points. Fig. 5.1 shows an illustration of the proposed SCI-CBIR
approach, which is explained in detail in the following subsections.
5.3.1 First Step: DL-Based Compression
The DL-based compression step of the proposed SCI-CBIR approach aims to compress
each RS image to a minimum length bitstream, which is efficiently stored and utilized
for reconstructing the image with a minimum amount of distortion. By following
the recent advances on DL-based image compression, this step employs a pair of
encoder decoder DNNs for learning to reconstruct RS images and an entropy model
for reducing the length of bitstreams (i.e., bit-rate optimization). Accordingly, this
step includes three main blocks: i) image encoding; ii) compression decoding; and iii)
entropy modelling.
Let
f:X Y
be an image encoder that maps the image
xi
to its latent
yi
, where
Y
is
the set of all latents for
X
. The first block of this step transforms
xi
into its quantized
latent representation yˆias follows:
yi=f(xi;θf);yi
ˆ=Q(yi), (5.1)
where
Q(a) = a
, is a rounding function that converts
a
into its nearest integer (i.e.,
quantization) and
θf
is the encoder parameters. During training,
Q(a)
is replaced by
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 68
U(a1
2
,
a+1
2)
, where
U
is a uniform distribution. Let
g:Y X
ˆ
be a decoder that
maps the latent
yi
into the reconstructed image
xˆi
, where
X
ˆ
is the set of reconstructed
images. The second block of this step reconstructs
xi
from its quantized representation
as follows:
xˆi=g(yi
ˆ ; θg) = g(Q(f(xi;θf));θg), (5.2)
where
θg
is the decoder parameters. The third block of this step estimates the required
number of bits to encode
yi
ˆ
, which is defined according to the mutual information
between
xi
and
xˆi
. Since the actual distribution of image latents
pY
ˆ
is unknown, its
inference is intractable. Accordingly, the entropy modelling block estimates
pY
ˆ
with
an entropy model
qY
ˆ|θe
, where
θe
is the entropy model parameters. This block also
employs arithmetic coding algorithm
A
, which consists of arithmetic encoder
Ae
and arithmetic decoder
Ad
for generating compressed bitstreams from quantized
representations.
To achieve minimum image compression distortion at a minimum length of bit-
stream, the image compression objective
LC
is defined according to a rate-distortion
optimization problem [175] as follows:
LC=LR+λLD,
LC=Exipx[log(qyˆi(yˆi))] + λExipx[d(xi,xˆi)],(5.3)
where
px
is approximated over the images of
T
. The rate term
LR
is the cross entropy
between the entropy model
qyˆ
and the marginal distribution of quantized image
latents
Exipxpyˆ|x
.
d
is the distortion metric, for which we utilize multiscale structural
similarity index (
MS-SSIM
) [176] as
d(xi
,
xˆi) =
1
MS-SSIM(xi
,
xˆi)
.
λ
controls the
rate-distortion trade-off points.
It is worth noting that the proposed SCI-CBIR approach is independent from the
image compression method utilized in the first step of our approach as soon as it
employs a pair of encoder and decoder DNNs. Recent studies on DL-based image
compression have focused on enhancing the capacity of considered entropy model
for an accurate estimation of
pyˆ
, while operating image reconstruction based on
encoder-decoder architectures. In this chapter, for the entropy modelling block of
the this step, we consider context-adaptive hyper-prior based Gaussian mixture
models introduced in [170] as the entropy model due to its proven success for
spatial redundancy reduction. In this entropy model, the probability estimation of
quantized image latents are conditioned on a hyper-prior (which is defined by a
factorized density model) and an autoregressive context model to capture the spatial
dependencies among the elements of quantized latents. The reader is referred to [170]
for the details of this entropy model.
5.3.2 Second Step: Deep Hashing-Based Indexing
The deep hashing-based indexing step of the proposed SCI-CBIR approach aims
to map the latent representation of each RS image (which is characterized in the
first step) into its discriminative hash code, which preserves the semantic image
content. Then, hash codes are indexed in a hash table for all RS images in the archive,
where semantically similar images are in the same hash bucket. To this end, this
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 69
step includes three main blocks: i) index decoding; ii) hash code generation; and
iii) class prediction. Let
t:Y
,
θt E
be a decoder that maps the latent
yi
into the
corresponding image embedding for indexing
ei
associated with the image
xi
(i.e.,
t(yi) = ei
), where
θt
is the decoder parameters. The index decoding block employs
t
for characterizing image embeddings by extracting and decoding semantically
informative features specific to indexing based on the latent representations of images.
Accordingly,
t
is composed of the attention layer introduced in [170] followed by
convolutional layers. Let
b:E
,
θb {
1,1
}q
be a binarizer that maps the image
embedding
ei
into the binary hash code
bi
of
xi
, where
θb
is the binarizer parameters.
Let
k:E
,
θk {
0,1
}K
be a classifier that maps the image embedding
ei
into the class
prediction
l
ˆi
of
xi
, where
θk
is the classifier parameters. Once the image embedding
ei
is characterized for
xi
, the class prediction and hash code generation blocks operate
c
and don eito generate corresponding class prediction and hash code, respectively.
To characterize discriminative hash codes that preserve the semantic similarity of
images, we employ the soft pairwise loss (SPL) [147]
LP
, bit-balancing loss [177]
LB
,
and a classification loss
LN
. SPL considers the rank difference of semantic pairwise
similarities of images. To this end, image pairs are grouped into images with hard
similarity and images with soft similarity. An image pair shares either no common
labels or all its labels for hard similarity, while an image pair shares some of its labels
for soft similarity. Let
J={(xi
,
xj)|xi T
,
xj T
,
i=j}
be a set of all image pairs
in T. The SPL function is defined as follows:
LP=
(xi,xj)J
mijlog(1+esh
ij )sh
ijso
ij
+γ(1mij)
1
2(sh
ij +q)so
ijq
2
2,
so
ij =<li,lj>
li2lj2,sh
ij =<bi,bj>,
mij =(1, so
ij {0,1}
0, 0 <so
ij <1
(5.4)
where
so
ij
and
sh
ij
are the pairwise similarities between
xi
and
xj
and their hash codes,
respectively.
mij
defines whether
(xi
,
xj)
is associated with soft similarity (
mij =
0)
or hard similarity (
mij =
1).
γ
is a weighting parameter between different types of
similarities. For balancing the distribution of hash code bits by maximizing their
variance, we adapt the bit-balancing loss [177] for image pairs as follows:
LB=
(xi,xj)J(bT
i1)2
2+(bT
j1)2
2, (5.5)
where
1
is a vector with all elements 1.
LB
enforces the hash codes to contain the
equal numbers of
1 and 1. To further enhance the discriminative capability of hash
codes, we formulate the classification loss over image pairs as follows:
LN=
(xi,xj)Jli
ˆli2
2+lj
ˆlj2
2. (5.6)
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 70
By considering the above-mentioned losses defined for the first step of our SCI-CBIR
approach, the final hashing objective is formulated as follows:
LH=wPLP+wBLB+wNLN, (5.7)
where
wP
,
wB
,
wN
are the loss weights. We note that the proposed SCI-CBIR approach
is independent from the DNN architecture utilized in this step, and thus hash codes
can be obtained through different DNN architectures.
5.3.3 Multi-Stage Learning Procedure
The objectives of the both steps of the proposed approach
LC
,
LH
enforce to encode
different information through the image encoder
f
on image latents
Y
. The first
step enforces image latents to embody maximum information required for recon-
structing images, while the second step enforces image latents to preserve the most
discriminative image features for hash code learning. For the training of proposed
SCI-CBIR, one could optimize the aggregation of different losses considered for both
steps in a single learning procedure that is widely utilized for combining different
objectives in DL. However, due to different characteristics of image compression and
indexing tasks, this learning procedure may lead to: i) the competition of the learning
objectives of image compression and indexing tasks; ii) the dominance of one of
the objectives; and iii) limited characterization of each task compared to separately
learning each objective. In this case, either multiple instances of the considered DNN
need to be trained with different
λ
values or recurrent models need to be integrated
to achieve a variable range of rate-distortion trade-off points [165]. Accordingly, to
prevent this limitation, we propose a multi-stage learning procedure for the training
of our SCI-CBIR that aims to: i) learn RS image latents compatible for both RS image
compression and indexing; ii) automatically weights different losses; and iii) auto-
matically achieve different rate-distortion trade-off points for compression without
applying computationally demanding grid search of
λ
. To this end, the proposed
learning procedure is made up of three consecutive stages: i) learning reconstruction;
ii) bit-rate optimization; and ii) learning hash codes.
Learning Reconstruction: The compression objective in (5.3) involves the conflict of
bit-rate and distortion terms that leads to decreasing bit-rate term increases distortion
term, and vice versa (i.e., rate-distortion trade-off). Accordingly, to achieve an
effective reconstruction capability without effected by the rate-distortion trade-off, in
the first stage, only the distortion loss
LD
is optimized until its convergence with a
learning rate η1, which is gradually decreased based on the value of LD.
Bit-Rate Optimization: To accurately achieve different rate-distortion points, in
the second stage,
LD
is continued to optimize together with bit-rate loss
LR
with
a learning rate
η2
. The most of the existing DL-based image compression methods
require multiple trainings with different
λ
values to achieve different trade-off points
for (5.3). Unlike them, in this stage, we reformulate (5.3) as a multi-objective opti-
mization problem and employ multiple-gradient descent algorithm (MGDA) [178]
for automatically achieving the set of optimal trade-offs points as the set of Pareto
optimal solutions. Let
gD=θCLD
and
gR=θCLR
be the gradient vectors of
LD
and
LR
, respectively, over the parameters
θC=θfθgθe
. The gradient descent
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 71
direction for a Pareto optimal solution (which leads to an optimal trade-off point) is
obtained by optimizing the following problem:
minnu2
2u=wDgD+wRgR,wD+wR=1,
wD0, wR0o,(5.8)
where wDand wRare estimated as follows:
wR=1wD,
wD=
1, gT
DgRgT
DgD
0, gT
DgRgT
RgR
(gRgD)TgR
gRgD2
2, otherwise
(5.9)
After obtaining uby solving (8), the parameters θCare updated as follows:
θC=θCη2u=θCη2(wDgD+wRgR). (5.10)
Since the distortion loss is converged in the learning reconstruction stage,
wD
1
and
wR
0 at the beginning of this stage. As this stage continues,
LR
decreases
until the first Pareto solution is found by (9). Then, by increasing
η2
,
wR
is gradually
increased to reach another Pareto solution. Thus, by adjusting the learning rate itself,
this stage allows to obtain the set of optimal rate-distortion trade-off points without
operating multiple trainings and applying computationally demanding grid search
of λ.
Learning Hash Codes: The last stage involves optimizing all the losses associated
with both steps of our approach to learn RS image latents compatible for both RS
image indexing and compression. To this end, this stage employs two learning
rates
ηC
3
and
ηH
3
for the losses of the first and second steps, respectively. It is worth
noting that since the losses
LD
and
LR
are optimized in the first two stages, we
keep
ηC
3<ηH
3
to prevent the domination of image compression over image indexing.
Since the different rate-distortion points are achieved in the second stage, the overall
objective is written for a rate-distortion point as follows:
L=LC+LH
LC=wDLD+wRLR,
LH=wPLP+wBLB+wNLN,
(5.11)
where
wD
and
wR
are estimated for the specific rate-distortion point in the previous
step. To automatically find the weights
wP
,
wB
,
wN
instead of time demanding grid
search, we utilize automatic loss weighting techniques. Accordingly, the update rules
for the SCI-CBIR parameters are written as follows:
θC=θCηC
3(θCLC+θCLH)
θH=θHηH
3θHLH,(5.12)
where θH=θtθkθb.
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 72
5.4 Dataset Description and Experimental Design
5.4.1 Dataset Description
To evaluate the proposed approach, experiments were conducted on: i) BigEarthNet-
S2; and ii) MLRSNet [179]. In the experiments, we considered a subset of BigEarthNet-
S2 acquired over Serbia and summer season that includes 14,832 images, each of
which is made up of 120
×
120 pixels for 10m bands, 60
×
60 pixels for 20m bands and
20
×
120 pixels for 60m bands. In the experiments, cubic interpolation was applied to
20m and 60m bands that leads to 120
×
120 pixels for each band. In the experiments,
the 19 class nomenclature of BigEarthNet-S2 was exploited. MLRSNet is a multi-label
RS image archive that contains 109,161 images selected from aerial orthoimagery
with varying spatial resolutions from 10m to 0.1m. For the experiments, we randomly
selected a subset of MLRSNet that consists of 15,302 images, each of which has the
size of 256
×
256 pixels. Each image is annotated with multi-labels from 60 classes. For
the experiments, we divided BigEarthNet-S2 and MLRSNet into training, validation
and test sets with the ratios of 50%, 25%, 25% and 50%, 10%, 40%, respectively. To
apply CBIR, we selected queries from the validation set, while images were retrieved
from the test set of each archive.
5.4.2 Experimental Design
For the first step of the proposed approach, we utilize the auto-encoder DNN archi-
tecture presented in [170]. The indexing decoder within the second step of proposed
SCI-CBIR includes the attention layer from [170] followed by two convolutional
layers, each of which includes 512 hidden units with ReLU activation function, while
their filter sizes are 5
×
5 and 3
×
3. The class prediction and hash code generation
blocks of the second step include single convolutional layers with the filter size of
1
×
1. We tested different activation functions for the hash code generation block
among sigmoid, tanh, softsign [180] and greedy hash [181] functions. The parameter
γ
was set to 0.1
/q
, while the hash code length
q
was varied as
q=
16,32,64. The
mini-batch size was selected as 32 for both archives. While training the second step,
horizontal and vertical flipping were randomly applied to the training set. We trained
the proposed approach by using stochastic gradient descent algorithm.
As discussed in Section 5.3.3, the training of the proposed approach is divided into
three stages. In the first stage, the first step of the proposed approach was optimized
for the distortion loss only and
η1
was updated according to the MS-SSIM value
averaged on the validation set Vas follows:
η1=
104, MS-SSIM(V,V
ˆ)<24
5×105, 24 MS-SSIM(V,V
ˆ)29
105, MS-SSIM(V,V
ˆ)>29
(5.13)
The second stage starts when the distortion loss value reaches its convergence. The
learning rate
η2
was set to 10
5
at the beginning of the stage. After the first Pareto
point was obtained,
η2
is increased to 9
×
10
5
. In the third stage, the second step
of the proposed approach was jointly trained with the first step, while the learning
rate
ηH
3
was set to 10
4
.
ηC
3
was varied as
ηC
3=
0,10
8
,10
4
, while automatic loss
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 73
weighting technique was varied among projecting conflicting gradients (PCGrad)
[182], dynamic weight average (DWA) [183] and equal weighting. All the experiments
were conducted on NVIDIA Tesla V100 GPUs. Experimental results were provided in
terms of MS-SSIM and bit-rate (bpp) for compression performances, while precision
(P (%)), recall (P (%)), mean average precision (MAP (%)) and retrieval time were used
for comparing retrieval performances. It is worth noting that we mapped MS-SSIM
values into decibel (dB) scale as suggested in [170]. The retrieval metrics P, R and
MAP were averaged on the 15 most similar images.
We conducted experiments to: 1) perform a sensitivity analysis; and 2) compare
the proposed SCI-CBIR approach with standard approaches. In detail, we compare
the results of the first step of SCI-CBIR with those obtained by applying image
compression with a recurrent neural network (denoted as IC-RNN) [184] and JPEG
2000 [162]. We compare the results of the second step of SCI-CBIR with those
obtained by the second step of our approach trained on fully decompressed data
(denoted as SI-CBIR). We compare the results of proposed SCI-CBIR trained by using
our multi-stage learning procedure with those trained by using standard learning
procedure. For IC-RNN, we utilized MS-SSIM as the distortion measure and updated
the learning rate using (5.13). It was trained with 6 RNN iterations for 280 epochs. For
SI-CBIR, we trained the second step of our approach followed by the image encoder
of the first step with the same hyper-parameters and the loss functions
LP
,
LB
and
LN
. SI-CBIR is not capable of simultaneous compression and indexing, and thus
requires decoding prior to indexing. For standard learning procedure, we jointly
trained all the losses required for compression and indexing in a single learning
procedure. For the loss weights, we varied the weight of the distortion loss
LD
and
kept the rest as equal to control rate-distortion trade-off.
5.5 Experimental Results
5.5.1 Sensitivity Analysis of the Proposed SCI-CBIR Approach
In this sub-section, the results of the sensitivity analysis for the proposed SCI-CBIR
approach is presented in terms of: i) different values of the learning rate
ηC
3
; ii) the
effectiveness of the attention layer applied in the second step; iii) different activation
functions of the hash code generation block within the second step; iv) different
automatic loss weighting techniques applied in the third stage of our multi-stage
learning procedure; and v) different values of
q
. It is worth noting that during the
sensitivity analysis, we set default values for the following hyper-parameters: i)
q=
64; and ii) the bpp value as 0.63 and 0.33 on BigEarthNet-S2 and MLRSNet,
respectively, for the first two stages of our learning procedure. We also set PCGrad as
default automatic loss weighting technique and Greedy hash as the default activation
function.
In the first set of trials, we analyzed the effect of the learning rate
ηC
3
(which is utilized
in the third stage of the proposed multi-stage learning procedure). Table 5.1 shows
the corresponding results for the BigEarthNet-S2 archive when different values of
ηC
3
are used and the first two stages of our learning procedure are achieved at different
bpp values. By analyzing the table, one can observe that using a higher value of
ηC
3
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 74
TABLE 5.1: RESULTS OBTAINED BY PROPOSED SCI-CBIR FOR DIFFERENT VALUES OF
ηC
3
WHEN THE FIRST TWO STAGES OF OUR LEARNING PROCEDURE ARE ACHIEVED AT
DIFFERENT BIT-RATES (BIGEARTHNET-S2 ARCHIVE)
ηC
3= 0 ηC
3= 108ηC
3= 104
MS-SSIM bpp P R MAP MS-SSIM bpp P R MAP MS-SSIM bpp P R MAP
26.6 0.63 74.1 69.1 73.8 26.7 0.62 74.2 70.1 74.1 15.2 0.08 77.9 73.1 75.4
27.8 0.79 73.3 70.1 73.1 27.9 0.78 74.5 70.0 74.3 14.9 0.08 75.7 74.2 72.7
28.8 0.96 72.9 70.3 72.7 29.0 0.94 73.9 70.0 73.7 14.5 0.08 76.2 75.2 75.3
29.3 1.07 73.0 69.7 72.7 29.5 1.05 73.8 69.5 73.4 15.0 0.08 76.1 75.2 75.3
30.1 1.39 73.5 69.9 73.2 30.3 1.34 74.2 69.7 73.8 14.1 0.05 76.3 73.6 75.6
30.2 1.45 73.2 70.2 73.0 30.5 1.38 73.8 69.9 73.4 14.0 0.05 74.9 74.2 73.7
30.6 1.66 73.0 68.1 72.6 30.8 1.56 73.8 69.9 73.5 14.3 0.06 77.1 73.4 76.2
TABLE 5.2: RESULTS OBTAINED BY PROPOSED SCI-CBIR WITH AND WITHOUT THE ATTEN-
TION LAYER WHEN THE FIRST TWO STAGES OF OUR LEARNING PROCEDURE ARE ACHIEVED
AT DIFFERENT BIT-RATES (BIGEARTHNET-S2 ARCHIVE)
MS-SSIM bpp With Attention Layer Without Attention Layer
P R MAP P R MAP
26.6 0.63 74.2 70.1 74.1 73.7 69.7 73.3
27.8 0.79 74.5 70.0 74.3 73.3 68.8 73.0
28.8 0.96 73.9 70.0 73.7 72.8 68.1 72.5
29.3 1.07 73.8 69.5 73.4 73.1 68.1 72.6
30.1 1.39 74.2 69.7 73.8 72.2 68.6 71.9
30.2 1.45 73.8 69.9 73.4 71.7 68.0 71.3
30.6 1.66 73.8 69.9 73.5 73.1 68.1 72.6
as 10
4
leads to a significant reduction on compression results while providing the
highest retrieval scores. One can see from the table that when the first step of our
approach is not optimized (
ηC
3=
0), our approach achieves the lowest retrieval scores
compared to using
ηC
3>
0. However, when
ηC
3
is set a small value higher than zero
(
ηC
3=
10
8
), the proposed SCI-CBIR approach achieves comparable compression and
retrieval performances. This shows that our approach is capable of simultaneously
learning image representations for both indexing and compression in the third stage
of our multi-stage learning procedure when
ηC
3
is properly set. Accordingly, we set
ηC
3
to 10
8
for the rest of experiments. We observed the similar effect of
ηC
3
for the
MLRSNet archive.
In the second set of experiments, we assessed the effectiveness of the attention layer,
which is used in the second step of our approach. Table 5.2 shows the retrieval results
obtained by the proposed SCI-CBIR approach with and without the attention layer
for BigEarthNet-S2 when the different bpp values are achieved in the first two stages
of our learning procedure. From the table one can see that the overall scores obtained
with attention layer is significantly higher than those without attention layer indepen-
dently from the bpp values. This shows the effectiveness of the attention layer that
increases the capability of our approach to accurately decode image representations
for indexing in the second step, and thus to learn discriminative hash codes. The
similar behaviour of the attention layer has been observed for the MLRSNet archive.
In the third set of trials, we analyzed the effect of different activation functions of the
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 75
TABLE 5.3: RESULTS OBTAINED BY PROPOSED SCI-CBIR UNDER DIFFERENT ACTIVATION
FUNCTIONS (THE BIGEARTHNET-S2 ARCHIVE)
Activation Function P R MAP
Sigmoid 71.4 71.8 71.4
Tanh 72.1 71.3 72.4
Softsign [180] 71.8 70.5 71.0
Greedy Hash [181] 74.2 70.1 74.1
TABLE 5.4: RESULTS OBTAINED BY PROPOSED SCI-CBIR FOR DIFFERENT AUTOMATIC LOSS
WEIGHTING TECHNIQUES (BIGEARTHNET-S2 ARCHIVE)
Automatic Loss Weighting Technique P R MAP
DWA [183] 73.3 69.7 72.9
PCGrad [182] 74.2 70.1 74.1
Equal Weighting 73.3 69.3 73.2
hash code generation block. Table 5.3 shows the corresponding retrieval results for
BigEarthNet-S2. One can observe from the table that using Greedy hash activation
function achieves the highest precision and MAP scores with comparable recall
score. It is due to the fact that Greedy hash function does not require to apply the
quantization loss on the discrete hash codes. Accordingly, this function minimizes
the quantization error compared to other activation functions [181]. Thus, we set
Greedy hash as the activation function for the rest of the experiments. We observed
the similar behaviour for the MLRSNet archive.
In the fourth set of experiments, we assessed the effect of different automatic loss
weighting techniques (which are applied in the third stage of our learning procedure)
on retrieval performance. Table 5.4 shows the corresponding retrieval performances
for the BigEarthNet-S2 archive. From the table one can observe that proposed SCI-
CBIR approach achieves the highest scores when PCGrad is chosen as the automatic
loss weighting technique. When DWA and the equal weighting technique (which
equally weights different losses) are used, SCI-CBIR leads to similar retrieval perfor-
mances. It is worth noting that the hashing objective in (5.7) is made up of different
types of losses. PCGrad is capable of projecting the gradient of a loss function onto
the normal plane of the gradient of another loss function. This reduces gradient
interference among different loss functions that allows more effective optimization
on hashing objective compared to DWA. Accordingly, for the rest of the experiments,
we utilized PCGrad as the automatic loss weighting technique applied in the third
stage of our learning procedure. The similar behaviour of these techniques on our
approach has been observed for the MLRSNet archive.
In the fifth set of trials, we analyzed the effect of hash code length. Table 5.5 shows
the corresponding retrieval performances at different values of
q
for BigEarthNet-S2
and MLRSNet archives. One can observe from the table that, by increasing
q
, the
most of the metric values monotonically increase for both archives. Accordingly, the
proposed SCI-CBIR achieves the highest scores under all the metrics when
q=
64
compared to other values of
q
. As an example, proposed SCI-CBIR with
q=
64
achieves almost 14% higher precision and 15% higher recall compared to SCI-CBIR
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 76
TABLE 5.5: RESULTS OBTAINED BY PROPOSED SCI-CBIR FOR DIFFERENT VALUES OF q
qBigEarthNet-S2 MLRSNet
P R MAP P R MAP
16 72.2 69.0 70.6 46.7 45.0 44.7
32 72.5 70.9 72.6 57.7 57.3 56.5
64 74.2 70.1 74.1 60.6 59.8 58.9
0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15
15
20
25
30
bpp
MS-SSIM (dB)
SCI-CBIR IC-RNN JPEG2000
(a)
0.35 0.4 0.45 0.5 0.55
14
16
18
20
bpp
MS-SSIM (dB)
SCI-CBIR IC-RNN JPEG2000
(b)
FIGURE 5.2: Multi-scale similarity index (MS-SSIM) in dB versus bpp obtained by the pro-
posed SCI-CBIR approach, IC-RNN and JPEG2000 for (a) BigEarthNet-S2 and (b) MLRSNet
archives.
with
q=
16 for MLRSNet archive. Thus, for the rest of the experiments, we set
q
as
64.
5.5.2 Comparison with Standard Approaches
In this sub-section, we compare the performance of the first and second steps
of our approach and our multi-stage learning procedure with the standard ap-
proaches. Accordingly, we evaluated the effectiveness of: i) the first step compared
to JPEG2000 [162] and IC-RNN [184]; ii) the second step compared to SI-CBIR; and
iii) the multi-stage learning procedure compared to standard learning procedure.
In the first set of trials, we compare the DL-based compression step of our approach
with JPEG2000 and IC-RNN. Fig. 5.2 shows the compression results at different
bpp values for BigEarthNet-S2 and MLRSNet archives. By assessing the figure, one
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 77
(a) (b) (c) (d)
FIGURE 5.3: An RS image compression example: (a) original image; reconstructed image
at 0.7 bits per pixel (bpp) by (b) JPEG2000 [162]; (c) IC-RNN [184]; and (d) the proposed
SCI-CBIR approach (BigEarthNet-S2 archive).
(a) (b) (c) (d)
FIGURE 5.4: An RS image compression example: (a) original image; reconstructed image at
0.3 bpp by (b) JPEG2000 [162]; (c) IC-RNN [184]; and (d) the proposed SCI-CBIR approach
(MLRSNet archive).
can observe that our SCI-CBIR approach achieves the highest MS-SSIM at each bpp
value for both archives. This shows that the first step of our approach is capable of
effectively decoding RS images with varying rate-distortion points while RS image
compression and indexing are simultaneously learnt in our approach. In greater
details, the proposed SCI-CBIR approach and IC-RNN significantly outperform
the JPEG2000 algorithm. This shows the effectiveness of DL-based compression
compared to conventional methods for RS images. Fig. 5.3 and 5.4 show an example
of reconstructed RS images after they are compressed by proposed SCI-CBIR, IC-
RNN and JPEG2000 for the BigEarthNet-S2 and MLRSNet archives, respectively.
One can see from the figures that the proposed SCI-CBIR approach is as capable as
IC-RNN for reconstructing images without significant loss of spatial information.
When compared to JPEG2000, our approach provides higher reconstruction quality.
As an example, when JPEG2000 is utilized to compress the original image given in
Fig. 5.3-a at 0.7 bpp, it is not able to reconstruct the spatial details of the original
image (see Fig. 5.3-b) in contrast to our approach.
In the second set of experiments, we assessed the effectiveness of the deep hashing-
based indexing step of our approach compared to SI-CBIR. Fig. 5.5 shows the
corresponding CBIR results for both archives when the first step of our approach
were used to decode RS images at different bpp values for SI-CBIR. By analyzing
the figure one can see that the proposed SCI-CBIR approach achieves similar CBIR
performance compared to SI-CBIR under different bpp values. In greater details,
one can also observe that the CBIR performance of our approach is not significantly
affected by the changes in bbp values. This shows that when compression and
indexing are simultaneously learnt, the proposed SCI-CBIR approach is capable
of indexing RS images as accurate as without learning image compression during
training as in SI-CBIR. Fig. 5.6 and 5.7 show an example of RS images retrieved by
both approaches for the BigEarthNet-S2 and MLRSNet archives, respectively. One
can see from the figures that proposed SCI-CBIR approach retrieves similar images
to the query images compared to SI-CBIR independently from the bpp values. This
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 78
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
72
73
74
75
bpp
MAP@15 (%)
SCI-CBIR SI-CBIR
(a)
0.4 0.5 0.6 0.7 0.8 0.9 1
58
60
62
64
bpp
MAP@15 (%)
SCI-CBIR SI-CBIR
(b)
FIGURE 5.5: MAP versus bpp obtained by the proposed SCI-CBIR approach and SI-CBIR for
(a) BigEarthNet-S2 and (b) MLRSNet archives.
TABLE 5.6: RETRIEVAL TIME PER IMAGE (IN MILLISECONDS) OBTAINED BY SI-CBIR AND
THE PROPOSED SCI-CBIR APPROACH
Archive Approach Time
Decoding Indexing Total
BigEarthNet-S2 SI-CBIR 970 149 1119
SCI-CBIR N/A 149 149
MLRSNet SI-CBIR 5287 733 6020
SCI-CBIR N/A 733 733
is inline with our conclusion from Fig. 5.5. Table 5.6 shows the required CBIR time
for both approaches. It can be seen from the table that the required retrieval time
per image of proposed SCI-CBIR is almost one-tenth of the time for both archives
compared to SI-CBIR under similar CBIR scores. This is due to the fact that the
retrieval time of SI-CBIR includes also the image decoding time, which is not required
for proposed SCI-CBIR approach. In detail, since RS image compression and indexing
are simultaneously learnt by our approach during training, hash codes (which are
generated by our deep hashing-based indexing step) are directly utilized for CBIR
without any need for decompressing RS images. Due to this, during large-scale RS
image indexing, the proposed SCI-CBIR approach saves the significant amount of
time required for computationally demanding decompression of images.
In the third set of trials, we analyzed the effectiveness of the proposed multi-stage
learning procedure by comparing it with standard learning procedure. Table 5.7
shows the compression and retrieval results obtained by the proposed SCI-CBIR
approach trained with the proposed multi-stage and standard learning procedures
for the BigEarthNet-S2 archive. By assessing the table, one can see that the proposed
SCI-CBIR approach with our multi-stage procedure provides higher scores of CBIR
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 79
(a)
1st 5th 10th 15th
(b)
(c)
(d)
(e)
(f)
FIGURE 5.6: (a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed SCI-CBIR
at 0.62 bpp; (d) the proposed SCI-CBIR at 0.78 bpp; (e) the proposed SCI-CBIR at 1.05 bpp;
and (f) the proposed SCI-CBIR at 1.56 bpp (BigEarthNet-S2 archive).
metrics and MS-SSIM values compared to SCI-CBIR with the standard learning
procedure at similar bpp values. This is due to the fact that when a single learning
procedure with equal loss weights is utilized as in standard learning procedure, learn-
ing objectives for indexing and compression are conflicting each other independently
from the different rate-distortion trade-off points (which is controlled by
λ
in stan-
dard learning procedure). This prevents to accurately learn RS image compression
together with RS image indexing. Unlike the standard learning procedure, due to the
proposed multi-stage learning procedure, our approach is capable of simultaneously
learning both tasks in an effective way by automatically: i) weighting different loss
functions; and ii) finding rate-distortion trade-off points. The similar behaviour of
the proposed multi-stage learning procedure has been observed for the MLRSNet
archive.
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 80
(a)
1st 5th 10th 15th
(b)
(c)
(d)
(e)
(f)
FIGURE 5.7: (a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed SCI-CBIR
at 0.33 bpp; (d) the proposed SCI-CBIR at 0.56 bpp; (e) the proposed SCI-CBIR at 0.69 bpp;
and (f) the proposed SCI-CBIR at 0.85 bpp (MLRSNet archive).
5.6 Conclusion
This chapter introduces a novel approach (denoted as SCI-CBIR) to simultaneously
compress and index RS images for scalable CBIR. The SCI-CBIR approach is charac-
terized by two steps that are simultaneously applied based on a novel multi-stage
learning procedure. The first step is the DL-based compression step, where RS images
are first mapped into their latent representations, and then reconstructed back from
the latents by exploiting a pair of encoder and decoder DNNs. An entropy model
is utilized to generate bitstreams for a rate-distortion trade-off point. The second
step is the deep hashing-based indexing step, where hash codes of RS images are
generated from their latent representations. With the proposed multi-stage learning
procedure, all the parameters of SCI-CBIR are learnt within three consecutive stages
as: i) minimizing a distortion loss to model reconstruction; ii) finding the Pareto
optimal solutions of a multi-objective optimization problem to achieve a variable
range of bit-rates; and iii) minimizing soft pairwise, bit-balancing and classification
Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 81
TABLE 5.7: RESULTS OBTAINED BY PROPOSED SCI-CBIR TRAINED WITH OUR MULTI-STAGE
LEARNING PROCEDURE AND STANDARD LEARNING PROCEDURE ASSOCIATED TO SIMILAR
BIT-RATES (THE BIGEARTHNET-S2 ARCHIVE)
Our Multi-Stage Learning Procedure Standard Learning Procedure
MS-SSIM bpp P R MAP λMS-SSIM bpp P R MAP
26.7 0.62 74.2 70.1 74.1 150 22.3 0.63 70.8 67.6 70.2
27.9 0.78 74.5 70.0 74.3 200 23.0 0.71 70.8 67.8 70.3
29.0 0.94 73.9 70.0 73.7 500 26.3 0.87 70.1 68.0 70.0
29.5 1.05 73.8 69.5 73.4 700 26.9 1.08 70.6 68.2 70.0
30.3 1.34 74.2 69.7 73.8 1000 27.6 1.29 70.0 68.0 69.3
30.5 1.38 73.8 69.9 73.4 1250 27.9 1.44 70.5 67.9 70.1
30.8 1.56 73.8 69.9 73.5 1500 28.0 1.64 70.1 67.2 69.6
losses with automatic loss weighting techniques to characterize hash codes. This
allows the proposed SCI-CBIR approach to: i) obtain different bit-rates without a
need for training the considered DNN multiple times; and ii) automatically find the
weights for the five different losses considered in both steps without any need for
computationally expensive grid search.
Experimental results obtained on two benchmark archives show that the proposed
approach provides high compression performance, while resulting in high retrieval
accuracy without any need for decompressing the images prior the indexing (which
is required for the most of the CBIR systems in RS). We underline that this is a
very important advantage particularly for large-scale CBIR, and thus the proposed
approach is convenient for possible operational applications. It is worth noting
that the archives used in our experiments are benchmarks. However, in many real
applications we expect that the CBIR is applied to much larger archives. For large-
scale CBIR, by using our approach the gain in retrieval time is expected to be increased
significantly compared to the existing approaches. In the case of compressing and
indexing very large size RS image scenes, we suggest to utilize light-weight DNNs
(such as Zoom-In [185] and ESPNetv2 [186]) that allow to apply training and inference
of our approach in a computationally efficient manner.
It is worth noting that the proposed approach can be easily adapted to the CBIR
problems for which: i) images are compressed by other DL-based compression algo-
rithms; and also ii) hash codes are obtained through different DL-based architectures.
As a final remark, we would like to point out that the development of DL-based
image compression methods is becoming a more and more important topic. In this
context, the proposed approach is very promising as it allows RS CBIR for the case
that images are compressed by using DNNs. As a future development, we plan to
study the development of DL-based 3D compression models where not only spatial
but also spectral redundancies are compressed. Moreover, we plan to explore RS
CBIR in the 3D compressed domain, which is expected to be particularly relevant for
search and retrieval from hyperspectral image archives.
82
Chapter 6
Generative Reasoning Integrated Label
Noise Robust Deep Image
Representation Learning in Remote
Sensing
Most of the DL-based IRL methods require the availability of a set of high quan-
tity and quality of annotated training RS images, which can be time-consuming,
complex and costly to gather. To reduce labeling costs, publicly available thematic
maps, automatic labeling procedures or crowdsourced data can be used. However,
such approaches increase the risk of including label noise in training data. It may
result in overfitting on noisy labels when discriminative reasoning is employed as
in most of the existing methods. This leads to sub-optimal learning procedures,
and thus inaccurate characterization of RS images. In this chapter, we introduce
a generative reasoning integrated label noise robust deep representation learning
(GRID) approach. The proposed GRID approach aims to model the complementary
characteristics of discriminative and generative reasoning for IRL under noisy labels.
To this end, we first integrate generative reasoning into discriminative reasoning
through a supervised variational autoencoder. This allows the proposed GRID ap-
proach to automatically detect training samples with noisy labels. Then, through
our label noise robust hybrid representation learning strategy, GRID adjusts the
whole learning procedure for IRL of these samples through generative reasoning and
that of the other samples through discriminative reasoning. Our approach learns
discriminative RS image representations while preventing interference of noisy labels
during training independently from the IRL method being selected. Thus, unlike the
existing label noise robust methods, GRID does not depend on the type of annotation,
label noise, neural network architecture, loss function or learning task, and thus can
be directly utilized for various RS image understanding problems. Experimental
results show the effectiveness of the proposed GRID approach compared to the
state-of-the-art methods. The code of the proposed approach will be publicly avail-
able at
https://git.tu-berlin.de/rsim/GRID
. This chapter is mainly based on the
following publications:
G. Sumbul and B. Demir, “Generative reasoning integrated label noise robust
deep image representation learning,” IEEE Transactions on Image Processing,
2023. DOI:10.1109/TIP.2023.3293776.
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 83
G. Sumbul and B. Demir, “Label noise robust image representation learning
based on supervised variational autoencoders in remote sensing,” in Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium, 2023.
6.1 Introduction
DL-based IRL of RS images is generally achieved in a supervised way during the
optimization of a loss function based on the characteristics of a learning task (e.g.,
single/multi-label classification, semantic segmentation etc.). To effectively learn
DL model parameters, the availability of a high quantity and quality of annotated
training RS images is required. Depending on the considered learning task, annota-
tions of training RS images can be given at scene-level or pixel-level. For scene-level
annotations, each training image is annotated by either a single label, which is asso-
ciated to the most significant content of the image, or multi-labels. In general, the
manual collection of RS image annotations by domain experts for large scale data
can be time consuming, complex and costly. To address this issue, publicly available
thematic maps (e.g., the CORINE Land Cover inventory [49]), automatic labeling
procedures and volunteered geographic information (VGI) as crowdsourced data can
be used. These strategies provide RS image annotations at zero cost. However, the
considered thematic map or VGI source can be outdated with respect to RS images
due to possible changes on the ground; or there can be annotation errors. Thus,
these strategies increase the risk of including noisy labels in training data. It is worth
noting that for a scene-level single-label and a pixel-level noisy annotation, label
noise occurs as an incorrect label associated to an image and a pixel, respectively.
However, for a scene-level multi-label noisy annotation, it can emerge as a missing
label (i.e., a class is present in an image while the corresponding label is not assigned
to that image), a wrong label (i.e., a class is not present in an image while its label is
assigned to the image) or combination of both missing and wrong labels.
Most of the existing DL-based IRL methods in RS employ discriminative learning
(i.e., discriminative reasoning) of image representations. This is based on directly
modeling a posterior data distribution
p(y|x)
by utilizing
(x
,
y)
image annotation
pairs from training data. The effectiveness of the discriminative reasoning has been
proven compared to generative reasoning (which is based on modelling the joint data
distribution
p(x
,
y)
) when training data is abundant [187]. However, discriminative
models are more sensitive to label noise compared to generative models. Accordingly,
discriminative learning of RS image representations with noisy labels may result in
overfitting of the considered deep neural network (DNN) to noisy labels and lack
of its generalization capability, and thus inaccurate characterization of RS images
during both training and inference [50], [51].
To address this problem, several methods, mostly in computer vision (CV) commu-
nity, are presented to improve the robustness of discriminative IRL when training
data includes noisy labels. All these methods are potentially effective for DL-based
IRL under noisy labels in RS. However, most of them are dependent on the type
of: i) label noise present in training data; ii) image annotation; iii) loss function
(e.g., cross-entropy, focal loss etc.); iv) DNN architecture; or v) learning task. Some
methods also require the availability of a subset of the training set, which includes
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 84
clean labels, or require the computationally demanding noise correction strategies
prior to training. Thus, they may not be directly integrated into different scenarios
associated to IRL in RS.
To overcome this issue, in this chapter we introduce a Generative Reasoning Inte-
grated Label Noise Robust Deep Representation Learning (denoted as GRID here-
after) approach. The proposed GRID approach aims to model the complementary
characteristics of discriminative and generative reasoning for IRL under noisy labels.
To this end, for discriminative reasoning, we first employ a DNN composed of an RS
image encoder (i.e., CNN backbone) and a discriminative task head for modelling the
posterior distribution of labeled RS images as in the most of supervised DL-based IRL
methods in RS. Then, we integrate generative reasoning into discriminative reasoning
through a supervised variational autoencoder (which includes a variational encoder,
a feature decoder and a generative task head) followed by the CNN backbone for
modelling the joint distribution of labeled RS images. This allows the proposed GRID
approach to automatically detect training samples with noisy labels based on the loss
values acquired from discriminative and generative task heads. Then, through our
label noise robust hybrid representation learning strategy, the model parameters of
the considered DNN is updated through: i) generative reasoning for the samples with
noisy labels; and ii) discriminative reasoning for the remaining samples in training
data. Accordingly, our approach allows to learn discriminative RS image representa-
tions through the CNN backbone, while preventing the overfitting on noisy labels
during training independent from the IRL method being selected. Thus, unlike the
existing label noise robust methods, GRID does not depend on the type of annotation,
label noise, DNN architecture, loss function or learning task. It also does not require
a trustworthy subset of a training set or require a computationally demanding noise
correction strategy prior to training. Thus, our approach can be directly utilized
for various scenarios for IRL in RS. In this chapter, we consider two IRL scenarios,
where training RS images are annotated with: 1) scene-level noisy multi-labels; and
2) pixel-level noisy labels. Under these scenarios, we consider three learning tasks
with the corresponding loss functions and DNN architectures. For different scenarios
and learning tasks, we conduct experiments on a single RS application for the sake of
simplicity. This application is selected as content-based image retrieval (CBIR) due to
the importance of employing accurate image features for similarity matching in CBIR.
We would like to note that, according to our knowledge, GRID is the first approach
in RS that combines generative and discriminative reasoning for supervised IRL
under noisy labels that leads to characterize accurate RS image representations while
preventing interference of noisy labels during training.
The rest of the chapter is organized as follows: Section 6.2 presents the related works
on DL-based label noise robust IRL in CV and RS. Section 6.3 introduces the proposed
GRID approach. Section 6.4 describes the considered RS image data archives and the
experimental setup, while Section 6.5 provides the experimental results. Section 6.6
concludes the chapter.
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 85
6.2 Related Works
A few methods for label noise robust DL-based IRL are recently presented in RS for
image classification [188]–[192] and semantic segmentation problems [31], [193]. As
an example for RS image classification problems, a noisy label distillation method
is introduced in [188] to leverage the knowledge learnt through a teacher model on
images with noisy labels for a student model. In this method, two convolutional
neural networks (CNNs) are employed as a teacher-student framework, while a clean
and trustworthy subset of a training set is assumed to be available for the student
CNN. In [189], down-weighting factor is integrated into normalized softmax loss
function to reduce the effect of wrongly classified images (which are assumed to be
associated with noisy labels) on the model parameter updates. It is noted that these
methods are designed for RS images associated with single-labels. For RS images
annotated by multi-labels, a collaborative learning framework is proposed in [192]
to identify and exclude images with noisy multi-labels during training. To this end,
it employs two CNNs operating collaboratively, which are forced to characterize
distinct image representations and to produce similar predictions. In [191], the effects
of different label noise types in multi-label RS image classification problems are
investigated, while different noise robust methods are integrated from single-label to
multi-label classification problems in RS. Apart from scene-level image classification,
label noise robust land-cover map generation through semantic segmentation has
been also attracted attention in RS. As an example, in [31], an online noise correction
approach is proposed to detect and correct pixel-level noisy labels via information
entropy at the early stage of training, and thus to continue training with corrected
labels.
It is noted that, for label noise robust IRL, the research is more extended in CV,
but mostly dedicated to single-label image classification problems. Recent research
directions in CV community are mainly concentrated on the development of: i) deep
architectures [194]; ii) loss functions [195]; iii) regularization strategies [196]; and iv)
sample selection and label adjustment techniques [197] while aiming to achieve more
robust learning procedures towards label noise. The methods in the first category
focus on designing DNN architectures specific to training data with noisy labels.
As an example, in [194], a contrastive-additive noise network is proposed to model
trustworthiness of noisy labels in the context of image classification. To this end,
it includes a probabilistic latent variable model as a contrastive layer to estimate
the quality of labels and an additive layer to aggregate the class predictions and
noisy labels. The methods in the second category is mostly devoted to utilizing
loss functions, which have robust characteristics when used with noisy labels. For
instance, asymmetric loss function introduced in [198] allows to dynamically decrease
the weights of negative classes in multi-labels. This decreases the effect of images
with missing labels on IRL. The methods in the third category aim at regularizing
the whole learning procedure to prevent overfitting on noisy labels. As an example,
in [196], a regularization term is integrated into cross-entropy loss to guide the
learning process with the class predictions from an early stage of training to prevent
memorization of noisy-labels. The methods in the last category focus on first detecting
images associated with correct labels or adjusting noisy labels, and then learning
through those samples or adjusted labels. For instance, in [197], a joint training
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 86
B
CNN
Backbone
Od(B)
W:L(ˆ
yd
i,yi)L(ˆ
yg
i,yi)> aλ
Discriminative
Task Head
Generative
Task Head
VAE
Encoder
Feature
Decoder
Latent
Variable
Sampling
Og(B)
.
.
.
· · ·
γOd(B)
βOg(B)
θOd(C)
θOg(W)
C:L(ˆ
yd
i,yi)L(ˆ
yg
i,yi)aλ
· · ·
Automatic Noisy Sample Detection
Forward
Pass
Backward Pass
(Clean Samples)
Backward Pass
(Noisy Samples)
Clean
Samples
Noisy
Samples
Backward Pass
(All Samples)
FIGURE 6.1: An illustration of the training of our GRID approach that jointly leverages the ro-
bustness of generative reasoning towards noisy labels and the effectiveness of discriminative
reasoning on image representation learning. During the forward pass on a mini-batch
B
, the
loss values
Od(B)
,
Og(B)
and the predicted labels
Y
ˆd
,
Y
ˆg
are obtained through discriminative
and generative reasoning for a given learning task. Then, the set
W
of training samples with
noisy labels (i.e., noisy samples) and the set
C
of training samples with correct labels (i.e.,
clean samples) are constructed through our automatic noisy sample detection procedure
based on the values of the loss function
L
associated with the learning task. During the
backward pass, the model parameters except the CNN backbone parameters are updated
with all samples based on
γOd(B)
and
βOg(B)
. The parameters of the CNN backbone
are updated through: i) the generative task head for the noisy samples based on
θOg(W)
;
and ii) the discriminative task head for the clean samples based on θOd(C).
with co-regularization approach employs collaborative learning of two CNNs for
the selection of correct labels by an agreement strategy. For a detailed summary of
DL-based label noise robust IRL methods in CV, we refer the reader to [51].
6.3 Proposed Approach
Let
X={x1
,
. . .
,
xM}
be an RS image archive that includes
M
images, where
xt
is the
t
th image in the archive. We assume that a training set
T={(x1
,
y1)
,
. . .
,
(xK
,
yK)}
that includes
K
i.i.d samples of random variables
x
and
y
is available.
xi
is the
i
th
image and
yi
is the corresponding image annotation. Annotations of training images
can be given at pixel-level or scene-level. An image can be annotated by a broad
category label (i.e., single-label) or multi-labels. We assume that the labels in the
set
Y
of training image annotations can be noisy. For a scene-level single-label or a
pixel-level noisy annotation, label noise may occur as an incorrect label associated to
an image or a pixel, respectively. For a scene-level multi-label noisy annotation, label
noise may occur as a missing label, a wrong label or combination of both missing
and wrong labels.
The proposed GRID approach aims to jointly leverage the robustness of generative
reasoning towards noisy labels and the effectiveness of discriminative reasoning on
IRL. This is achieved by first integrating generative reasoning into discriminative
reasoning through a supervised variational autoencoder, and then characterizing
discriminative RS image representations while preventing interference of noisy labels
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 87
through our label noise robust hybrid representation learning strategy. Fig. 6.1 shows
an illustration of the proposed GRID approach. We first provide general information
on discriminative reasoning, and then present our approach in detail in the following
subsections.
6.3.1 Basics on Discriminative Reasoning
DL-based IRL methods through discriminative reasoning aim to employ the dis-
criminative capabilities of DNNs for the characterization of RS image features. This
is achieved by maximizing the posterior distribution of labeled RS images
p(y|x)
during training. To this end, the considered DNN typically includes an image
encoder (i.e., a CNN backbone) and a discriminative task head including fully con-
nected or convolutional layers (which is branched out from the CNN backbone). Let
φ:θ
,
X F
be any type of CNN backbone that maps the image
xi
into the corre-
sponding image descriptor
fi
, which is a sample of random variable
f
.
θ
is the set
of CNN parameters and
F
is the set of all descriptors for
X
. Let
td:γ
,
F Y
ˆd
be a
discriminative task head that maps the image descriptor into the corresponding label
prediction associated with the image
xi
[i.e.,
td(φ(xi
;
θ)
;
γ) = yˆd
i
], where
γ
is the task
head parameters and
Y
ˆd
is the set of all predicted image labels. The CNN backbone
models global image representation space, while overall DNN models the posterior
distribution
p(y|x)
. In this formulation, the model parameters
θγ
are updated
to maximize
ET[logp(y|x)]
. Accordingly, the objective function
Od
associated with
discriminative reasoning for a set of samples Sis written as follows:
Od(S) = 1
|S|
(xi,yi)S L(td(φ(xi;θ);γ),yi);L L, (6.1)
where
L
is the set of all loss functions, whose each element is capable of measuring
how different the prediction
yˆd
i
is from
yi
. Accordingly, any loss function that can
measure the sample-wise error can be used for L.
The discriminative learning of RS image representations has been found successful
for many applications when the labeled training data is abundant [187]. However,
learning image representations via modeling the posterior distribution of training
data can be sensitive to noisy labels included in the calculation of the loss function.
When the ratio of noisy labels over Yis significantly high, the considered DNN can
suffer from the overfitting on noisy labels leading to inaccurate IRL and lack of the
generalization capability of the considered DNN [50], [51].
6.3.2 Integration of Generative Reasoning
Generative learning of image representations via modeling the joint data distribution
p(x
,
y)
limits the overfitting of the considered DNN on noisy labels during training.
Thus, it is proven to be more robust to noisy labels compared to discriminative
learning [199]. However, learning image representations via generative reasoning
may limit to accurately characterize discriminative image descriptors, and thus
may lead to inaccurate IRL. Accordingly, the proposed GRID approach aims at
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 88
effectively integrating generative reasoning into discriminative reasoning to achieve
discriminative and generative modelling of RS images in a single learning procedure.
To model
p(x
,
y)
, we assume that
x
and
y
are generated through a latent variable
z
. Each sample of the latent variable
zi
is generated from a prior distribution
p(z)
,
while
xi
and
yi
are generated from
p(x
,
y|z)
. It is worth noting that the marginal
likelihood over the latent variable
Rp(z)p(x
,
y|z)dz
is intractable for DNNs since
it is hard to find an analytical solution for the posterior distribution of the latent
variable
p(z|x
,
y)
. To this end, we utilize a variational auto-encoder (VAE) introduced
in [200] as a latent variable model. Accordingly, we approximate the true posterior
distribution of latent variable with a variational approximate posterior
q(z|x
,
y)
of
known functional form (e.g., a Gaussian distribution parameterized by the encoder
of a VAE). Then, the variational lower bound on the marginal log-likelihood (i.e.,
evidence lower bound [ELBO]) is defined as follows:
log pβd(x,y)Eqβe(z|x,y)[log pβd(x,y|z)]
DKL(qβe(z|x,y)|| pβd(z)),(6.2)
where
DKL(·||·)
is the Kullback-Leibler (KL) divergence [201],
βe
is the VAE encoder
parameters and
βd
is the VAE decoder parameters. It is worth noting that
qβe(z|x)
is a
sufficient statistic for
qβe(z|x
,
y)
. It guarantees that
z
generated from
x
embodies the
same information when it is jointly generated from
x
and
y
[202]. Since
pβd(x
,
y|z)
can be factorized into
pβd(x|z)pβd(y|z)
(i.e, conditional independence), (6.2) can be
written as follows:
log pβd(x,y)Eqβe(z|x)[log pβd(x|z)]
+Eqβe(z|x)[log pβd(y|z)]
DKL(qβe(z|x)|| pβd(z)).
(6.3)
We define the variational approximate posterior and the latent prior as multivariate
Gaussian distributions as follows:
ziqβe(z|xi) = N(z|µi,σ2
iI), (6.4)
pβd(z) = N(z|0,I), (6.5)
Since
f
is the representative of
x
, we define the variational generative process based
on
f
rather than
x
. Let
e
be a VAE encoder that maps the image descriptor
fi
into the
parameters of the qβedistribution µiand σifor xi. To prevent the interference of the
DNN training with the stochastic sampling of
z
, we utilize the reparameterization
trick introduced in [200] to generate zias follows:
zi=µi+σi·ϵi;ϵi N(0,I). (6.6)
Let
tg:βt
,
Z Y
ˆg
be a generative task head that maps the latent into the corre-
sponding label prediction associated with the image
xi
[i.e.,
tg(zi
;
βt) = yˆg
i
], where
βt
is the task head parameters and
Y
ˆg
is the set of all predicted image labels.
tg
is
chosen as the duplicate of
td
, but they are associated to different model parameters
(i.e.,
βt=γ
). Let
r:βr
,
Z F
ˆ
be a feature decoder that maps the latent into the
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 89
reconstructed image descriptor
f
ˆi
for
xi
, where
βr
is the feature decoder parameters.
tg
models
pβd(y|z)
, while
r
models
pβd(x|z)
. Accordingly,
tg
and
r
both form the
VAE decoder (i.e., βd=βtβr).
To accurately model
p(x
,
y)
, the VAE parameters
β=βeβd
can be learned by
maximizing the ELBO defined in (6.3). To this end, we define: i) the first term of
the ELBO based on mean squared error loss function
LMSE
; ii) second term of the
ELBO based on the loss function
L
considered for discriminative reasoning; and iii)
third term of the ELBO based on the known functional forms of
qβe(z|xi)
and
pβd(z)
.
Accordingly, the objective function
Og
associated with generative reasoning for a set
of samples Sis written as follows:
Og(S) = 1
|S|
(xi,yi)S LMSE(r(zi;βr),fi)
+1
|S|
(xi,yi)S L(tg(zi;βt),yi)
+1
2
J
j=11+log(σ2
i,j)µ2
i,jσ2
i,j,
(6.7)
where
µi,j
and
σi,j
are the
j
th element of the vectors
µi
and
σi
, respectively, while
J
is
their length. For the derivation of the KL divergence term in the ELBO, the reader is
referred to [200].
It is worth noting that the proposed integration of generative reasoning into dis-
criminative reasoning does not depend on the selection of the loss function
L
and
discriminative task head, and thus can be applied to most of the supervised DL-
based IRL methods. It also does not require an additional CNN backbone as image
encoder since we define the variational generative process based on
f
. Thus, the
considered VAE is directly branched out from the CNN backbone to learn RS image
representations based on generative and discriminative reasoning together.
6.3.3 Label Noise Robust Hybrid Representation Learning
The proposed GRID approach aims to jointly model the posterior and joint distribu-
tions of annotated RS images in a single learning procedure, while achieving label
noise robust IRL. To this end, we introduce a label noise robust hybrid representa-
tion learning strategy to model RS images through: i) generative reasoning for the
training samples with noisy labels; and ii) discriminative reasoning for the remaining
samples in the training data. For the sake of simplicity, we refer training samples
with noisy labels as noisy samples, and those with correct labels as clean samples
hereafter. It is noted that generative reasoning is less annotation dependent com-
pared to discriminative reasoning due to modelling the joint distribution
p(x
,
y)
in a
probabilistic generative process. Thus, for discriminative reasoning, the loss value
differences between noisy samples and clean samples are higher compared to genera-
tive reasoning. The proposed integration of generative reasoning into discriminative
reasoning allows to automatically detect noisy samples based on the loss values of
L
incurred through generative and discriminative reasoning. Accordingly, we decide
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 90
whether a training sample is noisy or clean based on the loss values acquired from
discriminative and generative task heads. To this end, we define our automatic noisy
sample detection procedure as follows. A training sample is considered as noisy if it
leads to a significantly higher loss value from the discriminative task head compared
to the generative task head. For a given mini-batch
B
, we first sort the differences of
normalized loss values acquired from discriminative and generative task heads. This
can be defined as a non-decreasing sequence Aas follows:
A= (ak)|B|
k=1,akak+1
ak {L(yˆd
i,yi)L(yˆg
i,yi)}(xi,yi)B k,(6.8)
where loss values are normalized based on the min-max scaling strategy. Then, we
divide
B
into the set
W
of noisy samples and the set
C
of clean samples, where
B=W C;W C =, as follows:
W={(xi,yi)|(xi,yi)BL(yˆd
i,yi)L(yˆg
i,yi)>aλ}(6.9)
C={(xi,yi)|(xi,yi)BL(yˆd
i,yi)L(yˆg
i,yi)aλ}. (6.10)
W
includes the samples from
B
associated with the
λ {
1,2,
. . .
,
|B|}
largest ele-
ments of
A
(i.e., the
λ
highest loss value differences), while
C
includes the rest of the
samples from B.λis a hyper-parameter of the proposed GRID approach.
To learn the model parameters associated with discriminative and generative reason-
ing, one could directly apply optimization to jointly minimize
Og(B)
and
Od(B)
. This
leads to optimization of the objectives for all samples in
B
based on both generative
and discriminative reasoning. When it is applied to the parameters
θ
of the CNN
backbone
φ
, it can limit to exploit the effectiveness of generative reasoning for noisy
samples and that of discriminative reasoning for clean samples due to interference of
different learning characteristics. Accordingly, the model parameters
θ
are updated
based on whether a sample is assigned to
W
or
C
. Accordingly, the update rule for
θ
is written as follows:
θθηθ|W|Og(W) + |C|Od(C)
|B| ,(6.11)
where
η
is the learning rate. It is noted that we define the variational generative
process based on the image descriptors. Accordingly, for the backbone parameters,
the first and the third terms of the ELBO is assumed to be 0 (see (6.3)). Then, the
update rule can be written based on only Las follows:
θθηθ
|B|
(xi,yi)W
L(yˆg
i,yi) +
(xi,yi)C
L(yˆd
i,yi). (6.12)
Based on this update rule, the CNN backbone parameters are updated only to mini-
mize
L
, whose values are obtained from generative task head for noisy samples and
discriminative task head for clean samples. Accordingly, RS image representations
are learned based on: i) the generative reasoning for noisy training samples; and
ii) discriminative reasoning for clean samples. However, for the remaining model
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 91
parameters, it is important to maintain the characteristics of discriminative and gen-
erative reasoning throughout the training. Accordingly, discriminative task head
parameters
γ
are updated based on
Od(B)
, while the VAE parameters
β
are updated
based on Og(B)as follows:
γγηγOd(B)
ββηβOg(B).(6.13)
Due to the automatic detection of training samples associated with noisy annotations
and learning RS image representation space (characterized by the CNN backbone),
the proposed GRID approach leverages the effectiveness of both discriminative and
generative reasoning. This leads to learning RS image representations robust to
label noise without overfitting on noisy labels as in discriminative learning. It is
worth mentioning that the proposed GRID approach is independent from the DNN
architecture, loss function
L
, learning task, annotation type being considered and the
type of label noise present in training data. In this chapter, we assess our approach
under two scenarios, where training RS images are annotated with: 1) scene-level
noisy multi-labels; and 2) pixel-level noisy labels. Under these scenarios, we consider
three learning tasks with the corresponding loss functions and DNN architectures
(see Section 6.4 for the details).
6.4 Dataset Description and Experimental Design
6.4.1 Dataset Description
We conducted experiments on the BigEarthNet-S2 and the DLRSD [203] RS image
archives. We employed a subset of BigEarthNet-S2 that includes images acquired
over Serbia in summer. It consists of 14,832 Sentinel-2 multispectral images. Each
image is a section of: i) 120
×
120 pixels for 10m bands; ii) 60
×
60 pixels for 20m bands;
and iii) 20
×
20 pixels for 60m bands. It is noted that bicubic interpolation is applied to
20m bands, while 60m bands are excluded from the experiments. For the experiments,
we utilized the 19 class nomenclature of BigEarthNet-S2. We also extracted the CLC
land cover map of each image for the selection of
L
(which requires the availability of
land-cover maps during training). The DLRSD archive includes 2,100 aerial images.
Each image has the size of 256
×
256 pixels with a spatial resolution of 30 cm, and
annotated with both multi-labels and pixel-level labels, where the class nomenclature
is defined in [93]. For the experiments, these archives were divided into training,
validation and test sets with the ratios of 70%, 10%, 20% for DLRSD and 52%, 24%,
24% for BigEarthNet-S2.
6.4.2 Experimental Design
To conduct experiments, we considered two different scenarios, where training
images are annotated with: 1) scene-level noisy multi-labels; and 2) pixel-level noisy
labels. For these scenarios, we tested our approach under three learning tasks with
their corresponding loss functions and DNN architectures that are explained in detail
in the following.
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 92
In the first scenario, IRL is achieved based on supervised multi-label RS image
classification. For this scenario, binary cross entropy (BCE) loss function was chosen
as
L
of the proposed GRID approach. Accordingly, each of the generative and
discriminative task heads includes an FC layer as a classifier that produces multi-
label class probabilities. The proposed approach applied to this scenario is denoted
as GRID (BCE) hereafter.
In the second scenario, IRL is achieved by: 1) semantic segmentation for land
cover map generation based on pixel-wise cross entropy loss function (denoted
as GRID (PCE) hereafter); and 2) multi-label co-occurrence prediction based on
region representation learning (RRL) loss function introduced in [79] (denoted as
GRID (RRL) hereafter). For GRID (PCE), each of the generative and discriminative
task heads consists of three transposed convolutional layers with the filters of 64, 32
and the number of considered classes. For GRID (RRL), each of the generative and
discriminative task heads includes an FC layer that produces the prediction for graph
driven region-based representations. The reader is referred to [79] for the details.
For both scenarios, we employed the DenseNet-121 architecture [148] as the CNN
backbone, and utilized the latent dimension of 128 for the VAE encoder. The feature
decoder of VAE employs an FC layer with the hidden unit size of image descriptor
dimension (which is 1024 for DenseNet-121) for GRID (BCE) and GRID (RRL), while
the FC layer is replaced with a convolutional layer with the kernel size of 1
×
1 for
GRID (PCE). The parameter
λ=k|B|/
100 was varied as
k {
10,20,
. . .
,90
}
when
k
shows the percentage of each mini-batch that is identified as the set of noisy samples
(denoted as λk%khereafter).
To assess the robustness of our approach to label noise for both scenarios, we applied
synthetic label noise injection to the training sets in the range of
[
10%,60%
]
with
the step size of 10%. In particular, for scene-level annotations, the set of class labels
are randomly chosen from the training label set
Y
based on a synthetic label noise
injection ratio (SLNIR). Then, each selected class label is randomly changed into one
of other class labels, which are not associated with the corresponding image. This
ensures that both missing and wrong labels are considered as noisy annotations. For
pixel-level annotations,
Y
is converted to the set of unique class labels associated
with each image prior to random selection based on SLNIR. Then, changed classes
are reflected to all relevant pixel labels.
We conducted experiments related to all scenarios and learning tasks on a single RS
application for the sake of simplicity. This application was selected as content-based
image retrieval (CBIR) since learning accurate image features is of great importance
for similarity matching in CBIR. To apply CBIR after learning RS image represen-
tations, for each archive, we employed the training set for selecting query images,
while images were retrieved from the test set. We performed the hyper-parameter
selection of our approach on the validation set in the context of CBIR. We trained our
approach for 100 epochs by using the Adam gradient descent optimization algorithm
with the initial learning rate of 10
3
and the mini-batch size of 128. After RS IRL is
achieved by the proposed approach, we obtained the features of query and archive
images from the last layer of the backbone. Then, to apply CBIR, similarity matching
of these features was performed by using the
χ2
-distance measure. CBIR results
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 93
TABLE 6.1: RESULTS (%) OBTAINED BY THE PROPOSED GRID (BCE) APPROACH FOR
DIFFERENT VALUES OF λAND SLNIR (%) (DLRSD ARCHIVE)
SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%
066.4 65.9 64.5 64.4 64.5 57.9 59.3 60.4 58.5
10 63.1 64.2 62.9 63.6 60.7 60.2 56.2 60.2 57.8
20 61.6 60.2 64.5 62.5 59.9 60.8 58.1 55.7 54.3
30 57.0 56.0 56.8 57.4 57.5 59.2 54.9 58.9 55.0
40 51.9 54.6 57.5 54.7 56.1 55.5 54.6 50.0 55.7
50 51.6 51.1 51.1 54.0 53.3 54.3 50.8 47.8 52.7
60 49.3 48.7 47.9 50.7 51.4 51.2 48.2 49.0 47.2
are provided in terms of normalized discounted cumulative gains (NDCG), which
was averaged on the 20 and 30 most similar images for DLRSD and BigEarthNet-S2,
respectively.
For the two above-mentioned scenarios, we carried out experiments to: 1) perform a
sensitivity analysis; 2) conduct an ablation study; and 3) compare our approach with
the state-of-the-art methods in the framework of CBIR. Under the first scenario, we
compared our GRID (BCE) approach with: 1) the early-learning regularization (ELR)
framework [196]; 2) the joint training with co-regularization (JoCoR) approach [197];
RS IRL with multi-label classification by using 3) focal loss (denoted as FL) [195]; 4)
asymmetric loss (denoted as ASL) [198]; and 5) the standard binary cross entropy
(BCE) loss. It is worth noting that ELR, JoCoR and FL are originally introduced for
single-label classification problems. By following [191], we adapted them to multi-
label classification. Under the second scenario, we compared our GRID (PCE) ap-
proach with: 1) the high-resolution land cover mapping through learning with noise
correction method [31] (denoted as LNC); and 2) RS IRL with semantic segmentation
by the standard pixel-wise cross-entropy loss (PCE). For the second scenario, we also
compared our GRID (RRL) approach with RS IRL with multi-label co-occurrence
prediction based on RRL loss [79] (denoted as RRL). For each comparison with our
approach, we used the same CNN backbone and the same task heads.
6.5 Experimental Results
6.5.1 Sensitivity Analysis of the Proposed Approach
In this sub-section, we present the results of the sensitivity analysis of the proposed
approach under scene-level noisy labels (i.e., first scenario) and pixel-level noisy
labels (i.e., second scenario) in terms of different values of the
λ
hyper-parameter at
different values of SLNIR. We also assessed the effectiveness of our automatic noisy
sample detection procedure for both scenarios in terms of the noisy sample detection
accuracy.
1st Scenario (Scene-Level Noisy Labels): Tables 6.1 and 6.2 show the results of
GRID (BCE) for the DLRSD and BigEarthNet-S2 archives, respectively. One can see
from the Table 6.1 that when the level of training label noise increases, our approach
achieves generally higher scores by detecting more training samples as noisy with
higher values of
λ
for the DLRSD archive. However, as it can be seen from Table 6.2,
our GRID (BCE) approach achieves the highest scores when 20% of each mini-batch
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 94
TABLE 6.2: RESULTS (%) OBTAINED BY THE PROPOSED GRID (BCE) APPROACH FOR
DIFFERENT VALUES OF λAND SLNIR (%) (BIGEARTHNET-S2 ARCHIVE)
SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%
067.6 67.3 66.6 66.4 65.3 63.7 63.7 61.9 59.9
10 66.4 67.9 67.2 66.4 65.1 64.7 63.2 62.8 62.5
20 65.4 65.9 65.5 65.1 64.0 61.4 61.6 61.9 61.3
30 64.9 65.2 64.6 63.7 61.5 60.6 62.0 58.6 59.2
40 63.8 64.4 63.3 62.3 60.1 59.8 60.0 59.3 59.9
50 63.1 62.1 62.4 62.0 58.3 60.9 59.5 57.8 56.3
60 61.5 62.0 61.8 61.2 59.4 58.7 58.5 58.9 58.6
0 10 20 30 40 50 60 70 80 90 100
68
70
72
74
76
Epoch
Accuracy (%)
(a)
0 10 20 30 40 50 60 70 80 90 100
50
55
60
Epoch
Accuracy (%)
(b)
0 10 20 30 40 50 60 70 80 90 100
45
50
55
Epoch
Accuracy (%)
(c)
0 10 20 30 40 50 60 70 80 90 100
45
50
55
Epoch
Accuracy (%)
(d)
0 10 20 30 40 50 60 70 80 90 100
50
52
54
56
58
60
Epoch
Accuracy (%)
(e)
λ10% λ20% λ30% λ40% λ50% λ60% Random Selection
0 10 20 30 40 50 60 70 80 90 100
58
60
62
64
66
Epoch
Accuracy (%)
(f)
FIGURE 6.2: Noisy sample detection accuracy of the proposed GRID (BCE) approach versus
epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and
k
for
λk%
is set
as equal to the SLNIR value (DLRSD archive).
is identified as noisy (
λ20%
) for most of the SLNIR values under BigEarthNet-S2. It is
worth noting that BigEarthNet-S2 includes a higher number of RS images compared
to DLRSD, and thus there is a lower risk of overfitting to noisy labels. Accordingly,
when a training set size is higher than a certain extent as in BigEarthNet-S2, our
approach is capable of achieving a high performance with lower values of
λ
under
even a high label noise rate. However, when the rate of label noise in a training set
is high for a small dataset like DLRSD, our approach requires to increase the effect
of generative reasoning through detecting a higher number of noisy samples (i.e., a
high value of
λ
) for more accurate IRL. By considering that there is not a single
λ
value that provides the highest scores under all SLNIR values for DLRSD, we set it
based on the results on BigEarthNet-S2. Accordingly, for the rest of the experiments,
we set λof GRID (BCE) to λ20%.
We would like to note that if the value of
λ
is high, there is a risk of detecting training
samples with correct labels (i.e., clean samples) as noisy samples. To analyze the
effectiveness of our automatic noisy sample detection procedure, Figures 6.2 and 6.3
show the noisy sample detection accuracy when
k
for
λk%
is set as equal to the SLNIR
value (e.g.,
λ20%
for SLNIR
=
20%) for DLRSD and BigEarthNet-S2, respectively,
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 95
0 10 20 30 40 50 60 70 80 90 100
70
72
74
76
78
80
82
Epoch
Accuracy (%)
(a)
0 10 20 30 40 50 60 70 80 90 100
55
60
65
70
Epoch
Accuracy (%)
(b)
0 10 20 30 40 50 60 70 80 90 100
50
55
60
65
Epoch
Accuracy (%)
(c)
0 10 20 30 40 50 60 70 80 90 100
45
50
55
60
65
Epoch
Accuracy (%)
(d)
0 10 20 30 40 50 60 70 80 90 100
50
55
60
65
Epoch
Accuracy (%)
(e)
λ10% λ20% λ30% λ40% λ50% λ60% Random Selection
0 10 20 30 40 50 60 70 80 90 100
55
60
65
Epoch
Accuracy (%)
(f)
FIGURE 6.3: Noisy sample detection accuracy of the proposed GRID (BCE) approach versus
epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and
k
for
λk%
is set
as equal to the SLNIR value (BigEarthNet-S2 archive).
TABLE 6.3: RESULTS (%) OBTAINED BY THE PROPOSED GRID (PCE) APPROACH FOR
DIFFERENT VALUES OF λAND SLNIR (%) (DLRSD ARCHIVE)
SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%
065.8 63.5 63.3 62.2 59.5 63.3 62.5 61.1 61.5
10 62.7 61.2 61.6 58.1 60.1 57.2 59.1 60.0 59.2
20 59.7 59.8 59.5 58.3 55.0 57.9 56.1 55.6 57.4
30 59.0 57.1 56.5 57.5 53.6 53.0 55.5 56.1 55.0
40 56.1 55.9 56.5 56.2 55.1 53.4 55.0 54.0 52.6
50 53.9 52.1 52.2 54.4 50.5 52.1 49.7 51.4 51.9
60 49.4 49.1 46.7 44.4 47.5 47.4 48.2 50.8 47.1
under the first scenario. One can observe from the figures that our approach detects
noisy samples more accurately than random selection under each SLNIR value. This
shows the effectiveness of our automatic noisy sample detection procedure in the
proposed approach. It can be also seen from the figures that after a certain number of
training epochs, noisy sample detection accuracy starts to decrease for most of the
SLNIR values. It is due the fact that as the proposed approach combines generative
and discriminative reasoning during training, image representation space encoded by
the CNN backbone starts to become robust to noisy samples. Then, for our approach,
detecting noisy samples becomes harder and harder based on the image features
from the backbone as training continues. This leads to decrease in noisy sample
detection accuracy after a certain number of epochs. In greater detail, our approach
trained on BigEarthNet-S2 provides higher detection accuracy compared to that on
DLRSD especially on higher SLNIR values. It is due to the higher number of training
samples in BigEarthNet-S2 compared to DLRSD that allows our approach to learn
model parameters and to detect noisy samples more accurately.
2nd Scenario (Pixel-Level Noisy Labels): Tables 6.3 and 6.4 show the results of
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 96
TABLE 6.4: RESULTS (%) OBTAINED BY THE PROPOSED GRID (PCE) APPROACH FOR
DIFFERENT VALUES OF λAND SLNIR (%) (BIGEARTHNET-S2 ARCHIVE)
SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%
064.7 64.7 64.6 64.1 63.8 62.5 61.0 61.5 61.2
10 63.7 63.4 62.9 62.6 60.7 60.2 60.7 61.0 58.8
20 62.1 60.7 61.8 60.4 60.3 55.8 55.8 55.6 61.2
30 61.8 62.0 60.2 60.6 59.2 54.2 60.4 60.2 57.8
40 60.6 60.0 59.9 58.7 57.8 53.8 55.0 53.9 53.5
50 59.4 59.1 58.3 57.0 52.8 53.9 55.7 56.8 52.6
60 59.8 58.6 52.0 54.5 53.2 53.5 52.7 54.2 54.5
0 10 20 30 40 50 60 70 80 90 100
66
68
70
72
Epoch
Accuracy (%)
(a)
0 10 20 30 40 50 60 70 80 90 100
46
48
50
52
54
56
Epoch
Accuracy (%)
(b)
0 10 20 30 40 50 60 70 80 90 100
40
45
50
55
Epoch
Accuracy (%)
(c)
0 10 20 30 40 50 60 70 80 90 100
40
45
50
Epoch
Accuracy (%)
(d)
0 10 20 30 40 50 60 70 80 90 100
50
52
54
56
Epoch
Accuracy (%)
(e)
λ10% λ20% λ30% λ40% λ50% λ60% Random Selection
0 10 20 30 40 50 60 70 80 90 100
54
56
58
60
62
64
Epoch
Accuracy (%)
(f)
FIGURE 6.4: Noisy sample detection accuracy of the proposed GRID (PCE) approach versus
epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and
k
for
λk%
is set
as equal to the SLNIR value (DLRSD archive).
GRID (PCE) for the DLRSD and BigEarthNet-S2 archives, respectively. By as-
sessing the table, one can observe that as SLNIR value increases the proposed
approach achieves the higher scores with higher values of
λ
for DLRSD. How-
ever, for BigEarthNet-S2, the proposed GRID (PCE) approach achieves the highest
scores when
λ
is set to
λ10%
for most of the SLNIR values. This is inline with our
conclusion from the first scenario. In greater detail, for most of the SLNIR val-
ues, GRID (PCE) achieves the higher scores with lower values of
λ
compared to
GRID (BCE) for both archives. This is due to the fact that the semantic segmentation
task of the 2nd scenario is more complex than the multi-label image classification task.
Accordingly, our GRID (PCE) approach requires increasing the effect of discrimina-
tive reasoning over generative reasoning compared to GRID (BCE) to overcome the
complexity of the semantic segmentation task. This can be achieved by decreasing
the value of
λ
as it can be seen from the results. For the rest of the experiments, we
set
λ
of GRID (PCE) to
λ10%
based on the BigEarthNet-S2 results similar to the first
scenario.
Figures 6.4 and 6.5 show the noisy sample detection accuracy of the proposed
GRID (PCE) approach for DLRSD and BigEarthNet-S2, respectively, when kfor λk%
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 97
0 10 20 30 40 50 60 70 80 90 100
72
74
76
Epoch
Accuracy (%)
(a)
0 10 20 30 40 50 60 70 80 90 100
55
60
65
Epoch
Accuracy (%)
(b)
0 10 20 30 40 50 60 70 80 90 100
40
45
50
55
60
Epoch
Accuracy (%)
(c)
0 10 20 30 40 50 60 70 80 90 100
40
45
50
55
60
Epoch
Accuracy (%)
(d)
0 10 20 30 40 50 60 70 80 90 100
45
50
55
60
65
Epoch
Accuracy (%)
(e)
λ10% λ20% λ30% λ40% λ50% λ60% Random Selection
0 10 20 30 40 50 60 70 80 90 100
50
55
60
65
70
Epoch
Accuracy (%)
(f)
FIGURE 6.5: Noisy sample detection accuracy of the proposed GRID (PCE) approach versus
epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and
k
for
λk%
is set
as equal to the SLNIR value (BigEarthNet-S2 archive).
is set as equal to the SLNIR value under the second scenario. One can see from the
figures that GRID (PCE) is capable of detecting noisy samples with higher accuracy
than the random sampling under each SLNIR value for both archives. In particular,
the proposed approach achieves higher detection accuracy on BigEarthNet-S2 than
DLRSD. These follow our conclusion from the first scenario. This shows that our
approach is capable of accurately detecting noisy samples independently from the
considered loss function, learning task, DNN and training sample annotation type. In
greater detail, unlike the first scenario, after a certain number of training epochs noisy
sample detection accuracy of GRID (PCE) becomes non-decreasing for some SLNIR
values. It is due to the relative complexity of semantic segmentation task compared
to multi-label image classification task that may require more training epochs for our
approach under especially high SLNIR values. Since label noise rate of a training set
is assumed to be unknown for our approach, we avoided over-parameterization of
hyper-parameters such as number of training epochs. It is noted that the results for
the sensitivity analysis of the 2nd scenario were also confirmed through experiments
for our GRID (RRL) approach on both archives (not reported for space constraints).
6.5.2 Ablation Study of the Proposed Approach
In this sub-section, we present an ablation study of our approach to analyze the
effectiveness of our label noise robust hybrid representation learning compared to
using: i) only discriminative reasoning; ii) only generative reasoning; and iii) their
standard joint learning under both first and second scenarios. For the standard joint
learning of discriminative and generative reasoning, we jointly minimize
Og
and
Od
for all the training samples without the detection of noisy samples. This leads to
optimization of the all model parameters based on both generative and discrimina-
tive reasoning of noisy samples and clean samples together. Figure 6.6 shows the
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 98
0 10 20 30 40 50 60
40
50
60
70
SLNIR (%)
NDCG (%)
(a)
0 10 20 30 40 50 60
45
50
55
60
65
70
SLNIR (%)
NDCG (%)
(b)
0 10 20 30 40 50 60
40
50
60
70
SLNIR (%)
NDCG (%)
(c)
0 10 20 30 40 50 60
50
55
60
65
SLNIR (%)
NDCG (%)
(d)
0 10 20 30 40 50 60
40
45
50
55
60
SLNIR (%)
NDCG (%)
(e)
Discriminative reasoning Generative reasoning The standard joint learning Our hybrid representation learning
0 10 20 30 40 50 60
45
50
55
60
65
SLNIR (%)
NDCG (%)
(f)
FIGURE 6.6: Results obtained by using: 1) discriminative reasoning; 2) generative reason-
ing; 3) their standard joint learning; and 4) our label noise robust hybrid representation
learning strategy for different values of SLNIR when RS IRL is achieved by: i) multi-label
classification on (a) DLRSD and (b) BigEarthNet-S2; ii) semantic segmentation on (c) DLRSD
and (d) BigEarthNet-S2; and iii) multi-label co-occurrence prediction on (e) DLRSD and (f)
BigEarthNet-S2.
results of using: i) discriminative reasoning; ii) generative reasoning; iii) the standard
joint learning of discriminative and generative reasoning; and iv) our label noise
robust hybrid representation learning strategy under different SLNIR values for both
archives. By assessing the figure, one can observe that our label noise robust hybrid
representation learning strategy provides the highest scores for most of the SLNIR
values independently from the considered scenarios. This shows that our approach is
capable of: i) accurately combining generative and discriminative reasoning indepen-
dently from the considered loss function, learning task and type of annotation; and ii)
effectively adjusting the whole learning procedure accordingly for label noise robust
IRL. In greater detail, generative reasoning achieves the lowest scores under most
of the SLNIR values and considered scenario compared to discriminative reasoning.
However, its performance is less affected by the increase in label noise rate compared
to discriminative reasoning. This shows the capability of generative reasoning to
allow robust learning of image representations under label noise. One can see from
the figure that the standard joint learning provides lower scores compared to using
only discriminative reasoning for most of the SLNIR values under both scenarios.
Learning image representations based on discriminative and generative reasoning
on all the training samples may not be accurately achieved due to interference of
different learning characteristics. However, when the complementary characteristics
of discriminative and generative reasoning is modeled based on our hybrid repre-
sentation learning strategy, the proposed approach is capable of overcoming this
limitation. This shows the importance of the label noise robust hybrid representation
learning strategy in our approach.
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 99
TABLE 6.5: RESULTS (%) OBTAINED BY BCE, ELR [196], FL [195], ASL [198], JOCOR [197]
AND THE PROPOSED GRID (BCE) APPROACH UNDER DIFFERENT VALUES OF SLNIR (%)
(DLRSD ARCHIVE)
SLNIR BCE ELR FL ASL JOCOR GRID (BCE)
0 62.7 63.8 62.8 63.5 61.7 67.2
10 60.3 62.0 58.7 57.2 61.7 64.2
20 55.9 59.2 55.3 51.4 59.3 62.5
30 55.6 54.4 52.6 50.6 55.5 62.2
40 50.1 50.9 48.6 46.3 51.8 55.7
50 48.3 50.0 46.4 43.7 49.1 53.8
60 47.1 47.2 46.1 43.6 47.9 50.6
TABLE 6.6: RESULTS (%) OBTAINED BY BCE, ELR [196], FL [195], ASL [198], JOCOR [197]
AND THE PROPOSED GRID (BCE) APPROACH UNDER DIFFERENT VALUES OF SLNIR (%)
(BIGEARTHNET-S2 ARCHIVE)
SLNIR BCE ELR FL ASL JOCOR GRID (BCE)
0 67.6 68.9 66.2 65.6 66.6 68.2
10 65.7 66.6 64.1 63.6 66.4 66.7
20 63.3 64.2 63.2 62.7 64.5 65.7
30 62.6 63.1 61.6 62.3 62.9 63.4
40 61.6 61.9 61.3 61.7 62.1 63.3
50 60.2 60.4 59.8 60.4 61.1 61.6
60 59.6 59.8 59.9 60.1 60.2 60.3
6.5.3 Comparison Among the State-of-the-Art Methods
In this sub-section, we analyze the effectiveness of the proposed approach compared
to different state-of-the-art methods for both scenarios under different values of
SLNIR.
1st Scenario (Scene-Level Noisy Labels): We compared our GRID (BCE) approach
with BCE, ELR [196], FL [195], ASL [198] and JoCoR [197] for the first scenario.
Tables 6.5 and 6.6 show the corresponding results for DLRSD and BigEarthNet-S2
archives, respectively. By analyzing the tables, one can observe that the proposed
GRID (BCE) approach leads to the highest scores for almost all the SLNIR values
on both DLRSD and BigEarthNet-S2 archives. For example, our approach outper-
forms ELR by almost 8% NDCG score when SLNIR
=
30% for DLRSD. In detail, it
provides more than 3% higher NDCG score compared to ASL when SLNIR
=
10%
for BigEarthNet-S2. As SLNIR value increases, reduction in the NDCG scores is
higher for DLRSD compared to BigEarthNet-S2. This is due to the small number of
images present in DLRSD that leads to overfitting on noisy labels more easily than
BigEarthNet-S2. However, even under high SLNIR values for DLRSD, our approach
achieves comparable results with other methods under smaller SLNIR values. As
an example, our approach under SLNIR
=
40% achieves similar performance with
BCE under SLNIR
=
30%. These results demonstrate the success of the proposed
GRID (BCE) approach compared to other methods when the training images are
annotated with scene-level noisy multi-labels.
2nd Scenario (Pixel-Level Noisy Labels): We compared our GRID (PCE) approach
with PCE and LNC [31], while GRID (RRL) was compared with RRL [79] for the
second scenario. Tables 6.7 and 6.8 show the corresponding results for DLRSD
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 100
TABLE 6.7: RESULTS (%) OBTAINED BY PCE, LNC [31], RLL [79] AND THE PROPOSED
GRID (PCE) AND GRID (RRL) APPROACHES UNDER DIFFERENT VALUES OF SLNIR (%)
(DLRSD ARCHIVE)
SLNIR PCE LNC GRID (PCE) RRL GRID (RRL)
065.0 62.5 64.0 57.5 58.1
10 60.0 60.9 62.1 52.2 52.7
20 59.1 60.8 61.1 51.8 54.9
30 56.8 57.5 57.7 48.8 52.9
40 55.8 55.2 56.1 43.6 53.3
50 53.0 53.6 52.6 44.5 51.8
60 48.3 48.2 48.3 45.1 47.7
TABLE 6.8: RESULTS (%) OBTAINED BY PCE, LNC [31], RLL [79] AND THE PROPOSED
GRID (PCE) AND GRID (RRL) APPROACHES UNDER DIFFERENT VALUES OF SLNIR (%)
(BIGEARTHNET-S2 ARCHIVE)
SLNIR PCE LNC GRID (PCE) RRL GRID (RRL)
0 63.5 62.5 64.9 62.4 63.8
10 61.8 61.7 61.8 60.1 62.5
20 61.2 61.3 61.8 58.8 62.1
30 61.1 61.2 61.5 59.3 61.1
40 59.9 60.0 60.0 58.5 61.2
50 58.9 58.8 59.0 57.4 60.5
60 58.2 58.3 58.6 57.5 59.9
and BigEarthNet-S2 archives, respectively. One can see from the tables that both
GRID (PCE) and GRID (RRL) achieve the highest scores compared to other methods
under most of the SLNIR values. As an example, when SLNIR
=
10% for DLRSD,
GRID (PCE) achieves more than 1% higher NDCG score compared to LNC, which
is specifically designed for pixel-wise label noise robust semantic segmentation of
RS images. Even when synthetic pixel-wise label noise is not injected to the training
sets (SLNIR
=
0%), both GRID (PCE) and GRID (RRL) are capable of providing the
highest scores for the BigEarthNet-S2 archive. This is due to the fact that even if
SLNIR
=
0%, our approach is learning RS image representations robust to label noise
already present in the original training sets. In greater detail, only when SLNIR
equals to 50% and 0% for DLRSD, our GRID (PCE) approach is outperformed by
LNC and PCE, respectively, with 1% difference of NDCG scores. However, this is
specific to DLRSD archive and not valid for BigEarthNet-S2 archive. These results
show the success of our approach compared to other methods when the training
samples are annotated with pixel-level noisy labels. This is inline with our conclusion
from the first scenario.
It is worth noting that under two scenarios, we tested our approach with three dif-
ferent loss functions (BCE, PCE and RRL), three different learning tasks (multi-label
image classification, semantic segmentation and multi-label co-occurrence prediction)
with the corresponding DNN architectures and two different annotation types (scene-
level and pixel-level) compared to state-of-the-art methods. The results show that
the proposed approach is capable of accurately learning RS image representations
under label noise independently from the considered DNN architecture, loss function,
learning task and annotation type. This is due to the capability of our approach to
simultaneously leverage the robustness of generative reasoning to noisy labels and
Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 101
the effectiveness of discriminative reasoning for IRL.
6.6 Conclusion
In this chapter, we have introduced a novel generative reasoning integrated label
noise robust deep representation learning (GRID) approach to model the comple-
mentary characteristics of discriminative and generative reasoning for IRL under
noisy labels. To achieve this, the proposed GRID approach first integrates generative
reasoning into discriminative reasoning through a supervised VAE as the probabilis-
tic generative process. Due to this integration, both generative and discriminative
reasoning share the same CNN backbone that allows to: 1) model the posterior and
joint distributions of annotated RS images in a single learning procedure; 2) auto-
matically detect training samples with noisy labels based on the loss values acquired
from discriminative and generative task heads. This is achieved by the label noise
robust hybrid representation learning strategy (which models RS images through
generative reasoning for the training samples with noisy labels and discriminative
reasoning for the remaining samples in the training data) in our approach. By this
way, the proposed GRID approach learns discriminative RS image representations
through the CNN backbone while preventing interference of noisy labels during
training.
It is worth noting that our approach is independent from the type of DNN architec-
ture, loss function, learning task, annotation being considered, label noise present in
training data, and can operate with any DL-based IRL method. In addition, GRID
does not require the availability of a clean subset of a training set. In this chapter,
we consider two different scenarios, where training samples are annotated with: 1)
scene-level noisy multi-labels; and 2) pixel-level noisy labels. Experimental analysis
conducted on two RS image archives shows the effectiveness of our approach for
these scenarios. In particular, the success of our approach is shown under three learn-
ing tasks with the corresponding loss functions and DNN architectures at different
synthetic label noise injection rates while considering both wrong and missing labels.
This shows that the proposed approach accurately learns discriminative RS image
representations, while ensuring the robustness of whole learning procedure towards
noisy labels independently from the IRL method being considered. We underline
that this is a very important advantage for operational RS applications, which deal
with noisy annotations and require different IRL scenarios.
We would like to point out that our automatic noisy sample detection procedure is
controlled by the hyper-parameter
λ
. Its selection may be dependent on the level
of noisy labels in a training set, which is unknown most of the time in operational
scenarios. Accordingly, as a future development of this work, we plan to investigate
the strategies for automatically detecting level of noise in training data, and then
integrating it into our automatic noisy sample detection procedure for the proposed
approach.
102
Chapter 7
Plasticity-Stability Preserving
Multi-Task Image Representation
Learning in Remote Sensing
DL-based multi-task learning (MTL) methods have recently attracted attention for
IRL in RS. For a given set of tasks (e.g., scene classification, semantic segmentation,
image reconstruction, etc.), existing MTL methods employ a joint optimization al-
gorithm on the direct aggregation of task-specific loss functions. Such an approach
may provide limited performance when: i) tasks compete or even distract each other;
ii) one of the tasks dominates the whole learning procedure; or iii) characterization
of each task is under-performed compared to single-task learning. This is mainly
due to the lack of: i) plasticity condition (which is associated to sensitivity to new
information); or ii) stability condition (which is associated to protection from radical
disruptions by new information) of the whole learning procedure. To avoid this issue,
in this chapter, we propose a novel plasticity-stability preserving multi-task learning
(PLASTA-MTL) approach to ensure the plasticity and the stability conditions of
whole learning procedure independently from the number and type of tasks. This is
achieved by defining two novel loss functions. The first loss function is the plasticity
preserving loss (PPL) function that aims to enforce the global image representation
space to be sensitive to new information learned with each task. This is achieved by
minimizing the difference of gradient magnitudes for the global representation and
task-specific embedding spaces. The second loss function is the stability preserving
loss (SPL) function that aims to protect the global representation space radically
disrupted by a new task. This is achieved by minimizing the angular distances be-
tween the task gradients over global representation space. To effectively employ the
proposed loss functions, we also introduce a novel sequential optimization algorithm.
Experimental results show the effectiveness of the proposed approach compared to
the state-of-the-art MTL methods. This chapter is mainly based on the following
publication:
G. Sumbul and B. Demir, “Plasticity-stability preserving multi-task learning
for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 60, pp. 1–16, 2022. DOI:10.1109/TGRS.2022.3160097.
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 103
7.1 Introduction
As highlighted in the previous chapter, in DL-based IRL methods, image representa-
tions are automatically learned during the optimization of an objective function based
on the characteristics of a learning task. Most of the existing methods in RS utilize
the following learning tasks: 1) scene classification [4]–[14]; 2) similarity learning
[15]–[27]; 3) image reconstruction [28], [29]; and 4) semantic segmentation [30], [79].
Each task has different objectives that leads to different optimization procedures
throughout the training of the considered deep neural network (DNN). Accordingly,
learned image representations have different characteristics for different tasks, and
thus carry different information to be utilized in the final application. As an example,
when the task is scene classification, RS image representations can be learned with
convolutional neural networks (CNNs) by optimizing entropy-based loss functions.
In this way, image representations are encoded to separate pre-defined classes that
maximizes inter-class distances in the image representation space. For the similarity
learning task, on the other hand, image representations are learned to discriminate
dissimilar RS images that minimizes intra-class distance in the image representation
space [34]. This can be achieved by employing siamese CNNs on tuples of RS images
to optimize triplet or contrastive loss functions. If the task is chosen as the image
reconstruction, auto-encoder neural networks can be used first to construct the rep-
resentations and then to recover RS images with reconstruction loss. In this way,
resulting image representations are robust to noise in RS images [204]. In RS, it is
common to use the above-mentioned tasks in the framework of single-task learning
(STL).
However, using a single task may not be sufficient to describe the complex content
of RS images. To address this issue, multiple tasks can be jointly utilized for the
image representation learning. When image representation learning is achieved
based on multiple tasks, resulting latent space can better represent the complex
semantic content of RS images. Accordingly, few DL based multi-task learning (MTL)
methods have been recently introduced in RS. As an example, in [52], RS image
similarity learning based on triplet loss is combined with the scene classification
task. In this method, task-specific heads are combined with the CNN backbone
shared by two tasks, while the joint optimization of task-specific loss functions is
employed by minimizing the summation of them. In this way, MTL is regarded as a
joint optimization problem based on the aggregation of task-specific loss functions.
This is followed by most of the MTL methods in RS.
Due to the complexity of MTL problem, it is common that: i) tasks may compete or
even distract each other during training; ii) one of the tasks may dominate the whole
learning procedure; or iii) characterization of each task can be under-performed
compared to STL [58]. These problems undermine the effectiveness of whole repre-
sentation learning procedure [59]. These issues occur due to the stability-plasticity
constraint of MTL[60]. MTL methods require to be sensitive to new information
learned from each task that allows the contribution of each task to further improve
modeling the image characterization. This condition is known as plasticity [60].
If there is a lack of plasticity condition in response to new information, an image
representation space will be slightly affected while learning a new task, and thus
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 104
will merely reflect different characteristics of representations learned via different
tasks. If the considered DNN suffers from the lack of plasticity condition, information
specific to each task will be only encoded in the corresponding task-specific head.
The possible drawbacks of this issue are twofold. First, only the general features of
RS images can be encoded in the CNN backbone, and thus image features extracted
from the considered DNN will have the lower discrimination capability compared to
STL. Second, one of the tasks can dominate the global image representation space. In
this case, all tasks except the one, which dominates the image representation space
learned via the backbone, will not significantly affect the image features. For MTL,
during the learning process of a new task, new information encoded in the considered
DNN should not radically disrupt what is already characterized based on the other
tasks. This condition is known as stability. When there is a lack of stability condition
in response to new information captured via new task, there is a risk that previous
information encoded by the considered DNN can be forgotten. Thus, a global image
representation space will be mainly characterized based on the characteristics of
representations learned via single task. This risk is more evident when some of the
tasks compete each other. In this case, since every task aims to radically change the
global image representation space compared to other tasks, tasks may distract each
other that leads to less accurate RS image characterization for MTL compared to STL.
The MTL formulation of the existing DL based methods (which is based on joint
optimization) is limited to control learning of each task. Thus, it does not allow to
control plasticity and stability of the whole learning procedure. It is also worth noting
that, in the above-mentioned MTL formulation, whole learning procedure is sensitive
to proper selection of loss function weight for each task that generally requires a
grid search (which is computationally demanding) [61]. Thus, MTL methods that
can effectively combine multiple tasks without the need for selection of loss weights
while considering the stability-plasticity problem are needed to accurately apply RS
image representation learning.
To avoid the above-mentioned problems, as a first time, we propose a novel PLAsticity-
STAbility preserving Multi-Task Learning (PLASTA-MTL) approach. The PLASTA-
MTL approach aims to preserve: 1) the plasticity for each task; and 2) the stability
in between learning consecutive tasks for whole learning procedure independently
from the number of tasks and the type of tasks. To this end, we introduce novel plas-
ticity preserving and stability preserving loss functions. The plasticity preserving loss
(PPL) function enforces the global image representation space (which is shared by all
the tasks) to be sensitive to new information learned with each task during training.
This is achieved by minimizing the gradient magnitude differences between global
image representation and task-specific embedding spaces. The stability preserving
loss (SPL) function protects the image representation space radically disrupted by
each task during training. This is achieved by minimizing the angular distances
between task gradients over global image representation space. To effectively apply
these two loss functions, unlike the most of the existing MTL methods, we also
propose a sequential optimization algorithm. The proposed algorithm aims to adap-
tively adjust the interactions between task-specific learning procedures, allowing
to ensure plasticity and stability conditions for all the tasks. To this end, instead of
joint optimization of all loss functions, task-specific objectives together with the PPL
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 105
function are sequentially optimized. By this algorithm, the SPL function is optimized
at the end of the task sequence for all the considered tasks.
The novelty of the proposed PLASTA-MTL approach consists in: 1) the adaptive
adjustment of interactions between task-specific learning procedures by the proposed
sequential optimization algorithm; 2) the protection of image representation space
from radical disruptions occurred due to each task by the proposed SPL function; and
3) the sensitivity assurance of the image representation space to new information from
each task by the proposed PPL function. Due to the proposed sequential optimization
algorithm, our PLASTA-MTL approach does not need to select loss function weights
for each task. Due to its stability and plasticity preserving capabilities, our PLASTA-
MTL approach overcomes the above-mentioned MTL problems of joint optimization
algorithm, which are mainly conflicts between tasks, the dominance of one of the
tasks and under-performance of tasks compared to STL. It is worth noting that the
proposed PLASTA-MTL approach is independent from the number of considered
tasks and their types. In this chapter, we consider the different combinations of four
learning tasks: 1) supervised scene classification; 2) supervised similarity learning;
3) supervised multi-label co-occurrence prediction; and 4) unsupervised similarity
learning. For different combinations of these learning tasks, we conduct experiments
on a single RS application for the sake of simplicity. This application is selected as
content-based image retrieval (CBIR) due to the importance of employing accurate
image features for similarity matching in CBIR.
The rest of this chapter is organized as follows. Section 7.2 provides the related
works. Section 7.3 presents the proposed PLASTA-MTL approach. Section 7.4
describes the considered datasets and the experimental setup. Section 7.5 provides
the experimental results, while Section 7.6 concludes this chapter.
7.2 Related Works
In this section, we initially present the recent advances in single-task driven IRL
methods in RS based on the considered learning tasks and then survey the existing
DL based MTL methods for RS IRL.
7.2.1 Single-Task Driven Methods
In the context of DL based single-task driven IRL, an objective function is usually
selected on the basis of the characteristics of the considered learning task, and thus
image features are automatically learned during the optimization of this objective
function. We categorize the existing methods into five groups based on the tasks that
they utilize and survey an example of studies in the following.
Scene Classification Driven Methods: The task of scene classification aims at auto-
matically assigning single-labels or multi-labels to image scenes. In [13], land-use
class probabilities obtained by a CNN are exploited for weighting the distance be-
tween a query image and the archive images for CBIR applications, while in [6], a
distance between image and its land-use class is used to apply re-ranking on the
order of retrieved images. In [9], aggregated deep local features are utilized for query
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 106
sensitive CBIR on RS images. To this end, vector of locally aggregated descriptors
obtained via multiplicative and additive attention mechanisms are used to construct
memory vector for expanded image description. In [5], fuzzy distance calculation
based on fuzzy rules is introduced for the definition of RS image similarity, while
image descriptors are extracted from a CNN. In [7], query-adaptive feature fusion
technique is introduced to employ different hierarchical image representations from
a CNN.
Similarity Learning Driven Methods: DL based similarity learning aims to auto-
matically identify image similarity based on an image representation space, where
semantically similar images are located close to each other. In [21], a twin CNN
is introduced for the prediction of pairwise image similarity during the hash code
generation of RS images. In [15], a triplet deep metric learning network (TDMLN)
is introduced for RS image similarity learning. TDMLN utilizes three CNNs with
shared model parameters that allow to learn RS image similarity through triplet loss
function on image triplets, each of which include anchor, positive and negative im-
age. TDMLN aims at learning a metric space where the distance between an anchor
and its positive image is minimized and that between the anchor and its negative
image is maximized. In [17], a Siamese graph convolutional network is proposed to
employ region adjacency graph based image descriptors for the characterization of
pairwise image similarity with contrastive loss function. In [27], RS image similarity
learning based on image triplets is utilized for hash code generation of RS images. In
[16], distribution consistency loss function is proposed in the context of deep metric
learning to make use of multiple positive and negative images for each anchor image
unlike the triplet loss function. In [22], quantized deep learning to hash approach is
introduced for efficient CBIR. In this approach, DNN weights and activation func-
tions are binarized while pairwise image similarity characterization is used for hash
code generation of RS images. In [24], generative adversarial network regularization
based deep metric learning method is introduced to model pairwise image similarity
while a generative adversarial network is used to mitigate the overfitting problem.
In [25], a global optimization algorithm is introduced to jointly employ different
metric learning based loss functions on image representations for the consistency
between the loss reduction direction and the optimization direction. In [26], weighted
Wasserstein ordinal loss function is proposed for Siamese CNNs to formulate the
image similarity learning problem as an unsupervised deep ordinal classification
problem. In [19], dual-anchor triplet loss function is introduced to make use of more
than one anchor for each image triplet (which is achieved by considering the positive
image as the second anchor).
Image Reconstruction Driven Methods: DL based image reconstruction task aims
at automatically reconstructing input images based on unsupervised image represen-
tation learning. In [28], A deep bag-of-words method is introduced. In this method,
a convolutional autoencoder (CAE) is utilized to: i) encode the RS image local areas
into a representation space; and ii) decode local descriptors to image space. A recon-
struction loss function is employed between an image local area and the CAE output,
while k-means clustering is used with bag-of-words approach to define the global
image representation. In [29], residual-dyad units (which is the combination of full
preactivation block and a convolutional shortcut block) are proposed for CAEs to
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 107
avoid diminishing feature reuse problem of conventional residual connections.
Semantic Segmentation Driven Methods: The semantic segmentation task aims to
automatically identify pixel-based class labels, which are associated to RS images.
As an example for such methods, in [30], a fully convolutional network (FCN) is
proposed to characterize local areas of multi-label RS images. The FCN is first trained
to predict land-cover map of RS images, which are then used to characterize convolu-
tional descriptors of image local areas. The set of final local descriptors are utilized
for region-based RS image matching. In [79], a graph-theoretic deep representation
learning method is introduced to characterize multi-label co-occurrence relationships
associated to each RS image in an archive. To this end, a CNN is employed for
the automatic prediction of graph driven region-based image representation with a
region representation learning loss function.
7.2.2 Multi-Task Driven Methods
MTL aims at enhancing the effectiveness of image representation learning and the
prediction accuracy of each task compared to using a separate learning procedure for
each task [205]. To this end, DL based MTL problem is formulated as learning the
model parameters of a DNN with respect to multiple loss functions, each of which is
associated with a task. In RS, DL based MTL has been applied to various applications
(e.g., motion deblurring [53], building damage mapping [54], change detection [55],
road extraction [206] etc.). As an example, in the context of CBIR, few DL based MTL
methods have been recently proposed in RS while combining two tasks: i) scene
classification; and ii) similarity learning. In [56], a wide-context attention network is
introduced to learn the correlation of local descriptors with wide context information
by employing channel dependence-attention and spatial context-attention modules.
In [52], a center-metric learning method, which employs positive-negative center
loss function for modeling metric space, is proposed to characterize within-class
variations. In [57], a discriminative distillation network is introduced to incrase the
interclass variations and to reduce the intraclass differences. In [207], a deep hashing
CNN is employed for simultaneously generating hash codes and predicting land-use
classes of RS images. All above-mentioned deep MTL methods in RS utilize a CNN
backbone (which is shared by all tasks) followed by task-specific heads, while image
representation learning is done by jointly optimizing the aggregation of task-specific
loss functions. Although, the main problems of this MTL formulation are separately
addressed by automatically selecting loss weights with gradient adjustment strategies
in computer vision domain (e.g., [205], [182], [208], [183]), they are still based on the
joint optimization algorithm.
7.3 Proposed Approach
Let
X={x1
,
. . .
,
xM}
be an archive that includes
M
images, where
xi
is the
ith
RS
image in the archive
X
. Let
φ:θ
,
X Rγ
be any type of DNN that maps the image
xi
to
γ
-dimensional image descriptor
φ(xi
;
θ)
, where
θ
is the set of DNN parameters.
Let
T={T1
,
. . .
,
TN}
be a set of
N
tasks, where i
th
task
Ti
is associated with a loss
function
LTi
. When image representation learning is achieved based on multiple
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 108
tasks, the objective function consists of multiple loss functions
{LTi}N
i=1
. In this
chapter, MTL is performed by hard parameter sharing technique [58], that allows to
characterize a global descriptor for each image based on the multiple tasks. In this
way, considered DNN typically includes an encoder (i.e., a CNN backbone), which
is shared by all the tasks, and task-specific heads, which are branched out from the
CNN backbone. Each task-specific head characterizes the task-specific embedding
space based on the characteristics of each task. The CNN backbone models global
image representation space. Let
Gθ
be the set of DNN parameters that is used
for defining global image representation space.
G
is chosen as the parameters of
the last layer of the CNN backbone shared by all the tasks. Let
ETiθ
be the
set of parameters that is used to construct the task-specific embedding for the
ith
task
Ti
. Accordingly, after learning DNN parameters
θ
,
G
is used to obtain image
representations.
In the standard MTL formulation (which is based on joint optimization algorithm), all
the model parameters
θ
including
G
and
{ETi}N
i=1
are simultaneously updated based
on the gradients of aggregated loss functions (
θiLTi
). This MTL formulation is
limited to control learning process of each task and thus the plasticity and stability
conditions of the whole learning procedure. This leads to the problems, which are
discussed in the first section of this chapter. To avoid these problems by preserving
the plasticity and stability capabilities for all the considered tasks, the proposed
PLASTA-MTL approach is characterized by two novel loss functions and a novel
optimization algorithm. By the proposed plasticity preserving loss (PPL) function,
the PLASTA-MTL approach minimizes the gradient magnitude differences between
global image representation space and task-specific embedding spaces for the sensi-
tivity of the global image representation space to new information learned via each
task. By the proposed stability preserving loss (SPL) function, the PLASTA-MTL
approach minimizes angular distances between task gradients over global image
representation space to protect it from radical disruptions by each task. To accurately
apply these loss functions, the proposed optimization algorithm sequentially opti-
mizes task-specific objectives together with the PPL function. In our algorithm, the
SPL function is optimized at the end of the task sequence for all the tasks. In the
following sections, we initially explain in detail the proposed PPL and SPL functions
and then introduce the proposed sequential optimization algorithm.
7.3.1 Plasticity Preservation
The proposed PLASTA-MTL approach aims to control the level of plasticity for each
task in the context of MTL, and thus to ensure the sensitivity to new information
learned via each task. The level of plasticity for each task is controlled by what
extent information encoded in task-specific embedding space is also encoded in the
global image representation space. To this end, we define the plasticity condition
for the
ith
task
Ti
as how much change is occurred in
G
compared to that of
ETi
,
while learning
Ti
is based on the corresponding loss function
LTi
. To measure the
change occurred in
G
and
ETi
for
Ti
, we utilize the gradients of
LTi
with respect to
the global image representation and task-specific embedding parameters (
GLTi(θ)
and
ETiLTi(θ)
). Then, the gradient magnitude difference between global image
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 109
representation space
G
and task-specific embedding space
ETi
for task
Ti
represents
the change occurred in
G
and
ETi
as follows:
GLTiETiLTi
. When this
difference increases throughout the learning procedure, information specific to task
Ti
is only encoded by task-specific embedding space. Then, the considered DNN
suffers from the lack of plasticity condition for global image representation space.
Accordingly, to minimize the degree of changes in global image representation space
G
and the task-specific embedding space
ETi
, we define the PPL function
LTi
PPL
for
the task Tias follows:
LTi
PPL =|GLTi(θ)
dim(GLTi(θ)) ETiLTi(θ)
dim(ETiLTi(θ))|, (7.1)
where
dim
function gives the dimensions of the gradient vectors that are used to
normalize the gradient magnitude difference. Since each task is associated with a
separate set of task-specific embedding parameters, PPL is defined for each task.
In detail, we define the PPL objective based on the gradients of a task-specific loss
function. It is worth noting that defining loss functions based on the task-specific
gradients is often considered in the framework of MTL (e.g., [61], [208], [209]) to
control the effect of each task on the weight update of a DNN [58].
Due to our PPL function, the proposed PLASTA-MTL approach keeps the gradient
magnitudes of
G
and
ETi
on the same scale while modelling the the task
Ti
. This leads
the task-specific information to be characterized in both global image representation
space and task-specific embedding space. Thus, the global image representation
space (which is shared by all the tasks) is enforced to be sensitive to new information
learned with each task during training. Accordingly, the proposed PLASTA-MTL
approach prevents the considered DNN from the lack of plasticity condition for
each considered task. It is worth noting that when joint optimization algorithm is
employed on the aggregation of all task-specific loss functions, application of our
PPL function for all tasks can increase the complexity of whole learning procedure.
In this case, the gradient magnitude of
G
is forced to simultaneously have the same
scale with that of
ETi
for each
i {
1,
. . .
,
N}
that can exacerbate confusion for the
whole learning procedure.
7.3.2 Stability Preservation
The proposed PLASTA-MTL approach aims to adjust the level of stability in between
consecutive tasks in the context of MTL, and thus to prevent whole learning proce-
dure from radical disruptions while learning multiple tasks. The level of stability
in between learning different tasks is characterized by the degree of change (which
is occurred in global image representation space) due to a new task with respect to
that of previous tasks. Accordingly, the level of stability condition for all the tasks
{T1
,
. . .
,
TN}
can be defined as how much change is occurred in
G
in-between learn-
ing consecutive tasks based on their corresponding loss functions
{LT1
,
. . .
,
LTN}
.
To this end, we define the relative change in
G
between learning two consecutive
tasks
Ti
and
Ti+1
as the angular distance between the gradients of the associated
loss functions
GLTi(θ)
and
GLTi+1(θ)
. If this angular distance between two gradi-
ent vectors (that is associated with two consecutive tasks) becomes extremely high
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 110
throughout the learning procedure, the gradient of the latter task enforces global
image representation to change into very different direction compared to the former
task. In this way, the latter task radically changes the global image representation
space. This may lead to lack of stability for the considered learning. Accordingly,
to minimize the angular distances, each of which is between the gradients of each
consecutive tasks, we define the SPL function as follows:
LSPL =1
N1
N1
i=1
arccos(GLTi(θ)·GLTi+1(θ)
GLTi(θ)GLTi+1(θ)), (7.2)
where
arccos(a·b
||a||||b||)
measures the angle between the vectors
a
and
b
. To ensure the
stability condition for all the tasks
{T1
,
. . .
,
TN}
, the proposed SPL function considers
the angular distances between all consecutive pairs in the task sequence.
Due to our SPL function, the proposed PLASTA-MTL approach keeps the angular
distances between different task gradients minimum while learning all the tasks
{T1
,
. . .
,
TN}
. Thus, the directions of task gradients over global image representation
space are forced to be stable throughout the whole learning procedure. This prevents
radical changes in global image representation space due to learning any task. Ac-
cordingly, the proposed PLASTA-MTL approach prevents the considered DNN from
the lack of stability condition for all the task. We would like to point out that if the
conventional optimization algorithm of MTL is applied, the optimization of all loss
functions is applied simultaneously. In this way, there is a single change in
G
based
on the gradient of aggregated loss functions of all tasks. Then, it is hard to model
relative changes in Gwith respect to different tasks.
7.3.3 Sequential Optimization Algorithm
For the whole learning procedure, the proposed sequential optimization algorithm
aims to adaptively adjust the interactions between task-specific learning procedures,
and thus allows the proposed PLASTA-MTL approach to ensure plasticity and
stability conditions for all the tasks. As in most of the DL based MTL methods,
learning the parameters of the considered DNN for the tasks
{Ti}N
i=1
can be achieved
based on the following empirical risk minimization formulation:
min
θ
N
i=1
λiLTi(θ), (7.3)
where
λi
is the weight parameter of the task
Ti
. In this formulation, for a given mini-
batch of training images, there is one optimization procedure, where all the model
parameters are jointly updated to minimize the aggregation of all loss functions.
This formulation limits to control plasticity and stability conditions for each task as
explained in the previous sections of this chapter. Unlike the existing MTL meth-
ods, in the proposed sequential optimization algorithm, there is one optimization
procedure for each task-specific loss function together with the corresponding PPL
function. At the end of the task-sequence, this algorithm applies one more additional
optimization procedure for SPL by considering all the tasks. To this end, we first
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 111
formulate (7.3) as a multi-level optimization problem as follows:
min
G,θTNLTN(G,θTN)
s.t. Gargmin
G,θTN1LTN1(G,θTN1)(7.4)
. . .
s.t. Gargmin
G,θT1LT1(G,θT1),
where
θTiθ
is the set of task-specific parameters associated to the task
Ti
(i.e.,
task-specific head parameters). The reader is referred to [210] for the details of
multi-level optimization formulation. For (7.4), the set of all tasks
T
is regarded as a
sequence
Ti|i {
1,
. . .
,
N}
. Accordingly, instead of jointly optimizing all the tasks,
every task
Ti
in the sequence is optimized sequentially. In this way, global image
representation space (which is defined by
G
) is always affected by the optimization
of last task in the sequence. This allows to adaptively adjust the interactions between
task-specific learning procedures, and thus to integrate the plasticity and stability
preserving capabilities of the proposed PLASTA-MTL approach into the whole
learning procedure. To this end, for each task, we minimize the corresponding PPL
function LTi
PPL with the task-specific loss function LTiby integrating multi-objective
optimization of two loss functions to (7.4), as follows:
min
G,θTN
(LTN(G,θTN),LTN
PPL(GLTN,ETNLTN))
s.t. Gargmin
G,θTN1
(LTN1,LTN1
PPL )(7.5)
. . .
s.t. Gargmin
G,θT1
(LT1,LT1
PPL).
It is worth noting that during the optimization of
LTi
PPL
,
ETiLTi
is regarded as
constant. Due to this, global image representation space (which is defined by
G
) is
affected by the optimization of last task in the sequence with the corresponding PPL
function. Since SPL function
LSPL
is applied for all the tasks, it is optimized at the
end of the sequence, as follows:
min
GLSPL({GLTi}N
i=1)
s.t. Gargmin
G,θTN
(LTN,LTN
PPL)(7.6)
s.t. Gargmin
G,θTN1
(LTN1,LTN1
PPL )
. . .
s.t. Gargmin
G,θT1
(LT1,LT1
PPL),
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 112
where
GLTi
is stored in each minimization step to be utilized for the optimization
of LSPL.
It is worth noting that depending on the selection of tasks, the assurance of the
stability condition for the considered DNN may decrease the level of plasticity
condition, and vice versa. In this way, the lack of one of the stability and plasticity
conditions is associated to the excess of the other condition. As an example, if some
of the considered tasks are in a heavy competition during training and one of the
tasks can distract the other tasks, there is the lack of stability condition. This is
also due to the excess of plasticity condition. In this way, increasing the level of
stability condition results in the decrease of the plasticity condition that leads to the
lack of stability condition. Under such conditions, the stability-plasticity constraint
of a DNN is defined as a dilemma between these two capabilities of the DNN. If
there is this dilemma, it can be misleading to address both stability and plasticity
capabilities at the same time. This may lead to ineffective characterization of one
of the conditions. The drawback of this can be more evident if preserving one of
the capabilities is more important than the other one. Accordingly, in the proposed
PLASTA-MTL approach, we aim to automatically detect which capability should be
preserved if there is a need for selecting only one of them. To this end, we define the
importance level of stability condition for the considered DNN and the tasks based
on the
L2
-norm of the gradient of SPL. Accordingly, for a given set of tasks, we define
the set of all the loss functions to be considered based on the two different levels of
importance for LSPL as follows:
L:
LT1. . . LTN,LSPL, if GLSPL α
LT1,LT1
PPL . . . LTN,LTN
PPL, if GLSPL β
LT1,LT1
PPL . . . LTN,LTN
PPL,LSPL, otherwise
(7.7)
where
α
,
β
controls the importance limits, while
α>β
. If
L2
-norm of the gradient
GLSPL
is significantly high (higher than
α
), we assume that there is no need to
apply
LPPL
. This applies to
LSPL
if
L2
-norm of the gradient
GLSPL
is significantly
low (lower than
β
). If the
L2
-norm is in between
α
and
β
, we define this interval
as the condition where stability-plasticity constraint is not a dilemma anymore,
and thus both of the capabilities can be preserved in the proposed PLASTA-MTL
approach. It is worth noting that since
GLSPL
depends on the normalized gradients
of consecutive task-specific loss functions (see (7.2)), it is mostly affected by which
tasks are jointly considered. However, it is less affected by the considered data set
since the input samples indirectly changes the gradient of the SPL function. The
proposed sequential algorithm automatically decides to apply PPL, SPL or both loss
functions together depending on the parameters
α
and
β
. Accordingly, (7.5) is used to
apply only PPL function, (7.6) is used without
LTi
PPL
to apply SPL function, and (7.6)
is used to apply both loss functions together. In practice, this decision can be made
at the end of the first epoch of the training based on the parameters of
α
and
β
. The
proposed sequential optimization algorithm is summarized in Algorithm 2. To better
understand the applied operations in it, Fig. 7.1 shows an illustration of the proposed
PLASTA-MTL approach training with the proposed optimization algorithm. It is
noted that, for simplicity, forward and backward passes applied in our optimization
algorithm are visualized for two tasks. For the first task, while
θLT1
is propagated
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 113
)
)
Plasticity Preservation for
CNN
Backbone
Head
Head
Stability Preservation for ,
Standard Forward Pass Backward Pass for Backward Pass for
Backward Pass for
Backward Pass for
Backward Pass for
Plasticity Preservation for
Applied
Operations
Head
Head
Head
Head
CNN
Backbone
CNN
Backbone
(a)
(c)
(e)
(b)
(d)
(f)
FIGURE 7.1: An illustration of the proposed plasticity-stability preserving multi-task learning
(PLASTA-MTL) approach training, when two tasks
T1
and
T2
are considered. Standard and
plasticity preservation backward passes for (a)
T1
, and (c)
T2
are shown, while the changes
over the gradient vectors (b)
GLT1
and (d)
GLT2
during the plasticity preservation of these
tasks are visualized. (e) The backward pass for stability preservation of all the tasks are given
with (f) the illustration of changes over their gradient vectors.
back (which is visualized with red arrows in (a)),
LT1
PPL
is calculated. Then, backward
pass for
LT1
PPL
is applied (which is illustrated with purple arrows in (a)). During the
plasticity preservation for the first task, the change over the gradient vector
GLT1
is visualised in (b). Same steps are also presented for the second task in (c) and
(d). After the plasticity preservation is employed for both tasks,
LSPL
is calculated
(see (e)). At the end, the backward pass for the SPL function is applied (which is
visualized with blue arrows). During the stability preservation for both tasks, the
changes over the gradient vectors of both tasks are presented in (f).
Since the proposed algorithm allows to apply a task-specific optimization procedure
for each task unlike the joint optimization algorithm, the PLASTA-MTL approach is
capable of effectively preserving plasticity and stability capabilities for each task in
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 114
Algorithm 2 The proposed sequential optimization algorithm to train the proposed
PLASTA-MTL approach
Require:
Mini-batch
B X
, set of tasks
T={T1
,
. . .
,
TN}
, set of model parameters
θ,α,β
1: for i1to Ndo
2: Compute LTi(θ)
3: Compute θLTi(θ),GLTi(θ)and ETiLTi(θ)
4: Compute LTi
PPL =|∥∇GLTi(θ)
dim(GLTi(θ)) ETiLTi(θ)
dim(ETiLTi(θ)) |
5: Compute GLTi
PPL
6: Update θusing θLTi(θ)
7: if GLSPL<αthen
8: Update Gusing GLTi
PPL
9: end if
10: end for
11: Compute LSPL =1
N1
N1
i=1arccos(GLTi(θ)·GLTi+1(θ)
GLTi(θ)GLTi+1(θ))
12: Compute GLSPL
13: if GLSPL>βthen
14: Update Gusing GLSPL
15: end if
the context of MTL. We would like to point out that this algorithm does not require
the selection of any loss function weights (which generally require a computationally
demanding grid search in joint optimization algorithm). It is also worth noting that
the proposed algorithm works independently from the number of considered tasks
and the type of tasks.
7.4 Dataset Description and Experimental Design
7.4.1 Dataset Description
The experiments were performed on the DLRSD [203] and the BigEarthNet-S2 bench-
mark archives. The DLRSD archive includes the same images with the UC Merced
archive [84] that consists of 2,100 aerial images, each of which has the size of 256
×
256 pixels with a spatial resolution of 30 cm. In the DLRSD archive, the images are
associated to the multi-labels and the pixel-based labels, where the set of class labels
are defined in [93]. We utilized the Serbia subset of the BigEarthNet-S2 benchmark
archive, where images are acquired during summer season. This subset includes
14,832 Sentinel-2 images, each of which is a section of: 1) 120
×
120 pixels for 10m
bands; 2) 60
×
60 pixels for 20 m bands; and 3) 20
×
20 pixels for 60 m bands. For
the experiments, we applied bicubic interpolation to 20m bands and excluded 60 m
bands. For the experiments, we utilized the 19 class nomenclature of BigEarthNet-S2.
For the tasks that require the availability of land-cover maps, we extracted the CLC
land cover map of each image.
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 115
To perform experiments, we divided the DLRSD and the BigEarthNet-S2 archives
into training, validation and test sets with the ratios of 70%, 10%, 20% and 52%, 24%,
24%, respectively. To apply CBIR, the training set of the DLRSD archive and the
validation set of the BigEarthNet-S2 archive were used for selecting query images,
while images were retrieved from the test set for both archives.
7.4.2 Experimental Design
In the experiments, we utilized the DenseNet-121 CNN architecture [148] as the
MTL backbone shared by all the tasks. To perform the experiments, we utilized
the different combinations of four tasks: 1) supervised scene classification; 2) super-
vised similarity learning; 3) supervised multi-label co-occurrence prediction; and
4) unsupervised similarity learning. For each task, we added a task-specific head
to the CNN backbone. Each task-head includes a fully connected (FC) layer that: i)
takes the global image representation from the CNN backbone; and ii) produces a
64-dimensional task-specific embedding. Supervised scene classification task (which
is denoted as
T1
) aims to automatically assign multi-labels to image scenes. To this
end, the task-head of
T1
also includes a classification layer that produces multi-label
class probabilities. For this task, the task-specific loss function
LT1
is selected as cross-
entropy loss function. For the details of this task, the reader is referred to Chapter 3.
Supervised similarity learning task (which is denoted as
T2
) aims to automatically
identify image similarities. To this end, we selected a triplet loss function as the task-
specific loss function
LT2
. Triplet loss function directly operates on the task-specific
embeddings, and requires the availability of image triplets (each of which includes
anchor, positive and negative images). For this task, image triplets are selected by
using hard triplet sampling technique based on the multi-label similarities. The
reader is referred to Chapter 4 for the details of triplet loss function and the triplet
sampling techniques. Supervised multi-label co-occurrence prediction task (which is
denoted as
T3
) aims to predict co-occurence relationships of multiple classes present
in an image. To this end, by following the method presented in [79], the task-head
of
T3
also includes an FC layer that takes task-specific embeddings and produces
the prediction for graph driven region-based image representations. For this task,
the region representation learning loss function [79] is selected as the task-specific
loss function
LT3
. It minimizes the prediction error of the task-specific head with
comparison to the image graphs, which are obtained based on the image land-cover
maps. Unsupervised similarity learning task (which is denoted as
T4
) aims at learning
image representations by maximizing similarity between different views of the same
image without relying on any ground truth information. To this end, by following
the strategy of self-supervised contrastive learning presented in [211], we used a set
of data augmentation techniques to generate different views of each training image.
Then, the task-specific loss function
LT4
is selected as contrastive loss function, which
operates on the task-specific embeddings of two different augmented views of each
image. It allows to maximize the similarity between the augmented views of images
with respect to the rest of images. The reader is referred to [211] for the details of
contrastive loss and the set of data augmentation techniques, which is applied to
generate different views of images.
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 116
We trained the proposed PLASTA-MTL approach for 100 epochs. For training, we
utilized the Adam variant of stochastic gradient descent with the initial learning rate
of 10
3
. All the experiments were performed on 4 NVIDIA Tesla V100 GPUs. After
training is finished by employing the above-mentioned tasks in the context of MTL,
we extracted the features of query and archive images from the last layer of the CNN
backbone. To apply CBIR, we applied similarity matching of the extracted image
features based on the
χ2
-distance measure. CBIR results are provided in terms of
two evaluation metrics: 1) normalized discounted cumulative gains (NDCG); and 2)
mean average precision (mAP).
We carried out various experiments to: 1) perform a sensitivity analysis of the
proposed PLASTA-MTL approach; and 2) compare our approach with state-of-the-
art MTL methods in the context of CBIR. For the sensitivity analysis, we assessed: i)
the effectiveness of the selection of plasticity and stability preserving capabilities; ii)
the effect of task sequence order on the proposed sequential optimization algorithm;
iii) computational complexity of the PLASTA-MTL approach; and iv) the comparison
of utilizing multiple tasks in our approach with separately employing each task (that
is based on single-task learning (STL)).
We compared the proposed approach with: 1) conventional multi-task learning (equal
weighting); 2) multi-task learning using uncertainty to weigh losses (uncertainty
weighting) [205]; 3) projecting conflicting gradients (PCGrad) [182]; 4) gradient
normalization for adaptive loss balancing in deep multitask networks (GradNorm)
[208]; and 5) dynamic weight average (DWA) [183]. For all the methods, we used the
same CNN backbone and task-specific heads with our approach. For the first method,
we applied joint optimization on the summation of task-specific loss functions with
equal weights. For the other four methods, we used the same method-specific
parameters given in [205], [182], [208] and [183].
7.5 Experimental Results
We performed different kinds of experiments in order to: 1) carry out a sensitivity
analysis; and 2) compare the effectiveness of the proposed PLASTA-MTL approach
with the state-of-the-art MTL methods in the framework of CBIR.
7.5.1 Sensitivity Analysis of the Proposed Approach
In this sub-section, we performed the sensitivity analysis of the proposed PLASTA-
MTL approach in terms of: i) the effectiveness of the automatic selection of plasticity
and stability preserving capabilities; ii) the task sequence order utilized in our ap-
proach; iii) the computational complexity; and iv) the comparison with single-task
learning.
In the first set of trials, we analyzed the effectiveness of automatically detecting the
preservation of plasticity and stability capabilities in the proposed PLASTA-MTL
approach. Table 7.1 shows the mAP scores for the DLRSD archive when different
combinations of the tasks
{T1
,
T2
,
T3
,
T4}
are utilized with the different combinations
of plasticity and stability preserving capabilities in the PLASTA-MTL approach. By
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 117
TABLE 7.1: MEAN AVERAGE PRECISION (MAP) SCORES ASSOCIATED TO THE DIFFERENT
COMBINATIONS OF TASKS WITH DIFFERENT CAPABILITIES OF THE PLASTA-MTL AP-
PROACH ARE UTILIZED (THE DLRSD ARCHIVE)
Tasks PLASTA-MTL
GLSPLmAP (%)
T1T2T3T4LPPL LSPL
0.33
95.0
95.7
94.0
0.43
96.0
97.6
96.7
0.13
95.2
91.0
96.0
0.18
94.8
95.2
95.5
0.09
86.1
84.5
85.4
0.09
95.4
94.8
93.8
0.12
96.5
96.3
97.2
0.04
96.7
94.7
94.8
0.05
97.0
94.5
96.8
0.06
95.5
93.4
95.2
0.13
97.5
97.0
97.6
assessing the table, one can observe that the selection of which capabilities are pre-
served in our PLASTA-MTL approach is one of the most important factors affecting
the overall CBIR performance. This issue becomes more evident under two scenarios.
First, if some of the considered tasks are in competition during training, the preserva-
tion of both capabilities at the same time leads to the ineffective characterization of
either stability or plasticity conditions. This results in lower mAP scores compared
to preserving only one of the capabilities. As an example, when the considered tasks
include
T1
and
T2
, employing only either PPL or SPL leads to 1.7% and 1% higher
mAP scores, respectively, compared to utilizing both loss functions together in the
proposed PLASTA-MTL approach. This is due to the fact that learning the task
T1
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 118
TABLE 7.2: MEAN AVERAGE PRECISION (MAP) SCORES WHEN THE TASKS
T1
,
T2
AND
T3
ARE
UTILIZED IN DIFFERENT ORDERS FOR THE PLASTA-MTL APPROACH (THE DLRSD
ARCHIVE)
Task Order mAP (%)
T1T2T397.2
T1T3T297.0
T2T1T397.1
T2T3T196.8
T3T1T297.7
T3T2T197.5
(which is supervised scene classification) enforces to maximize inter-class distances
in the global image representation space, while learning the task
T2
(which is super-
vised similarity learning) enforces to minimize intra-class distance. These learning
characteristics can easily result in the competition of the two tasks. However, when
the considered tasks include
T2
and
T3
(which are not in competition during training),
preserving each capability further improves the CBIR performance. Second, when
the number of considered tasks decreases, the effect of selecting one of the plasticity
and stability preserving capabilities on mAP scores increases. As an example, when
the considered tasks include only
T1
and
T4
, the difference of mAP scores between
preserving plasticity and stability capabilities is more than 4%. However, when all the
tasks are considered including
T1
,
T2
,
T3
and
T4
, this difference is less than 1%. These
two scenarios show that the accurate selection of which capabilities are preserved
in our PLASTA-MTL approach is crucial for accurate CBIR performance. The pro-
posed sequential optimization strategy automatically detects which capabilities are
preserved by controlling the importance level of stability condition, which is defined
based on the
L2
-norm of the gradient of the SPL function. Table 7.1 also includes the
average gradient norm values, which are obtained in the first epoch of the training.
By the analyzing the table, one can observe that when the norm value is significantly
high (e.g.,
T={T1
,
T2}
and
T={T1
,
T3}
), preserving only stability capability in the
PLASTA-MTL approach provides the highest mAP scores. When the norm value
is significantly low (e.g.,
T={T1
,
T2
,
T4}
and
T={T1
,
T3
,
T4}
), preserving only
plasticity capability in the PLASTA-MTL approach provides the highest mAP scores.
This shows the effectiveness of the automatic detection strategy of the proposed
sequential optimization algorithm, which is utilized to identify which capabilities are
preserved in our PLASTA-MTL approach. The average gradient norm values given
in Table 7.1 show that two importance levels of stability condition can be defined
as
α=
0.3 and
β=
0.1. Accordingly, we used these parameters in the proposed
sequential optimization algorithm for the rest of the experiments.
In the second set of trials, we analyzed the effect the task sequence order utilized
in the proposed PLASTA-MTL approach. Table 7.2 shows the mAP scores for the
DLRSD archive when the tasks
{T1
,
T2
,
T3}
are utilized with all the possible orders
in the task sequence of our approach. By analyzing the table, one can see that when
the order of the considered tasks is changed, the proposed PLASTA-MTL approach
provides different mAP scores. This is due to the fact that since all the tasks are
learned sequentially in the proposed optimization algorithm, different task sequence
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 119
FIGURE 7.2: Normalized discounted cumulative gains (NDCG) versus the number of re-
trieved images obtained for the DLRSD archive when the tasks
T1
,
T2
and
T3
are utilized in
different orders for the PLASTA-MTL approach.
orders lead to changes in the whole learning procedure. However, from the table, one
can also observe that the differences between the mAP scores of different task orders
are not significantly high. The difference between the highest mAP score (which
is obtained with the task order of
T3T1T2
) and the lowest mAP score (which
is obtained with the task order of
T2T3T1
) is less than 1%. Figure 7.2 shows
the NDCG scores of the same tasks and their orders for the DLRSD archive under
different numbers of retrieved images. From the figure, one can see that increasing the
number of retrieved images does not change our conclusion. These results show that
utilizing different task orders does not significantly affect the CBIR performance of
the proposed PLASTA-MTL approach. For the rest of the experiments, we employed
the numerical order of tasks (i.e.,
T1T2T3T4
) for the proposed PLASTA-MTL
approach.
In the third set of trials, we assessed the computational complexity of the proposed
PLASTA-MTL approach. To this end, in Table 7.3, we compared our approach with
the equal weighting method in terms of the training time required per epoch when
the different combinations of the tasks
{T1
,
T2
,
T3
,
T4}
are utilized on the DLRSD
archive. It is worth noting that the equal weighting method jointly optimizes all the
loss function without the need of any other steps that may increase the computational
complexity. Accordingly, this method can be regarded as one of the MTL methods,
which are associated to the lowest computational complexity. By assessing the table,
one can observe that our approach requires higher training time per epoch compared
to the equal weighting method for each task combination. This is due to the fact
that the sequential optimization applied in the proposed PLASTA-MTL approach
requires higher number of forward and backward passes of the considered DNN
compared to the joint optimization algorithm. This increases the required training
time per epoch for our approach. This becomes more evident if the same batches
of training images are used for all the tasks (e.g.,
T={T1
,
T2}
). In this condition,
the equal weighting method requires one forward pass and one backward pass for
each batch, while our approach requires at least two forward and backward passes
depending on the number of tasks. When some of the considered tasks require
different batches of training images that leads to more than one forward pass, the
computational complexity of the equal weighting method increases. However, it does
not affect the computational complexity of our proposed approach. As an example,
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 120
TABLE 7.3: TRAINING TIMES PER EPOCH ON THE DLRSD ARCHIVE WHEN THE DIFFERENT
COMBINATIONS OF TASKS ARE UTILIZED FOR THE PROPOSED PLASTA-MTL APPROACH
AND EQUAL WEIGHTING.
Tasks
Method Training Time per Epoch (sec)
T1T2T3T4
Equal Weighting 9.3
PLASTA-MTL 18.0
Equal Weighting 18.3
PLASTA-MTL 24.5
Equal Weighting 57.6
PLASTA-MTL 62.9
Equal Weighting 17.7
PLASTA-MTL 24.6
Equal Weighting 60.2
PLASTA-MTL 64.9
Equal Weighting 66.4
PLASTA-MTL 70.9
Equal Weighting 15.7
PLASTA-MTL 32.6
Equal Weighting 57.9
PLASTA-MTL 73.6
Equal Weighting 69.1
PLASTA-MTL 77.7
Equal Weighting 67.8
PLASTA-MTL 77.4
Equal Weighting 64.5
PLASTA-MTL 88.4
when the tasks
{T1
,
T2}
are utilized, the training time per epoch of our approach
is almost twice as large as that of the equal weighting method. However, when
the tasks
{T1
,
T4}
are utilized, the task
T4
requires to feed the augmented views of
images into the considered DNN that costs an additional forward pass step. In this
case, the required training time per epoch of the proposed PLASTA-MTL approach
is almost same as that of the equal weighting method. It is worth noting that the
overall computational complexity is also affected by the total number of epochs in
addition to the training time per epoch. Accordingly, Figure 7.3 shows the minimum
numbers of training epochs at which the proposed PLASTA-MTL approach and the
equal weighting method reaches a range of mAP scores, when the different number
of tasks are considered. By analyzing the figure, one can see that our approach is
able to achieve same mAP scores with the less number of training epochs compared
to the equal weighting method. As an example, when the tasks
{T1
,
T2
,
T3}
are
considered, our approach achieves 93% mAP score with 25 less training epochs
compared to the equal weighting method. This leads to less total training time for
our approach although the corresponding training time per epoch is higher than
the equal weighting method. This issue becomes more evident when the number of
considered tasks increase. As an example, when all the tasks are utilized, the total
training time of our approach to reach 93% mAP score is significantly less than that
of the Equal Weighing method. These results show that the learning efficiency of the
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 121
(a) (b)
(c)
FIGURE 7.3: Mean Average Precision (mAP) versus the minimum number of train-
ing epochs for the DLRSD archive when the tasks: (a)
T2
and
T3
; (b)
T1
,
T2
and
T3
;
and (c)
T1
,
T2
,
T3
and
T4
are utilized for the PLASTA-MTL approach and the equal weighting
method.
proposed PLASTA-MTL approach is significantly higher than the equal weighting
method. This leads to the reduction of total training time (which is required to reach
a high CBIR performance) for the proposed PLASTA-MTL approach.
In the fourth set of trials, we analyzed the effectiveness of the proposed PLASTA-MTL
approach compared to separately employing each task of the considered task set (that
is based on single-task learning (STL)). For the DLRSD archive, Table 7.4 shows the
mAP scores of the PLASTA-MTL approach for the different combinations of the tasks
{T1
,
T2
,
T3
,
T4}
and the STL for each task. By analyzing the table, one can observe
that, for each combination, our approach provides higher mAP scores compared
to separately learning each task. As an example, when the tasks
{T1
,
T2
,
T4}
are
considered, the proposed PLASTA-MTL approach provides almost 2%, 15%, 14%
higher mAP scores compared to applying separate learning procedures for
T1
,
T2
and
T4
, respectively. This shows that our approach effectively combines multiple
tasks together that leads to more accurate image representation learning compared
to utilizing a single task.
7.5.2 Comparison with Existing Methods
In the fifth set of trials, we analyzed the effectiveness of the proposed PLASTA-MTL
approach compared to the state-of-the-art MTL methods in the context of CBIR
under various combinations of the considered four tasks. These methods are: equal
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 122
TABLE 7.4: MEAN AVERAGE PRECISION (MAP) SCORES WHEN THE DIFFERENT COMBINA-
TIONS OF TASKS ARE UTILIZED IN THE PLASTA-MTL APPROACH COMPARED TO SINGLE
TASK LEARNING (THE DLRSD ARCHIVE)
Tasks
Method mAP (%)
T1T2T3T4
STL
94.9
81.8
95.4
83.2
PLASTA-MTL
95.7
97.6
96.0
95.5
86.1
95.4
97.2
96.7
97.0
95.5
97.6
(a) (b) (c)
(d) (e)
FIGURE 7.4: Normalized discounted cumulative gains (NDCG) versus the number of re-
trieved images obtained for the DLRSD archive when the tasks: (a)
T1
and
T4
; (b)
T2
and
T3
; (c)
T1
,
T2
and
T4
; (d)
T2
,
T3
and
T4
; and (e)
T1
,
T2
,
T3
and
T4
are used in the context of multi-task
learning.
weighting, uncertainty weighting [205], PCGrad [182], GradNorm [208] and DWA
[183]. Table 7.5 and 7.6 show the corresponding mAP scores on the DLRSD and
the BigEarthNet-S2 archives, respectively. By assessing the tables, one can observe
that the proposed PLASTA-MTL approach leads to the highest mAP scores on each
task combination for both archives compared to the state-of-the-art MTL methods.
As an example, the proposed PLASTA-MTL approach outperforms the PCGrad by
more than 4% for the DLRSD archive and more than 8% for the BigEarthNet-S2
archive when the tasks
{T2
,
T3
,
T4}
are utilized. When all the tasks
{T1
,
T2
,
T3
,
T4}
are used, our approach provides almost 3% higher mAP scores for both archives
compared to the GradNorm. We observed the similar behaviours while comparing
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 123
TABLE 7.5: MEAN AVERAGE PRECISION (MAP) SCORES ASSOCIATED TO THE DIFFERENT
COMBINATIONS OF TASKS (THE DLRSD ARCHIVE)
Tasks
Method mAP (%)
T1T2T3T4
Equal Weighting 90.1
Uncertainty Weighting [205] 94.4
PCGrad [182] 92.7
GradNorm [208] 94.3
DWA [183] 93.0
PLASTA-MTL 96.0
Equal Weighting 93.2
Uncertainty Weighting [205] 94.0
PCGrad [182] 92.6
GradNorm [208] 92.9
DWA [183] 92.5
PLASTA-MTL 95.5
Equal Weighting 91.6
Uncertainty Weighting [205] 95.4
PCGrad [182] 92.9
GradNorm [208] 93.8
DWA [183] 91.4
PLASTA-MTL 96.7
Equal Weighting 92.0
Uncertainty Weighting [205] 95.0
PCGrad [182] 91.2
GradNorm [208] 91.4
DWA [183] 90.9
PLASTA-MTL 95.5
Equal Weighting 92.6
Uncertainty Weighting [205] 95.8
PCGrad [182] 94.9
GradNorm [208] 95.0
DWA [183] 93.7
PLASTA-MTL 97.6
the methods of equal weighting, uncertainty weighting and DWA with our approach.
This shows that the proposed PLASTA-MTL approach provides more accurate RS
image representations that leads to more effective CBIR compared to other methods.
This is due to the plasticity and stability preserving capabilities of our approach that
overcomes the well-known problems of MTL. Figure 7.4 and 7.5 show the NDCG
scores of the considered state-of-the-art methods and our approach under different
combinations of the tasks
{T1
,
T2
,
T3
,
T4}
and different numbers of retrieved images
for the DLRSD and the BigEarthNet-S2 archives, respectively. From the figures, one
can see that when the number of retrieved images are increased (from 1 to 50 for
DLRSD and 1 to 100 for BigEarthNet-S2), the proposed PLASTA-MTL approach
provides the highest NDCG scores for almost all task combinations at each number of
retrieved images on both archives. For the DLRSD archive, Fig. 7.6 shows an example
of a query image and the retrieved images by these methods and our approach,
when all the tasks are utilized and query image contains the classes of buildings,
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 124
TABLE 7.6: MEAN AVERAGE PRECISION (MAP) SCORES ASSOCIATED TO THE DIFFERENT
COMBINATIONS OF TASKS (THE BIGEARTHNET-S2 ARCHIVE)
Tasks
Method mAP (%)
T1T2T3T4
Equal Weighting 95.9
Uncertainty Weighting [205] 83.8
PCGrad [182] 96.3
GradNorm [208] 90.4
DWA [183] 94.7
PLASTA-MTL 97.2
Equal Weighting 87.7
Uncertainty Weighting [205] 92.0
PCGrad [182] 92.1
GradNorm [208] 84.0
DWA [183] 88.2
PLASTA-MTL 93.4
Equal Weighting 95.7
Uncertainty Weighting [205] 96.3
PCGrad [182] 87.0
GradNorm [208] 92.6
DWA [183] 94.7
PLASTA-MTL 97.4
Equal Weighting 80.4
Uncertainty Weighting [205] 90.7
PCGrad [182] 85.5
GradNorm [208] 89.4
DWA [183] 90.7
PLASTA-MTL 93.8
Equal Weighting 94.8
Uncertainty Weighting [205] 97.3
PCGrad [182] 93.9
GradNorm [208] 95.0
DWA [183] 95.2
PLASTA-MTL 97.7
cars,grass,pavement and trees. The retrieval orders of images are given above the
figure. By assessing the figure, one can observe that the proposed PLASTA-MTL
approach leads to retrieval of similar images at all retrieval orders (see Fig. 7.6g).
However, by using state-of-the-art MTL methods, retrieved images contain classes
which are not present in the query image. As an example, the equal weighting and
the DWA methods lead to retrieval of the image, which include only field class, at the
5
th
and 4
th
retrieval orders, respectively (see Fig. 7.6b and 7.6d). We observed the
similar behaviours of these methods for the BigEarthNet-S2 archive. We would like
to point out that these methods employ different gradient adjustment strategies for
overcoming the well-known problems of MTL. Accordingly, their success has been
proven for many MTL problems in computer vision domain. However, since they do
not consider the stability-plasticity constraint of MTL and they are still based on the
joint optimization algorithm, they are limited to solve all possible problems of MTL
under various task combinations for RS images. This leads to less accurate image
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 125
(a) (b) (c)
(d) (e)
FIGURE 7.5: Normalized discounted cumulative gains (NDCG) versus the number of re-
trieved images obtained for the BigEarthNet-S2 archive when the tasks: (a)
T2
and
T3
; (b)
T2
and
T4
; (c)
T1
,
T3
and
T4
; (d)
T2
,
T3
and
T4
; and (e)
T1
,
T2
,
T3
and
T4
are used in the context of
multi-task learning.
representations learned via these methods compared to the proposed PLASTA-MTL
approach. Accordingly, the image representations learned via our approach lead to
more effective CBIR results.
7.6 Conclusion
In this chapter, we have proposed a novel plasticity-stability preserving multi-task
learning (PLASTA-MTL) approach for DL-based IRL. This approach is characterized
by novel: i) plasticity preserving loss (PPL) function; ii) stability preserving loss
(SPL) function; and iii) sequential optimization algorithm. The PPL function allows
our approach to minimize the differences of gradient magnitudes for the global
representation space and each task-specific embedding spaces of the considered
DNN. The use of the SPL function in the proposed PLASTA-MTL approach leads
to minimization of the angular distances between task gradients over global image
representation space. The proposed optimization algorithm sequentially optimizes:
i) each task-specific objective with the corresponding PPL function; and ii) the SPL
function for all the considered tasks. Experimental results conducted on two bench-
mark archives show the effectiveness of the proposed PLASTA-MTL approach over
the state-of-the-art MTL methods in the context of CBIR. The main reasons for the
success of our approach are summarized as follows:
1.
Due to the proposed PPL function, the PLASTA-MTL approach enforces the
global image representation space to be sensitive to new information learned
with each task that leads to the preservation of plasticity condition for the
considered DNN.
2.
Due to the proposed SPL function, the PLASTA-MTL approach protects the
global image representation space radically disrupted by a new task that leads
to the preservation of stability condition for the considered DNN.
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 126
1st 2nd 3rd 4th
(b)
5th 10th 20th
(c)
(a) (d)
(e)
(f)
(g)
FIGURE 7.6: (a) Query image; and images retrieved by using (b) equal weighting; (c) un-
certainty weighting; (d) PCGrad; (e) GradNorm; (f) DWA; (g) the proposed PLASTA-MTL
approach when the tasks: T1,T2,T3and T4are utilized for the DLRSD archive.
3.
Due to the proposed sequential optimization algorithm, the PLASTA-MTL
approach accurately characterizes: i) the plasticity condition for each task; and
ii) the stability condition in between consecutive tasks.
4.
Due to the effective combination of multiple tasks independently from the
number and type of tasks while considering the stability-plasticity constraint of
MTL without the need for selection of loss weights, the PLASTA-MTL approach
prevents: i) conflicts between tasks; ii) the dominance of one of the tasks; and
ii) under-performance of tasks compared to STL. This leads to more accurate
image representation learning compared to utilizing a single task and the
conventional deep multi-task learning procedures.
It is worth noting that, in this chapter, we conducted experiments in the context of a
single RS application, CBIR, for the sake of simplicity. Moreover, the global image
Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 127
representation space learned via our approach can be also used for other applications
since it applies image representation learning based on the information learned via
multiple tasks to represent the complex semantic content of RS images. We would like
to also point out that the set of all tasks are assumed to be known during the training
of our approach. However, inclusion of new tasks to the set of considered tasks after
training for the PLASTA-MTL approach can further improve the characterization of
RS image content. Accordingly, as a future development of this work, we plan to
study on continual learning to include new tasks to the PLASTA-MTL approach after
completing the whole learning procedure while preserving its plasticity and stability
capabilities also for these tasks.
128
Chapter 8
Conclusion and Outlook
In this chapter, we conclude this thesis with: i) a summary of presented methodolo-
gies in Section 8.1; and ii) an overview on the possible research directions comple-
mentary to the thesis in Section 8.2.
8.1 Conclusion
In this thesis, we have mainly presented six novel contributions to the state-of-the-
art DL-based representation learning of RS images to foster automatic knowledge
discovery from massive EO archives in effective and efficient ways.
As the first main contribution of this thesis, in Chapter 2 we have proposed a large-
scale benchmark RS image archive (which is denoted as BigEarthNet) to address
the limitations of existing benchmark datasets, which mostly include single-modal
RS images (e.g., multispectral or SAR) and single-label image annotations with an
insufficient amount of training data for recent DNNs. BigEarthNet includes 590,326
RS image pairs acquired over 10 different European countries. Each pair is made
up of two image patches from new generation satellites Sentinel-1 and Sentinel-
2 acquired in the same geographical area; and annotated by multiple land-cover
classes (i.e., multi-labels) from the CORINE Land Cover (CLC) database. We have
also proposed an alternative nomenclature for the characteristics of BigEarthNet
image pairs as an evolution of the original CLC labels. We would like to note that
BigEarthNet makes a significant advancement for DL-based IRL in RS as it fulfills
the requirement of training DNNs with a large number of annotated training images.
It also opens up promising research directions for DL-based IRL based on multiple
modalities. Our experimental analysis shows that IRL directly from BigEarthNet
provides more accurate characterization of RS images compared to transfer learning
strategy (e.g., utilizing DNN models pre-trained on ImageNet). Together with all the
BigEarthNet data, we have also made several DL models pre-trained on it publicly
available. This eliminates the limitations of using DNNs, which are pre-trained on
general purpose computer vision datasets, for research studies in RS images.
The second main contribution of the thesis consists in our deep multi-attention driven
approach proposed for multi-label RS image scene classification problems in Chapter
3. The proposed approach is capable of efficiently and effectively describing the
spatial and spectral information content of high dimensional and high spatial resolu-
tion RS images based on three main steps: 1) spatial and spectral characterization of
Chapter 8. Conclusion and Outlook 129
image local areas through a novel K-Branch CNN; 2) definition of a multi-attention
driven global descriptor through a novel multi-attention strategy; and 3) classifica-
tion of RS image scenes with multi-labels. Due to the proposed K-Branch CNN of
the first step, our approach models the complex information content of RS images
for which the spectral bands can be associated to varying spatial resolutions, while
leading to a significant reduction on the computational complexity. Thanks to the
multi-attention strategy defined in the framework of RNNs, our approach accurately
identifies importance levels for different image local areas, and then defines image
representations based on these scores. Experimental results obtained on BigEarthNet
demonstrate that the proposed approach has a high potential for the operational RS
scene classification scenarios, where EO data archives contain RS images with highly
complex spatial and spectral information content as in new generation satellite image
archives such as Sentinel-2.
As the third main contribution of the thesis, in Chapter 4 a novel triplet selection
method has been proposed for DL-based IRL through the characterization of image
similarities for multi-label CBIR problems in RS. Our method selects a small set of the
most representative and informative triplets by evaluating the relevancy, hardness
and diversity of multi-label RS images. With those image triplets, a metric space,
where semantically similar images are located close to each other, can be modeled
through triplet loss to perform CBIR in large-scale RS image archives. The selection
of a compact subset of informative and representative triplets in our method enables
effective learning of a metric space on DNNs for accurate multi-label CBIR in RS,
while reducing the total number of triplets and increasing the learning efficiency
in terms of the converge speed. The experimental analysis in this chapter confirms
that our triplet selection method is much more suitable to be used with operational
CBIR applications compared to well-known methods, as it significantly reduces
the computational complexity of training DNNs without sacrificing from CBIR
performance.
The fourth main contribution of the thesis is composed of our SCI-CBIR approach
proposed in Chapter 5 to simultaneously characterize image representations through
hash codes and achieve image compression, and thus eliminate the need for decom-
pressing RS images prior to CBIR. Our SCI-CBIR approach employs first: i) DL-based
compression through an encoder-decoder DNN and a probabilistic entropy model;
and then ii) deep hashing-based indexing through pairwise, bit-balancing and classi-
fication loss functions based on the encoded RS image representations. The novel
multi-stage learning procedure for the training of SCI-CBIR allows to effectively char-
acterize image features for both image indexing and compression by automatically
weighting different loss functions and rate-distortion trade-off points. We would like
to emphasize that due to the proposed approach, the need for decompressing images
prior to indexing, unlike the existing CBIR approaches in RS, is fully eliminated. This
can save a significant amount of time for large-scale CBIR applications on massive
RS image archives that is demonstrated with the experimental results on benchmark
archives.
As the fifth main contribution of the thesis, in Chapter 6 we have introduced the
GRID approach to accurately learn RS image representations when training images
are associated with noisy labels. The proposed approach models the complementary
Chapter 8. Conclusion and Outlook 130
characteristics of discriminative and generative reasoning for IRL under noisy label
by integrating generative reasoning into discriminative reasoning through a super-
vised variational autoencoder. Due to its label noise robust hybrid representation
learning strategy (which employs generative reasoning for the training samples with
noisy labels; and discriminative reasoning for the remaining samples in training
data), our approach allows to learn discriminative RS image representations, while
preventing the overfitting on noisy labels during training. GRID does not depend
on the type of annotation, label noise present in training data, DNN architecture,
loss function or learning task, and can operate with any DL-based IRL method. In
greater details, a small clean subset (training samples with clean labels) of a training
set or a computationally demanding noise correction strategy prior to training is not
required for GRID unlike the existing methods. The experiments conducted for three
different learning tasks with the corresponding loss functions and DNN architectures
at different synthetic label noise injection rates show the success of GRID indepen-
dently of the IRL method being considered. This can be a very important advantage
for operational RS applications, which deal with noisy annotations in training data
and require to perform under different IRL scenarios.
The sixth main contribution of the thesis consists in our PLASTA-MTL approach
introduced in Chapter 7 for effectively learning RS image representations when
multiple learning task are involved in IRL. Our approach: 1) adaptively adjusts the
interactions between task-specific learning procedures by the proposed sequential
optimization algorithm; 2) protects image representation space from radical disrup-
tions occurred due to each task by the proposed stability preserving loss function;
and 3) assures the sensitivity of the image representation space to new information
from each task by the proposed plasticity preserving loss function. Due to its sta-
bility and plasticity preserving capabilities, our PLASTA-MTL approach overcomes
the well-known multi-task learning problems, which are mainly conflicts between
tasks, the dominance of one of the tasks and under-performance of tasks compared
to single-task learning. Consequently, PLASTA-MTL is capable of learning an RS
image latent space that can better represent the complex semantic content of RS
images compared to IRL under single learning task. Extensive experimental analysis
conducted for different combinations of four learning tasks confirms the potential
of our approach to describe the complex content of RS images by using multiple
learning tasks. This carries a huge potential for EO applications, which require to
model the complex patterns of RS image semantics on a large scale.
In conclusion, this thesis tackles several challenges of learning RS image represen-
tations imposed in recent years, which have witnessed the wide use of DNNs for
a wide range of research problems. We hope that this thesis can be regarded as an
important step for DL-based automatic knowledge discovery on massive RS im-
age archives by considering its: i) novel methodologies; ii) BigEarthNet benchmark
archive; iii) theoretical and experimental analyses of the proposed methodologies;
and iv) the public availability of research outcomes. We would like to note that while
our work provides new insights, it also leads to new research questions waiting to be
addressed. In the following section, we discuss two main directions of research as
future developments of this thesis.
Chapter 8. Conclusion and Outlook 131
8.2 Future Research Directions
As highlighted in Chapters 1 and 2, the availability of multi-source/multi-modal
RS images (e.g., multispectral, hyperspectral, SAR etc.) associated to the same geo-
graphical area allows for rich characterization of Earth’s surface and thus learning
more accurate image representations with DNNs when different data modalities
are jointly considered in a convenient way. To this end, BigEarthNet has been
proposed to contribute to the development of unsupervised, self-supervised and
semi-supervised multi-modal IRL methods for information discovery from big data
archives. However, the development of efficient and effective IRL methods, which
employ information from multiple RS image sensors for learning joint feature rep-
resentations among different data modalities, has not been addressed in this thesis.
To pave the way on this direction, in [80], we introduce a novel self-supervised
method designed for only cross-modal CBIR problems on multi-modal RS image
archives. This method is capable of simultaneously preserving intra and inter-modal
similarities and eliminating inter-modal discrepancies without requiring annotated
training images. It is achieved by considering multi-modal RS images as the multiple
views of the same geographical area that allows IRL in an unsupervised way by
maximizing agreement between the multiple views of a shared context [211]. This
self-supervised strategy can be extended to utilize publicly available multi-sensor
RS images of next-generation Earth observation missions (e.g., Sentinels) on a large
scale for the joint use of RS image representations in atmospheric, oceanic, and land
monitoring. To support this research direction, we plan to enrich the BigEarthNet
archive by extending it to whole Europe with zero-annotation cost, as CORINE
land cover database is publicly available for all European countries. In addition to
this, BigEarthNet can be also easily extended on a world scale by enriching it with
Sentinel-1 and Sentinel-2 image patches without annotations or combining it with
other multi-modal benchmark archives. In parallel with this, we also plan to develop
self/semi-supervised IRL methodologies, which are capable of learning joint RS
image representations on such multi-sensor data for large-scale knowledge discovery
in EO applications through cross/multi-modal image classification and retrieval.
Throughout this thesis, we assume that the training stages of the proposed methods
are performed with pre-defined training sets, while full access to training data is
guaranteed. However, RS image archives of some data providers (e.g., commercial
providers) may not be accessible during training due to commercial concerns and
legal regulations, or it may not be feasible to gather all training data in a centralized
server due to data storage limitations. To address this issue, as a future development
of this thesis, we plan to make the proposed methodologies compatible for federated
and distributed learning of image representations to learn the model parameters
of DNNs on distributed servers without full access to training data of some data
providers. It may also happen that possible changes on the ground require to re-
learn or to update already learned IRL models with new RS images. This may not
be always feasible with re-training of DNNs from scratch with updated training
data due to the excessive growth of RS image archives. Updating DNNs by fine-
tuning with only more recent RS images may lead to inaccurate learning of RS image
representations, as new training data may require the removal of previous knowledge
encoded by DNNs. To address this issue, as another extension of this thesis, we plan
Chapter 8. Conclusion and Outlook 132
to investigate continual life-long learning of RS image representations in effective and
efficient ways to extract information from dynamically growing RS image archives.
133
Bibliography
[1]
A. G. Castriotta, “Copernicus sentinel data access annual report,” European
Space Agency, Tech. Rep., 2021. [Online]. Available:
https://sentinels.
copernicus . eu / web / sentinel/ - /copernicus - sentinel - data - access -
annual-report-2021.
[2]
C. Persello, J. D. Wegner, R. Hansch, D. Tuia, P. Ghamisi, M. Koeva, and G.
Camps-Valls, “Deep learning and earth observation to support the sustain-
able development goals: Current approaches, open challenges, and future
opportunities,” IEEE Geoscience and Remote Sensing Magazine, pp. 2–30, 2022.
DOI:10.1109/MGRS.2021.3136100.
[3]
Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review
and new perspectives,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. DOI:10.1109/TPAMI.2013.50.
[4]
M. Guo, C. Zhou, and J. Liu, “Jointly Learning of Visual and Auditory: A New
Approach for RS Image and Audio Cross-Modal Retrieval,” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 11,
pp. 4644–4654, 2019.
[5]
F. Ye, W. Luo, M. Dong, D. Li, and W. Min, “Content-Based Remote Sensing
Image Retrieval Based on Fuzzy Rules and a Fuzzy Distance,” IEEE Geoscience
and Remote Sensing Letters, pp. 1–5, 2020. DOI:10.1109/LGRS.2020.3030858.
[6]
F. Ye, M. Dong, W. Luo, X. Chen, and W. Min, “A New Re-Ranking Method
Based on Convolutional Neural Network and Two Image-to-Class Distances
for Remote Sensing Image Retrieval,” IEEE Access, vol. 7, pp. 141498–141 507,
2019.
[7] F. Ye, X. Zhao, W. Luo, D. Li, and W. Min, “Query-Adaptive Remote Sensing
Image Retrieval Based on Image Rank Similarity and Image-to-Query Class
Similarity,” IEEE Access, vol. 8, pp. 116824–116839, 2020.
[8]
C. Liu, J. Ma, X. Tang, F. Liu, X. Zhang, and L. Jiao, “Deep Hash Learning for
Remote Sensing Image Retrieval,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 59, no. 4, pp. 3420–3443, 2021.
[9]
R. Imbriaco, C. Sebastian, E. Bondarev, and P. H. N. de With, “Aggregated
deep local features for remote sensing image retrieval,” Remote Sensing, vol. 11,
no. 5, p. 493, 2019. DOI:10.3390/rs11050493.
[10]
Y. Boualleg and M. Farah, “Enhanced Interactive Remote Sensing Image
Retrieval with Scene Classification Convolutional Neural Networks Model,”
in Proceedings of the IEEE International Geoscience and Remote Sensing Symposium,
2018, pp. 4748–4751.
[11]
W. Zhou, S. Newsam, C. Li, and Z. Shao, “Learning Low Dimensional Con-
volutional Neural Networks for High-Resolution Remote Sensing Image Re-
trieval,” Remote Sensing, vol. 9, no. 5, p. 489, 2017.
Bibliography 134
[12]
F. Hu, X. Tong, G. Xia, and L. Zhang, “Delving into deep representations for
remote sensing image retrieval,” in International Conference on Signal Processing,
2016, pp. 198–203.
[13]
F. Ye, H. Xiao, X. Zhao, M. Dong, W. Luo, and W. Min, “Remote Sensing
Image Retrieval Using Convolutional Neural Network Features and Weighted
Distance,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 10, pp. 1535–
1539, 2018.
[14]
C. Ma, F. Chen, J. Yang, J. Liu, W. Xia, and X. Li, “A remote-sensing image-
retrieval model based on an ensemble neural networks,” Big Earth Data, vol. 2,
no. 4, pp. 351–367, 2018.
[15]
R. Cao, Q. Zhang, J. Zhu, Q. Li, Q. Li, B. Liu, and G. Qiu, “Enhancing re-
mote sensing image retrieval using a triplet deep metric learning network,”
International Journal of Remote Sensing, vol. 41, no. 2, pp. 740–751, 2020.
[16]
L. Fan, H. Zhao, and H. Zhao, “Distribution Consistency Loss for Large-Scale
Remote Sensing Image Retrieval,” Remote Sensing, vol. 12, no. 1, p. 175, 2020.
[17]
U. Chaudhuri, B. Banerjee, and A. Bhattacharya, “Siamese graph convolu-
tional network for content based remote sensing image retrieval,” Computer
Vision and Image Understanding, vol. 184, pp. 22–30, 2019.
[18]
U. Chaudhuri, B. Banerjee, A. Bhattacharya, and M. Datcu, “A Zero-Shot
Sketch-Based Intermodal Object Retrieval Scheme for Remote Sensing Im-
ages,” IEEE Geoscience and Remote Sensing Letters, pp. 1–5, 2021. DOI:
10.1109/
LGRS.2021.3056392.
[19]
M. Zhang, Q. Cheng, F. Luo, and L. Ye, “A triplet nonlocal neural network with
dual-anchor triplet loss for high-resolution remote sensing image retrieval,”
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
vol. 14, pp. 2711–2723, 2021.
[20]
Y. Li, Y. Zhang, X. Huang, and J. Ma, “Learning source-invariant deep hash-
ing convolutional neural networks for cross-source remote sensing image
retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 11,
pp. 6521–6536, 2018. DOI:10.1109/TGRS.2018.2839705.
[21]
Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma, “Large-scale remote sensing
image retrieval by deep hashing neural networks,” IEEE Transactions on Geo-
science and Remote Sensing, vol. 56, pp. 950–965, 2018.
[22]
P. Li, L. Han, X. Tao, X. Zhang, C. Grecos, A. Plaza, and P. Ren, “Hashing Nets
for Hashing: A Quantized Deep Learning to Hash Framework for Remote
Sensing Image Retrieval,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 58, no. 10, pp. 7331–7345, 2020.
[23]
W. Xiong, Z. Xiong, Y. Zhang, Y. Cui, and X. Gu, “A Deep Cross-Modality
Hashing Network for SAR and Optical Remote Sensing Images Retrieval,”
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
vol. 13, pp. 5284–5296, 2020.
[24]
Y. Cao, Y. Wang, J. Peng, L. Zhang, L. Xu, K. Yan, and L. Li, “DML-GANR:
Deep Metric Learning With Generative Adversarial Network Regularization
for High Spatial Resolution Remote Sensing Image Retrieval,” IEEE Transac-
tions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8888–8904, 2020.
Bibliography 135
[25]
L. Fan, H. Zhao, and H. Zhao, “Global Optimization: Combining Local Loss
With Result Ranking Loss in Remote Sensing Image Retrieval,” IEEE Transac-
tions on Geoscience and Remote Sensing, vol. 59, no. 8, pp. 7011–7026, 2021. DOI:
10.1109/TGRS.2020.3029334.
[26]
Y. Liu, L. Ding, C. Chen, and Y. Liu, “Similarity-Based Unsupervised Deep
Transfer Learning for Remote Sensing Image Retrieval,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 58, no. 11, pp. 7872–7889, 2020.
[27]
S. Roy, E. Sangineto, B. Demir, and N. Sebe, “Metric-learning-based deep
hashing network for content-based retrieval of remote sensing images,” IEEE
Geoscience and Remote Sensing Letters, vol. 18, no. 2, pp. 226–230, 2021. DOI:
10.1109/LGRS.2020.2974629.
[28]
X. Tang, X. Zhang, F. Liu, and L. Jiao, “Unsupervised deep feature learning for
remote sensing image retrieval,” Remote Sensing, vol. 10, no. 8, p. 1243, 2018.
DOI:10.3390/rs10081243.
[29]
N. Khurshid, M. Tharani, M. Taj, and F. Z. Qureshi, “A Residual-Dyad En-
coder Discriminator Network for Remote Sensing Image Matching,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 58, no. 3, pp. 2001–2014,
2020.
[30]
Z. Shao, W. Zhou, X. Deng, M. Zhang, and Q. Cheng, “Multilabel Remote
Sensing Image Retrieval Based on Fully Convolutional Network,” IEEE Journal
of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13,
pp. 318–328, 2020.
[31]
R. Dong, W. Fang, H. Fu, L. Gan, J. Wang, and P. Gong, “High-resolution land
cover mapping through learning with noise correction,” IEEE Transactions
on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022. DOI:
10.1109/TGRS.
2021.3068280.
[32]
G. Hoxha, F. Melgani, and B. Demir, “Toward Remote Sensing Image Retrieval
Under a Deep Image Captioning Perspective,” IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing, vol. 13, pp. 4462–4475, 2020.
[33]
G. Hoxha, S. Chouaf, F. Melgani, and Y. Smara, “Change captioning: A new
paradigm for multitemporal remote sensing image analysis,” IEEE Transac-
tions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022. DOI:
10.1109/
TGRS.2022.3195692.
[34]
G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learning meets
metric learning: Remote sensing image scene classification via learning dis-
criminative cnns,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56,
no. 5, pp. 2811–2821, 2018.
[35]
Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wise attention in
a hybrid convolutional and bidirectional lstm network for multi-label aerial
image classification,” ISPRS Journal of Photogrammetry and Remote Sensing,
vol. 149, pp. 188–199, 2019.
[36]
Y. Hua, L. Mou, and X. X. Zhu, “Relation network for multilabel aerial image
classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58,
no. 7, pp. 4558–4572, 2020. DOI:10.1109/TGRS.2019.2963364.
[37]
A. Alshehri, Y. Bazi, N. Ammour, H. Almubarak, and N. Alajlan, “Deep
attention neural network for multi-label classification in unmanned aerial
vehicle imagery,” IEEE Access, vol. 7, pp. 119873–119880, 2019.
Bibliography 136
[38]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding
for face recognition and clustering,” IEEE Conference on Computer Vision and
Pattern Recognition, pp. 815–823, 2015. DOI:10.1109/CVPR.2015.7298682.
[39]
C. Zhou, L. Po, W. Y. F. Yuen, K. W. Cheung, X. Xu, K. W. Lau, Y. Zhao, M. Liu,
and P. H. W. Wong, “Angular deep supervised hashing for image retrieval,”
IEEE Access, vol. 7, pp. 127521–127532, 2019. DOI:
10.1109/ACCESS.2019.
2939650.
[40]
X. Yang, P. Zhou, and M. Wang, “Person reidentification via structural deep
metric learning,” IEEE Transactions on Neural Networks and Learning Systems,
vol. 30, no. 10, pp. 2987–2998, 2019. DOI:10.1109/TNNLS.2018.2861991.
[41]
Z. Li, J. Tang, L. Zhang, and J. Yang, “Weakly-supervised semantic guided
hashing for social image retrieval,” International Journal of Computer Vision,
vol. 128, no. 8–9, 2265–2278, 2020.
[42]
W. Song, S. Li, and J. A. Benediktsson, “Deep hashing learning for visual and
semantic retrieval of remote sensing images,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 59, pp. 9661–9672, 2021.
[43]
X. Tang, C. Liu, X. Zhang, J. Ma, C. Jiao, and L. Jiao, “Remote sensing image
retrieval based on semi-supervised deep hashing learning,” in Proceedings of
the IEEE International Geoscience and Remote Sensing Symposium, 2019, pp. 879–
882. DOI:10.1109/IGARSS.2019.8898676.
[44]
C. Liu, J. Ma, X. Tang, X. Zhang, and L. Jiao, “Adversarial hash-code learning
for remote sensing image retrieval,” in Proceedings of the IEEE International
Geoscience and Remote Sensing Symposium, 2019, pp. 4324–4327. DOI:
10.1109/
IGARSS.2019.8900431.
[45]
X. Tang, Y. Yang, J. Ma, Y.
-
M. Cheung, C. Liu, F. Liu, X. Zhang, and L. Jiao,
“Meta-hashing for remote sensing image retrieval,” IEEE Transactions on Geo-
science and Remote Sensing, vol. 60, pp. 1–19, 2022. DOI:
10.1109/TGRS.2021.
3136159.
[46]
W. Song, Z. Gao, R. Dian, P. Ghamisi, Y. Zhang, and J. A. Benediktsson,
“Asymmetric hash code learning for remote sensing image retrieval,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022. DOI:
10.1109/TGRS.2022.3143571.
[47]
H. Kramer, Observation of the Earth and Its Environment: Survey of Missions and
Sensors. Springer Berlin Heidelberg, 2019.
[48]
D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang,
“More diverse means better: Multimodal deep learning meets remote-sensing
imagery classification,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 59, no. 5, pp. 4340–4354, 2021.
[49]
J. Feranec, T. Soukup, G. Hazeu, and G. Jaffrain, European Landscape Dynamics:
CORINE Land Cover Data. CRC Press, 2016.
[50]
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep
learning requires rethinking generalization,” in Proceedings of the International
Conference on Learning Representations, 2017.
[51]
H. Song, M. Kim, D. Park, Y. Shin, and J.
-
G. Lee, “Learning from noisy labels
with deep neural networks: A survey,” IEEE Transactions on Neural Networks
and Learning Systems, 2022, doi: 10.1109/TNNLS.2022.3152527.
Bibliography 137
[52]
Y. Liu, Z. Han, C. Chen, L. Ding, and Y. Liu, “Eagle-eyed multitask cnns for
aerial image retrieval and scene classification,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 58, no. 9, pp. 6699–6721, 2020. DOI:
10.1109/TGRS.
2020.2979011.
[53]
J. Fang, X. Cao, D. Wang, and S. Xu, “Multitask Learning Mechanism for
Remote Sensing Image Motion Deblurring,” IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, vol. 14, pp. 2184–2193, 2021.
[54]
F. Chen and B. Yu, “Earthquake-Induced Building Damage Mapping Based
on Multi-Task Deep Learning Framework,” IEEE Access, vol. 7, pp. 181396–
181404, 2019.
[55]
R. Caye Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Multitask learn-
ing for large-scale semantic change detection,” Computer Vision and Image
Understanding, vol. 187, p. 102783, 2019.
[56]
H. Wang, Z. Zhou, H. Zong, and L. Miao, “Wide-Context Attention Network
for Remote Sensing Image Retrieval,” IEEE Geoscience and Remote Sensing
Letters, pp. 1–5, 2020.
[57]
W. Xiong, Z. Xiong, Y. Cui, and Y. Lv, “A Discriminative Distillation Network
for Cross-Source Remote Sensing Image Retrieval,” IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1234–1247,
2020.
[58]
S. Vandenhende, S. Georgoulis, W. V. Gansbeke, M. Proesmans, D. Dai, and
L. V. Gool, “Multi-Task Learning for Dense Prediction Tasks: A Survey,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
[59]
X. Zhao, H. Li, X. Shen, X. Liang, and Y. Wu, “A modulation module for
multi-task learning with applications in image retrieval,” in Proceedings of the
European Conference on Computer Vision, 2018, pp. 415–432.
[60]
R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in
cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
[61]
O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,”
in Proceedings of the Advances in Neural Information Processing Systems, 2018,
pp. 525–536.
[62]
G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-
label remote sensing image classification,” IEEE Access, vol. 8, pp. 95934–
95946, 2020. DOI:10.1109/ACCESS.2020.2995805.
[63]
G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides,
M. Caetano, B. Demir, and V. Markl, “BigEarthNet-MM: A large scale multi-
modal multi-label benchmark archive for remote sensing image classification
and retrieval,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3,
pp. 174–180, 2021. DOI:10.1109/MGRS.2021.3089174.
[64]
G. Sumbul, M. Ravanbakhsh, and B. Demir, “Informative and representa-
tive triplet selection for multilabel remote sensing image retrieval,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. DOI:
10.1109/TGRS.2021.3124326.
[65]
G. Sumbul and B. Demir, “Plasticity-stability preserving multi-task learning
for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 60, pp. 1–16, 2022. DOI:10.1109/TGRS.2022.3160097.
Bibliography 138
[66]
G. Sumbul, J. Xiang, and B. Demir, “Towards simultaneous image compression
and indexing for scalable content-based retrieval in remote sensing,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022. DOI:
10.1109/TGRS.2022.3204914.
[67]
G. Sumbul and B. Demir, “Generative reasoning integrated label noise robust
deep image representation learning,” IEEE Transactions on Image Processing,
2023. DOI:10.1109/TIP.2023.3293776.
[68]
G. Sumbul, J. Kang, and B. Demir, “Deep learning for image search and re-
trieval in large remote sensing archives,” in Deep Learning for the Earth Sciences:
A comprehensive approach to remote sensing, climate science and geosciences, Hobo-
ken, NJ, USA: Wiley, 2021, ch. 11, pp. 150–160. DOI:
10.1002/9781119646181.
ch11.
[69]
G. Sumbul, M. Charfuelan, B. Demir, and M. Volker, “BigEarthNet: A large-
scale benchmark archive for remote sensing image understanding,” in Proceed-
ings of the IEEE International Geoscience and Remote Sensing Symposium, 2019,
pp. 5901–5904. DOI:10.1109/IGARSS.2019.8900532.
[70]
G. Sumbul and B. Demir, “A novel multi-attention driven system for multi-
label remote sensing image classification,” in Proceedings of the IEEE Inter-
national Geoscience and Remote Sensing Symposium, 2019, pp. 5726–5729. DOI:
10.1109/IGARSS.2019.8898188.
[71]
G. Sumbul, M. Ravanbakhsh, and B. Demir, “A relevant, hard and diverse
triplet sampling method for multi-label remote sensing image retrieval,” in
Proceedings of the IEEE Mediterranean and Middle-East Geoscience and Remote Sens-
ing Symposium, 2022, pp. 5–8. DOI:10.1109/M2GARSS52314.2022.9839759.
[72]
G. Sumbul, J. Xiang, N. T. Madam, and B. Demir, “A novel framework to
jointly compress and index remote sensing images for efficient content-based
retrieval,” in Proceedings of the IEEE International Geoscience and Remote Sensing
Symposium, 2022, pp. 251–254. DOI:10.1109/IGARSS46834.2022.9884146.
[73]
G. Sumbul and B. Demir, “Label noise robust image representation learning
based on supervised variational autoencoders in remote sensing,” in Pro-
ceedings of the IEEE International Geoscience and Remote Sensing Symposium,
2023.
[74]
A. Preethy Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Remote-sensing im-
age scene classification with deep neural networks in JPEG 2000 compressed
domain,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4,
pp. 3458–3472, 2021. DOI:10.1109/TGRS.2020.3007523.
[75]
G. Sumbul, S. Nayak, and B. Demir, “SD-RSIC: Summarization-driven deep
remote sensing image captioning,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 59, no. 8, pp. 6922–6934, 2021. DOI:
10.1109/TGRS.2020.3031111
.
[76]
A. P. Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Approximating JPEG 2000
wavelet representation through deep neural networks for remote sensing
image scene classification,” in Proceedings of the Image and Signal Processing
for Remote Sensing Conference, vol. 11155, 2019, 111550S. DOI:
10.1117/12.
2534643.
[77]
K. Zhang, G. Sumbul, and B. Demir, “An approach to super-resolution of
sentinel-2 images based on generative adversarial networks,” in Proceedings of
Bibliography 139
the IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Sympo-
sium, 2020, pp. 69–72. DOI:10.1109/M2GARSS47143.2020.9105165.
[78]
H. Yessou, G. Sumbul, and B. Demir, “A comparative study of deep learning
loss functions for multi-label remote sensing image classification,” in Proceed-
ings of the IEEE International Geoscience and Remote Sensing Symposium, 2020,
pp. 1349–1352. DOI:10.1109/IGARSS39084.2020.9323583.
[79]
G. Sumbul and B. Demir, “A novel graph-theoretic deep representation learn-
ing method for multi-label remote sensing image retrieval,” in Proceedings of
the IEEE International Geoscience and Remote Sensing Symposium, 2021, pp. 266–
269. DOI:10.1109/IGARSS47720.2021.9554466.
[80]
G. Sumbul, M. Müller, and B. Demir, “A novel self-supervised cross-modal
image retrieval method in remote sensing,” in Proceedings of the IEEE Inter-
national Conference on Image Processing, 2022, pp. 2426–2430. DOI:
10.1109/
ICIP46576.2022.9897475.
[81]
A. Zell, G. Sumbul, and B. Demir, “Deep metric learning-based semi-supervised
regression with alternate learning,” in Proceedings of the IEEE International Con-
ference on Image Processing, 2022, pp. 2411–2415. DOI:
10.1109/ICIP46576.
2022.9897939.
[82]
B. Büyüktas, G. Sumbul, and B. Demir, “Learning across decentralized multi-
modal remote sensing archives with federated learning,” in Proceedings of the
IEEE International Geoscience and Remote Sensing Symposium, 2023.
[83]
J. Henkel, G. Hoxha, G. Sumbul, L. Möllenbrok, and B. Demir, “Annotation
cost efficient active learning for remote sensing image retrieval,” in Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium, 2023.
[84]
Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-
use classification,” in Proceedings of the International Conference on Advances in
Geographic Information Systems, 2010, 270–279.
[85]
W. Shao, W. Yang, and G. S. Xia, “Extreme value theory-based calibration for
the fusion of multiple features in high-resolution satellite scene classification,”
International Journal of Remote Sensing, vol. 34, no. 23, pp. 8588–8602, 2013.
[86]
Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based feature selection
for remote sensing scene classification,” IEEE Geoscience and Remote Sensing
Letters, vol. 12, no. 11, pp. 2321–2325, 2015.
[87]
B. Zhao, Y. Zhong, G. Xia, and L. Zhang, “Dirichlet-derived multiple topic
scene classification model for high spatial resolution remote sensing imagery,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 2108–2123,
2016.
[88]
G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “Aid: A
benchmark data set for performance evaluation of aerial scene classification,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981,
2017.
[89]
G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification:
Benchmark and state of the art,” Proceedings of the IEEE, vol. 105, no. 10,
pp. 1865–1883, 2017.
[90]
H. Li, X. Dou, C. Tao, Z. Wu, J. Chen, J. Peng, M. Deng, and L. Zhao, “Rsi-cb:
A large-scale remote sensing image classification benchmark using crowd-
sourced data,” Sensors, vol. 20, no. 6, 2020.
Bibliography 140
[91]
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset
and deep learning benchmark for land use and land cover classification,”
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
vol. 12, no. 7, pp. 2217–2226, 2019. DOI:10.1109/JSTARS.2019.2918242.
[92]
W. Zhou, S. Newsam, C. Li, and Z. Shao, “Patternnet: A benchmark dataset
for performance evaluation of remote sensing image retrieval,” ISPRS Journal
of Photogrammetry and Remote Sensing, vol. 145, pp. 197–209, 2018.
[93]
B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabel remote
sensing image retrieval using a semisupervised graph-theoretic method,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1144–
1158, 2018.
[94]
L. Zhao, P. Tang, and L. Huo, “Feature significance-based multibag-of-visual-
words model for remote sensing image scene classification,” Journal of Applied
Remote Sensing, vol. 10, no. 3, pp. 1 –21, 2016.
[95]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
nition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 770–778.
[96]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” International Conference on Learning Representations,
2015.
[97]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet
Large Scale Visual Recognition Challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015. DOI:10.1007/s11263-015-0816-y.
[98]
G. Jaffrain, C. Sannier, A. Pennec, and H. Dufourmont, “Corine land cover
2012 - final validation report,” European Environment Agency, Tech. Rep.,
2017. [Online]. Available:
https://land.copernicus . eu / user- corner/
technical-library/clc-2012-validation-report-1.
[99]
C. Paris, L. Bruzzone, and D. Fernández-Prieto, “A novel approach to the
unsupervised update of land-cover maps by classification of time series of
multispectral images,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 57, no. 7, pp. 4259–4277, 2019, ISSN: 1558-0644. DOI:
10.1109/TGRS.2018.
2890404.
[100]
S. Arnold, B. Kosztra, G. Banko, G. Smith, G. Hazeu, M. Bock, and N Valcarcel
Sanz, “The eagle concept—a vision of a future european land monitoring
framework,” in Proceedings of the EARSeL Symposium towards Horizon, vol. 2020,
2013, pp. 551–568.
[101]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
Proceedings of the International Conference on Learning Representations, 2014,
pp. 1–41.
[102]
M. Tao, J. Su, Y. Huang, and L. Wang, “Mitigation of radio frequency inter-
ference in synthetic aperture radar data: Current status and future trends,”
Remote Sensing, vol. 11, no. 20, p. 2438, 2019.
[103]
F. Zhang, B. Du, and L. Zhang, “Scene classification via a gradient boosting
random convolutional network framework,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 54, no. 3, pp. 1793–1802, 2016.
Bibliography 141
[104]
K. Nogueira, O. A. B. Penatti, and J. A. Santos, “Towards better exploiting
convolutional neural networks for remote sensing scene classification,” Pattern
Recognition, vol. 61, pp. 539–556, 2017.
[105]
G. Sumbul, R. G. Cinbis, and S. Aksoy, “Multisource region attention net-
work for fine-grained object recognition in remote sensing imagery,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 57, no. 7, pp. 4929–4937,
2019.
[106]
S. Roy, E. Sangineto, N. Sebe, and B. Demir, “Semantic-fusion gans for semi-
supervised satellite image classification,” in Proceedings of the International
Conference on Image Processing, 2018, pp. 684–688.
[107]
X. Lu, H. Sun, and X. Zheng, “A feature aggregation convolutional neural net-
work for remote sensing scene classification,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 57, no. 10, pp. 7894–7906, 2019.
[108]
J. Xie, N. He, L. Fang, and A. Plaza, “Scale-free convolutional neural network
for remote sensing scene classification,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 57, no. 9, pp. 6916–6928, 2019.
[109]
A. Zeggada, F. Melgani, and Y. Bazi, “A deep learning approach to UAV
image multilabeling,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5,
pp. 694–698, 2017.
[110]
S. Koda, A. Zeggada, F. Melgani, and R. Nishii, “Spatial and structured SVM
for multilabel image classification,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 56, no. 10, pp. 5948–5960, 2018.
[111]
R. Stivaktakis, G. Tsagkatakis, and P. Tsakalides, “Deep learning for multilabel
land cover scene categorization using data augmentation,” IEEE Geoscience
and Remote Sensing Letters, vol. 16, no. 7, pp. 1031–1035, 2019.
[112]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Compu-
tation, vol. 9, no. 8, pp. 1735–1780, 1997.
[113]
F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual
prediction with LSTM,” Neural Computation, vol. 12, no. 10, pp. 2451–2471,
2000.
[114]
S. Masum, J. P. Chiverton, Y. Liu, B. Vuksanovic, and M. Petridis, “Investi-
gation of machine learning techniques in forecasting of blood pressure time
series data,” in Proceedings of the International Conference on Innovative Tech-
niques and Applications of Artificial Intelligence, 2019, pp. 269–282.
[115]
T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-
based neural machine translation,” in Proceedings of the Conference on Empirical
Methods in Natural Language Processing, 2015, pp. 1412–1421.
[116]
Z. A. Daniels and D. N. Metaxas, “Addressing imbalance in multi-label classifi-
cation using structured hellinger forests,” in Proceedings of the AAAI Conference
on Artificial Intelligence, 2017, pp. 1826–1832.
[117]
D. Mishkin, N. Sergievskiy, and J. Matas, “Systematic evaluation of CNN
advances on the ImageNet,” Computer Vision and Image Understanding, vol. 161,
no. C, pp. 11–19, 2017.
[118] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feed-
forward neural networks,” in Proceedings of the International Conference on
Artificial Intelligence and Statistics, 2010, pp. 249–256.
Bibliography 142
[119] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,” Journal
of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[120]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in Proceedings of the International
Conference on Machine Learning, 2015, pp. 448–456.
[121]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” in Proceedings of the International Conference on Learn-
ing Representations, 2015.
[122]
M. Zhang and Z. Zhou, “A review on multi-label learning algorithms,” IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819–1837,
2014.
[123]
R. A. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-
Wesley, 2011, pp. 327–328.
[124]
G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” Inter-
national Journal of Data Warehousing and Mining, vol. 3, no. 3, pp. 1–13, 2007.
[125]
G. Tsoumakas, I. Katakis, and I. Vlahavas, “Data mining and knowledge
discovery handbook,” in Springer, 2010, ch. Mining Multi-label Data, pp. 667–
685.
[126]
M. Ahmed, “Data summarization: A survey,” Knowledge and Information Sys-
tems, vol. 58, no. 2, pp. 249–273, 2019.
[127]
Y. Yang and S. Newsam, “Geographic image retrieval using local invariant
features,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 2,
pp. 818–832, 2013. DOI:10.1109/TGRS.2012.2205158.
[128]
E. Aptoula, “Remote sensing image retrieval with global morphological tex-
ture descriptors,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52,
no. 5, pp. 3023–3034, 2014. DOI:10.1109/TGRS.2013.2268736.
[129]
I. Tekeste and B. Demir, “Advanced local binary patterns for remote sensing
image retrieval,” IEEE International Geoscience and Remote Sensing Symposium,
pp. 6855–6858, 2018. DOI:10.1109/IGARSS.2018.8518856.
[130]
B. Demir and L. Bruzzone, “Hashing-based scalable remote sensing image
search and retrieval in large archives,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 54, no. 2, pp. 892–904, 2016. DOI:
10.1109/TGRS.2015.
2469138.
[131]
B. Chaudhuri, B. Demir, L. Bruzzone, and S. Chaudhuri, “Region-based re-
trieval of remote sensing images using an unsupervised graph-theoretic ap-
proach,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 7, pp. 987–991,
2016. DOI:10.1109/LGRS.2016.2558289.
[132]
Y. Li, Y. Zhang, C. Tao, and H. Zhu, “Content-based high-resolution remote
sensing image retrieval via unsupervised feature learning and collaborative
affinity metric fusion,” Remote Sensing, vol. 8, no. 9, p. 709, 2016. DOI:
10.3390/
rs8090709.
[133]
Y. Boualleg and M. Farah, “Enhanced interactive remote sensing image re-
trieval with scene classification convolutional neural networks model,” IEEE
International Geoscience and Remote Sensing Symposium, pp. 4748–4751, 2018.
DOI:10.1109/IGARSS.2018.8518388.
Bibliography 143
[134]
F. Sabahi, M. O. Ahmad, and M. N. S. Swamy, “An unsupervised learn-
ing based method for content-based image retrieval using hopfield neural
network,” Proceedings of the International Conference of Signal Processing and
Intelligent Systems, pp. 1–5, 2016. DOI:10.1109/ICSPIS.2016.7869882.
[135]
H. Lai, Y. Pan, Ye Liu, and S. Yan, “Simultaneous feature learning and hash
coding with deep neural networks,” IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3270–3278, 2015. DOI:10.1109/CVPR.2015.7298947.
[136]
P. Zhu, Y. Tan, L. Zhang, Y. Wang, J. Mei, H. Liu, and M. Wu, “Deep learning
for multilabel remote sensing image annotation with dual-level semantic
concepts,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 6,
pp. 4047–4060, 2020.
[137]
H. Xuan, A. Stylianou, and R. Pless, “Improved embeddings with easy positive
triplet mining,” IEEE International Conference on Computer Vision, pp. 2474–
2482, 2020.
[138]
D. Zhang, Y. Li, and Z. Zhang, “Deep metric learning with spherical embed-
ding,” in Proceedings of the Advances in Neural Information Processing Systems,
vol. 33, 2020, pp. 18772–18783.
[139]
W. Ge, W. Huang, D. Dong, and M. R. Scott, “Deep metric learning with
hierarchical triplet loss,” in Proceedings of the European Conference on Computer
Vision, 2018, pp. 269–285.
[140]
Y. Yuan, K. Yang, and C. Zhang, “Hard-aware deeply cascaded embedding,”
IEEE International Conference on Computer Vision, pp. 814–823, 2017.
[141]
S. Kim, M. Seo, I. Laptev, M. Cho, and S. Kwak, “Deep metric learning beyond
binary supervision,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 2283–2292. DOI:
10.1109/CVPR.2019.00239
.
[142]
S. Zhang, Q. Zhang, X. Wei, Y. Zhang, and Y. Xia, “Person re-identification
with triplet focal loss,” IEEE Access, vol. 6, pp. 78092–78099, 2018. DOI:
10.
1109/ACCESS.2018.2884743.
[143]
X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss
with general pair weighting for deep metric learning,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5017–5025.
DOI:10.1109/CVPR.2019.00516.
[144]
K. Sohn, “Improved deep metric learning with multi-class n-pair loss objec-
tive,” in Proceedings of the Advances in Neural Information Processing Systems,
vol. 29, 2016.
[145]
X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson, “Ranked
list loss for deep metric learning,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 5202–5211. DOI:
10.1109/
CVPR.2019.00535.
[146]
C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in
deep embedding learning,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017, pp. 2840–2848.
[147]
Z. Zhang, Q. Zou, Y. Lin, L. Chen, and S. Wang, “Improved deep hashing with
soft pairwise similarity for multi-label image retrieval,” IEEE Transactions on
Multimedia, vol. 22, no. 2, pp. 540–553, 2020.
Bibliography 144
[148]
G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 2261–2269.
[149]
E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in
Proceedings of the International Conference on Learning Representations, 2015.
[150]
S. Deepak and P. Ameer, “Retrieval of brain MRI with tumor using con-
trastive loss based similarity on googlenet encodings,” Computers in Biology
and Medicine, vol. 125, p. 103993, 2020. DOI:
https://doi.org/10.1016/j.
compbiomed.2020.103993.
[151]
H. Xuan, R. Souvenir, and R. Pless, “Deep randomized ensembles for metric
learning,” in Proceedings of the European Conference on Computer Vision, 2018,
pp. 723–734.
[152]
H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via
lifted structured feature embedding,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.
[153]
W. Chen, Y. Liu, W. Wang, E. M. Bakker, T. Georgiou, P. Fieguth, L. Liu, and
M. S. Lew, “Deep learning for instance retrieval: A survey,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, pp. 1–20, 2022. DOI:
10.1109/
TPAMI.2022.3218591.
[154]
J. Lin, Z. Li, and J. Tang, “Discriminative deep hashing for scalable face
image retrieval,” in Proceedings of the International Joint Conference on Artificial
Intelligence, 2017, pp. 2266–2272.
[155]
P. Li and P. Ren, “Partial randomness hashing for large-scale remote sensing
image retrieval,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 3,
pp. 464–468, 2017. DOI:10.1109/LGRS.2017.2651056.
[156]
T. Reato, B. Demir, and L. Bruzzone, “An unsupervised multicode hash-
ing method for accurate and scalable remote sensing image retrieval,” IEEE
Geoscience and Remote Sensing Letters, vol. 16, no. 2, pp. 276–280, 2019. DOI:
10.1109/LGRS.2018.2870686.
[157]
E. Augé, J. E. Sánchez, A. Kiely, I. Blanes, and J. Serra-Sagristá, “Performance
impact of parameter tuning on the CCSDS-123 lossless multi- and hyperspec-
tral image compression standard,” Journal of Applied Remote Sensing, vol. 7,
no. 1, pp. 1–16, 2013. DOI:10.1117/1.jrs.7.074594.
[158]
M. Ryan and J. Arnold, “The lossless compression of aviris images by vector
quantization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 3,
pp. 546–550, 1997. DOI:10.1109/36.581964.
[159]
P. Hao and Q. Shi, “Reversible integer KLT for progressive-to-lossless com-
pression of multiple component images,” in Proceedings of the International
Conference on Image Processing, vol. 1, 2003, pp. I–633. DOI:
10.1109/ICIP.
2003.1247041.
[160]
G. P. Abousleman, M. Marcellin, and B. R. Hunt, “Compression of hyper-
spectral imagery using the 3-D DCT and hybrid DPCM/DCT,” IEEE Trans-
actions on Geoscience and Remote Sensing, vol. 33, no. 1, pp. 26–34, 1995. DOI:
10.1109/36.368225.
[161]
W. Sweldens, “The lifting scheme: A custom-design construction of biorthog-
onal wavelets,” Applied and Computational Harmonic Analysis, vol. 3, no. 2,
Bibliography 145
pp. 186–200, 1996, ISSN: 1063-5203. DOI:
https://doi.org/10.1006/acha.
1996.0015.
[162]
A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still image
compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36–
58, 2001. DOI:10.1109/79.952804.
[163]
European Space Agency (ESA), “Sentinel-2 user handbook,” Sentinel User
Handbook and Exploitation Tools, Tech. Rep., 2015. [Online]. Available:
https:
/ / sentinel . esa . int / documents / 247904 / 685211 / sentinel - 2 _ user _
handbook.
[164]
F. Kong, K. Hu, Y. Li, D. Li, and S. Zhao, “Spectral-spatial feature partitioned
extraction based on CNN for multispectral image compression,” Remote Sens-
ing, vol. 13, no. 1, pp. 2072–4292, 2021.
[165]
Y. Hu, W. Yang, Z. Ma, and J. Liu, “Learning end-to-end lossy image com-
pression: A benchmark,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 44, no. 8, pp. 4194–4211, 2022.
[166]
F. Kong, S. Zhao, Y. Li, D. Li, and Y. Zhou, “A residual network framework
based on weighted feature channels for multispectral image compression,”
Ad Hoc Networks, vol. 107, p. 102272, 2020, ISSN: 1570-8705.
[167]
F. Kong, K. Hu, Y. Li, D. Li, X. Liu, and T. S. Durrani, “A spectral-spatial
feature extraction method with polydirectional CNN for multispectral image
compression,” IEEE Journal of Selected Topics in Applied Earth Observations and
Remote Sensing, vol. 15, pp. 2745–2758, 2022. DOI:
10.1109/JSTARS.2022.
3158281.
[168]
J. Ball, V. Laparra, and E. P. Simoncelli, “End-to-end optimization of nonlinear
transform codes for perceptual quality,” in Proceedings of the Picture Coding
Symposium, 2016, pp. 1–5.
[169]
L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression
with compressive autoencoders,” in Proceedings of the International Conference
on Learning Representations, 2017.
[170]
Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression
with discretized gaussian mixture likelihoods and attention modules,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2020, pp. 7936–7945. DOI:10.1109/CVPR42600.2020.00796.
[171]
T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “End-to-end learnt
image compression via non-local attention optimization and improved context
modeling,” IEEE Transactions on Image Processing, vol. 30, pp. 3179–3191, 2021.
[172]
J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational
image compression with a scale hyperprior,” in Proceedings of the International
Conference on Learning Representations, 2018.
[173]
D. Minnen, J. Ballé, and G. Toderici, “Joint autoregressive and hierarchical
priors for learned image compression,” in Proceedings of the Advances in Neural
Information Processing Systems, 2018, pp. 10794–10803.
[174]
A. Preethy Byju, B. Demir, and L. Bruzzone, “A progressive content-based
image retrieval in JPEG 2000 compressed remote sensing archives,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 58, no. 8, pp. 5739–5751,
2020. DOI:10.1109/TGRS.2020.2969374.
Bibliography 146
[175]
J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image com-
pression,” in Proceedings of the International Conference on Learning Representa-
tions, 2017.
[176]
Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for
image quality assessment,” in Proceedings of the Asilomar Conference on Signals,
Systems & Computers, vol. 2, 2003, pp. 1398–1402. DOI:
10.1109/ACSSC.2003.
1292216.
[177]
H. F. Yang, K. Lin, and C. S. Chen, “Supervised learning of semantics-preserving
hash via deep convolutional neural networks,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 40, no. 2, pp. 437–451, 2018.
[178]
J.
-
A. Désidéri, “Multiple-gradient descent algorithm (MGDA) for multiobjec-
tive optimization,” Comptes Rendus Mathematique, vol. 350, no. 5, pp. 313–318,
2012, ISSN: 1631-073X. DOI:
https://doi.org/10.1016/j.crma.2012.03.014
.
[179]
X. Qi, P. Zhu, Y. Wang, L. Zhang, J. Peng, M. Wu, J. Chen, X. Zhao, N. Zang,
and P. T. Mathiopoulos, “MLRSNet: A multi-label high spatial resolution
remote sensing dataset for semantic scene understanding,” ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 169, pp. 337–350, 2020.
[180]
J. Bergstra, G. Desjardins, G. Lamblin, and Y. Bengio, “Quadratic polynomials
learn better image features (technical report 1337),” Département d’Informatique
et de Recherche Opérationnelle, Université de Montréal, Tech. Rep., 2009.
[181]
S. Su, C. Zhang, K. Han, and Y. Tian, “Greedy hash: Towards fast optimization
for accurate hash coding in CNN,” in Proceedings of the Advances in Neural
Information Processing Systems, 2018, pp. 806–815.
[182]
T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient
surgery for multi-task learning,” in Proceedings of the Advances in Neural Infor-
mation Processing Systems, vol. 33, 2020, pp. 5824–5836.
[183]
S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with
attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019, pp. 1871–1880. DOI:10.1109/CVPR.2019.00197.
[184]
K. Islam, L. M. Dang, S. Lee, and H. Moon, “Image compression with recurrent
neural network and generalized divisive normalization,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021,
pp. 1875–1879.
[185]
F. Kong and R. Henao, “Efficient classification of very large images with tiny
objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2022, pp. 2384–2394.
[186]
S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “ESPNetv2: A light-
weight, power efficient, and general purpose convolutional neural network,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2019, pp. 9182–9192.
[187]
R. Raina, Y. Shen, A. McCallum, and A. Ng, “Classification with hybrid
generative/discriminative models,” in Proceedings of the Advances in Neural
Information Processing Systems, vol. 16, 2003.
[188]
R. Zhang, Z. Chen, S. Zhang, F. Song, G. Zhang, Q. Zhou, and T. Lei, “Remote
sensing image scene classification with noisy label distillation,” Remote Sensing,
vol. 12, no. 15, p. 2376, 2020.
Bibliography 147
[189]
J. Kang, R. Fernandez-Beltran, P. Duan, X. Kang, and A. J. Plaza, “Robust
normalized softmax loss for deep metric learning-based characterization of
remote sensing images with label noise,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 59, no. 10, pp. 8798–8811, 2021, ISSN: 1558-0644. DOI:
10.1109/TGRS.2020.3042607.
[190]
P. Li, X. He, X. Cheng, M. Qiao, D. Song, M. Chen, T. Zhou, J. Li, X. Guo, S.
Hu, and Z. Tian, “An improved categorical cross entropy for remote sensing
image classification based on noisy labels,” Expert Systems with Applications,
vol. 205, p. 117296, 2022. DOI:10.1016/j.eswa.2022.117296.
[191]
T. Burgert, M. Ravanbakhsh, and B. Demir, “On the effects of different types
of label noise in multi-label remote sensing image classification,” IEEE Trans-
actions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022. DOI:
10.1109/
TGRS.2022.3226371.
[192]
A. K. Aksoy, M. Ravanbakhsh, and B. Demir, “Multi-label noise robust col-
laborative learning method for remote sensing image classification,” IEEE
Transactions on Neural Networks and Learning Systems, 2022. DOI:
10.1109/
TNNLS.2022.3209992.
[193]
N. Ahmed, R. M. Rahman, M. S. G. Adnan, and B. Ahmed, “Dense prediction
of label noise for learning building extraction from aerial drone imagery,”
International Journal of Remote Sensing, vol. 42, no. 23, pp. 8906–8929, 2021.
[194]
J. Yao, J. Wang, I. W. Tsang, Y. Zhang, J. Sun, C. Zhang, and R. Zhang, “Deep
learning from noisy image labels with quality embedding,” IEEE Transactions
on Image Processing, vol. 28, no. 4, pp. 1909–1922, 2019. DOI:
10.1109/TIP.
2018.2877939.
[195]
T.
-
Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense
object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 42, no. 2, pp. 318–327, 2020. DOI:10.1109/TPAMI.2018.2858826.
[196]
S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda, “Early-learning
regularization prevents memorization of noisy labels,” Advances in Neural
Information Processing Systems, vol. 33, pp. 20331–20342, 2020.
[197]
H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by agreement:
A joint training method with co-regularization,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2020, pp. 13723–13 732.
DOI:10.1109/CVPR42600.2020.01374.
[198]
T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, and L.
Zelnik-Manor, “Asymmetric loss for multi-label classification,” in Proceedings
of the IEEE International Conference on Computer Vision, 2021, pp. 82–91. DOI:
10.1109/ICCV48922.2021.00015.
[199]
K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin, “Robust inference via gener-
ative classifiers for handling noisy labels,” in Proceedings of the International
Conference on Machine Learning, 2019, pp. 3763–3772.
[200]
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceed-
ings of the International Conference on Learning Representations, 2014.
[201]
S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals
of Mathematical Statistics, vol. 22, no. 1, pp. 79 –86, 1951. DOI:
10.1214/aoms/
1177729694.
Bibliography 148
[202]
H.
-
Z. Feng, K. Kong, M. Chen, T. Zhang, M. Zhu, and W. Chen, “SHOT-VAE:
Semi-supervised deep generative models with label-aware ELBO approxima-
tions,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[203]
Z. Shao, K. Yang, and W. Zhou, “Performance evaluation of single-label and
multi-label remote sensing image retrieval using a dense labeling dataset,”
Remote Sensing, vol. 10, no. 6:964, 2018.
[204] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[205]
R. Cipolla, Y. Gal, and A. Kendall, “Multi-task learning using uncertainty to
weigh losses for scene geometry and semantics,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491. DOI:
10.1109/CVPR.2018.00781.
[206]
X. Lu, Y. Zhong, Z. Zheng, Y. Liu, J. Zhao, A. Ma, and J. Yang, “Multi-Scale and
Multi-Task Deep Learning Framework for Automatic Road Extraction,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 9362–9377,
2019.
[207]
W. Song, S. Li, and J. A. Benediktsson, “Deep Hashing Learning for Visual and
Semantic Retrieval of Remote Sensing Images,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 59, no. 11, pp. 9661–9672, 2021.
[208]
Z. Chen, V. Badrinarayanan, C.
-
Y. Lee, and A. Rabinovich, “GradNorm: Gra-
dient normalization for adaptive loss balancing in deep multitask networks,”
in Proceedings of the International Conference on Machine Learning, vol. 80, 2018,
pp. 794–803.
[209]
K.
-
K. Maninis, I. Radosavovic, and I. Kokkinos, “Attentive single-tasking of
multiple tasks,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019, pp. 1851–1860.
[210]
A. Migdalas, P. Pardalos, and P. Värbrand, Multilevel Optimization: Algorithms
and Applications. Springer US, 2013.
[211]
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for
contrastive learning of visual representations,” in Proceedings of the Interna-
tional Conference on Machine Learning, vol. 119, 2020, pp. 1597–1607.