Document [original]

Deep Image Representation Learning

for Knowledge Discovery from Earth

Observation Data Archives

vorgelegt von

M. Sc.

GENCER SÜMBÜL

ORCID: 0000-0003-3690-3052

an der Fakultät IV - Elektrotechnik und Informatik

der Technischen Universität Berlin

zur Erlangung des akademischen Grades

Doktor der Ingenieurwissenschaften

- Dr.-Ing. -

genehmigte Dissertation

Promotionsausschuss:

Vorsitzender: Prof. Dr. Matthias Boehm

Gutachterin: Prof. Dr. Begüm Demir

Gutachter: Prof. Dr. Farid Melgani

Gutachter: Prof. Dr. Claudio Persello

Tag der wissenschaftlichen Aussprache: 09. Mai 2023

Berlin 2023

Abstract

Advances in remote sensing (RS) technology have increased the availability of images

regularly acquired by satelliteborne and airborne sensors, while free data policies

support researchers to have access to massive Earth observation data archives. To

automatically extract knowledge from these archives on a large-scale, deep learning

(DL) based RS image representation learning (IRL) has attracted great attention.

However, existing methods have limitations on: i) accurate characterization of high-

level semantic content and spectral information present in RS images; ii) modelling

RS image similarities by exploiting multi-label training images; iii) time efﬁcient

and scalable information extraction; iv) effective IRL under noisy training labels;

and v) joint use of multiple learning tasks for describing the complex content of RS

images. This thesis aims to develop advanced DL-based IRL methods to tackle these

limitations, while a particular attention is devoted to image scene classiﬁcation and

content-based image retrieval (CBIR) problems due to their importance for large-scale

knowledge discovery. In detail, we propose ﬁve DL-based IRL methods throughout

the thesis. First, a multi-label classiﬁcation approach is introduced to accurately

describe complex spatial and spectral content of high-spatial resolution RS images,

where several spectral bands are associated with varying spatial resolutions. Second,

we propose an image triplet sampling method for IRL through the characterization

of RS image similarities, which forms the foundation for CBIR. Among multi-label

training images, this method selects a small set of the most representative and

informative image triplets that lead to a decrease in computational complexity and

an increase in learning speed without a signiﬁcant loss in performance. Third, an

approach devoted to simultaneous RS image compression and indexing is introduced

for scalable CBIR. This approach characterizes hash codes of RS images on learning

based compression domain; and thus prevent the requirement of decoding images

prior to CBIR that can save a signiﬁcant amount of time. Fourth, we propose an

approach for IRL when training data includes noisy labels. By integrating generative

reasoning into discriminative reasoning, our approach models the complementary

characteristics of discriminative and generative reasoning, and thus prevents the

interference of noisy labels during training. Fifth, a multitask learning approach is

introduced to achieve IRL when multiple learning tasks are jointly utilized. Due to

its loss functions and sequential optimization algorithm, this approach preserves the

plasticity for each task and the stability in between learning consecutive tasks. For

benchmarking the proposed methods, we introduce a large-scale multi-modal multi-

label benchmark RS image archive (denoted as BigEarthNet). It includes 590,326

pairs of Sentinel-1 and Sentinel-2 image patches acquired over 10 European countries.

We make BigEarthNet, its pre-trained DL models and the codes of all the methods

publicly available as open source contributions of the thesis.

Zusammenfassung

Fortschritte in den Technologien der Fernerkundung (FK) haben zu einer erhöhten

Verfügbarkeit von Bildmaterial, das von satelliten- und ﬂugzeuggestützten Sensoren

erfasst wird, geführt; gleichzeitig ermöglicht die kostenlose Freigabe von Datensätzen

Forschern den Zugang zu umfangreichen Archiven mit Erdbeobachtungsdaten. Hi-

erdurch ergibt sich ein Potential für tiefes Lernen (TF) basierte Repräsentationslernen

(RL) Studien zur automatischen Wissensentdeckung aus diesen Archiven. Beste-

hende Methoden haben jedoch Einschränkungen in Bezug auf: i) die genaue Charak-

terisierung des semantischen Inhalts und der spektralen Informationen der FK-Bilder;

ii) die korrekte Nutzung von FK-Bildern mit mehreren Labels während des Train-

ings; iii) die zeitefﬁziente und skalierbare Informationsgewinnung; iv) effektives

RL unter fehlerhaften Trainingslabels; und v) die kombinierte Nutzung mehrerer

Lerntasks zur Beschreibung der Bildinhalte. Diese Arbeit zielt darauf ab, TF-basierte

RL-Methoden zu entwickeln, um diese Deﬁzite zu beheben, wobei ein besonderes

Augenmerk auf die Klassiﬁzierung von Bildszenen und inhaltsbasierte Bildabfra-

gen (IB) gelegt wird. Der erste Beitrag dieser Arbeit besteht in der Entwicklung

eines Multi-Label-Klassiﬁkationsansatzes zur genauen Beschreibung des komplexen

räumlichen und spektralen Inhalts hochaufgelöster FK-Bilder. Als zweiten Beitrag

schlagen wir eine Bild-Tripel-Sampling-Methode für RL vor. Diese basiert auf der

Charakterisierung von Bildähnlichkeiten, die grundlegend für IB sind. Unter den

Trainingsbildern wählt die Methode eine kleine Anzahl verschiedener Anker sowie

relevante, harte und diversiﬁzierte Positiv- und Negativbilder aus, die zu kleineren

Berechnungskomplexität ohne signiﬁkanten Performanceverlust führen. Im dritten

Beitrag wird ein Ansatz zur gleichzeitigen FK-Bildkompression und Indizierung für

skalierbares IB vorgestellt. Unser Ansatz charakterisiert Hash-Codes von FK-Bildern

auf einer lernbasierten Kompressionsdomäne und erspart somit die Dekodierung von

Bildern vor der IB, was zu einer erheblichen Zeitersparnis führen kann. Als vierten

Beitrag schlagen wir einen Ansatz für RL vor, für den Fall, dass die Trainingsdaten

fehlerhafte Labels enthalten. Durch die Kombination von generativen und diskrim-

inativen Modellierungen nutzt unser Ansatz ihre komplementären Eigenschaften,

um die Störung durch fehlerhafte Labels während des Trainings zu verhindern. Im

fünften Beitrag wird ein Multitask-Lernansatz eingeführt, bei dem mehrere Lerntasks

kombiniert verwendet werden. Aufgrund seiner Verlustfunktionen und seines se-

quentiellen Optimierungsalgorithmus bewahrt dieser Ansatz die Plastizität für jeden

einzelnen Lerntask und die Stabilität zwischen aufeinanderfolgenden Lerntasks. Für

das Benchmarking der vorgeschlagenen Methoden besteht der letzte Beitrag dieser

Arbeit in der Erstellung von BigEarthNet, dem ersten groß angelegten multimodalen

Multi-Label-Benchmark-Archiv in FK. Wir stellen BigEarthNet, seine vortrainierten

TF-Modelle und die Codes aller Methoden als Open-Source-Beiträge der Dissertation

öffentlich zur Verfügung.

iii

Acknowledgements

This thesis has been made possible through the support of numerous people whom I

have had the privilege of meeting. First, I am deeply grateful to my supervisor, Prof.

Dr. Begüm Demir, for her time and encouragement from the ﬁrst moment of my

PhD studies, and for being an exceptional mentor to be always available to transmit

her knowledge. Without her priceless guidance, this thesis would not have been

possible.

I would like to thank the members of my doctoral committee, Prof. Dr. Matthias

Boehm, Prof. Dr. Begüm Demir, Prof. Dr. Farid Melgani and Prof. Dr. Claudio

Persello for their interest to my studies, helpful feedbacks and thoughtful comments

that have helped me to improve my thesis signiﬁcantly.

I also want to thank my colleagues from Remote Sensing Image Analysis (RSiM)

Group of TU Berlin, Genc Hoxha, Minh Tai Le, Mahdyar Ravanbakhsh, Barı¸s Büyük-

ta¸s, Yeti Gürbüz, Bernhard Föllmer, Lars Möllenbrok, Akshara Preethy Byju, Tristan

Kreuziger, Kai Norman Clasen, Martin Hermann Paul Fuchs, Tom Burgert, Huma

Ghani Zada, Ahmet Kerem Aksoy, Georgii Mikriukov, Steve Ahlswede, Sayantan Sen-

gupta, Adina Zell, Leonard Hackel, David Mickisch, Jakob Hackstein, Julia Henkel,

Adela Westedt, Martha Domhöfer, Tim Siebert, Theresa Follath and Kiril Murschel

for their help and all the enjoyable moments in Berlin.

Last but most important, I would like to express my sincere and deepest gratitude to

my wife, Kimya, for her continuous support and encouragement despite innumerable

sacriﬁces, for believing in my success all the time, for being my best friend and

companion in my life, and most signiﬁcantly for her inestimable love. I would not be

where I am today without her.

This thesis is supported by the European Research Council (ERC) through the ERC-

2017-STG BigEarth Project under Grant 759764.

Contents

Abstract i

Zusammenfassung ii

Acknowledgements iii

Contents iv

List of Figures vii

List of Tables xi

List of Abbreviations xiv

1 Introduction 1

1.1 Objectives and Novel Contributions of the Thesis . . . . . . . . . . . 6

1.2 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Additional Contributions . . . . . . . . . . . . . . . . . . . . . 10

1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

BigEarthNet: A Large Scale Benchmark Archive for Remote Sensing Image

Representation Learning 12

2.1 Introduction................................. 13

2.2 Limitations of Existing Archives . . . . . . . . . . . . . . . . . . . . . 14

2.3 BigEarthNet: A Large-Scale Benchmark Archive . . . . . . . . . . . . 15

2.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.1 Comparison with Transfer Learning from ImageNet . . . . . . 20

2.5.2 Comparison of State-of-the-Art CNN Models . . . . . . . . . . 22

2.6 Conclusion.................................. 22

A Deep Multi-Attention Driven Approach for Multi-Label Remote Sensing

Image Classification 24

3.1 Introduction................................. 25

3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Spatial and Spectral Characterization of Local Areas . . . . . . 27

3.2.2 Deﬁnition of a Multi-Attention Driven Global Descriptor . . . 28

3.2.3 Classiﬁcation of RS Image Scenes with Multi-Labels . . . . . . 31

3.3 Dataset Description and Experimental Design . . . . . . . . . . . . . 32

3.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.1 Sensitivity Analysis of the Proposed Approach . . . . . . . . . 37

3.4.2 Comparison Among the Existing Approaches . . . . . . . . . 39

3.5 Conclusion.................................. 41

Remote Sensing Image Similarity Learning Through Informative and Rep-

resentative Triplets for Multi-Label Image Retrieval 43

4.1 Introduction................................. 44

4.2 RelatedWorks................................ 46

4.3 ProposedMethod.............................. 49

4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.2 Diverse Anchor Selection . . . . . . . . . . . . . . . . . . . . . 49

4.3.3 Relevant, Hard and Diverse Positive-Negative Image Selection 50

4.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 52

4.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5.1 Sensitivity Analysis of the Proposed Method . . . . . . . . . . 54

4.5.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5.3 Comparison with Different Triplet Sampling Methods . . . . 56

4.5.4 Comparison with the State-of-the-Art DML Approaches . . . 58

4.6 Conclusion.................................. 60

Towards Simultaneous Image Compression and Indexing for Scalable

Content-Based Retrieval in Remote Sensing 62

5.1 Introduction................................. 63

5.2 RelatedWorks................................ 65

5.3 Proposed SCI-CBIR Approach . . . . . . . . . . . . . . . . . . . . . . . 66

5.3.1 First Step: DL-Based Compression . . . . . . . . . . . . . . . . 67

5.3.2 Second Step: Deep Hashing-Based Indexing . . . . . . . . . . 68

5.3.3 Multi-Stage Learning Procedure . . . . . . . . . . . . . . . . . 70

5.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 72

5.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5.1 Sensitivity Analysis of the Proposed SCI-CBIR Approach . . . 73

5.5.2 Comparison with Standard Approaches . . . . . . . . . . . . . 76

5.6 Conclusion.................................. 80

Generative Reasoning Integrated Label Noise Robust Deep Image Repre-

sentation Learning in Remote Sensing 82

6.1 Introduction................................. 83

6.2 RelatedWorks................................ 85

6.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.3.1 Basics on Discriminative Reasoning . . . . . . . . . . . . . . . 87

6.3.2 Integration of Generative Reasoning . . . . . . . . . . . . . . . 87

6.3.3 Label Noise Robust Hybrid Representation Learning . . . . . 89

6.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 91

6.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 91

6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.5.1 Sensitivity Analysis of the Proposed Approach . . . . . . . . . 93

6.5.2 Ablation Study of the Proposed Approach . . . . . . . . . . . 97

6.5.3 Comparison Among the State-of-the-Art Methods . . . . . . . 99

6.6 Conclusion.................................. 101

Plasticity-Stability Preserving Multi-Task Image Representation Learning

in Remote Sensing 102

7.1 Introduction................................. 103

7.2 RelatedWorks................................ 105

7.2.1 Single-Task Driven Methods . . . . . . . . . . . . . . . . . . . 105

7.2.2 Multi-Task Driven Methods . . . . . . . . . . . . . . . . . . . . 107

7.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.1 Plasticity Preservation . . . . . . . . . . . . . . . . . . . . . . . 108

7.3.2 Stability Preservation . . . . . . . . . . . . . . . . . . . . . . . . 109

7.3.3 Sequential Optimization Algorithm . . . . . . . . . . . . . . . 110

7.4 Dataset Description and Experimental Design . . . . . . . . . . . . . 114

7.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 115

7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.5.1 Sensitivity Analysis of the Proposed Approach . . . . . . . . . 116

7.5.2 Comparison with Existing Methods . . . . . . . . . . . . . . . 121

7.6 Conclusion.................................. 125

8 Conclusion and Outlook 128

8.1 Conclusion.................................. 128

8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 131

Bibliography 133

vii

List of Figures

2.1 An example of BigEarthNet image pairs and their multi-labels. . . . . 15

2.2

An example of the Sentinel-2 image patches of BigEarthNet that are

fully covered by seasonal snow, cloud and cloud shadow. . . . . . . . 16

2.3

An example of a query pair from the BigEarthNet archive and retrieved

image pairs obtained by using: 1) direct learning from BigEarthNet;

and 2) transfer learning from ImageNet in the framework of content-

based multi-modal multi-label image retrieval. . . . . . . . . . . . . . 21

3.1

Block diagram of the proposed approach for multi-label RS image

scene classiﬁcation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2

The proposed

-Branch CNN introduced in the ﬁrst step of the pro-

posed approach. One local area is highlighted as an example to feed

into the corresponding CNN. . . . . . . . . . . . . . . . . . . . . . . . 28

3.3

Single LSTM cell with its inputs, gates and cell state followed by two

LSTM cells in a sequence. Without losing in generality, particular

sequence of the LSTM network (which starts with the ﬁrst local area

and ends with the last local area) is chosen in the ﬁgure. . . . . . . . . 29

3.4

Proposed multi-attention strategy with bidirectional LSTM networks

for the second step of the proposed approach. . . . . . . . . . . . . . . 30

3.5

Detailed illustration of the three main steps of the proposed approach:

(a) spatial and spectral characterization of local areas; (b) deﬁnition of

a multi-attention driven global descriptor; (c) RS image scene classiﬁ-

cation with multi-labels. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6

An example of the BigEarthNet-S2 images with the true multi-labels

and the multi-labels assigned by the ResNet18, ResNet34, VGG16,

VGG19, CA-LSTM and the proposed approach. . . . . . . . . . . . . . 40

4.1

An example of three triplets consisting of images from BigEarthNet-

S2. Each triplet given in different rows consists of an anchor (in blue

frame), a positive image (in green frame), and a negative image (in

red frame). The associated multi-labels are given below the respective

images..................................... 45

viii

4.2

An Abstract representation of triplet selection and the progress for

feature space update. Blue arrows indicate reducing distances for

updating the embedding, while red arrows indicate increasing the

distances.

marks a chosen anchor,

, and

are positive images,

and

, and

are negative images in different triplets. The triplet

(Xa

N1)

is trivial because it already satisﬁes the margins, and thus

the corresponding distances are not updated. The triplet

(Xa

N2)

leads to a relatively small error and the images are pushed and pulled

a little. The triplet

(Xa

N3)

violates the margin greatly and causes

a signiﬁcant error.

is a positive image, but very far from the anchor,

so it is considered as a hard positive image.

is respectively a hard

negativeimage. ............................... 47

4.3

A block scheme of the proposed triplet sampling method to drive the

training phase of a DNN for multi-label CBIR problems. . . . . . . . 49

4.4

An example of images from the UCMerced Land Use archive and the

multi-labels associated with them: (a) sand, sea (b) airplane, cars, grass,

pavement (c) bare-soil, buildings, grass (d) buildings, cars, pavement, trees. 52

4.5

An image retrieval example: (a) query image; (b) images retrieved by

TNDML; (c) images retrieved by RSDML; (d) images retrieved by the

proposed DAS-RHDIS method (IRS-BigEarthNet archive). . . . . . . 58

4.6

An image retrieval example: (a) query image; (b) images retrieved by

TNDML; (c) images retrieved by RSDML; (d) images retrieved by the

proposed DAS-RHDIS method (UCMerced archive). . . . . . . . . . . 59

4.7 F1

scores obtained by different triplet sampling strategies and the

number of accumulated triplets during the training (The UCMerced

archive). ................................... 60

5.1 Illustration of the proposed SCI-CBIR approach. . . . . . . . . . . . . 67

5.2

Multi-scale similarity index (MS-SSIM) in dB versus bpp obtained

by the proposed SCI-CBIR approach, IC-RNN and JPEG2000 for (a)

BigEarthNet-S2 and (b) MLRSNet archives. . . . . . . . . . . . . . . . 76

5.3

An RS image compression example: (a) original image; reconstructed

image at 0.7 bits per pixel (bpp) by (b) JPEG2000 [162]; (c) IC-RNN

[184]; and (d) the proposed SCI-CBIR approach (BigEarthNet-S2 archive).

5.4

An RS image compression example: (a) original image; reconstructed

image at 0.3 bpp by (b) JPEG2000 [162]; (c) IC-RNN [184]; and (d) the

proposed SCI-CBIR approach (MLRSNet archive). . . . . . . . . . . . 77

5.5

MAP versus bpp obtained by the proposed SCI-CBIR approach and

SI-CBIR for (a) BigEarthNet-S2 and (b) MLRSNet archives. . . . . . . 78

5.6

(a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed

SCI-CBIR at 0.62 bpp; (d) the proposed SCI-CBIR at 0.78 bpp; (e) the

proposed SCI-CBIR at 1.05 bpp; and (f) the proposed SCI-CBIR at 1.56

bpp (BigEarthNet-S2 archive). . . . . . . . . . . . . . . . . . . . . . . . 79

5.7

(a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed

SCI-CBIR at 0.33 bpp; (d) the proposed SCI-CBIR at 0.56 bpp; (e) the

proposed SCI-CBIR at 0.69 bpp; and (f) the proposed SCI-CBIR at 0.85

bpp (MLRSNet archive). . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1

An illustration of the training of our GRID approach that jointly lever-

ages the robustness of generative reasoning towards noisy labels and

the effectiveness of discriminative reasoning on image representa-

tion learning. During the forward pass on a mini-batch

, the loss

values

Od(B)

Og(B)

and the predicted labels

ˆd

ˆg

are obtained

through discriminative and generative reasoning for a given learning

task. Then, the set

of training samples with noisy labels (i.e., noisy

samples) and the set

of training samples with correct labels (i.e.,

clean samples) are constructed through our automatic noisy sample

detection procedure based on the values of the loss function

asso-

ciated with the learning task. During the backward pass, the model

parameters except the CNN backbone parameters are updated with

all samples based on

∇γOd(B)

and

∇βOg(B)

. The parameters of the

CNN backbone are updated through: i) the generative task head for

the noisy samples based on

∇θOg(W)

; and ii) the discriminative task

head for the clean samples based on ∇θOd(C). ............. 86

6.2

Noisy sample detection accuracy of the proposed GRID (BCE) ap-

proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)

40%, (e) 50%, (f) 60%; and

for

λk%

is set as equal to the SLNIR value

(DLRSDarchive). .............................. 94

6.3

Noisy sample detection accuracy of the proposed GRID (BCE) ap-

proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)

40%, (e) 50%, (f) 60%; and

for

λk%

is set as equal to the SLNIR value

(BigEarthNet-S2 archive). . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.4

Noisy sample detection accuracy of the proposed GRID (PCE) ap-

proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)

40%, (e) 50%, (f) 60%; and

for

λk%

is set as equal to the SLNIR value

(DLRSDarchive). .............................. 96

6.5

Noisy sample detection accuracy of the proposed GRID (PCE) ap-

proach versus epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d)

40%, (e) 50%, (f) 60%; and

for

λk%

is set as equal to the SLNIR value

(BigEarthNet-S2 archive). . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.6

Results obtained by using: 1) discriminative reasoning; 2) generative

reasoning; 3) their standard joint learning; and 4) our label noise robust

hybrid representation learning strategy for different values of SLNIR

when RS IRL is achieved by: i) multi-label classiﬁcation on (a) DLRSD

and (b) BigEarthNet-S2; ii) semantic segmentation on (c) DLRSD and

(d) BigEarthNet-S2; and iii) multi-label co-occurrence prediction on (e)

DLRSD and (f) BigEarthNet-S2. . . . . . . . . . . . . . . . . . . . . . . 98

7.1

An illustration of the proposed plasticity-stability preserving multi-

task learning (PLASTA-MTL) approach training, when two tasks

and

are considered. Standard and plasticity preservation back-

ward passes for (a)

, and (c)

are shown, while the changes over

the gradient vectors (b)

∇GLT1

and (d)

∇GLT2

during the plasticity

preservation of these tasks are visualized. (e) The backward pass for

stability preservation of all the tasks are given with (f) the illustration

of changes over their gradient vectors. . . . . . . . . . . . . . . . . . . 113

7.2

Normalized discounted cumulative gains (NDCG) versus the num-

ber of retrieved images obtained for the DLRSD archive when the

tasks

and

are utilized in different orders for the PLASTA-

MTLapproach. ............................... 119

7.3

Mean Average Precision (mAP) versus the minimum number of train-

ing epochs for the DLRSD archive when the tasks: (a)

and

; (b)

and

; and (c)

and

are utilized for the PLASTA-MTL

approach and the equal weighting method. . . . . . . . . . . . . . . . 121

7.4

Normalized discounted cumulative gains (NDCG) versus the number

of retrieved images obtained for the DLRSD archive when the tasks:

(a)

and

; (b)

and

; (c)

and

; (d)

and

; and (e)

T2,T3and T4are used in the context of multi-task learning. . . . . . . 122

7.5

Normalized discounted cumulative gains (NDCG) versus the number

of retrieved images obtained for the BigEarthNet-S2 archive when the

tasks: (a)

and

; (b)

and

; (c)

and

; (d)

and

; and

(e) T1,T2,T3and T4are used in the context of multi-task learning. . . . 125

7.6

(a) Query image; and images retrieved by using (b) equal weighting;

proposed PLASTA-MTL approach when the tasks:

and

are

utilized for the DLRSD archive. . . . . . . . . . . . . . . . . . . . . . . 126

List of Tables

2.1 A List of Existing RS Image Archives . . . . . . . . . . . . . . . . . . . 14

2.2

The list of classes within CLC and proposed class nomenclatures and

their associated numbers of image pairs. These numbers are obtained

after eliminating Sentinel-2 image patches that are fully covered by

seasonal snow, cloud, and cloud shadow. . . . . . . . . . . . . . . . . 18

2.3

Class-based

Scores (%) obtained when: i) transfer learning from

ImageNet and ii) direct learning from BigEarthNet are used for multi-

modal multi-label image classiﬁcation. . . . . . . . . . . . . . . . . . . 20

2.4

Overall Multi-Modal Multi-Label Classiﬁcation Results Under Differ-

ent Metrics and DL Models for BigEarthNet. . . . . . . . . . . . . . . 22

3.1

Multi-Label Classiﬁcation Accuracies and the Number of Required

Model Parameters (NP) When Using Local Areas With Different Sizes

for the Proposed Approach. . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2

Results Obtained by the SiB-CNN

RGB

, the SiB-CNN, the L-SiB-CNN

and the Proposed K-BranchCNN. .................... 38

3.3

Multi-Label Classiﬁcation Accuracies obtained by Using Different

Steps of the Proposed Approach. . . . . . . . . . . . . . . . . . . . . . 39

3.4

Results Obtained by the ResNet18, ResNet34, VGG16, VGG19, CA-

LSTM and the Proposed Approach Together With the Number of

Required Model Parameters (NP). . . . . . . . . . . . . . . . . . . . . 39

4.1

The Performance of Different DL Model Architectures for the UCMerced

Archive. ................................... 55

4.2

The Effect of Varying Embedding Sizes on the Retrieval Performance

for the UCMerced Archive. . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3

Results obtained by the different anchor selection strategies (RAS, BAS

and proposed DAS) under different metrics for the UCMerced archive

when proposed RHDIS is used for positive and negative image selection.

4.4

Results obtained by the different positive and negative image selection

strategies (RIS, BIS and proposed RHDIS) under different metrics for

the UCMerced archive when proposed DAS is used for anchor selection.

4.5

The performance of different triplet selection methods for the IRS-

BigEarthNet and UCMerced archives. . . . . . . . . . . . . . . . . . . 57

4.6

The performance of different deep metric learning methods for the

IRS-BigEarthNet and UCMerced archives. . . . . . . . . . . . . . . . . 60

5.1

Results Obtained by Proposed SCI-CBIR For Different Values of

ηC

When the First Two Stages of Our Learning Procedure are Achieved at

Different Bit-rates (BigEarthNet-S2 Archive) . . . . . . . . . . . . . . . 74

xii

5.2 Results Obtained by Proposed SCI-CBIR with and without the Atten-

tion Layer When the First Two Stages of Our Learning Procedure are

Achieved at Different Bit-rates (BigEarthNet-S2 Archive) . . . . . . . 74

5.3

Results Obtained by Proposed SCI-CBIR under Different Activation

Functions (The BigEarthNet-S2 Archive) . . . . . . . . . . . . . . . . . 75

5.4

Results Obtained by Proposed SCI-CBIR For Different Automatic Loss

Weighting Techniques (BigEarthNet-S2 Archive) . . . . . . . . . . . . 75

5.5 Results Obtained by Proposed SCI-CBIR for Different Values of q. . 76

5.6

Retrieval Time per Image (in milliseconds) Obtained by SI-CBIR and

the Proposed SCI-CBIR Approach . . . . . . . . . . . . . . . . . . . . 78

5.7

Results Obtained by Proposed SCI-CBIR Trained with Our Multi-Stage

Learning Procedure and Standard Learning Procedure Associated to

Similar Bit-Rates (The BigEarthNet-S2 Archive) . . . . . . . . . . . . . 81

6.1

Results (%) Obtained by the Proposed GRID (BCE) Approach for

Different Values of λand SLNIR (%) (DLRSD archive) . . . . . . . . . 93

6.2

Results (%) Obtained by the Proposed GRID (BCE) Approach for

Different Values of λand SLNIR (%) (BigEarthNet-S2 archive) . . . . 94

6.3

Results (%) Obtained by the Proposed GRID (PCE) Approach for

Different Values of λand SLNIR (%) (DLRSD archive) . . . . . . . . . 95

6.4

Results (%) Obtained by the Proposed GRID (PCE) Approach for

Different Values of λand SLNIR (%) (BigEarthNet-S2 archive) . . . . 96

6.5

Results (%) Obtained by BCE, ELR [196], FL [195], ASL [198], Jo-

CoR [197] and the Proposed GRID (BCE) Approach Under Different

Values of SLNIR (%) (DLRSD archive) . . . . . . . . . . . . . . . . . . 99

6.6

Results (%) Obtained by BCE, ELR [196], FL [195], ASL [198], Jo-

CoR [197] and the Proposed GRID (BCE) Approach Under Different

Values of SLNIR (%) (BigEarthNet-S2 archive) . . . . . . . . . . . . . 99

6.7

Results (%) Obtained by PCE, LNC [31], RLL [79] and the Proposed

GRID (PCE) and GRID (RRL) Approaches Under Different Values of

SLNIR (%) (DLRSD archive) . . . . . . . . . . . . . . . . . . . . . . . . 100

6.8

Results (%) Obtained by PCE, LNC [31], RLL [79] and the Proposed

GRID (PCE) and GRID (RRL) Approaches Under Different Values of

SLNIR (%) (BigEarthNet-S2 archive) . . . . . . . . . . . . . . . . . . . 100

7.1

Mean Average Precision (mAP) Scores Associated to the Different

Combinations of Tasks with Different Capabilities of the PLASTA-

MTL Approach are Utilized (The DLRSD Archive) . . . . . . . . . . . 117

7.2

Mean Average Precision (mAP) Scores When the Tasks

and

are

Utilized in Different Orders for the PLASTA-MTL Approach (The

DLRSDArchive) .............................. 118

7.3

Training Times per Epoch on the DLRSD archive When the Different

Combinations of Tasks are Utilized for the Proposed PLASTA-MTL

Approach and Equal Weighting. . . . . . . . . . . . . . . . . . . . . . 120

7.4

Mean Average Precision (mAP) Scores When the Different Combina-

tions of Tasks are Utilized in the PLASTA-MTL Approach Compared

to Single Task Learning (The DLRSD Archive) . . . . . . . . . . . . . 122

xiii

7.5

Mean Average Precision (mAP) Scores Associated to the Different

Combinations of Tasks (The DLRSD Archive) . . . . . . . . . . . . . . 123

7.6

Mean Average Precision (mAP) Scores Associated to the Different

Combinations of Tasks (The BigEarthNet-S2 Archive) . . . . . . . . . 124

xiv

List of Abbreviations

AHCL Asymmetric Hash Code Learning

ASL Asymmetric Loss

BAS Batch Anchor Selection

BCE Binary Cross Entropy

BIS Batch Positive and Negative Image Selection

BigEarthNet-S1 Sentinel-1Image Patches of BigEarthNet

BigEarthNet-S2 Sentinel-2Image Patches of BigEarthNet

CA-LSTM

Class-Wise Attention-Based Convolutional and Bidirectional

LSTM Network

CBIR Content Based Image Retrieval

CLC CORINE Land Cover

CNN Convolutional Neural Network

CV Computer Vision

DAS Diverse Anchor Selection

DATL Dual Anchor Triplet Loss

DHCNN Deep Hashing Convolutional Neural Network

DHNN Deep Hashing Neural Network

DL Deep Learning

DML Deep Metric Learning

DNN Deep Neural Network

DWA Dynamic Weight Average

DenseNet Densely Connected Convolutional Network

ELBO Evidence Lower Bound

ELR Early-Learning Regularization

EO Earth Observation

FL Focal Loss

GRID Generative Reasoning Integrated Label Noise Robust Deep

Representation Learning

GradNorm Gradient Normalization for Adaptive Loss Balancing

IRL Image Representation Learning

JoCoR Joint Training with Co-Regularization

LNC

High-Resolution Land Cover Mapping through Learning with

Noise Correction

LSTM Long Short-Term Memory

LULC Land Use Land Cover

mAP Mean Average Precision

MS-SSIM Multi-Scale Structural Similarity Index Metric

MSE Mean Squared Error

MSL Multi Similarity Loss

MTL Multi Task Learning

MiLaN Metric Learning-Based Deep Hashing Network

NDCG Normalized Discounted Cumulative Gains

NPL N-Pair Loss

PCGrad Projecting Conﬂicting Gradients

PLASTA-MTL Plasticity-Stability preserving Multi-Task Learning

PPL Plasticity Preserving Loss

RAS Random Anchor Selection

RHDIS

Relevant, Hard and Diverse Positive and Negative Image

Selection

RIS Random Positive and Negative Image Selection

RNN Rrecurrent Neural Network

RRL Region Representation Learning Loss

RSDML Enhancing Remote Sensing Image Retrieval using a Triplet Deep

Metric Learning Network

RS Remote Sensing

ResNet Residual Network

SAR Synthetic Aperture Radar

SCI-CBIR

Simultaneous Remote Sensing Image Compression and Indexing

for Scalable Content Based Image Retrieval

SLNIR Synthetic Label Noise Injection Ratio

SPL Stability Preserving Loss

SSHAAE Semi-Supervised Hashing Adverserial Autoencoder

STL Single-Task Learning

TNDML Deep Metric Learning Using Triplet Network

VAE Variational Auto-Encoder

VGG Very Deep Convolutional Networks

VGI Volunteered Geographic Information

xvi

;To the love of my life, Kimya.. . ;

Chapter 1

Introduction

Unprecedented advances in satellite technology have resulted in regular, frequent,

and high-resolution monitoring of the Earth surface, producing fast-growing Earth

observation (EO) data archives. As an example of the exceptionally fast growth rate

of these archives, the published volume of data through the Copernicus programme

(which is the European ﬂagship satellite initiative with its Sentinel missions) during

only 2021 reached more than 7 PiB [1]. The rising operational capability of such

monitoring provides abundant information for the status of our planet. Accordingly,

EO data through especially the recent passive multispectral and synthetic aperture

radar (SAR) active instruments plays a crucial role to overcome the most pressing

global societal challenges, e.g., those deﬁned by the Sustainable Development Goals

[2]. Sentinel-2 satellite, for instance, has been acquiring high-resolution multispectral

images characterized by 10 to 60 m spatial resolution, 13 spectral bands and revisit

time of 10 days since 2015, while the Sentinel-1 mission has been providing C-band

SAR images with up to 5 m spatial resolution and revisit time of 6 days since 2014.

Due to the open EO data access policies of recent satellite missions, most of the

remote sensing (RS) image archives are publicly available to researchers. This carries

a huge potential for climate change analysis, urban area studies, forestry applications,

emergency management for disaster relief efforts, water quality assessment, crop

monitoring, etc. To extract relevant information from such huge and ever-growing

RS image archives on a large scale that can have a substantial impact on societal

challenges, data-driven approaches are a crucial prerequisite.

From the virtuous circle between the tremendous expansion of the data era and the

investigations of computer science in the last decades, machine learning, notably

deep learning (DL), emerged as the most promising breakthrough among data-driven

approaches. These advances also made huge leaps for modeling and analyzing RS im-

ages due to several advantages of DL-based methods compared to their conventional

counterparts. DL-based methods allow to automatically learn RS image represen-

tations with multiple levels of abstractions by dynamizing deep neural networks

(DNNs) exclusively on data [3]. By relying on a huge amount of EO data, DL-based

image representation learning (IRL) becomes capable of modeling higher-level RS

image semantics and its complex patterns beyond the regional borders. Today, as it

is almost a common consensus, DL-based approaches are revolutionizing the way

we address challenges for IRL in RS. Thus, it carries a huge potential for automatic

knowledge discovery from massive EO image archives on a large scale.

Chapter 1. Introduction 2

DL-based IRL is generally achieved in a supervised way during the optimization

of a loss function on a training set based on the characteristics of a learning task

(e.g., single/multi-label classiﬁcation, semantic segmentation etc.). To this end, the

considered DNN typically includes an image encoder (i.e., a CNN backbone) and

a task head including fully connected or convolutional layers (which is branched

out from the image encoder). The loss function is selected on the basis of the char-

acteristics of the considered learning task, and thus the model parameters of the

considered DNN are automatically learned during the optimization of this function.

Most of the DL-based IRL methods in RS utilize the following learning tasks: 1) scene

classiﬁcation [4]–[14]; 2) similarity learning [15]–[27]; 3) image reconstruction [28],

[29]; 4) semantic segmentation [30], [31]; and 5) image captioning [32], [33]. Each

learning task has different objectives that leads to different optimization procedures

throughout the training of the considered DNN. Accordingly, learned image repre-

sentations have different characteristics for different learning tasks, and thus carry

different information to be utilized in the ﬁnal EO application. As an example, when

the learning task is scene classiﬁcation, RS image representations can be learned

by optimizing entropy-based loss functions. In this way, image representations are

encoded to separate pre-deﬁned classes that maximizes inter-class distances in the

image representation space. For the similarity learning task, on the other hand, im-

age representations are learned to discriminate dissimilar RS images that minimizes

intra-class distance in the image representation space [34]. This can be achieved by

employing siamese CNNs on tuples of RS images to optimize triplet or contrastive

loss functions. If the task is chosen as the image reconstruction, auto-encoder neural

networks can be used ﬁrst to construct the representations and then to recover RS

images with reconstruction loss. Once the model parameters of the DNN are learned

on a training set, they are utilized to obtain either image features or the predictions

of the task head from large-scale RS image archives for a ﬁnal EO application. As

an example, for an EO application that requires to assign land-use/land-cover class

labels to RS images, class probabilities of a given RS image obtained by the task head

can be directly used to associate it with class labels. If an EO application performs

content-based image retrieval (CBIR), which aims to search for RS images similar

to a query image based on their semantic content, image representations of an RS

image archive obtained by the image encoder can be compared to that of the query

image for ﬁnding similar images.

We would like to note that automatic knowledge discovery from massive EO image

archives requires to employ DL-based IRL methods on a large scale. To this end, RS

image scene classiﬁcation and CBIR have been among the most emerging solutions

in this regard. Accordingly, the development of IRL methods devoted to image scene

classiﬁcation and CBIR problems has attracted great attention in RS community. Most

of these methods assume that each training image is annotated by a single (broad

category) label, which is associated to the most signiﬁcant content of the image.

However, RS images typically contain areas with a high variety of semantically

complex content that must be reﬂected by more than one class annotation through

multiple class labels (multi-labels). Thus, DL-based IRL methods that properly

exploit training images annotated by multi-labels are recently found very promising

for RS images in the framework of image scene-classiﬁcation and CBIR.

Chapter 1. Introduction 3

To employ DL-based IRL methods for scene classiﬁcation problems based on images

annotated by multi-labels, attention-based DNNs have been attracted great attention

in RS, e.g., class-wise attention-based recurrent neural network [35], attention-aware

label relational reasoning network [36], encoder-decoder based deep attention neu-

ral network [37]. The attention strategies proposed in [35], [36] and [37] identify

informative areas of images through an attention map based on the feature maps

of convolutional layers. These strategies are effective for very high resolution aerial

images, however they can be insufﬁcient for accurately describing the complex con-

tent of satellite RS images with high spatial resolution (e.g., Sentinel-2 and Landsat

multi-spectral images). Results carried out on very high resolution aerial images

with only RGB bands show the success of these strategies for the description of the

spatial image content. A direct adaptation of these methods for high dimensional RS

images may lead to an incomplete representation of the spectral information content.

These issues are critical particularly for images with several spectral bands with

varying spatial resolutions acquired by the new generation satellites (e.g., Sentinel-2).

Thus, methods that can efﬁciently and effectively describe the spatial and spectral

information content of high dimensional RS images are needed in the framework of

multi-label RS image scene classiﬁcation.

In the context of DL-based IRL for CBIR problems, recent years have witnessed

the increasing attention of deep metric learning (DML) based methods that aim at

learning a representation space (in which similar images are located close to each

other). Such methods are mostly trained using a triplet loss function made up of three

images as: i) an anchor image; ii) a positive image that is similar to the anchor; and

iii) a negative image that is dissimilar to the anchor [38]. A difﬁcult task in DML is to

construct the set of triplets. A simple strategy is to deﬁne triplets from an existing

training set of labeled images. In [27] a strategy is applied in a way that: i) an anchor

is randomly chosen from a mini-batch of training images; and then ii) one positive

image that has the same class label as the anchor is randomly chosen, while selecting

one negative image that has a different class label. For each anchor image, there can

be several positive and negative images. Thus, random selection does not guarantee

the selection of the most representative and informative images to the anchor and

can result in the construction of so-called trivial triplets. It is worth noting that one

can also exploit all the images in the mini-batch to construct triplets, as suggested

in [15]. However, this choice signiﬁcantly increases the total number of triplets and

thus the computational complexity of the training phase of the CBIR system [39], [40].

To overcome the limitation of random selection, the DML methods that evaluate the

hardness of images during the sampling process are introduced in the CV literature.

Most of the triplet sampling methods in CV rely on single-label image annotations

to decide which images are positive or negative for a given anchor image. From

the DML perspective, the selection of triplets from training images annotated by

multi-labels is more complex than that from training images labeled by single-labels.

To achieve accurate DML in multi-label RS CBIR, methods that accurately select a set

of triplets from multi-label training images are needed.

For large-scale CBIR, fast and accurate indexing methods that allow approximate

nearest neighbor search are fundamental. In this perspective, hashing-based indexing

has recently attracted attention to solve the large-scale approximate nearest neighbor

Chapter 1. Introduction 4

search problems for CBIR due to its high time-efﬁcient (in terms of both storage and

speed) and accurate search capability within huge image archives. DL-based hashing

methods map high-dimensional image representations into compact binary hash

codes while simultaneously optimizing IRL and hash code learning. Then, CBIR

can be achieved by calculating the Hamming distances with simple bit-wise XOR

operations [41]. Several DL-based hashing methods are presented in RS [20], [21], [27],

[42]–[46], which are potentially effective for CBIR in RS. It is noted that in massive EO

archives RS images are usually stored in compressed format to reduce their storage

sizes [47]. Thus, image decoding (i.e., decompression) is required before applying any

DL-based hashing method. This is computationally-demanding and impractical in

the case of large-scale CBIR problems. However, there is no hashing-based indexing

method in RS that can be applied in the compressed domain efﬁciently and effectively.

Accordingly, to achieve scalable CBIR in massive RS image archives, DL-based IRL

methods that can jointly characterize RS image representations and hash codes while

effectively compressing them are needed.

Most of the DL-based IRL methods require a huge amount of annotated RS images

during training to adjust the model parameters of the considered DNN and reach a

high performance. The availability and quality of such data determine the feasibility

of these methods. The process of collecting, preparing, and annotating RS images

on a large-scale to create sufﬁciently large high-quality archives to drive DL-based

studies is time consuming, complex, and costly in operational scenarios. Therefore,

most researchers rely on existing benchmark archives to employ and develop DL-

based methods. However, there are only few publicly available benchmark archives

in RS. Most of the existing archives feature a relatively small volume of images,

which is a limitation for DL-based studies due to the above-mentioned reasons. To

overcome this problem, a common approach is to exploit DNN models, which are

pre-trained on publicly available general purpose CV datasets. However, this is not

a viable approach in RS due to the differences in image characteristics in CV and

RS. As an example, Sentinel-2 images have 13 spectral bands associated to varying

and lower spatial resolutions with respect to CV images. In detail, RS benchmark

archives mostly contain single-label image annotations, i.e., each image is annotated

by a single high level land-use category label. However, as discussed above, RS

images must be reﬂected by more than one class annotation through low-level class

labels (i.e., multi-labels). Thus, a benchmark archive consisting of images annotated

with multi-labels is required. This lack of large-scale publicly available benchmark

archives of RS images with multi-labels prevents the wide spread adoption of DL

models in RS applications, even though raw data and potential applications do

exist. In addition, most of the existing publicly available benchmark archives contain

single-modal RS images (e.g., multispectral or SAR). However, multi-modal images

associated with the same geographical area allow for rich characterization of RS

images and thus improve image representation learning when jointly considered [48].

Thus, a large-scale benchmark archive consisting of multi-modal RS images annotated

with multi-labels is needed for DL-based IRL methods in RS.

In addition to the above-mentioned alternatives for obtaining high quantity of an-

notated training images, publicly available thematic maps (e.g., the CORINE Land

Cover inventory [49]), automatic labeling procedures, or volunteered geographic

Chapter 1. Introduction 5

information (VGI) as crowdsourced data can be also used in RS. These strategies

provide RS image annotations at zero cost. However, the considered thematic map or

VGI source can be outdated with respect to RS images due to possible changes on the

ground; or there can be annotation errors. Thus, these strategies increase the risk of

including noisy labels in training data. Learning RS image representations with noisy

labels may result in overﬁtting of the considered DNN to noisy labels and lack of its

generalization capability, and thus inaccurate characterization of RS images during

both training and inference [50], [51]. To address this problem, several methods,

mostly in CV community, are presented to improve the robustness of IRL when

training data includes noisy labels. All these methods are potentially effective for

DL-based IRL under noisy labels in RS. However, most of them are dependent on the

type of: i) label noise present in training data; ii) image annotation; iii) loss function

(e.g., cross-entropy, focal loss etc.); iv) DNN architecture; or v) learning task. Some

methods also require the availability of a subset of the training set, which includes

clean labels, or require the computationally demanding noise correction strategies

prior to training. Thus, they may not be directly integrated into different scenarios

associated to IRL in RS. Accordingly, DL-based IRL approaches that allow to learn

RS image representations under noisy labels independently of the IRL scenario being

considered are required.

In RS, it is common to employ single learning task for DL-based IRL. However, using

a single learning task may not be sufﬁcient to describe the complex content of RS

images. To address this issue, multiple learning tasks can be jointly utilized for IRL.

When IRL is achieved based on multiple tasks, the resulting representation space can

better characterize the complex semantic content of RS images. Accordingly, a few

DL based multi-task learning (MTL) methods have been recently introduced in RS to

learn image representations through the joint optimization of multiple loss functions,

each of which is associated with a learning task [52]–[57]. Due to the complexity of

MTL problem, it is common that: i) tasks may compete or even distract each other

during training; ii) one of the tasks may dominate the whole learning procedure; or

iii) characterization of each task can be under-performed compared to single task

learning [58]. These problems undermine the effectiveness of whole representation

learning procedure [59]. These issues occur due to the stability-plasticity constraint

of MTL [60]. MTL methods require to be sensitive to new information learned from

each task that allows the contribution of each task to further improve modeling the

image characterization. This condition is known as plasticity. During the learning

process of a new task, new information encoded in the considered DNN should

not radically disrupt what is already characterized based on the other tasks. This

condition is known as stability. The MTL formulation of the existing methods (which

is based on joint optimization) is limited to control learning of each task. Thus, it

does not allow to control plasticity and stability of the whole learning procedure. It

is also worth noting that, by this formulation, whole learning procedure is sensitive

to proper selection of loss function weight for each task that generally requires a

grid search (which is computationally demanding) [61]. Thus, MTL methods that

can effectively combine multiple learning tasks without the need for selection of

loss weights while considering the stability-plasticity problem are needed in RS to

accurately apply IRL.

Chapter 1. Introduction 6

1.1 Objectives and Novel Contributions of the Thesis

The overall aim of this thesis is to develop advanced DL-based IRL methods for

information discovery from massive RS image archives and to construct a large-

scale RS image archive for benchmarking DL-based IRL methods in RS. For the

development of novel methodologies, a particular attention is devoted to scene-

classiﬁcation and CBIR problems in RS due to their importance for information

discovery from massive RS image archives, while advanced methods for label-noise

robust and multi-task IRL are proposed independently from the learning task being

considered. To address the main challenges highlighted in the previous section, the

rest of this thesis is divided into six main chapters. In the following, the objectives

and contributions of these chapters are brieﬂy explained.

In Chapter 2, we present a large-scale RS image benchmark archive, aiming to address

the limitations of existing benchmark archives, and thus to provide a high quantity of

annotated training RS images suitable for DL-based IRL methods in RS. To this end,

we introduce BigEarthNet as the ﬁrst large-scale multi-modal multi-label benchmark

archive in RS that contains 590,326 pairs of Sentinel-2 and Sentinel-1 image patches

acquired over 10 European countries. Each pair in BigEarthNet is annotated with

multi-labels from the CORINE Land Cover (CLC) database of the year 2018. In

this chapter, we also introduce an alternative class-nomenclature since some CLC

classes can be challenging to be accurately described by only considering (single-date)

BigEarthNet image patches. An experimental analysis on: i) the comparison among

the strategies of IRL directly from RS images of BigEarthNet and transfer learning

from DNNs trained on computer vision images; and ii) several well-known CNN

architectures shows the effectiveness of BigEarthNet for scene classiﬁcation and CBIR

problems in RS. It is worth noting that we make all the data and the well-known DL

models trained on BigEarthNet publicly available at

https://bigearth.net

, offering

an important resource to support studies on DL-based IRL in RS.

In Chapter 3, we introduce a novel DL-based IRL approach that aims at accurately

describing complex spatial and spectral content of RS images in the framework of

the multi-label classiﬁcation of high-dimensional high-spatial resolution RS images.

The capability of the proposed approach is investigated in three consecutive steps:

1) spatial and spectral characterization of image local areas; 2) deﬁnition of a multi-

attention driven global descriptor; and 3) classiﬁcation of RS image scenes with

multi-labels. In the ﬁrst step, we present a novel branch-wise CNN architecture

(denoted as

-Branch CNN) that efﬁciently describes the complex content of local

areas of each image by different CNN branches specialized according to the spatial

resolutions of image bands. In the second step, we present a novel multi-attention

strategy in the framework of RNNs that: i) accurately identiﬁes importance levels

(i.e., scores) for different local areas; and then ii) deﬁnes a global descriptor for each

image based on these scores. In this chapter, extensive experiments are performed

to analyze the effectiveness of the proposed approach in terms of the sensitivity

analysis and comparison among the existing approaches. We make the code of

the proposed approach publicly available at

https://gitlab.tubit.tu-berlin.de/

rsim/MAML-RSIC.

In Chapter 4, we present a novel image triplet sampling method for DL-based IRL

Chapter 1. Introduction 7

of RS images through the characterization of image similarities, which forms the

foundation for CBIR in RS. The proposed method aims at selecting a small set of the

most representative and informative triplets of multi-label training images based

on two consecutive steps. In the ﬁrst step, a small number of diverse anchors is

selected based on a simple but efﬁcient iterative algorithm. In the second step,

relevant, hard and diverse positive and negative images with respect to each anchor

are chosen based on a novel strategy. The effectiveness of the proposed method

is theoretically and experimentally investigated in terms of: i) the computational

complexity of the training phase with respect to the CBIR performance; and ii)

the learning efﬁciency via converge speed of the considered DNNs. In addition,

an overview of the existing triplet selection methods and the detailed literature

review on CBIR in RS are provided. It is worth mentioning that we make the code

of the proposed method publicly available at

https://git.tu-berlin.de/rsim/

image-retrieval-from-triplets.

In Chapter 5, we propose a novel approach devoted to simultaneous RS image

compression and indexing for scalable CBIR. The proposed approach aims to: i)

jointly characterize representations and hash codes of RS images on the learning

based compression domain; and thus ii) prevent the requirement of decoding RS

images prior to CBIR. This is achieved by two main steps: i) deep learning-based

compression; and ii) deep hashing-based indexing. The ﬁrst step applies image

feature extraction and image reconstruction based on a pair of encoder and decoder

DNNs, while a probabilistic entropy model is employed to optimize the length of

the compressed bitstreams. The second step employs pairwise, bit-balancing and

classiﬁcation loss functions for the generation of hash codes based on image features

characterized by the ﬁrst step. To effectively characterize image features for both

image indexing and compression, we propose a novel multi-stage learning procedure

for the training of the proposed approach, allowing to automatically weight different

loss functions considered in both steps. As a ﬁrst time in RS, the proposed approach

simultaneously applies RS image compression and indexing, and thus does not

require RS image decoding prior to CBIR that can save a signiﬁcant amount of time

for operational applications. Through the extensive experiments for the sensitivity

analysis of the proposed approach and its comparison with standard approaches,

the effectiveness of simultaneous image compression and indexing for large-scale

knowledge discovery on RS image archives is investigated. We make the code of the

proposed approach available at https://git.tu-berlin.de/rsim/SCI-CBIR.

In Chapter 6, we introduce a novel generative reasoning integrated label noise robust

deep representation learning approach. The proposed approach aims at modeling the

complementary characteristics of discriminative and generative reasoning for IRL on

training RS images associated with noisy labels. To this end, for the ﬁrst time in RS,

we integrate generative reasoning into discriminative reasoning through a variational

autoencoder for supervised IRL under noisy labels that leads to characterize accu-

rate RS image representations while preventing interference of noisy labels during

training. Unlike the existing label noise robust methods, the proposed approach does

not depend on the type of annotation, label noise, DNN architecture, loss function

or learning task. It also does not require a clean subset (training samples with clean

labels) of a training set or require a computationally demanding noise correction

Chapter 1. Introduction 8

strategy prior to training. Thus, our approach can be directly utilized for various

scenarios for IRL in RS. In this chapter, extensive experimental analysis is given on

two IRL scenarios, where training RS images are annotated with: 1) scene-level noisy

multi-labels; and 2) pixel-level noisy labels. Under these scenarios, we consider three

learning tasks with the corresponding loss functions and DNN architectures. We

would like to note that we will make the code of the proposed approach publicly

available at https://git.tu-berlin.de/rsim/GRID.

In Chapter 7, we explore DL-based IRL when multiple learning tasks are jointly

utilized and introduce a novel plasticity-stability preserving multi-task learning

approach that aims to preserve: 1) the plasticity for each task; and 2) the stability

in between learning consecutive tasks independently from the number of tasks

and the type of tasks. To this end, we introduce novel plasticity preserving and

stability preserving loss functions. The plasticity preserving loss (PPL) function

enforces an image representation space to be sensitive to new information learned

with each task during training. The stability preserving loss (SPL) function protects

the image representation space radically disrupted by each task during training. To

effectively apply these two loss functions, we also propose a sequential optimization

algorithm that adaptively adjust the interactions between task-speciﬁc learning

procedures, and thus to ensure plasticity and stability conditions for all the tasks. In

this chapter, analysis through extensive experiments for the sensitivity analysis of

the proposed approach and its comparison with state-of-the-art methods is provided

when different combinations of four learning tasks are utilized for IRL. As an open

source contribution, we make the code of the proposed approach publicly available

at https://git.tu-berlin.de/rsim/PLASTA-MTL.

1.2 List of Publications

During the PhD period of the author, the contributions of this thesis have been

published as journal articles or presented at scientiﬁc conferences, as it is a common

practice in computer science. The studies in subsection 1.2.1 list the contributions

of this thesis as publications. The work described in this thesis has also inspired

additional studies (which are listed in subsection 1.2.2) that are not discussed in this

thesis.

1.2.1 Contributions of the Thesis

Journal Articles

•

G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-

label remote sensing image classiﬁcation,” IEEE Access, vol. 8, pp. 95 934–95946,

2020. DOI:10.1109/ACCESS.2020.2995805.

•

G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M.

Caetano, B. Demir, and V. Markl, “BigEarthNet-MM: A large scale multi-modal

multi-label benchmark archive for remote sensing image classiﬁcation and

retrieval,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–

180, 2021. DOI:10.1109/MGRS.2021.3089174.

Chapter 1. Introduction 9

•

G. Sumbul, M. Ravanbakhsh, and B. Demir, “Informative and representative

triplet selection for multilabel remote sensing image retrieval,” IEEE Transactions

on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. DOI:

10.1109/TGRS.

2021.3124326.

•

G. Sumbul and B. Demir, “Plasticity-stability preserving multi-task learning

for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 60, pp. 1–16, 2022. DOI:10.1109/TGRS.2022.3160097.

•

G. Sumbul, J. Xiang, and B. Demir, “Towards simultaneous image compression

and indexing for scalable content-based retrieval in remote sensing,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022. DOI:

10.

1109/TGRS.2022.3204914.

•

G. Sumbul and B. Demir, “Generative reasoning integrated label noise robust

deep image representation learning,” IEEE Transactions on Image Processing,

2023. DOI:10.1109/TIP.2023.3293776.

Book Chapters

•

G. Sumbul, J. Kang, and B. Demir, “Deep learning for image search and retrieval

in large remote sensing archives,” in Deep Learning for the Earth Sciences: A

comprehensive approach to remote sensing, climate science and geosciences, Hoboken,

NJ, USA: Wiley, 2021, ch. 11, pp. 150–160. DOI:10.1002/9781119646181.ch11.

Conference Papers

•

G. Sumbul, M. Charfuelan, B. Demir, and M. Volker, “BigEarthNet: A large-

scale benchmark archive for remote sensing image understanding,” in Proceed-

ings of the IEEE International Geoscience and Remote Sensing Symposium, 2019,

pp. 5901–5904. DOI:10.1109/IGARSS.2019.8900532.

•

G. Sumbul and B. Demir, “A novel multi-attention driven system for multi-

label remote sensing image classiﬁcation,” in Proceedings of the IEEE International

Geoscience and Remote Sensing Symposium, 2019, pp. 5726–5729. DOI:

10.1109/

IGARSS.2019.8898188.

•

G. Sumbul, M. Ravanbakhsh, and B. Demir, “A relevant, hard and diverse

triplet sampling method for multi-label remote sensing image retrieval,” in

Proceedings of the IEEE Mediterranean and Middle-East Geoscience and Remote

Sensing Symposium, 2022, pp. 5–8. DOI:10.1109/M2GARSS52314.2022.9839759.

•

G. Sumbul, J. Xiang, N. T. Madam, and B. Demir, “A novel framework to

jointly compress and index remote sensing images for efﬁcient content-based

retrieval,” in Proceedings of the IEEE International Geoscience and Remote Sensing

Symposium, 2022, pp. 251–254. DOI:10.1109/IGARSS46834.2022.9884146.

•

G. Sumbul and B. Demir, “Label noise robust image representation learning

based on supervised variational autoencoders in remote sensing,” in Proceedings

of the IEEE International Geoscience and Remote Sensing Symposium, 2023.

Chapter 1. Introduction 10

1.2.2 Additional Contributions

Journal Articles

•

A. Preethy Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Remote-sensing

image scene classiﬁcation with deep neural networks in JPEG 2000 compressed

domain,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4,

pp. 3458–3472, 2021. DOI:10.1109/TGRS.2020.3007523.

•

G. Sumbul, S. Nayak, and B. Demir, “SD-RSIC: Summarization-driven deep

remote sensing image captioning,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 59, no. 8, pp. 6922–6934, 2021. DOI:10.1109/TGRS.2020.3031111.

Conference Papers

•

A. P. Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Approximating JPEG 2000

wavelet representation through deep neural networks for remote sensing image

scene classiﬁcation,” in Proceedings of the Image and Signal Processing for Remote

Sensing Conference, vol. 11155, 2019, 111550S. DOI:10.1117/12.2534643.

•

K. Zhang, G. Sumbul, and B. Demir, “An approach to super-resolution of

sentinel-2 images based on generative adversarial networks,” in Proceedings of

the IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium,

2020, pp. 69–72. DOI:10.1109/M2GARSS47143.2020.9105165.

•

H. Yessou, G. Sumbul, and B. Demir, “A comparative study of deep learning

loss functions for multi-label remote sensing image classiﬁcation,” in Proceedings

of the IEEE International Geoscience and Remote Sensing Symposium, 2020, pp. 1349–

1352. DOI:10.1109/IGARSS39084.2020.9323583.

•

G. Sumbul and B. Demir, “A novel graph-theoretic deep representation learning

method for multi-label remote sensing image retrieval,” in Proceedings of the

IEEE International Geoscience and Remote Sensing Symposium, 2021, pp. 266–269.

DOI:10.1109/IGARSS47720.2021.9554466.

•

G. Sumbul, M. Müller, and B. Demir, “A novel self-supervised cross-modal im-

age retrieval method in remote sensing,” in Proceedings of the IEEE International

Conference on Image Processing, 2022, pp. 2426–2430. DOI:

10.1109/ICIP46576.

2022.9897475.

•

A. Zell, G. Sumbul, and B. Demir, “Deep metric learning-based semi-supervised

regression with alternate learning,” in Proceedings of the IEEE International Con-

ference on Image Processing, 2022, pp. 2411–2415. DOI:

10.1109/ICIP46576.2022.

9897939.

•

B. Büyüktas, G. Sumbul, and B. Demir, “Learning across decentralized multi-

modal remote sensing archives with federated learning,” in Proceedings of the

IEEE International Geoscience and Remote Sensing Symposium, 2023.

•

J. Henkel, G. Hoxha, G. Sumbul, L. Möllenbrok, and B. Demir, “Annotation

cost efﬁcient active learning for remote sensing image retrieval,” in Proceedings

of the IEEE International Geoscience and Remote Sensing Symposium, 2023.

Chapter 1. Introduction 11

1.3 Structure of the Thesis

The rest of this thesis is structured as follows:

Chapter 2 introduces BigEarthNet, which is a large-scale multi-modal multi-label RS

image archive, for benchmarking DL-based IRL methods in RS.

Chapter 3 presents our DL-based IRL approach for multi-label classiﬁcation of high-

dimensional high-spatial resolution RS images.

Chapter 4 introduces our image triplet sampling method for DL-based IRL of RS

images through the characterization of image similarities in a metric space.

Chapter 5 presents our approach devoted to simultaneous RS image compression

and indexing for scalable CBIR.

Chapter 6 presents our generative reasoning integrated label noise robust deep

representation learning approach for IRL on training images with noisy labels.

Chapter 7 introduces our plasticity-stability preserving multi-task learning approach

for DL-based IRL when multiple learning tasks are jointly utilized.

Chapter 8 concludes this thesis with a summary, as well as a discussion for the future

research directions.

Chapter 2

BigEarthNet: A Large Scale Benchmark

Archive for Remote Sensing Image

Representation Learning

DL-based IRL methods in RS generally require the availability of a high quantity of

annotated training RS images for accurately learning the model parameters of the

considered DNN. To fulﬁll this requirement, this chapter presents the multi-modal

multi-label BigEarthNet benchmark archive made up of 590,326 pairs of Sentinel-1

and Sentinel-2 image patches acquired over 10 different European countries (Aus-

tria, Belgium, Finland, Ireland, Kosovo, Lithuania, Luxembourg, Portugal, Serbia,

Switzerland). Each pair of patches in BigEarthNet is annotated with multi-labels

provided by the CORINE Land Cover (CLC) map of 2018 based on its thematically

most detailed Level-3 class nomenclature. Some CLC classes are challenging to be

accurately described by only considering (single-date) BigEarthNet image patches. In

this chapter, we also introduce an alternative class-nomenclature as an evolution of

the original CLC labels to address this problem. This is achieved by interpreting and

arranging the CLC Level-3 nomenclature based on the properties of BigEarthNet im-

ages in a new nomenclature of 19 classes. In our experiments, we show the potential

of BigEarthNet for multi-modal multi-label CBIR and scene-classiﬁcation problems

by considering several state-of-the-art DL models. We also demonstrate that the

DL models trained from scratch on BigEarthNet outperform those pre-trained on

ImageNet, especially in relation to some complex classes, including agriculture and

other vegetated and natural environments. We make all the data and the DL models

publicly available at

https://bigearth.net

, offering an important resource to sup-

port studies on DL-based IRL in RS. This chapter is mainly based on the following

publications:

•

G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M.

Caetano, B. Demir, and V. Markl, “BigEarthNet-MM: A large scale multi-modal

multi-label benchmark archive for remote sensing image classiﬁcation and

retrieval,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–

180, 2021. DOI:10.1109/MGRS.2021.3089174.

•

G. Sumbul, M. Charfuelan, B. Demir, and M. Volker, “BigEarthNet: A large-

scale benchmark archive for remote sensing image understanding,” in Proceed-

ings of the IEEE International Geoscience and Remote Sensing Symposium, 2019,

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 13

pp. 5901–5904. DOI:10.1109/IGARSS.2019.8900532.

2.1 Introduction

Most of the DL-based RS image representation learning methods require a high

amount of annotated images during training to accurately optimize all parameters

and reach a high performance. The availability and quality of such data determine

the feasibility of many DL models. There are several benchmark archives made

publicly available for different RS applications. To the best of our knowledge, most

of the existing publicly available benchmark archives for image scene classiﬁcation

and retrieval problems contain: 1) single-modal RS images (e.g., multispectral or

SAR); and 2) single-label image annotations (i.e., each image is annotated by a single

label that is associated with the most signiﬁcant content of the considered image)

with a small number of annotated images. However, multi-modal images associated

with the same geographical area allow for rich characterization of RS images when

jointly considered [48]. In addition, RS images usually contain areas with a high

variety of semantically complex content that must be reﬂected by more than one class

annotation through multiple class labels (multi-labels).

Thus, a benchmark archive consisting of multi-modal RS images annotated with

multi-labels is needed. However, annotating RS images with multi-labels at a large-

scale to drive DL studies is time consuming, complex, and costly in operational

scenarios. To overcome this problem, a common approach is to exploit DL models

with proven architectures, which are pre-trained on publicly available general pur-

pose datasets in the computer vision (CV) community. However, we argue that this

is not a proper approach in RS, because of the differences in image characteristics

in CV and RS. For example, Sentinel-2 multispectral images have 13 spectral bands

associated with varying and lower spatial resolutions compared to the CV images.

To overcome these issues, in this chapter, we introduce BigEarthNet as a large-scale

multi-modal multi-label benchmark RS image archive that contains 590,326 pairs

of Sentinel-2 and Sentinel-1 image patches. Each pair of patches in BigEarthNet

is annotated with multi-labels provided by the CORINE Land Cover (CLC) map

of 2018 (CLC 2018) [49]. The CLC nomenclature includes land cover and land use

classes grouped in a three-level hierarchy, and for the BigEarthNet image patches,

the most thematically detailed Level-3 class nomenclature is considered. We would

like to note that there are some CLC classes that are difﬁcult to be identiﬁed by

only exploiting (single-date) images, because: i) land use concepts associated with

some classes (e.g., Dump sites,Sport and leisure facilities) may not be visible from

space or fully recognizable with the spatial resolution of Sentinel images; and ii) RS

time series, which BigEarthNet does not include, may be required to describe and

discriminate some classes (e.g., Non-irrigated arable land,Permanently irrigated land).

To this end, we also introduce an alternative nomenclature for images in BigEarthNet

as an evolution of the original CLC labels. The rest of the chapter is organized as

follows. We ﬁrst review the existing benchmark RS image archives in Section 2.2,

and then introduce BigEarthNet and the alternative class-nomenclature in Section

2.3. Section 2.5 provides the experimental results, while Section 2.4 provides the

experimental design. Section 2.6 draws the conclusion of this chapter.

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 14

TABLE 2.1: A LIST OF EXISTING RS IMAGE ARCHIVES

Dataset Name Image Type Annotation

Type

Number of

Images Publication

Year

UC Merced [84] Aerial RGB Single Label 2,100 2010

UC Merced [93] Aerial RGB Multi Label 2,100 2018

WHU-RS19 [85] Aerial RGB Single Label 1,005 2013

RSSCN7 [86] Aerial RGB Single Label 2,800 2015

SIRI-WHU [87] Aerial RGB Single Label 2,400 2016

RSC11 [94] Aerial RGB Single Label 1,232 2016

AID [88] Aerial RGB Single Label 10,000 2017

NWPU-RESISC45 [89] Aerial RGB Single Label 31,500 2017

RSI-CB [90] Aerial RGB Single Label 36,707 2017

PatternNet [92] Aerial RGB Single Label 30,400 2018

EuroSat [91] Satellite Multispectral Single Label 27,000 2019

DFC15 [35] Aerial RGB Multi Label 3,342 2019

2.2 Limitations of Existing Archives

Most of the existing benchmark archives in RS (UC Merced Land Use Dataset [84],

WHU-RS19 [85], RSSCN7 [86], SIRI-WHU [87], AID [88], NWPU-RESISC45 [89],

RSI-CB [90], EuroSat [91] and PatternNet [92]) contain small number of single-modal

RS images (e.g., multispectral or SAR) annotated with single category labels. Table

2.1 presents the list of the existing archives. These archives become popular for the

implementation, evaluation and validation of algorithms in the context of image

classiﬁcation, search and retrieval tasks. However, RS community encounters very

critical limitations while using these archives for applying DL models. One of the

most critical limitations is that the number of annotated images included in the

existing archives is very small. In this respect, they are found insufﬁcient to train

modern deep neural networks to reach a high generalization ability as the models

may overﬁt dramatically when using small training sets. In details, training such

networks on the existing archive images suffers from the problem of learning a

large number of parameters that prevents the accurate characterization of high-level

features in RS images.

It is worth mentioning that annotating RS images at a large-scale to drive DL studies

is time consuming, complex, and costly in operational scenarios. To overcome this

problem, a common approach is to exploit DL models with proven architectures

(such as ResNet [95] or VGG [96]), which are pre-trained on publicly available general

purpose datasets in the CV community (e.g., ImageNet [97]). The existing model is

then ﬁne-tuned on a small set of RS images to calibrate the ﬁnal layers. This strategy

is also known as a transfer learning strategy. There are several versions of the above-

mentioned models that have been pre-trained on large-scale datasets in CV. However,

we argue that this is not a proper approach in RS, because of the differences in image

characteristics in CV and RS. For example, Sentinel-2 multispectral images have 13

spectral bands associated with varying and lower spatial resolutions compared to the

CV images. High spectral resolution of the data can allow accurate characterization of

the complex semantic content at Sentinel-2 images if it is efﬁciently characterized. In

addition, the semantic content present in CV and RS images is signiﬁcantly different,

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 15

Urban fabric, Arable land, Pastures,

Complex cultivation patterns

Urban fabric, Industrial or commercial units,

Land principally occupied by agriculture,

Mixed forest, Marine waters

Urban fabric, Arable land,

Land principally occupied by agriculture

Urban fabric, Permanent crops, Complex cultivation

patterns, Land principally occupied by agriculture,

Broad-leaved forest, Moors, heathland and

sclerophyllous vegetation

FIGURE 2.1: An example of BigEarthNet image pairs and their multi-labels.

and thus the respective semantic classes differ from each other. Accordingly, ﬁne-

tuning pre-trained models for RS images may lead to weak discrimination ability for

land-cover classes in RS. Thus, ﬁne-tuning may not be generally applicable to close

this semantic gap.

Another limitation of existing archives is that they contain images annotated by

single high-level category labels, which are related to the most signiﬁcant content of

the image. However, RS images generally contain multiple classes so that they can

simultaneously be associated to different land-cover class labels (i.e., multi-labels).

Hence, a benchmark archive in RS consisting of images annotated with multi-labels

is required. Although the benchmark archives presented in [35], [93] contains images

with multi-labels, the sample size of this archive is very small to be efﬁciently utilized

for DL models.

The last limitation of existing RS image archives is that since researchers generally do

not have free access to satellite data together with their annotation, most of the bench-

mark archives contain only aerial images with RGB image bands as single-modal

data. Despite the fact that the benchmark archive proposed in [91] includes annotated

satellite images, it suffers from the limitation that is related to the number of images,

which is explained before, and it only includes multi-spectral Sentinel-2 images. It is

worth noting that multi-modal images associated with the same geographical area

allow for rich characterization of RS images and thus improve image representation

learning when jointly considered [48]. The lack of sufﬁcient multi-modal satellite

images with annotations prevents to employ DL-based methods in a convenient way

for the complete understanding of huge amount of freely accessible satellite data

(e.g., Sentinel-1, Sentinel-2).

To overcome these issues, as the ﬁrst large-scale multi-modal multi-label benchmark

archive in RS, we introduce BigEarthNet that contains 590,326 pairs of Sentinel-2 and

Sentinel-1 image patches. Fig. 2.1 shows an example of the BigEarthNet image pairs

and their multi-labels, while it is explained in detail in the following sections.

2.3 BigEarthNet: A Large-Scale Benchmark Archive

To overcome the limitations of existing archives, we introduce BigEarthNet (called

also as BigEarthNet-MM) that is the ﬁrst large-scale multi-modal benchmark archive

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 16

FIGURE 2.2: An example of the Sentinel-2 image patches of BigEarthNet that are fully covered

by seasonal snow, cloud and cloud shadow.

in RS. BigEarthNet contains 590,326 pairs of Sentinel-1 and Sentinel-2 image patches

acquired over 10 different European countries (Austria, Belgium, Finland, Ireland,

Kosovo, Lithuania, Luxembourg, Portugal, Serbia, Switzerland). To construct Sentinel-

2 patches of BigEarthNet, 125 Sentinel-2 tiles associated with less than 1% of cloud

cover and acquired between June 2017 and May 2018 were considered. All tiles were

atmospherically corrected by employing Sentinel-2’s Level 2A product generation

and formatting tool (sen2cor) provided by the European Space Agency due to its

proven success in the literature. After the atmospheric correction, the 10

band of

each image patch is not available anymore, as it is the cirrus band (which is omitted

in the Level 2A output for its lack of surface information). Then, the tiles were

divided into 590,326 non-overlapping image patches, each of which is a section of: 1)

120

120 pixels for 10m bands; 2) 60

60 pixels for 20m bands; and 3) 20

20 pixels

for 60m bands. One important goal during the tile selection process was to represent

all chosen geographic locations with images acquired in different seasons. Due to the

restrictions of ﬁnding tiles with a low cloud cover percentage in the relatively narrow

time period, this has not been possible at each considered location. Accordingly, the

following respective numbers of patches for autumn, winter, spring, and summer

have been considered: 143557, 72877, 175937, and 126913. For the quality check

of patches, visual inspection was also employed, which led to the identiﬁcation of

70,987 Sentinel-2 image patches that are fully covered by seasonal snow, cloud, and

cloud shadow1. An example for those cases is shown in Fig. 2.2.

To construct the Sentinel-1 patches of BigEarthNet, 325 Sentinel-1 Ground Range

Detected (GRD) products acquired between June 2017 and May 2018 that jointly

cover the area of all original 125 Sentinel-2 tiles with close temporal proximity were

selected and processed. The selected scenes provide dual-polarized information

channels (VV and VH) and are based on the interferometric wide swath (IW) mode,

which is the main acquisition mode over land. All scenes were pre-processed by

using the Sentinel-1 toolbox (S1TBX) and the graph processing framework (GPF) of

ESA’s Sentinel Application Platform (SNAP). This includes the application of precise

orbit ﬁles, border and thermal noise removal, radiometric calibration, and geometric

correction (i.e., Range Doppler terrain correction). Depending on the spatial extent of

the scene, either the SRTM 30 (for scenes below 60° latitude) or the ASTER DEM (for

1The lists are available at http://bigearth.net/#downloads.

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 17

scenes above 60° latitude, where no SRTM 30 exists) were employed in the geometric

correction to project images from slant range to ground range. Finally, the backscatter

coefﬁcient was converted to a decibel (dB) scale. It is worth noting that, since the

selection of the speckle ﬁlter is considered to be application dependent, no speckle

ﬁltering was applied in our pre-processing workﬂow in order to preserve the full

resolution. This approach is also recommended by the Product Family Speciﬁcation

for SAR of the CEOS Analysis Ready Data for Land (CARD4L) framework

. Based

on the pre-processed Sentinel-1 scenes, for each Sentinel-2 patch, a corresponding

Sentinel-1 patch with a close timestamp was extracted. In addition, each Sentinel-1

patch inherited the annotations of the corresponding Sentinel-2 patch. The resulting

Sentinel-1 image patches contain a spatial resolution of 10m.

Each pair (which is made up of Sentinel-1 and Sentinel-2 image patches acquired in

the same geographical area) in BigEarthNet is associated with one or more class labels

(i.e. multi-labels) extracted from the CORINE land cover map of 2018. CORINE land

cover (CLC) is a pioneer adventure initiated in the 80’s of the last century to produce

harmonized land use land cover (LULC) maps in vector format for the member

states of the European Union. According to the validation report of the CLC, the

accuracy is around 85% [98]. Nowadays, CLC covers 39 countries from Europe and

was produced for ﬁve reference years, 1990, 2000, 2006, 2012, and 2018. The latter

was produced with data of 2017-2018, which matches the time frame of the images

included in BigEarthNet. Motivations for embracing a large-scale mapping endeavor

aimed at meeting the demand for spatially explicit and harmonized information on

land for a variety of purposes, such as environmental management and decision

making. The crude state-of-the-art of the 1980’s technology and the large spectrum

of potential uses of the maps led to the deﬁnition of a coarse spatial resolution

and a nomenclature with some broad class deﬁnitions, mixing land cover and land

use concepts. These deﬁnitions are implemented for map production by visual

interpretation of RS images and additional data in most countries. Additional data

may include very high spatial resolution imagery and ofﬁcial spatial data sets like

land registers, often to infer the land use. The same technical speciﬁcations were

preserved in map updating for historical consistency. Thus the produced ﬁve CLC

maps have a minimum mapping unit of 25 ha and a minimum mapping width of

100 m, and provide information on land according to a hierarchical nomenclature

of 44 classes at the most detailed level (Level3). The image patches in BigEarthNet

are representative of 43 CLC classes. In the case that CLC maps are considered as

labeling sources for training the machine learning methods to automatically analyse

RS images, the modiﬁed versions of the CLC nomenclature (which better ﬁt the

purpose of the considered application) are commonly preferred. One of the main

reason is that RS systems directly observe the land cover rather than the land use.

The CLC land-use based labels may not be fully recognizable through the RS images

unless they are not associated to very high spatial resolution. As an example, in [99]

CLC is used as a basis to collect training data for supervised RS image classiﬁcation,

but classes such as Discontinuous urban fabric and Sport and leisure facilities that depend

mainly on land use were removed. A deep revision of the CLC program is actually

2https://ceos.org/ard/

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 18

TABLE 2.2: THE LIST OF CLASSES WITHIN CLC AND PROPOSED CLASS NOMENCLATURES

AND THEIR ASSOCIATED NUMBERS OF IMAGE PAIRS. THESE NUMBERS ARE OBTAINED AFTER

ELIMINATING SENTINEL-2 IMAGE PATCHES THAT ARE FULLY COVERED BY SEASONAL SNOW,

CLOUD,AND CLOUD SHADOW.

CLC Class-Nomenclature Number of

Image Pairs

19 Classes Nomenclature

Number of Image Pairs

Total Training Validation Test

Continuous urban fabric 10,766

Discontinuous urban fabric 65,894 Urban fabric 74,891 38,783 18,180 17,928

Industrial or commercial units 11,865 Industrial or commercial units 11,865 6,182 2,875 2,808

Road and rail networks

and associated land 3,269

Port areas 453

Airports 820

Mineral extraction sites 4,225

Dump sites 822

Construction sites 1,081

Green urban areas 1,651

Sport and leisure facilities 4,983

removed

Non-irrigated arable land 183,987

Permanently irrigated land 13,571

Rice ﬁelds 3,793

Arable land 194,148 100,394 46,604 47,150

Vineyards 9,524

Permanent crops 29,350 15,862 6,676 6,812

Fruit trees and berry plantations 4,672

Olive groves 12,503

Annual crops associated

with permanent crops 7,019

Pastures 98,997 Pastures 98,997 50,981 23,846 24,170

Complex cultivation patterns 104,203 Complex cultivation patterns 104,203 53,534 25,031 25,638

Land principally occupied by

agriculture, with signiﬁcant

areas of natural vegetation

130,637

Land principally occupied by

agriculture, with signiﬁcant

areas of natural vegetation

130,637 67,260 31,325 32,052

Agro-forestry areas 30,649 Agro-forestry areas 30,649 15,790 7,598 7,261

Broad-leaved forest 141,300 Broad-leaved forest 141,300 73,411 33,759 34,130

Coniferous forest 164,775 Coniferous forest 164,775 86,569 38,674 39,532

Mixed forest 176,567 Mixed forest 176,567 91,930 41,996 42,641

Natural grassland 11,141 Natural grassland and

sparsely vegetated areas 12,022 6,663 2,560 2,799

Sparsely vegetated areas 1,202

Moors and heathland 5,073

Sclerophyllous vegetation 11,241

Moors, heathland and

sclerophyllous vegetation 16,267 8,438 3,970 3,859

Transitional woodland-shrub 148,950 Transitional woodland-shrub 148,950 77,593 35,146 36,211

Beaches, dunes, sands 1,536 Beaches, dunes, sands 1,536 1,197 118 221

Bare rock 2,894 removed

Burnt areas 304 removed

Inland marshes 5,516 Inland wetlands 22,100 11,620 5,131 5,349

Peatbogs 16,667

Salt marshes 1,339

Salines 424 Coastal wetlands 1,566 1,037 219 310

Intertidal ﬂats 962 removed

Water courses 9,792

Water bodies 58,009 Inland waters 67,277 35,349 15,751 16,177

Coastal lagoons 1,495

Estuaries 1,064

Sea and ocean 72,522

Marine waters 74,877 39,114 17,740 18,023

under consideration following the concept of the EIONET Action Group on Land

monitoring in Europe (EAGLE) [100].

To pay more justice to the properties of BigEarthNet image pairs, we introduce

a new class-nomenclature by modifying the multi-labels extracted from the CLC

2018. To this end, the CLC Level-3 nomenclature is interpreted and arranged in

a new nomenclature of 19 classes (see Table 2.2). Ten classes of the original CLC

nomenclature are maintained in the new nomenclature, 22 classes are grouped into 9

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 19

new classes, and 11 classes are removed. The classes maintained are semantically

homogeneous and largely related to land cover, such as Broad-leaved forest and Beaches,

dunes, sands. Furthermore, CLC classes that are not feasible to be identiﬁed by only

using single-date BigEarthNet images removed, such as Burnt areas. Complex classes

(which are often removed when undertaking image classiﬁcation) are maintained,

such as Complex cultivation patterns and Land principally occupied by agriculture, with

significant areas of natural vegetation. The goal is to investigate the ability of DL models

to learn from spatial patterns that express semantic classes. Classes are grouped

when sharing similar land cover types and spectral patterns. For example, Moors and

heath land and Sclerophyllous vegetation are grouped in a single class, and a new class,

Arable land, groups similar crops that require dense time series (which not available in

BigEarthNet) for their discrimination (e.g. irrigated and non-irrigated crops). Classes

that strongly depend on land use or need additional data for their discrimination are

removed. For example, class Airports essentially relates to land use, and Intertidal

flats appear in RS images either with or without water depending on the image

acquisition time and hence require appropriate time series for its classiﬁcation. The

number of labels associated with each image pair varies between 1 and 12, while

96.80% of image pairs are not associated with more than 5 labels. Only 23 image

pairs are annotated with more than 9 labels.

2.4 Experimental Design

The experiments were carried out in the context of content based multi-modal multi-

label RS image retrieval and classiﬁcation. To achieve multi-modal learning, we

stacked the VV and VH bands of Sentinel-1 image patches, and the Sentinel-2 bands

associated with 10m and 20m spatial resolution into one volume for each pair in

BigEarthNet. To this end, we initially applied cubic interpolation to 20m bands of

Sentinel-2 image patches. In the experiments, we did not use the Sentinel-2 image

bands associated with 60m spatial resolution (bands 1 and 9). This is due to the

fact that these bands are mainly used for cloud screening, atmospheric correction,

and cirrus detection in RS applications and do not embody a signiﬁcant amount

of information for the characterization of semantic content of RS images. In the

experiments, we considered the VGG model [96] and the ResNet model [95] at

various number of layers (VGG16, VGG19, ResNet50, ResNet101, ResNet152). To

fairly compare all models, we utilized the Adam optimizer [101] with an initial

learning rate of 10

−3

to decrease the sigmoid cross-entropy loss. Except the learning

rate, we employed the same parameter values given in [95], [96]. The batch size is

set to 256 for ResNet152 and to 500 for all other models used in the experiments.

We applied training from scratch for 100 epochs, while the ﬁnal layers of the pre-

trained models were ﬁne-tuned separately on each modality for 10 epochs. For all the

models, we added a fully connected layer that includes 19 neurons at the end of the

network for the classiﬁcation. For image retrieval, we extracted image features from

the considered models and applied similarity matching of the features based on the

χ2

-distance measure. We performed various experiments to analyze the effectiveness

of: i) learning from BigEarthNet directly (through training from scratch) instead

of using the pre-trained models on ImageNet; and ii) state-of-the-art CNN models

trained and evaluated on BigEarthNet. To use the pre-trained models on ImageNet,

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 20

TABLE 2.3: CLASS-BASED

SCORES (%) OBTAINED WHEN:I)TRANSFER LEARNING FROM

IMAGENET AND II)DIRECT LEARNING FROM BIGEARTHNET ARE USED FOR MULTI-MODAL

MULTI-LABEL IMAGE CLASSIFICATION.

Class Transfer Learning

From ImageNet

Learning From

BigEarthNet

Urban fabric 56.27 71.99

Industrial or commercial units 30.98 43.21

Arable land 80.05 83.62

Permanent crops 4.32 55.52

Pastures 50.98 74.77

Complex cultivation patterns 36.29 62.03

Land principally occupied by agriculture, with

signiﬁcant areas of natural vegetation 30.36 60.63

Agro-forestry areas 2.13 71.87

Broad-leaved forest 42.83 75.39

Coniferous forest 75.47 86.32

Mixed forest 72.19 81.31

Natural grassland and sparsely vegetated areas 14.11 43.88

Moors, heathland and sclerophyllous vegetation 5.29 59.91

Transitional woodland-shrub 41.23 64.21

Beaches, dunes, sands 43.67 63.39

Inland wetlands 8.20 57.81

Coastal wetlands 4.79 42.23

Inland waters 63.23 82.10

Marine waters 93.99 97.20

Average 39.81 67.23

we used the late fusion of separately ﬁne-tuned models on Sentinel-1 and Sentinel-2

patches. In the experiments, we did not use the Sentinel-2 patches that are fully

covered by seasonal snow, cloud, and cloud shadow. After the arrangements of the

new class nomenclature, 57 pairs among the 590, 326 pairs are not associated with any

LULC labels. these pairs are not used in the experiments. We divided the remaining

dataset into: i) the training set of 269,695 pairs of patches, ii) validation set of 123,723

pairs of patches, and iii) the test set of 125,866 pairs of patches.

We performed our experiments on a cluster of 4 NVIDIA Tesla V100 GPUs. The

results of multi-modal multi-label image classiﬁcation were provided in terms of four

performance metrics: 1) Hamming loss (

); 2) one-error (

); 3) recall (

); and 4)

F2-Score (F2).

2.5 Experimental Results

2.5.1

Comparison among the Strategies of Learning directly from

BigEarthNet and Transfer Learning from the ImageNet

In the ﬁrst set of experiments, we compare the effectiveness of learning directly from

BigEarthNet with respect to transfer learning from ImageNet. To this end, transfer

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 21

1st 5th 100th

Transfer Learning

from ImageNet

Learning Directly

from BigEarthNet-MM

Query Image Pair

Multi-Labels from

19 Classes Nomenclature

Urban fabric, Arable land, Coniferous forest,

Mixed forest, Transitional woodland, shrub

Urban fabric, Arable land, Coniferous forest,

Mixed forest, Transitional woodland, shrub

Urban fabric, Arable land, Coniferous forest,

Transitional woodland, shrub, Land principally

occupied by agriculture

Urban fabric, Arable land, Mixed forest, Land

principally occupied by agriculture

Urban fabric, Pastures, Transitional woodland, shrub,

Land principally occupied by agriculture, Inland

wetlands Urban fabric, Pastures, Coniferous forest Urban fabric, Permanent crops, Complex cultivation

patterns, Land principally occupied by agriculture

FIGURE 2.3: An example of a query pair from the BigEarthNet archive and retrieved image

pairs obtained by using: 1) direct learning from BigEarthNet; and 2) transfer learning from

ImageNet in the framework of content-based multi-modal multi-label image retrieval.

learning strategy is applied by using the pre-trained ResNet50 model trained on

ImageNet, while direct learning strategy is employed by using the ResNet50 trained

from scratch on BigEarthNet. Table 2.3 shows the class-based

classiﬁcation scores

(known also as macro-averaged

scores). By analyzing the table, one can see that

learning directly from BigEarthNet achieves the highest score for each class compared

to the transfer learning strategy. As an example, learning directly from BigEarthNet

provides more than 12% and 25% higher scores for the classes Industrial or commercial

units and Complex cultivation patterns, respectively, compared to the transfer learning

strategy. The difference in performance between these learning strategies is more

evident for more complex LULC classes. As an example, learning directly from

BigEarthNet improves the

scores more than 54% and 69% for the classes Moors,

heathland and sclerophyllous vegetation and Agro-forestry areas, respectively.

In the content of image retrieval, Fig. 2.3 shows an example of a query pair and

the retrieved pairs of images by these strategies. By assessing the ﬁgure, one can

observe that when learning is achieved directly from BigEarthNet, the semantically

more similar pairs of images are retrieved, containing the Urban fabric and Arable land

classes present in the query. Learning directly from BigEarthNet leads to retrieval of

a similar pair to the query even at the 100

retrieval order. However, using transfer

learning strategy results in retrieval of pairs that contain Urban fabric and Arable land

classes which are not present in the query pair. One can observe this behavior even

at the 5th retrieved pair.

The main reasons of the success of directly learning from BigEarthNet are due to the

fact that: 1) transfer learning from ImageNet limits the accurate characterization of

the spectral content of RS images; 2) ﬁne-tuning the pre-trained model on ImageNet

by using RS images can not be sufﬁcient to eliminate the semantic gap since the

category labels present in ImageNet are different from the land-cover class labels

present in BigEarthNet; and 3) the pre-trained model was trained for a single-label

image classiﬁcation scenario, and thus limits the accurate characterization of the

multiple land cover classes present in BigEarthNet.

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 22

TABLE 2.4: OVERALL MULTI-MODAL MULTI-LABEL CLASSIFICATION RESULTS UNDER

DIFFERENT METRICS AND DL MODELS FOR BIGEARTHNET.

Model HL OE (%) R(%) F2(%)

VGG16 0.078 7.35 76.97 76.18

VGG19 0.080 8.12 76.17 75.35

ResNet50 0.074 5.93 80.05 78.73

ResNet101 0.074 6.46 78.85 77.88

ResNet152 0.073 6.42 78.13 77.46

2.5.2 Comparison of State-of-the-Art CNN Models

In the second set of experiments, we compare the effectiveness of the VGG and the

ResNet models in the framework of multi-modal multi-label classiﬁcation. Table 2.4

shows the overall classiﬁcation results under different metrics (which are the sample-

averaged scores). By analyzing the table, one can observe that the ResNet model

provides the highest scores in all metrics. As an example, ResNet50 achieves more

than 2% higher recall and

scores compared to VGG models. This improvement

is due to the residual connections of the ResNet model and their increased depth in

terms of the number of layers compared to the VGG model. Increasing the depth of

the considering models does not signiﬁcantly affect the performances, i.e., similar

scores are obtained in all the metrics under different depth values of the same model.

2.6 Conclusion

In this chapter, we have presented the BigEarthNet benchmark archive that contains

590,326 pairs of Sentinel-1 and Sentinel-2 image patches with a new CLC-based

class nomenclature to pay more justice to the properties of the considered images.

BigEarthNet makes a signiﬁcant advancement for the use of DL in RS, opening up

promising directions to support research studies in the framework of multi-modal

multi-label RS image scene classiﬁcation and retrieval. BigEarthNet is suitable to

assess DL based methods for: i) learning from class-imbalanced multi-modal data

(since the LULC classes are not equally represented in BigEarthNet); ii) transfer learn-

ing (since BigEarthNet currently contains only pairs of images from a small number

of European countries); and iii) also on unsupervised, self-supervised and semi-

supervised multi-modal learning for information discovery from big data archives.

We would like to note that Sentinel-1 image patches of BigEarthNet (denoted as

BigEarthNet-S1 hereafter) and Sentinel-2 image patches of BigEarthNet (denoted

as BigEarthNet-S2 hereafter) can be also separately employed for single-modal RS

image understanding problems.

It is worth noting that BigEarthNet has limitations for the RS applications that

require time-series data to accurately describe LULC classes, such as Non-irrigated

arable land,Permanently irrigated land. We would like to also note that some Sentinel-1

image patches can be contaminated by artefacts caused by either well-known Radio-

Frequency-Interference [102] or other dataset related issues, which are independent

from the pre-processing steps applied while constructing BigEarthNet. As a ﬁnal

remark, we would like to point out that due to the use of labels from the CLC

Chapter 2. BigEarthNet: A Large Scale Benchmark Archive 23

map, the BigEarthNet archive can be extended to a larger scale within Europe with

zero-annotation cost. As a future development of this work, we plan to enrich the

BigEarthNet archive by extending it to whole Europe.

Chapter 3

A Deep Multi-Attention Driven

Approach for Multi-Label Remote

Sensing Image Classification

DL-based IRL methods have been found popular in the framework of RS image

scene classiﬁcation problems. Most of the existing methods assume that training

images are annotated by single-labels, however RS images typically contain multiple

classes and thus can simultaneously be associated with multi-labels. Despite the

success of existing methods in describing the information content of very high

resolution aerial images with RGB bands, any direct adaptation for high-dimensional

high-spatial resolution RS images falls short of accurate modeling the spectral and

spatial information content. To address this problem, this chapter presents a novel

approach in the framework of multi-label classiﬁcation of high dimensional RS

images. The proposed approach is based on three main steps. The ﬁrst step describes

the complex spatial and spectral content of image local areas by a novel

-Branch

CNN that includes spatial resolution speciﬁc CNN branches. The second step initially

characterizes the importance scores of different local areas of each image and then

deﬁnes a global descriptor for each image based on these scores. This is achieved

by a novel multi-attention strategy that utilizes the bidirectional long short-term

memory networks. The ﬁnal step achieves the classiﬁcation of RS image scenes with

multi-labels. Experiments carried out on BigEarthNet show the effectiveness of the

proposed approach in terms of multi-label classiﬁcation accuracy compared to the

state-of-the-art approaches. The code of the proposed approach is publicly available

https://gitlab.tubit.tu-berlin.de/rsim/MAML-RSIC

. This chapter is mainly

based on the following publications:

•

G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-

label remote sensing image classiﬁcation,” IEEE Access, vol. 8, pp. 95 934–95946,

2020. DOI:10.1109/ACCESS.2020.2995805.

•

G. Sumbul and B. Demir, “A novel multi-attention driven system for multi-

label remote sensing image classiﬁcation,” in Proceedings of the IEEE International

Geoscience and Remote Sensing Symposium, 2019, pp. 5726–5729. DOI:

10.1109/

IGARSS.2019.8898188.

Chapter 3. A Deep Multi-Attention Driven Approach 25

3.1 Introduction

In recent years, DL-based IRL has attracted the attention of RS researchers for the

development of RS image scene classiﬁcation methods, which aim at automatically

assigning class labels to each image scene in an RS archive. As an example, in [103] a

gradient boosting random convolutional network is proposed as an ensemble frame-

work to combine several deep neural networks for RS image scene classiﬁcation

problems. In [104] feature learning strategies deﬁned based on different training

procedures for convolutional neural networks (CNNs) are analyzed. In [105] a region

attention network, which assigns attention scores to candidate regions for the ex-

pected object locations, is introduced to learn the alignment of RS image scenes. to this

end, different image sources are used together for the identiﬁcation of ﬁne-grained

categories. In [106] a semi-supervised approach based on a generative adversarial

network is proposed for the cases that the amount of annotated training data is

insufﬁcient. In [107] an intermediate feature aggregation method that progressively

combines the different level features of CNNs is proposed. In [108] a scale-free CNN

that transfers the fully connected layers in a pre-trained CNN model to convolutional

layers and then uses a general average pooling layer after the ﬁnal convolutional

layer is introduced. The above-mentioned DL based approaches in RS assume that

each training image is annotated by a single (broad category) label, which is associ-

ated to the most signiﬁcant content of the image. However, this assumption may not

be appropriate for complex scene classiﬁcation applications where RS image scenes

contain multiple land-cover classes and thus simultaneously associated to different

class labels (i.e., multi-labels).

To train DL models with training images annotated by multi-labels, few DL based

multi-label scene classiﬁcation methods have been recently introduced in RS. In [109]

a radial basis function neural network is applied on the CNN features of aerial

images as a multi-label classiﬁer. In [110] a structured support vector machine that

models the spatial contiguity is utilized based on the CNN features of the aerial

images in the framework of multi-label classiﬁcation. In these approaches, CNNs are

used as conventional transfer learning approaches, for which pre-trained models on

publicly available general purpose computer vision (CV) datasets (e.g., ImageNet)

act as ﬁxed feature extractors without changing the model parameters. However,

this approach can reduce the multi-label scene classiﬁcation accuracy because of

the differences in image characteristics in CV and RS. In [111] a data augmentation

strategy is introduced to avoid using a pre-trained network for an end-to-end training

of a shallow CNN. In this approach, to adapt the standard CNN architecture in multi-

label learning, the softmax function of the classiﬁcation layer is changed into a

sigmoid function. The direct use of standard CNNs that are actually designed for the

images annotated by single-labels is a common approach in multi-label classiﬁcation

problems. However, it may lead to inaccurate identiﬁcation of the multiple classes

present in images. To overcome this limitation, integration of sequential neural

network approaches into CNN architectures is introduced in RS. In [35] a class-

wise attention-based recurrent neural network (RNN) is introduced to sequentially

model the co-occurrence relationship of multiple classes. In this approach, class

predictions are obtained one after another in the RNN sequence and each prediction

is based on the decisions made until the corresponding class is reached. In [36]

Chapter 3. A Deep Multi-Attention Driven Approach 26

an attention-aware label relational reasoning network is proposed to: i) localize

discriminative regions of aerial images; and ii) characterize the label relations present

in the images based on the localized feature maps. In [37] an encoder-decoder

neural network is introduced to characterize the aerial image features. In detail, a

squeeze excitation layer is used for modeling the channel-wise interdependencies of

the feature maps in the encoder, whereas a RNN based decoder is exploited as an

adaptive spatial attention mechanism. The attention strategies proposed in [35], [36]

and [37] identify informative areas of images through an attention map based on

the feature maps of convolutional layers. These strategies are effective for very high

resolution aerial images, however they can be insufﬁcient for accurately describing

the complex content of satellite RS images with high spatial resolution (e.g., Sentinel-

2 and Landsat multispectral images). Results carried out on very high resolution

aerial images with only RGB bands show the success of these strategies for the

description of the spatial image content. A direct adaptation of these methods for

high dimensional RS images may lead to an incomplete representation of the spectral

information content. These issues are critical particularly for images with several

spectral bands with varying spatial resolutions acquired by the new generation

satellites (e.g., Sentinel-2). Thus, methods that can efﬁciently and effectively describe

the spatial and spectral information content of high dimensional RS images are

needed in the framework of multi-label RS image scene classiﬁcation.

To address this problem, we propose a DL based approach that aims at accurately

describing complex spatial and spectral content of RS images in the framework of

multi-label RS image scene classiﬁcation. To this end, the proposed approach is

based on three main steps: 1) spatial and spectral characterization of image local

areas; 2) deﬁnition of a multi-attention driven global descriptor; and 3) classiﬁcation

of RS image scenes with multi-labels. The proposed approach assumes that RS

image bands can be associated with varying spatial resolutions and a set of training

images annotated with multi-labels (based on land-cover land-use classes present

in the images) is available. In the ﬁrst step, we introduce a novel branch-wise CNN

architecture (which is called as

-Branch CNN) that efﬁciently describes the complex

content of local areas of each image by different CNN branches specialized according

to the spatial resolutions of image bands. In the second step, we present a novel

multi-attention strategy in the framework of RNNs that: i) accurately identiﬁes

importance levels (i.e., scores) for different local areas; and then ii) deﬁnes a global

descriptor for each image based on these scores. In the third step, multi-labels are

automatically assigned to each RS image represented by the global descriptors. The

main novelty of the proposed approach consists in the design and development of:

i) the

-Branch CNN to efﬁciently model the complex information content of RS

images for which the spectral bands can be associated to varying spatial resolutions;

and ii) the multi-attention strategy that deﬁnes a global image descriptor based on

the extraction and exploitation of importance scores of image local areas. In order

to evaluate the performance of the proposed approach, several experiments are

carried out on BigEarthNet-S2. Unlike the conventional DL based methods in RS

that consider all the image bands as a single volume (after applying an interpolation

method to the lower spatial resolution bands) and deﬁne a global descriptor by

neglecting the importance scores of different local areas, the experimental results

show the success of the proposed approach. The rest of the chapter is organized as

Chapter 3. A Deep Multi-Attention Driven Approach 27

Spatial and Spectral Content

Characterization of Local

Areas

Definition of a

Multi-Attention Driven

Global Descriptor

Classification of

RS Image Scenes

with Multi-Labels

FIGURE 3.1: Block diagram of the proposed approach for multi-label RS image scene classiﬁ-

cation.

follows. Section 3.2 introduces the proposed approach, while Section 3.3 explains

the design of experiments. Section 3.4 provides the experimental results. Section 3.5

draws the conclusion of this chapter.

3.2 Proposed Approach

Let

X={x1

. . .

xM}

be an archive that consists of

images, where

is the

ith

image.

We assume that a set

T ⊂ X

of labeled images is initially available. Each image in

T ⊂ X

is associated with multi-labels from a label set

L={l1, ..., lS}

, where

|L| =S

Label information of

xi∈ T

is deﬁned by a binary vector

yi∈ {

0,1

, where each

element of

indicates the presence or absence of label

ls∈ L

in a sequence. We also

assume that spectral bands of each image

can be associated to the

different spatial

resolutions, resulting in different pixel sizes. We aim to learn

F(x∗

;

θ) = g(f(x∗

;

θ))

that maps a new image

x∗

to multi-labels, where

f(·)

generates classiﬁcation scores

for each label

and

g(·)

produces

y∗

as a predicted label set and

is the given

set of model parameters. We propose a multi-label RS image scene classiﬁcation

approach made up of three main steps: 1) spatial and spectral characterization of

image local areas by a novel

-Branch CNN; 2) deﬁnition of a multi-attention driven

global descriptor with a novel multi-attention strategy; and 3) classiﬁcation of RS

image scenes with multi-labels. Fig. 3.1 presents the block diagram of the proposed

approach and each step is explained in the following sub-sections.

3.2.1 Spatial and Spectral Characterization of Local Areas

To efﬁciently characterize the spatial and spectral content of image local areas, each

RS image is initially divided to

non-overlapping

w×w

sized local areas. Let

ρr

be the

rth

local area of

. Then, for each local area, we deﬁne different sets of image

bands based on their spatial resolutions. Let

ρr

i,k

be the

kth

subset of the

rth

local area

for the corresponding spatial resolution, where

k∈ {

1,2,...,

and

r∈ {

1,2,...,

To accurately describe the local areas with varying spatial resolutions, we introduce

-Branch CNN that utilizes separate CNNs, each of which is designed to describe

the local areas of image bands with different spatial resolutions. Thus, the number

Chapter 3. A Deep Multi-Attention Driven Approach 28

Division of

Non-overlapping

Local Areas

Group Bands for

Each Spatial

Resolution

CNN

FIGURE 3.2: The proposed

-Branch CNN introduced in the ﬁrst step of the proposed

approach. One local area is highlighted as an example to feed into the corresponding CNN.

of CNN branches is selected as the total number of different spatial resolutions.

If all spectral bands are associated to the same spatial resolution, the proposed

Branch CNN turns into a single branch CNN (i.e.,

1). Each

ρr

i,k

are fed into

different branches of the

-Branch CNN. Let

φk

be the

kth

branch that provides local

descriptors associated with

kth

spatial resolution by applying convolutional layers

and a fully connected (FC) layer. Different local descriptors for all sets of image

bands are ﬁrst characterized and then concatenated into one vector for one local area.

To effectively combine information from different branches, all concatenated feature

vectors are fed into a new FC layer to produce the local descriptors

ψi,r

. This step is

illustrated in Fig. 3.2.

The proposed

-Branch CNN describes the complex information content of image

local areas through speciﬁc branches associated to different spatial resolutions. By

this way, a unique CNN is used for the image bands with the same spatial resolution

unlike the traditional CNN based methods in RS (which consider all the image

bands as a single volume after applying interpolation to the low spatial resolution

bands). On the one side, this approach leads to an accurate characterization of the

content of high dimensional RS images. On the other side, due to modeling the local

areas, it requires a smaller number of model parameters being estimated. Thus, the

computational complexity of training phase is reduced, while the risk of over-ﬁtting

on training data with low generalizing capability is avoided (since smaller neural

networks have less tendency for over-ﬁtting).

3.2.2 Definition of a Multi-Attention Driven Global Descriptor

After obtaining the local descriptors

{ψi,r}R

r=1

in the ﬁrst step, a global descriptor

can be deﬁned by simply stacking all local descriptors. In this way, local descriptors

equally contribute to the deﬁnition of a global descriptor. However, local areas of an

RS image can be subject to different levels (i.e., scores) of importance to represent the

Chapter 3. A Deep Multi-Attention Driven Approach 29

tanh

LSTM

Cell

𝛅

𝛅𝛅tanh

LSTM

Cell

FIGURE 3.3: Single LSTM cell with its inputs, gates and cell state followed by two LSTM cells

in a sequence. Without losing in generality, particular sequence of the LSTM network (which

starts with the ﬁrst local area and ends with the last local area) is chosen in the ﬁgure.

semantic content of the image. Accordingly, this step aims at accurately extracting

and exploiting importance levels of local areas of each image, while deﬁning a global

image descriptor. To this end, we introduce a novel multi-attention strategy that is

deﬁned based on long short-term memory (LSTM) networks [112].

An LSTM network contains sequentially ordered LSTM nodes (i.e., cells). Each

cell includes input gate (

), forget gate (

), output gate (

) and cell state (

). Cell

state characterizes the knowledge of observed inputs until the corresponding cell.

Different gates control how the cell state should behave according to different aims.

Forget gate decides which portion of the current cell state value should be forgotten.

Input gate controls which portion of the input should be read by cell state. Output

gate decides which portion of the cell state should be produced as the output of the

new cell state. The reader is referred to [113] for the detailed explanation. In the

proposed approach, each LSTM cell takes the descriptor of

rth

local area (

ψi,r

) from

the

-Branch CNN as input and employs the aforementioned operations as follows:

fr=δ(Wf,rψi,r+Uf,rhτ+bf,r)

ir=δ(Wi,rψi,r+Ui,rhτ+bi,r)

or=δ(Wo,rψi,r+Uo,rhτ+bo,r)

cr=fr⊙cτ+ir⊙tanh(Wc,rψi,r+Uc,rhτ+bc,r)

(3.1)

where

tanh

and

are the hyperbolic tangent and sigmoid functions,

W.,r

and

b.,r

are

the weight and bias parameters; and the subscript of

refers to the parameters of

the LSTM cell associated with

rth

local area. All operations of one LSTM cell are

illustrated in Fig. 3.3. Each LSTM cell produces one preliminary attention score given

the sequence, hr|τ, based on the cell state and the gates as follows:

hr=hr|τ=or⊙tanh(cr). (3.2)

We utilize two LSTM networks in a bidirectional manner to consider the different

orders of local areas and thus all LSTM cells are placed in two different sequences

Chapter 3. A Deep Multi-Attention Driven Approach 30

LSTM

Cell

LSTM

Cell

LSTM

Cell

LSTM

Cell

LSTM

Cell

LSTM

Cell

𝛅

FIGURE 3.4: Proposed multi-attention strategy with bidirectional LSTM networks for the

second step of the proposed approach.

with different parameters. Each cell of the ﬁrst LSTM network produces the pre-

liminary attention score of one local area concerning the knowledge acquired from

the attention scores of previous local areas (i.e., previous cells). Thus

becomes

r−

1 in (3.1). The second LSTM network employs the same idea by considering

the subsequent local areas and thus

becomes

1 in (3.1). In the context of bidi-

rectional LSTM networks, forward and backward sequences can be combined by

using the concatenation, the summation or the multiplication operations [114], [115].

The concatenation operation is a widely used operation in the literature. However,

it requires a fully connected layer for the reduction of a vector into a single value,

which can signiﬁcantly increase the computational complexity of the whole approach.

When multiplication operation is used, the resulting value can be dominated by one

of the sequences, if the preliminary attention score is a negative value. Accordingly,

we select the summation operation for combining the sequences. To this end, after

obtaining two preliminary attention scores from the different orders, we apply the

ﬁnal attention score of the rth local area αi,ras follows:

αi,r=δhr|r−1+hr|r+1

2. (3.3)

This produces an attention score for the

rth

local area within the range of

[

0,1

]

. For

the beginning of passes (

1 or

r=R

refers to an initial state of the nodes.

Each attention score shows the importance level of the considered local area for the

complete characterization of the whole image content. Accordingly, multi-attention

scores

{αi,r}R

r=1

for the

ith

image

show the different importance levels of the image

local areas. The proposed multi-attention strategy is illustrated in Fig. 3.4.

Let

Ωi

be the multi-attention driven global descriptor of the

. After obtaining the

multi-attention scores, the global descriptor

Ωi

is deﬁned by the concatenation of

Chapter 3. A Deep Multi-Attention Driven Approach 31

K-Branch

CNN

LSTM

Concatenation of Weighted

Local Descriptors

Coniferous

forest

Mixed

forest

Transitional

woodland

(a) (b) (c)

FIGURE 3.5: Detailed illustration of the three main steps of the proposed approach: (a) spatial

and spectral characterization of local areas; (b) deﬁnition of a multi-attention driven global

descriptor; (c) RS image scene classiﬁcation with multi-labels.

local descriptors weighted by attention scores as follows:

Ωi= [αi,1ψ⊤

i,1, . . . , αi,Rψ⊤

i,R]⊤. (3.4)

Due to this step, the proposed approach extracts and exploits the importance scores

of local areas of each image instead of equally considering them.

3.2.3 Classification of RS Image Scenes with Multi-Labels

This step aims to classify RS images into multi-labels by using the multi-attention

driven global descriptor

Ωi

obtained in the second step of the proposed approach.

To this end, we employ a FC layer

f(·)

as a classiﬁer that generates class scores

zlj

for each class label

in the sequence based on the global descriptor

Ωi

. Then, we

obtain the class posterior probability of

for the image

with the sigmoid function

as:

P(lj|xi) =

+e−zlj)

. After characterizing the class posterior probabilities, we

deﬁne the overall loss of the approach as the cross entropy loss throughout all labels

and images as follows:

∑

xi∈T

∑

j=1

[lj∈yi]log(P(lj|xi))

+(1−[lj∈yi]) log(1−P(lj|xi))

(3.5)

where

[lj∈yi]

is the Iverson bracket, which equals 1 if the

is one of the true

multi-labels of

, 0 otherwise. After end-to-end training of the entire neural net-

work by minimizing the cross-entropy loss, the parameters

of the function

(i.e.,

model parameters of the approach) can be learned. Accordingly, our model becomes

capable of producing the posterior probabilities of multi-labels to be assigned to a

new RS image scene

x∗

. Then, the proposed approach predicts the multi-labels by

thresholding the probability values. Each step of the proposed approach is illustrated

in Fig. 3.5.

Chapter 3. A Deep Multi-Attention Driven Approach 32

3.3 Dataset Description and Experimental Design

3.3.1 Dataset Description

We conducted all experiments on BigEarthNet-S2, while we utilized multi-labels

based on the CLC class nomenclature. In the experiments, 70, 987 image patches

that are fully covered by seasonal snow, cloud and cloud shadow were not used.

According to our knowledge, BigEarthNet-S2 is the only archive in RS that includes

Sentinel-2 multispectral images, each of which is annotated with multi-labels. Thus,

we could only use it in the experiments in this chapter. The other benchmark archives,

e.g., DFC15 [35] and UC-Merced archives [93], consist of a very small number of RS

images that are annotated with multi-labels and contain only RGB bands. Thus, they

are not adequate to evaluate the proposed approach and are not considered in this

chapter.

The number of image patches associated with each BigEarthNet-S2 class varies sig-

niﬁcantly in the archive. To construct training set (which is used for training the

considered neural networks), validation set (which is used for selecting hyperparam-

eters) and test set (which is used for accuracy assessment), one could apply random

sampling. However, when images with multi-labels are considered, this approach

has a risk that randomly selected images may not represent all classes present in

the whole archive. There are also other approaches to divide a dataset into train,

validation and test sets, however they are also designed for images annotated by

single-labels and thus not suitable for multi-label applications [116]. To this end, we

develop an algorithm to represent each BigEarthNet-S2 class with a sufﬁcient number

of images in training, validation and test sets based on the label frequencies. The

algorithm starts by including all images to the the training set. Let

clm∈N

be the

number of images associated to the label

in the training set, where

m∈ {

. . .

and thus we deﬁne the frequency γlmof the label lmin the training set as follows:

γlm=clm

∑S

m=1clm

. (3.6)

Then, we deﬁne the cost of moving an image and its set of multi-labels from the

training set to either validation or test set as follows:

Cxi,yi=−

∑

m=1

γ∗

lm−1

√γlm

(3.7)

where

γ∗

indicates the new frequency of the label

after images are moved from

the training set to the validation or test sets. The algorithm ﬁrst sorts the label list in

decreasing order based on the number of images associated to each class. Then, from

the sorted list, the images with the decreasing cost values associated to each class are

randomly selected and moved either to the validation set or to the test set. Since the

algorithm starts to operate on the images associated to the majority classes, most of

the images will be moved from the training set at the beginning. However, the cost

value will reach the stationary point when it operates on the images associated to

the minority classes. Application of this algorithm to the BigEarthNet-S2 results in a

validation set of 198,762 images, a test set of 203,269 images, and a training set of

117,308 images. The algorithm is summarized in Algorithm 1.

Chapter 3. A Deep Multi-Attention Driven Approach 33

Algorithm 1 Our algorithm for the selection of training, validation and test sets.

Input: X={x1,..., xM},L={l1,..., lS},Y={y1,..., yM}

Assumption: L

is sorted in decreasing order based the number of images associated

to each class.

1: function LABELFREQ(T,lm)

2: clm← |{(xi,yi)|(xi,yi)∈ T ,yi,m=1}|

3: γlm←clm/(∑S

m=1clm)

4: return γlm

5: end function

6: function COST(T,(xi,yi),S,ΓL)

7: sum ←0

8: for m←1 to Sdo

9: γlm←ΓL

10: γ∗

lm←LABELFREQ(T −(xi,yi),lm)

11: sum ←sum −(γ∗

lm−1

S)/√γlm

12: end for

13: return sum

14: end function

15: T={(xi,yi)|xi∈ X,yi∈ Y} ▷Initial training set.

16: V=∅▷Initial validation set.

17: E=∅▷Initial test set.

18: S← |L|

19: state ←COST(T,∅,S)

20: ΓL←SS

m=1LABELFREQ(T,lm)▷Initial frequencies.

21: for m←1 to Sdo

22: for all isuch that yi,m=1do

23: if COST(T,(xi,yi),S,ΓL)<state then

24: T ← T −(xi,yi)

25: (V ← V + (xi,yi)) ⊕(E ← E + (xi,yi))

26: state ←COST(T,(xi,yi),S,ΓL)

27: end if

28: end for

29: end for

30: return T,V,E▷Resulting sets.

3.3.2 Experimental Design

After the selection of training, validation and test sets, we divided each image into

non-overlapping local areas. Then, we employed three branch CNN (i.e.,

for the

-Branch CNN) due to the three different spatial resolutions of Sentinel-2.

Accordingly, for each local area, we split the bands into three subsets. Then, we

stacked bands of each subset to obtain a single volume for each CNN branch. In

detail, the bands 2 to 4 and 8 (which have 10m spatial resolution) were fed into the

ﬁrst branch, while the bands 5 to 7, 8A, 11 and 12 (which have 20m spatial resolution)

were fed into the second branch and the third branch takes as input the remaining

bands 1 and 9 (which have 60m spatial resolution). We selected the number of local

Chapter 3. A Deep Multi-Attention Driven Approach 34

areas and all other hyperparameters with respect to the classiﬁcation performance

on the validation set. To select the local area size

w×w

is tested within the range

[

18,60

]

with a step size of 6. It is worth noting that, for the sizes, which are not

evenly divisible by the image size (120

120 for 10m bands, 60

60 for 20m bands,

20 for 60m bands), we applied zero padding to the image borders. Although the

same number of convolutional layers was used for all branches, the number of ﬁlters,

the exploitation of pooling strategy and the ﬁlter sizes vary among branches. It is

worth noting that the number of convolutional layers in all branches can be increased

at a large extent to achieve deeper models. However, this would also increase the

number of model parameters and thus the computational complexity. Accordingly,

three convolutional layers were used for all branches. For the ﬁrst branch, 32 ﬁlters

with the size of 5

5, 32 ﬁlters with the size of 5

5 ﬁlters and 64 ﬁlters with the size

of 3

3 ﬁlters were selected. For the second branch, the same number of ﬁlters was

used, while 3

3 ﬁlters were employed in each layer. For the third branch, 32 ﬁlters

with the size of 2

2 were used in each layer. We utilized the stride of 1 and zero

padding in all convolutional layers to preserve the spatial dimensionality and not to

lose information. In addition, max-pooling was utilized in the ﬁrst two branches to

provide partial translation invariance [117], which was not used in the last branch to

avoid further decreasing the spatial resolution. For the LSTM networks, we used a

128 dimensional memory.

We jointly trained all CNN branches, FC layers and LSTM networks (i.e., an end-

to-end learning of all steps was applied simultaneously). We used the Adam

method [101] of Stochastic Gradient Descent with the initial learning rate of 10−3to

decrease the sigmoid cross entropy loss, which aims at maximizing the log-likelihood

of the multi-labels in the training set. For the initialization of neural network weights,

we utilized the Xavier method [118] to keep the variance of weights similar among all

layers. We selected the 2

−5

L2-regularization weight to layer-wise regularize the

weights. 20% dropping out probability was chosen for Dropout regularization [119]

to avoid the over-ﬁtting of the proposed approach on the training set. In addition,

we utilized the Batch Normalization [120] to decrease the effect of different spectral

band statistics.

In the experiments, we compared the proposed approach with: 1) the Very Deep

Convolutional Networks (i.e., VGG networks) [121]; 2) the Deep Residual Nets (i.e.,

ResNet networks) [95]; and 3) the Class-Wise Attention-Based Convolutional and

Bidirectional LSTM Network [35] (denoted as CA-LSTM). For the VGG networks, we

selected 16 layers (VGG16) and 19 layers (VGG19) versions. At the similar depths

to the VGG networks, we selected 18 layers (ResNet18) and 34 layers (ResNet34)

versions of the ResNet networks. These are widely used CNNs for the image clas-

siﬁcation problems in the CV literature. We used the same parameters presented

in [121] and [95] for the VGG networks and the ResNet networks, respectively, except

only the considered learning rates. CA-LSTM is one of the few DL based approaches

proposed for the multi-label RS image scene classiﬁcation task. For the CA-LSTM,

we used the same feature extraction module (which is ResNet50 [95]), same LSTM

network (bidirectional LSTM network with 2048 dimensional memory) and same

parameters presented in the [35] except the learning rate.

Chapter 3. A Deep Multi-Attention Driven Approach 35

We also evaluated the different steps at the proposed approach. To assess the ef-

fectiveness of the ﬁrst step of the proposed approach (that is the

-Branch CNN),

we compared it with different single branch CNN approaches. To this end, we

initially applied cubic interpolation to 20m and 60m bands and stacked all bands

into one volume. Then, three different approaches are considered as follows: 1) a

single branch CNN that considers all the image bands as input and operates on

the whole images (denoted as SiB-CNN); 2) a single branch CNN that considers

all the image bands as input and operates on the local areas of images (denoted

as L-SiB-CNN); and 3) a single branch CNN that considers only RGB image bands

as input and operates on the whole images (denoted as SiB-CNN

RGB

). For these

approaches, the architecture of the ﬁrst branch of the proposed

-Branch CNN is

used. To evaluate the effectiveness of the second step of the proposed approach

(that is the multi-attention strategy), we compared the results with those obtained

without using the multi-attention strategy (i.e., only the ﬁrst step is used). For all

the experiments, we used the same training procedure from scratch with the same

number of epochs, learning rate and the number of mini-batches to compare different

approaches under the same setting. We performed our experiments on a cluster of 4

NVIDIA Tesla V100 GPUs.

Performance evaluation of any multi-label classiﬁcation approach requires to analyze

several factors rather than only evaluating the number of correct predictions and

thus needs much more complex analysis with respect to the single-label case [122].

Accordingly, we utilized the different classiﬁcation-based and ranking-based metrics

with varying characteristics to accurately evaluate the accuracy of the proposed

approach. Classiﬁcation-based metrics consider the list of predicted classes, whereas

ranking-based metrics focus on the ordered list of probabilities for all classes.

Under the category of classiﬁcation-based metrics, results of experiments were pro-

vided in terms of three performance metrics: 1) Recall (

); 2)

-Score (

); and 3)

Hamming loss (

). Classiﬁcation-based metrics can be calculated by: i) giving

equal importance to each sample of the test set (sample averaging); ii) giving equal

importance to each class (macro averaging); and iii) comparing the overall test set

with the ground reference (micro averaging) regardless of giving importance to

neither each sample nor each class.

Let

TPij

FPij

FNij

and

TNij

indicate the conditions of true positive, false positive,

false negative and true negative, respectively, for the

ith

image and

jth

label (

where each of them takes 0 or 1 and

TPij +FPij +FNij +TNij =

1 holds. The recall

is expressed by different averaging methods as follows:

Rsmpl =1

∑

i=1

∑S

j=1TPij

∑S

j=1TPij +FNij

(3.8)

Rmacr =1

∑

j=1

∑M

i=1TPij

∑M

i=1TPij +FNij

(3.9)

Rmicr =∑M

i=1∑S

j=1TPij

∑M

i=1∑S

j=1TPij +FNij

. (3.10)

Chapter 3. A Deep Multi-Attention Driven Approach 36

The

-Score is the weighted harmonic mean of the correct prediction rates among

the considered ground reference and the multi-label predictions. Thus, it is expressed

by different averaging techniques as follows [123]:

smpl =1

∑

i=1

∑S

j=15TPij

∑S

j=15TPij +4FNij +FPij

(3.11)

macr =1

∑

j=1

∑M

i=1TPij

∑M

i=15TPij +4FNij +FPij

(3.12)

micr =∑M

i=1∑S

j=15TPij

∑M

i=1∑S

j=15TPij +4FNij +FPij

. (3.13)

The Hamming loss is the average Hamming distance between the ground reference

labels and predicted multi-labels. Thus, it is deﬁned as follows [124]:

HL =1

∑

i=1

∑

j=1

[lj∈yi⊕lj∈y∗

i](3.14)

where ⊕is the XOR logical operation.

Under the category of ranking-based metrics, results of experiments are provided

in terms of four performance evaluation metrics: 1) Ranking loss (

); 2) One error

(

); 3) Coverage (

COV

); and 4) Label ranking average precision (

LRAP

). All the

ranking-based metrics are deﬁned with respect to the ranking of the

jth

label in the

class probabilities result of an multi-label classiﬁcation approach for the

ith

image

that is deﬁned as

rankij =|k:P(lk|xi)≥P(lj|xi)|

. Unlike the classiﬁcation-based

metrics, ranking-based metrics are calculated only by giving equal importance to

each sample of the test set.

Accordingly, ranking loss is the rate of wrongly ordered label pairs (i.e., the proba-

bility of a label, which is irrelevant to the image, is higher than a ground reference

label), and thus expressed as follows [125]:

RL =1

∑

i=1

|yi|(S−|yi|)∑

lj∈yi

∑

lk/∈yi

rankik ≤rankij. (3.15)

The one error is the rate of test images whose predicted label having the highest

ranking is not in the ground reference and thus deﬁned as follows [122]:

OE =1

∑

i=1

[argmax

rankij /∈yi]. (3.16)

The coverage calculates the average number of labels required to be included in the

prediction list of a multi-label classiﬁer such that all ground reference labels will be

predicted. Accordingly, it is deﬁned as follows [125]:

COV =1

∑

i=1

max

lj∈yi

rankij. (3.17)

Chapter 3. A Deep Multi-Attention Driven Approach 37

TABLE 3.1: MULTI-LABEL CLASSIFICATION ACCURACIES AND THE NUMBER OF REQUIRED

MODEL PARAMETERS (NP) WHEN USING LOCAL AREAS WITH DIFFERENT SIZES FOR THE

PROPOSED APPROACH.

Local Area Size (w×w) Classiﬁcation-Based Metrics (%) Ranking-Based Metrics NP

(×106)

10m 20m 60m Rmacr Rsmpl Rmicr F2

macr F2

smpl F2

micr HL RL(

)OE(

)COV LRAP(

)

18×18 9×9 3×3 51.0 68.1 61.7 46.9 68.9 64.2 4.1 2.7 6.5 5.8 85.3 0.71

24×24 12×12 4×4 50.0 66.2 59.2 46.0 67.4 62.2 4.1 2.8 6.8 5.9 85.1

0.93

30×30 15×15 5×5 52.8 68.5 62.3 47.2 69.4 64.7 4.0 2.6 5.8 5.7 85.9

1.13

36×36 18×18 6×6 54.4 70.7 65.0 47.8 71.1 66.5 4.1 2.6 6.2 5.7 85.7

1.70

42×42 21×21 7×7 53.3 70.7 64.8 46.2 71.0 66.6 4.1 2.6 6.5 5.8 85.6

2.03

48×48 24×24 8×854.6 72.5 67.1 48.0 72.2 68.2 4.1 2.6 6.5 5.8 85.3

2.81

54×54 27×27 9×9 54.2 72.3 67.1 48.1 72.2 68.4 4.1 2.6 6.3 5.7 85.6

3.29

60×60 30×30 10×10 54.1 72.4 66.8 46.7 71.8 67.6 4.3 2.9 6.9 6.1 84.4

4.26

For each ground reference label, the label ranking average precision calculates the

rate of higher-ranked ground reference labels. This is expressed as follows [122]:

LRAP=1

∑

i=1

∑

lj∈yi

|{lk:rankik ≤rankij,lk∈yi}|

rankij

. (3.18)

It is worth noting that, for any multi-label classiﬁer,

LRAP

provides scores strictly

greater than 0 unlike the other metrics [122]. Thus, small differences in the score of

this metric can be more informative compared to other metrics (e.g., recall). Smaller

values of the Hamming loss, ranking loss, one error and coverage indicate better

performance of an approach, whereas higher values of the recall,

-Score and the

label ranking average precision are associated to better performance.

3.4 Experimental Results

We carried out different kinds of experiments in order to: 1) perform a sensitivity

analysis with respect to different parameter settings and strategies; and 2) compare

the effectiveness of the proposed approach with the widely used deep CNNs and

one recent multi-label RS image scene classiﬁcation approach [35].

3.4.1 Sensitivity Analysis of the Proposed Approach

In this section, we performed the sensitivity analysis of the proposed approach under

different parameter settings and strategies.

In the ﬁrst set of trials, we analyzed the effect of utilizing local areas with different

sizes in terms of the multi-label classiﬁcation accuracy and computational complexity.

Table 3.1 shows the results with the required number of parameters under different

sizes of local areas. By analyzing the table, one can see that the reduction of computa-

tional complexity highly depends on the local area size

w×w

. This is due to the fact

that enlarging the local areas increases the number of parameters required to learn.

Chapter 3. A Deep Multi-Attention Driven Approach 38

TABLE 3.2: RESULTS OBTAINED BY THE SIB-CNN

RGB

,THE SIB-CNN, THE L-SIB-CNN AND

THE PROPOSED K-BRANCH CNN.

Method Classiﬁcation-Based Metrics (%) Ranking-Based Metrics

Rmacr Rsmpl Rmicr F2

macr F2

smpl F2

micr HL RL(

)OE(

)COV LRAP(

)

SiB-CNNRGB

33.6 53.7 45.6 35.1 56.0 49.8

4.8 4.1 13.5 7.1 80.0

SiB-CNN

39.1 60.5 52.8 40.9 62.4 56.7

4.4 3.4 9.6 6.5 83.0

L-SiB-CNN

44.0

65.7 58.8

41.2 66.2

62.9 4.1 2.8 7.4 5.9 84.8

Proposed K-Branch CNN 46.8

64.7 57.7

44.6 66.3

61.0

4.1 2.6 6.3 5.7 85.4

As an example, using 18

18 sized local areas reduces the number of parameters by

a half order of magnitude compared to the case for which 60

60 sized local areas

are used. From the Table 3.1 one can also observe that the accuracies obtained by

different sizes of local areas are similar to each other under most of the metrics. As

an example, using 60

60 sized local areas provides almost the same

macr

score

compared to the case 30

30 sized local area is considered. In few cases, there are

noticeable differences in the results associated to metrics. As an example, using

48 sized local areas results in more than 7% higher

Rmicr

compared to using

24 sized local areas. This is due to the fact that a smaller window size may

reduce the capability of describing the spatial information content. All these results

show that the selection of local area size in a proper range does not signiﬁcantly affect

the classiﬁcation accuracy of the proposed approach, however considerably changes

the computational complexity. Accordingly, for the rest of the experiments we used

30 sized local areas for 10m resolution bands since it provides the best values in

ranking-based metrics and Hamming loss with a signiﬁcantly reduced number of

parameters (that is less than a half of those required for 48

48, 54

54 and 60

local area sizes).

In the second set of trials, we analyzed the effect of the ﬁrst step of the proposed

approach on the multi-label classiﬁcation accuracy. To this end, we compare the

results of the

-Branch CNN (which is introduced in the ﬁrst step) with those ob-

tained by the SiB-CNN

RGB

(which is a single branch CNN that considers RGB bands),

SiB-CNN (which is a single branch CNN that considers all bands) and L-SiB-CNN

(which is a single branch CNN that considers all bands and operates on the image

local areas). Table 3.2 shows the multi-label classiﬁcation accuracies under differ-

ent metrics. From this table, one can observe that the proposed

-Branch CNN

provides the best scores under most of the metrics. As an example, the proposed

-Branch CNN provides more than 9%, almost 4% and more than 3% higher

macr

scores compared to the SiB-CNN

RGB

, SiB-CNN and L-SiB-CNN, respectively. In

greater detail, the SiB-CNN provides more than 6% higher

smpl

score by achieving a

reduction of about 4% in one error compared to the SiB-CNN

RGB

. This shows that

using spectral bands associated to 20m and 60m spatial resolutions improves the

multi-label classiﬁcation accuracy. Moreover, the L-SiB-CNN provides more than 4%

higher

smpl

score by achieving a reduction of more than 9% in coverage compared to

the SiB-CNN. This indicates that exploiting local areas of images also improves the

multi-label classiﬁcation accuracy. In addition, the proposed

-Branch CNN leads to

a reduction of about 7% in Hamming loss and more than 7% higher

Rmacr

compared

Chapter 3. A Deep Multi-Attention Driven Approach 39

TABLE 3.3: MULTI-LABEL CLASSIFICATION ACCURACIES OBTAINED BY USING DIFFERENT

STEPS OF THE PROPOSED APPROACH.

Steps of the Proposed Approach Classiﬁcation-Based Metrics (%) Ranking-Based Metrics

1st 2nd 3rd Rmacr Rsmpl Rmicr F2

macr F2

smpl F2

micr HL RL(

)OE(

)COV LRAP(

)

✓ ✗ ✓

46.8 64.7 57.7 44.6 66.3 61.0

4.1 2.6 6.3 5.7 85.4

✓ ✓ ✓ 52.8 68.5 62.3 47.2 69.4 64.7 4.0 2.6 5.8 5.7 85.9

TABLE 3.4: RESULTS OBTAINED BY THE RESNET18, RESNET34, VGG16, VGG19, CA-LSTM

AND THE PROPOSED APPROACH TOGETHER WITH THE NUMBER OF REQUIRED MODEL

PARAMETERS (NP).

Method Classiﬁcation-Based Metrics (%) Ranking-Based Metrics NP

(×106)

Rmacr Rsmpl Rmicr F2

macr F2

smpl F2

micr HL RL(

)OE(

)COV LRAP(

)

ResNet18 [95]

36.1 59.9 52.6 34.8 61.1 55.5

4.9 8.1 12.8 11.4 75.5

11.2

ResNet34 [95]

37.0 64.4 57.8 35.7 64.6 59.8

4.9 6.3 12.6 9.6 77.6

21.3

VGG16 [121]

37.8 62.7 55.6 36.1 64.2 58.7

4.5 3.3 10.1 6.3 82.4

134.4

VGG19 [121]

41.5 61.5 54.2 38.1 63.0 57.4

4.6 3.5 11.1 6.4 81.5

139.8

CA-LSTM [35]

43.5 64.8 58.5 40.4 65.5 60.7

4.7 3.7 9.9 6.8 81.5

33.5

Proposed Approach 52.8 68.5 62.3 47.2 69.4 64.7 4.0 2.6 5.8 5.7 85.9 1.1

to the SiB-CNN. All these results show that the

-Branch CNN much more accurately

characterizes the spectral content of RS images by utilizing all spectral bands with

different spatial resolutions in branch-wise CNN architecture compared to single

branch CNN approaches (which require to apply interpolation to lower resolution

bands).

In the third set of trials, we evaluated the effect of the second step of the proposed

approach. To this end, we compared the results of proposed approach with those

obtained by neglecting the multi-attention strategy (i.e., only the ﬁrst step is used).

When the second step is neglected, global descriptors are obtained by the concate-

nation of local descriptors without weighted by attention scores. Table 3.3 shows

the multi-label classiﬁcation accuracies under different metrics. From this table, one

can observe that when the use of multi-attention strategy signiﬁcantly improves the

classiﬁcation accuracy under all the metrics. As an example, the improvements are

6% in

Rmacr

and more than 3% in

smpl

score. This shows the effect of modeling the

importance scores of image local areas for the characterization of a global descriptor.

3.4.2 Comparison Among the Existing Approaches

In the fourth set of trials, we compared the effectiveness of the proposed approach

with the ResNet architectures at the depths of 18 and 34 (ResNet18 and ResNet34),

the VGG architectures at the depth 16 and 19 (VGG16 and VGG19) and the CA-LSTM

(which is a recent multi-label RS scene classiﬁcation approach). Table 3.4 shows

the multi-label classiﬁcation results of these methods under different metrics. By

analyzing the table, one can observe that our proposed approach leads to the highest

accuracies with the lowest number of parameters. As an example, the proposed

Chapter 3. A Deep Multi-Attention Driven Approach 40

RS Images Multi-Labels ResNet18 [27] ResNet34 [27] VGG16 [26] VGG19 [26] CA-LSTM [11] Proposed Approach

Coniferous forest,

Water bodies

Pastures,

Land principally

occupied by

agriculture,

Coniferous forest,

Transitional

woodland/shrub

Pastures,

Coniferous forest,

Natural grassland,

Moors and heathland,

Transitional

woodland/shrub

Pastures,

Coniferous forest

Broad-leaved forest,

Coniferous forest,

Mixed forest,

Water bodies

Broad-leaved forest,

Coniferous forest,

Mixed forest,

Water bodies

Coniferous forest,

Water bodies

Discontinuous

urban fabric,

Port areas,

Pastures,

Coniferous forest,

Coastal lagoons

Discontinuous

urban fabric,

Port areas,

Pastures,

Coniferous forest,

Coastal lagoons

Discontinuous urban

fabric, Industrial or

commercial units,

Green urban areas,

Coniferous forest,

Mixed forest,

Coastal lagoons,

Sea and ocean

Discontinuous urban

fabric, Industrial or

commercial units,

Port areas, Green urban

areas, Coniferous forest,

Mixed forest,

Transitional

woodland/shrub,

Water courses

Pastures,

Land principally

occupied by

agriculture,

Coniferous forest,

Transitional

woodland/shrub

Pastures,

Land principally

occupied by

agriculture,

Natural grassland

Coniferous forest,

Water bodies

Discontinuous

urban fabric,

Pastures,

Coniferous forest,

Coastal lagoons

Mixed forest,

Water bodies

Mixed forest,

Water bodies

Pastures,

Land principally

occupied by

agriculture,

Natural

grassland

Pastures,

Land principally

occupied by

agriculture,

Mixed forest

Discontinuous

urban fabric,

Industrial or

commercial units

Discontinuous

urban fabric,

Port areas

FIGURE 3.6: An example of the BigEarthNet-S2 images with the true multi-labels and

the multi-labels assigned by the ResNet18, ResNet34, VGG16, VGG19, CA-LSTM and the

proposed approach.

approach provides 15% higher

Rmacr

, more than 5% higher

smpl

score and a reduction

of more than 21% in ranking loss compared to the VGG16 (which is one of the well

known CNNs for image classiﬁcation problems). Moreover, the proposed approach

requires a signiﬁcantly reduced number of parameters that is more than two orders

of magnitude compared to the VGG16. Even with the deeper architecture (VGG19),

the VGG approach is not capable of increasing the classiﬁcation accuracy (while

providing the lowest scores under all metrics except the

Rmacr

and

macr

compared

to the VGG16) and requires the highest number of parameters to learn. As an

example, the VGG16 leads to a reduction of about 6% in ranking loss. This shows

that increasing the depth of a CNN is not sufﬁcient to obtain accurate multi-label

RS classiﬁcation results. In addition, the proposed approach leads to more than 11%

higher

macr

score and more than 8% higher

LRAP

score with a reduced number of

parameters that is more than an order of magnitude lower compared to the ResNet34

(which is one of the most popular CNNs due to the integration of residual connections

with convolutional layers). The proposed approach provides better metric values

(e.g., more than 9% higher

Rmacr

, 7% higher

macr

score, 9% higher

Rmacr

and a

reduction of about 30% in ranking loss) also compared to the CA-LSTM. This success

has been achieved with the signiﬁcantly reduced number of parameters by more than

an order of magnitude. All these results clearly show that the proposed approach

reduces the needs for very deep CNNs to achieve a high classiﬁcation accuracy.

This is an important advantage, since reducing the number of model parameters to

achieve promising performance is as important as the classiﬁcation accuracy for DL

based approaches. Figure 3.6 shows an example of BigEarthNet images with the

true multi-labels and the multi-labels assigned by the ResNet18, ResNet34, VGG19,

VGG16, CA-LSTM and the proposed approach. By analyzing the ﬁgure, one can see

that our proposed approach accurately predicts all classes without predicting any

wrong ones. Unlike the proposed approach, the VGG16 and VGG19 predict several

unrelated classes. As an example, both of the approaches predict broad-leaved forest

and mixed-forest classes for the ﬁrst image, although this image does not contain these

Chapter 3. A Deep Multi-Attention Driven Approach 41

classes. ResNet18 and ResNet34 are able to accurately predict only some of the multi-

labels. As an example, for the image in the center, the ResNet networks correctly

predict pastures and land principally occupied by agriculture classes, however coniferous

forest and transitional woodland/shrub classes are not predicted and thus missed. These

results prove that the VGG and ResNet networks are less accurate in the prediction

of all classes present in the images with respect to the proposed approach. From the

ﬁgure, one can see that the CA-LSTM provides accurate results for the top image

without any wrong classiﬁcation. However, for more complex images, this approach

is not capable of identifying some classes. As an example, for the image in the center,

the CA-LSTM wrongly predicts mixed forest and natural grassland classes instead

of coniferous forest and transitional woodland/shrub classes. However, these classes

are accurately predicted by the proposed approach. As another example, for the

bottom image, the CA-LSTM does not provide any correct prediction, whereas the

proposed approach correctly predicts all the classes. These results, again, prove that

the proposed approach more accurately describes the complex spatial and spectral

content of RS images compared to the CA-LSTM.

3.5 Conclusion

In this chapter, we have introduced a novel DL based approach for multi-label remote

sensing image scene classiﬁcation. The proposed approach is made up of three

main steps. The ﬁrst step achieves spatial and spectral characterization of image

local areas by a novel

-Branch CNN, which includes spatial resolution speciﬁc

CNN branches. The second step initially estimates the multiple attention scores

to identify the importance levels (i.e., scores) of different image local areas. This

is achieved by the novel bidirectional LSTM-based multi-attention strategy. Then,

each image is represented by a global descriptor deﬁned on the basis of the attention

scores. In the third step, images modeled by the multi-attention driven global

descriptors are classiﬁed and multi-label predictions are obtained. Experimental

results obtained on the BigEarthNet (which is a large-scale Sentinel-2 benchmark

archive) demonstrate that the proposed approach signiﬁcantly improves the multi-

label scene classiﬁcation accuracy compared to the well known deep CNNs and

the state-of-the-art attention driven multi-label RS image classiﬁcation approach.

Moreover, the proposed approach provides a computationally more efﬁcient solution

for multi-label classiﬁcation problems due to the signiﬁcant reduction in the number

of model parameters. Decreasing the model complexity reduces the risk of over-

ﬁtting (which also contributes to the improvement in the classiﬁcation accuracy). All

the results conﬁrm that the proposed approach is much more suitable to be used

within the operational RS scene classiﬁcation scenarios, where the images contain

highly complex spatial and spectral information content. The main reasons for the

success of the proposed approach are summarized as follows:

Due to the proposed

-Branch CNN (which includes a specialized branch in

terms of the DL techniques utilized throughout layers for the set of image

bands with the same spatial resolution), the proposed approach signiﬁcantly

improves the characterization of complex spatial and spectral content of high-

dimensional RS images with high-spatial resolution. Moreover,

-Branch CNN

Chapter 3. A Deep Multi-Attention Driven Approach 42

leads to a signiﬁcant reduction on the computational complexity of the entire

approach by reducing the number of model parameters.

Due to the proposed multi-attention strategy (which efﬁciently exploits the

bidirectional LSTM sequences on the local descriptors of each RS image to

estimate the multi-attention scores), the proposed approach accurately extracts

and exploits the importance levels of image local areas which are then used to

deﬁne the global descriptors.

It is worth noting that although in our experiments we have used the Sentinel-2

multispectral images (which include 13 bands associated to three different spatial

resolutions), the proposed approach can be used with any multispectral RS image.

This can be achieved by selecting: i) the number

of branches as the total number of

different spatial resolutions associated to the considered RS image bands; and ii) the

proper values of the hyperparameters for each branch in the

-Branch CNN. If all

the image bands are associated to the same spatial resolution value, the

-Branch

CNN turns into a single branch CNN (i.e.,

1). It is also important to note that

when RS image bands with varying spatial resolutions are considered, the most

straightforward way is to apply interpolation to the lower spatial resolution bands

and then to use a single-branch CNN. However, the experimental results show that

the use of interpolation may lead to a loss on the scene classiﬁcation accuracy.

As a ﬁnal remark, it is worth noting that to deﬁne the local areas of each image,

we simply divide images into non-overlapping blocks. As a future work, we plan

to apply a strategy for an adaptive deﬁnition of local areas based on the semantic

content of RS images that can further improve the classiﬁcation accuracy. Moreover,

we also plan to develop a data summarization strategy [126] instead of stacking local

descriptors in the second step of the proposed approach.

Chapter 4

Remote Sensing Image Similarity

Learning Through Informative and

Representative Triplets for Multi-Label

Image Retrieval

Learning the similarity between RS images forms the foundation for CBIR. Recently,

deep metric learning approaches that map the semantic similarity of images into an

embedding (metric) space have been found very popular in RS. A common approach

for learning the metric space relies on the selection of triplets of similar (positive)

and dissimilar (negative) images to a reference image called as an anchor. Choosing

triplets is a difﬁcult task particularly for multi-label RS CBIR, where each training

image is annotated by multiple class labels. To address this problem, in this chapter,

we propose a novel triplet sampling method in the framework of DNNs deﬁned for

multi-label RS CBIR problems. The proposed method selects a small set of the most

representative and informative triplets based on two main steps. In the ﬁrst step, a

set of anchors that are diverse to each other in the embedding space is selected from

the current mini-batch using an iterative algorithm. In the second step, different

sets of positive and negative images are chosen for each anchor by evaluating the

relevancy, hardness and diversity of the images among each other based on a novel

strategy. Experimental results obtained on two multi-label benchmark archives

show that the selection of the most informative and representative triplets in the

context of DNNs results in: i) reducing the computational complexity of the training

phase of the DNNs without any signiﬁcant loss on the performance; and ii) an

increase in learning speed since informative triplets allow fast convergence. The code

of the proposed method is publicly available at

https://git.tu-berlin.de/rsim/

image-retrieval-from-triplets

. This chapter is mainly based on the following

publications:

•

G. Sumbul, M. Ravanbakhsh, and B. Demir, “Informative and representative

triplet selection for multilabel remote sensing image retrieval,” IEEE Transactions

on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. DOI:

10.1109/TGRS.

2021.3124326.

•

G. Sumbul, J. Kang, and B. Demir, “Deep learning for image search and retrieval

in large remote sensing archives,” in Deep Learning for the Earth Sciences: A

Chapter 4. Informative & Representative Triplet Selection for CBIR 44

comprehensive approach to remote sensing, climate science and geosciences, Hoboken,

NJ, USA: Wiley, 2021, ch. 11, pp. 150–160. DOI:10.1002/9781119646181.ch11.

•

G. Sumbul, M. Ravanbakhsh, and B. Demir, “A relevant, hard and diverse

triplet sampling method for multi-label remote sensing image retrieval,” in

Proceedings of the IEEE Mediterranean and Middle-East Geoscience and Remote

Sensing Symposium, 2022, pp. 5–8. DOI:10.1109/M2GARSS52314.2022.9839759.

4.1 Introduction

One of the most emerging applications in RS is the accurate retrieval of RS images

from fast-growing archives. Thus, the development of content-based image retrieval

(CBIR) methods, which aim to search for RS images similar to a query image based

on their semantic content, has recently attracted great attention. The performance

of any CBIR system relies on its capability to learn discriminative and robust image

representations to describe the complex semantic content of RS images.

Conventional CBIR systems exploit hand-crafted features to describe the content of

images. As an example, Wang and Newsam present a retrieval system employing

the well-known scale-invariant feature transform (SIFT) to extract bag-of-visual-

words representations of image features [127]. Aptoula introduces the use of bag-

of-morphological-words representations for local texture descriptors [128]. In [129],

a comparative analysis of local binary patterns (LBP) that capture local patterns

between neighboring pixels is presented. Chaudhuri et al. present a method that

represents image content by a graph, where the graph nodes describe the image

region properties and the edges represent the spatial relationships among the regions

[93]. Binary hash codes obtained through kernel-based hashing methods are found

effective for describing RS images in [130]. After extracting the image features, the

most similar images with respect to a query image can be found by performing

the

-nearest neighbor (

-nn) search algorithm. In the case of graph-based image

representations, graph comparison methods such as the inexact graph matching

approach proposed by Chaudhuri et al. [131] can be used. The images represented

by binary hash codes can be searched and retrieved by using the computationally

efﬁcient hamming distance [130].

The above-mentioned CBIR systems cannot simultaneously optimize feature learning

and image retrieval, and thus result in a limited capability to represent the high-

level semantic content of RS images. This issue leads to insufﬁcient search and

retrieval performance. To overcome this problem, CBIR systems based on DNNs

have been recently presented in RS. As an example, Li et al. propose a method

that fuses deep features and hand-crafted features [132]. This method exploits four

convolutional neural networks (CNNs) to extract features at different steps and

with different coarse levels. Then, these deep features are fused with traditional

image descriptors such as LBPs and SIFT to be used in the retrieval process. A

convolutional autoencoder is used by Tang et al. to obtain deep bag-of-words image

descriptors in [28]. To this end, a reconstruction loss function that minimizes the

error between the input and the extracted descriptors is considered. Imbriaco et

al. extract local convolutional features and aggregate them into a global descriptor,

Chapter 4. Informative & Representative Triplet Selection for CBIR 45

Arable land, Pastures,

Coniferous forest

Arable land,

Pastures

Mixed forest,

Inland wetlands

Arable land, Pastures,

Complex cultivation

patterns

Arable land, Pastures,

Complex cultivation

patterns

Broad-leaved forest,

Coniferous forest

Pastures,

Coniferous forest

Arable land, Pastures,

Coniferous forest Marine waters

FIGURE 4.1: An example of three triplets consisting of images from BigEarthNet-S2. Each

triplet given in different rows consists of an anchor (in blue frame), a positive image (in green

frame), and a negative image (in red frame). The associated multi-labels are given below the

respective images.

where the deep features are extracted through a pre-trained model without any ﬁne-

tuning [9]. Boualleg and Farrah address the semantic gap between low-level features

and high-level perception of semantic similarity in [133]. This is achieved by using a

CNN to detect semantic concepts and a relevance feedback strategy to ensure that

CBIR results match with a query image. Sabahi et al. address the above-mentioned

semantic gap by employing a recurrent neural network to model the human visual

memory [134].

In recent years, deep metric learning (DML) based methods that aim at learning a

feature space (in which similar images are close to each other) have attracted attention

in RS. Current DML models are mostly trained using a triplet loss function made up

of three images as: i) an anchor image; ii) a positive image that is similar to the anchor;

and iii) a negative image that is dissimilar to the anchor [38]. An example of triplets

constructed from BigEarthNet-S2 can be seen in Fig. 4.1. A difﬁcult task in DML is to

construct the set of triplets. A simple strategy is to deﬁne triplets from an existing

training set of labeled images. Roy et al. apply a strategy that: i) randomly selects

an anchor from a mini-batch of training images; and then ii) randomly chooses one

positive image that has the same class label as the anchor, while selecting one negative

image that has a different class label [27]. Similarly, Lai et al. select triplets randomly

based on the class labels of training images to train an end-to-end model for hashing

[135]. For each anchor image, there can be several positive and negative images.

Chapter 4. Informative & Representative Triplet Selection for CBIR 46

Thus, random selection does not guarantee the selection of the most representative

and informative images to the anchor and can result in the construction of so-called

trivial triplets (see Section 4.2 for details). We would like to note that one can also

exploit all the images in the mini-batch to construct triplets, as suggested in [15].

However, this choice signiﬁcantly increases the total number of triplets and thus the

computational complexity of the training phase of the retrieval system [39], [40].

To overcome the limitation of random selection, the DML methods that evaluate

the hardness of images during the sampling process are introduced in the computer

vision (CV) literature (see Section 4.2 for details). According to our knowledge, most

of the triplet sampling methods in CV assume that each image is annotated by a

single label associated with the most signiﬁcant content of the considered image and

thus rely on single-label image annotations to decide which images are positive or

negative for a given anchor image. However, RS images typically consist of multiple

classes and thus can simultaneously be associated with different class labels (i.e.,

multi-labels). From the DML perspective, the selection of triplets from training

images annotated by multi-labels is more complex than that from training images

labeled by single-labels. To achieve accurate DML in multi-label RS CBIR, methods

that accurately select a set of triplets from multi-label training images are needed.

To address this problem, we propose a novel triplet sampling method in the frame-

work of DML designed for multi-label RS CBIR problems. Unlike the existing triplet

sampling methods, the proposed method aims to select a small set of triplets from

each mini-batch of multi-label training images. To this end, the proposed method

consists of two consecutive steps. In the ﬁrst step, a small number of diverse anchors

is selected based on a simple but efﬁcient iterative algorithm. In the second step,

relevant, hard and diverse positive and negative images with respect to each anchor

are chosen based on a novel strategy. Then, the triplets are constructed from the

selected anchors and their respective positive and negative images. Based on these

consecutive steps, the proposed method constructs a small number of the most infor-

mative and representative triplets to drive DML, resulting in an accurate CBIR and

also in a reduced training complexity for the considered DNN. It is worth noting that

the proposed triplet sampling method is independent of the considered DNN archi-

tecture, and therefore can be used within any DNN presented in the literature. In the

experiments, different DNN architectures are considered, while the

-nn algorithm

is used for the retrieval process after the characterization of the image descriptors

through the considered method. Experiments carried out on two multi-label RS

benchmark archives demonstrate the effectiveness of the proposed method.

The rest of the chapter is organized as follows: Section 4.2 presents the related

works on triplet sampling. Section 4.3 introduces the proposed method. Section

4.4 describes the considered datasets and the experimental setup, while Section 4.5

provides the experimental results. Section 4.6 concludes the chapter.

4.2 Related Works

The development of DML methods that aim to learn a metric space (in which seman-

tically similar images are close to each other) is important for an accurate CBIR. It

Chapter 4. Informative & Representative Triplet Selection for CBIR 47

Before After

N3P1

N2P2P3

N1N3

NP3

margin

FIGURE 4.2: An Abstract representation of triplet selection and the progress for feature space

update. Blue arrows indicate reducing distances for updating the embedding, while red

arrows indicate increasing the distances.

marks a chosen anchor,

, and

are positive

images, and

, and

are negative images in different triplets. The triplet

(Xa

N1)

trivial because it already satisﬁes the margins, and thus the corresponding distances are not

updated. The triplet

(Xa

N2)

leads to a relatively small error and the images are pushed

and pulled a little. The triplet

(Xa

N3)

violates the margin greatly and causes a signiﬁcant

error.

is a positive image, but very far from the anchor, so it is considered as a hard positive

image. N3is respectively a hard negative image.

has been shown that the triplet-based DML methods perform considerably well for

the CBIR tasks [27], [136]. The triplet-based DML methods use triplets of images to

learn a metric space by means of the triplet loss [38]. The optimization objective is to

minimize the feature distance between the anchor and its positive sample (i.e., image)

while maximizing the feature distance between the anchor and the negative sample.

The goal is to ensure that the positive sample is closer to the anchor than the negative

sample by at least a margin. During the training of a triplet-based DML method, for

the triplets that consist of a positive image inside the margin and the negative image

outside the margin, a zero value triplet loss is obtained, leading to small gradient

values and slow convergence. For the triplets that consist of a positive image visually

less similar to the anchor (i.e., outside the margin) and a negative image visually more

similar to the anchor (i.e., inside the margin), a high triplet loss value is obtained.

High loss values lead to large gradient values, and thus the parameters of the model

are updated. When a positive image is far from the margin, it is called as a hard

positive image. A negative image is called as hard negative if it is inside the margin and

very close to the anchor. If the distance between the anchor and positive image of a

triplet is higher than the distance between the anchor and negative image, the triplet

is considered as a hard triplet. In Fig. 4.2, an abstract representation of the triplet

selection and the feature space update is demonstrated. The images

and

are

the positive images for the anchor

in different triplets, while images

and

are the negative images for the anchor

. After updating the embedding (metric)

space using the selected triplets,

and

are pulled closer to the anchor

, while

and

pushed far away from the anchor

towards outside the margin. The

positive image

is inside the margin and negative image

is outside the margin,

and thus triplet

(Xa

N1)

is a trivial triplet. The positive image

is a hard positive

image for anchor

, since it is outside the margin and far from the anchor image.

The negative image

is a hard negative image, as it is very close to the anchor.

The triplet

(Xa

N3)

is a hard triplet, and causes a high loss value to update the

Chapter 4. Informative & Representative Triplet Selection for CBIR 48

parameters of the model. Since the trivial triplets are not sufﬁciently informative and

lead to slow convergence, the use of hard triplets has been considered to overcome

this problem.

Most of the methods in RS do not consider the hardness of the images in the selected

triplets and exploit the random triplet selection strategy as mentioned in the intro-

duction [15], [19], [27]. Unlike RS, in the CV community, the use of triplets is more

extended and the importance of the hardness is widely studied [137]–[140]. As an ex-

ample, Xuan et al. propose a triplet selection strategy that selects the closest positive

sample (easy positive) and the closest negative (hard negative) for each anchor [137].

Yuan et al. propose a hard-aware deeply cascaded (HDC) embedding method [140].

For each anchor and a selected positive sample, HDC selects the negative samples at

multiple hardness levels to construct different triplets. Hardness levels are deﬁned

based on the distances in the embedding space. Yang et al. investigate the impor-

tance of hard positive images by combining a positive image with all negative image

pairs in the batch [40]. Then, the positive images are weighted and hard positives

are preferred. Ge et al., propose a hard triplet selection method that constructs a

class-level hierarchical tree of image features for the whole dataset, where visually

similar classes are merged recursively [139]. Then, the selection of the triplets is

done based on a distance computed between an anchor image and different pairs

of image classes through the hierarchical tree. In addition to the methods that aim

to select triplets, there are also several works that focus on reformulating the triplet

loss function to emphasize the effect of hard triplets [141]–[143]. As an example,

Zhang et al. adapt the focal loss that is initially deﬁned for classiﬁcation problems

and propose an extended version for triplets as an alternative to the triplet loss [142].

This loss function ensures that more importance is given to hard triplets than easier

ones, and thus the model can learn from the most informative triplets and converge

faster. Kim et al. developed an adapted version of the triplet loss for pose estimation

[141]. This loss function preserves the distance ratios from the label space in the

embedding space. In [143], the multi-similarity loss function is proposed to refor-

mulate the triplet loss with a weighting strategy. By using the weighting strategy,

this loss function considers the relative similarity of all positive and all negative

samples in a mini-batch. In [144], the multi-class N-pair loss function is proposed

to generalize the triplet loss function for multiple negative images associated with

an anchor. In detail, for each anchor image, one positive image and several negative

images are selected as hard negatives from different negative classes. In [19], the

dual-anchor triplet loss function is introduced as an extension of the triplet loss. In

addition to the objectives of the triplet loss, this loss function also aims at increasing

the distance between the positive and negative images for a given anchor. Wang et al.

extend the concept of triplets to the whole mini-batch, where all available images are

ﬁrst sorted and then divided into a positive set and a negative set [145]. Afterward,

an extension of the triplet loss is used to force a margin between the two sets by

using all the images. This loss function employs a weighting strategy to increase the

importance of the hard negative images. In [146], it is shown that when an accurate

sampling strategy is considered, deep learning (DL) models with different modiﬁed

loss functions provide similar accuracies. This proves the fact that triplet selection

is as important as loss function in the framework of DML. Most of the triplet-based

methods in CV assume that a single label is associated with each image. However,

Chapter 4. Informative & Representative Triplet Selection for CBIR 49

Proposed informative and representative triplet selection method

Diverse anchor

selection (DAS)

Relevant, hard and diverse positive

and negative image selection (RHDIS)

Image embedding

extraction

Triplet loss

calculation

DNN model

updating

∀Xa∈A

∀ ∈P

∀∈

Updated parameters Extracted features

∀Xa∈A

NXa

FIGURE 4.3: A block scheme of the proposed triplet sampling method to drive the training

phase of a DNN for multi-label CBIR problems.

RS images typically consist of multiple classes and are associated with multi-label,

which makes selecting triplets more complex than the single-label scenario.

4.3 Proposed Method

4.3.1 Problem Formulation

Let

X={X1

. . .

XM}

be an archive consisting of

images, where

is the

-th

image in the archive. We assume that a training set

XT⊂X

is available. Each image

is annotated with a set of class labels, which describe the content of the image.

Let

L={

1,2,...,

be the set of all possible class labels. Each image

Xj∈XT

associated with a multi-label vector

Lj={l1

,...,

, where

1, if the class label

i∈L

is associated to the image

, and

0 otherwise. Each training image

annotated with at least one class label.

We propose a novel triplet sampling method in the framework of DL-based multi-

label CBIR. The proposed method aims: i) to select a small set of informative as

well as representative triplets from each training mini-batch

; and ii) to accurately

describe the complex semantic content of RS images. To this end, it consists of two

consecutive steps: 1) selection of anchors that are diverse to each other in the feature

space; 2) selection of positive and negative images with respect to each selected

anchor. To achieve the latter step, we jointly evaluate the relevancy, hardness and

diversity of the images during the selection (See Fig. 4.3). The proposed method

is independent of the considered DL model and can be used with any DL model

designed for CBIR problems. In the following subsections, the two steps of the

proposed method are described in detail.

4.3.2 Diverse Anchor Selection (DAS)

The ﬁrst step of the proposed method aims to ﬁnd a small set of the most repre-

sentative anchors. As mentioned before, all samples (i.e., images) in the mini-batch

could be selected as anchors. However, such an approach results in a large and

Chapter 4. Informative & Representative Triplet Selection for CBIR 50

redundant set of triplets and increases the computational complexity of the train-

ing. In detail, the complexity of the training grows cubically, if all possible triplets

are exploited [142]. Selecting a small set of anchors can signiﬁcantly reduce the

computational complexity of the training. To this end, we introduce a simple but

efﬁcient diverse anchor selection (DAS) strategy. The DAS strategy aims to select

diverse anchors from the mini-batch that, when included in the set of triplets, can

improve the retrieval performance. To this end, it exhibits an iterative algorithm to

evaluate the diversity in the feature space among the samples from the mini-batch.

The algorithm starts with an empty set

A=∅

. The ﬁrst anchor is selected randomly

from the current mini-batch

and added into

. At each iteration, a new anchor that

is associated with the highest distance from all already selected anchors is selected

from B. In detail, at the h-th iteration h-th anchor image Xhis selected as:

Xh=argmax

Xb∈B\Amax

Xa∈AD(Xb,Xa), (4.1)

where

D(·

·)

is the feature similarity measure, deﬁned as the Euclidean distance

between two images in the feature space. It is worth noting that the Euclidean

distances are normalized based on min-max normalization. The steps are iterated

until

anchors are selected. Due to the selection of anchors that are as distant as

possible to each other in the feature space, the diversity among the selected anchors

with respect to their correlation in the feature space is maximized. This results

in selecting a representative set of anchors, forming the basis for the positive and

negative image selection step.

4.3.3

Relevant, Hard and Diverse Positive-Negative Image Selec-

tion (RHDIS)

The second step of the proposed method aims to select positive and negative images

for each anchor that are informative (i.e., relevant and hard) and representative (i.e.,

diverse to each other in the feature space). This is achieved by a novel relevant, hard

and diverse positive and negative image selection strategy (RHDIS). The relevancy

of an image to an anchor is deﬁned based on its multi-label similarity with respect

to the considered anchor. In detail, a positive image can be associated with high

relevancy to an anchor if their class label similarity is high and vice versa. A negative

image can be relevant to an anchor if its class label similarity is small and vice versa.

The hardness of an image is associated with its distance to the considered anchor in

the feature space. In detail, a positive image can be hard if its distance to the anchor

in the embedding space is high, whereas a negative image can be considered hard if

its distance to the anchor is small.

The proposed RHDIS strategy initially evaluates the informativeness (i.e., relevancy

and hardness) of the images to select the candidates for positive and negative images

related to each anchor image. Then, the representative (diverse) ones among the

most informative positive and negative images are selected to construct the triplets.

To this end, for each image

in the mini-batch

, informativeness scores

Ip(Xa

Xb)

(which shows if

is a candidate positive image) and

In(Xa

Xb)

(which shows if

Chapter 4. Informative & Representative Triplet Selection for CBIR 51

is a candidate negative image) with respect to anchor Xaare initially computed as:

Ip(Xa,Xb) = β×S(Xa,Xb) + (1−β)×D(Xa,Xb), (4.2)

In(Xa,Xb) = β×[1−S(Xa,Xb)]+ (1−β)×[1−D(Xa,Xb)], (4.3)

where

S(Xa

Xb)

shows the class label similarity between the image

and

S(Xa

Xb)∈[

0,1

]

is calculated based on the soft pair-wise similarity measure (i.e.,

the distance between the multi-label vector

and

) [147]. If

S(Xa

Xb)

is high,

can be considered as a relevant positive image, whereas if [1-

S(Xa

Xb)

] is

high,

can be considered as a relevant negative image.

D(Xa

Xb)

is the distance

between

and

in the embedding space and measures the hardness of images as

mentioned before. If both

D(Xa

Xb)

and

S(Xa

Xb)

are high, the image

can be con-

sidered as a relevant and hard positive image. If both [1-

S(Xa

Xb)

] and [1-

D(Xa

Xb)

]

are high, the image

can be considered as a relevant and hard negative image.

β∈[

0,1

]

is the weighting parameter and can be adjusted to give more importance to

either the relevancy or the hardness of the image.

To construct a set

PXa={P1

,...,

PC}

positive images for an anchor

, the

image in the mini-batch associated with the highest

score with respect to

chosen as the ﬁrst positive image. Then, the next images are iteratively selected. We

apply an iterative approach similar to the DAS introduced in the ﬁrst step to select

the most representative images. At

-th iteration,

-th positive image

is selected as:

Pt=argmax

Xb∈B\PXahγ×Ip(Xa,Xb) + (1−γ)×max

Pc∈PXa

D(Xb,Pc)i. (4.4)

This process is repeated until the desired number of positive images is selected. The

parameter γ∈[0,1]controls the inﬂuence of the diversity term.

To construct a set

NXa={N1

,...,

NC}

negative images for each anchor

the image with the highest

score in the mini-batch with regard to

is selected as

the ﬁrst negative image. Afterward, the subsequent negative images are iteratively

selected. At t-th iteration, the t-th negative image Ntis selected as:

Nt=argmax

Xb∈B\NXahγ×In(Xa,Xb) + (1−γ)×max

Nc∈NXa

D(Xb,Nc)i. (4.5)

This selection strategy ensures that the selected positive and negative images for

each anchor are informative (i.e., hard and relevant) and representative (i.e., diverse

among each other in the feature space). After selecting the ﬁnal set of triplets from

the mini-batch B, the triplet loss function is calculated as:

L=∑

∀Xa∈A

∀Pt∈PXa

∀Nt∈NXa

max [D(Xa,Pt)−D(Xa,Nt) + α],0, (4.6)

where

is a margin enforced between positive and negative images for an anchor

image. After an end-to-end training of the whole neural network by minimizing the

triplet loss and learning the network parameters, the descriptors (i.e., features) of the

images in

X\XT

are obtained. Then, the

most semantically similar images with

regard to a given query image

Xq∈X

are selected by comparing their descriptors

based on the k-nn algorithm.

Chapter 4. Informative & Representative Triplet Selection for CBIR 52

(a) (b) (c) (d)

FIGURE 4.4: An example of images from the UCMerced Land Use archive and the multi-

labels associated with them: (a) sand, sea (b) airplane, cars, grass, pavement (c) bare-soil, buildings,

grass (d) buildings, cars, pavement, trees.

4.4 Dataset Description and Experimental Design

4.4.1 Dataset Description

To evaluate the proposed method, we conducted experiments on two different multi-

label RS archives: BigEarthNet-S2 and UC Merced Land Use (UCMerced) archive [84].

In the experiments, for BigEarthNet-S2, we considered the images acquired over

Ireland in the summer of 2017 (denoted as IRS-BigEarthNet). IRS-BigEarthNet

contains 15,894 Sentinel-2 images, each of which is made up of 120

120 pixels for

10 meter bands, 60

60 pixels for 20 meter bands and 20

20 pixels for 60 meter

bands. In the experiments, we excluded the 60 meter bands and applied bicubic

interpolation to 20 meter bands that results in 10 bands, each of which has a size

of 120

120 pixels. The class labels of the images were used based on the 19 class

nomenclature. Images with snow cover, cloud cover and cloud shadows are excluded

from training and evaluation.

The UCMerced archive consists of 2100 images selected from aerial orthoimagery

with a spatial resolution of 30cm. Each image has a size of 256

256 pixels. The

images are annotated with multi-labels by Chaudhuri et al. [93]. There are 17 classes

in total, with at least one and a maximum of seven class labels per image. Fig.

4.4 shows an example of images from this archive along with their multi-label

annotations.

The two benchmark archives differ greatly in size, complexity and characteristics.

This allows us to demonstrate the general applicability and success of the proposed

triplet sampling method in different scenarios. We randomly split UCMerced images

into 60% for training, 20% for validation and 20% for testing. For IRS-BigEarthNet,

the ofﬁcially provided splits into training, validation and evaluation sets were used.

During the training step, all triplets were sampled from the training set. Query

images were taken from the validation set, while image retrieval was applied to the

evaluation set.

4.4.2 Experimental Design

In the experiments, different CNN architectures were considered as backbones, while

an additional fully connected layer was added to produce image embeddings. The

resulting CNNs were trained for image retrieval by means of the triplet loss. It is

worth noting that our method does not depend on a speciﬁc DL model architecture.

Chapter 4. Informative & Representative Triplet Selection for CBIR 53

In our experiments, we evaluated three different CNN architectures: i) a shallow

convolutional neural network (S-CNN); ii) DenseNet-121 [148]; and iii) ResNet-50

[95]. S-CNN consists of three convolutional layers with 32, 32 and 64 ﬁlters having

5, 5

5 and 3

3 ﬁlter sizes, respectively. We added one fully connected (FC)

layer and one classiﬁcation layer to the output of last convolutional layer, while

zero padding for convolution operations and max-pooling between layers were used.

The last two architectures are well-known deep models, while the ﬁrst architecture

is an explicitly shallow model. All models were used without pre-training. The

size of mini-batch for IRS-BigEarthNet and UCMerced was selected as 300 and 100,

respectively. The training was performed for 100 epochs with the Adam optimizer,

using an initial learning rate of 0.001 (which was exponentially decayed every 5

epochs by 5%). The margin parameter αof the triplet loss was set to 0.2. The values

and

were set to 0.5 and 0.1, respectively, based on a grid search strategy.

All the experiments were conducted on NVIDIA Tesla V100 GPUs with 32 GBs of

memory. The results were provided in terms of the different evaluation metrics as:

accuracy, precision, recall and

score [93]. These values were the average of the

values obtained by retrieving the 30 and 10 most similar images for IRS-BigEarthNet

and UCMerced, respectively.

We carried out different kinds of experiments in order to: 1) perform a sensitivity

analysis with respect to different network architectures and embedding sizes; 2)

conduct an ablation study of the proposed triplet sampling method; 3) compare our

method with different triplet sampling methods; and 4) compare our method with

state of the art DML based methods. To perform the ablation study, we compared the

proposed diverse anchor selection (DAS) strategy (see Section 4.3.2 for the details)

with two frequently used anchor selection strategies that are:

•

Batch anchor selection (BAS): This strategy selects each image in the mini-batch

as an anchor once and can be considered an upper bound strategy for the triplet

selection. This strategy does not miss any information provided by speciﬁc

triplets. However, it leads to a very high number of ﬁnal triplets that can be

redundant.

•

Random anchor selection (RAS): This strategy selects a ﬁxed number of anchors

from the mini-batch without any prior assumption. It is simple, but there is

no guarantee that the randomly chosen anchors provide a good basis for the

triplets.

In the experiments, 10% of all possible anchors from the mini-batch was chosen for

the RAS and the proposed DAS strategies. We compared the proposed relevant, hard

and diverse positive-negative image selection (RHDIS) strategy (see Section 4.3.3 for

the details) with two baselines that are:

•

Batch positive and negative image selection (BIS): This strategy uses all images

in the mini-batch. Each image is used as the positive and the negative images

once. It covers all possible triplets, leading to a very high number of ﬁnal

triplets.

•

Random positive and negative image selection (RIS): This strategy randomly

selects sets of positive and negative images and combines all of them into

Chapter 4. Informative & Representative Triplet Selection for CBIR 54

triplets. Many of the resulting triplets may be trivial, but it requires no prior

knowledge and provides a lower bound baseline.

In the experiments, we also assessed the effectiveness of the joint use of the above-

mentioned strategies with proposed DAS and RHDIS for the selection of anchors as

well as positive and negative images. This is important as the anchor selection step

is independent from the step of the positive and negative image selection, and thus

proposed selection strategies can be combined with the other well-known strategies.

In the experiments, we also compared the proposed DAS-RHDIS method with two

triplet sampling methods: 1) the deep metric learning using triplet network, which

uses RAS for anchor selection and RIS for positive and negative image selection

(denoted as TNDML) [149]; and 2) enhancing remote sensing image retrieval using a

triplet deep metric learning network, which employs BAS for the anchor selection

and BIS for positive and negative image selection (denoted as RSDML) [15]. We also

compared the proposed DAS-RHDIS method with state-of-the-art DML methods for

CBIR: 1) the content-based medical image retrieval (CBMIR) system, which utilizes

a pair-wise similarity loss function to force all positive images to be close, while

separating all the negative images with a ﬁxed distance [150]; 2) the multi-similarity

loss with general pair weighting for deep metric learning (denoted as MSL) [143]; 3)

the dual-anchor triplet loss (denoted as DATL) proposed in [19]; and 4) the improved

deep metric learning with multi-class N-pair loss objective (denoted as NPL) [144].

For all the methods, we used the same CNN architecture and training setup as in our

method.

4.5 Experimental Results

4.5.1 Sensitivity Analysis of the Proposed Method

In this sub-section, we present the results of the sensitivity analysis for the proposed

triplet sampling method (denoted as DAS-RHDIS) in terms of different DL model

architectures and different embedding sizes. To analyze the proposed DAS-RHDIS

method in the framework of different DL models designed for multi-label RS CBIR,

we selected the CNN architectures of: i) S-CNN; ii) DenseNet-121; and iii) ResNet-50.

The embedding size for each architecture was set to 256. In Table 4.1, the results are

shown for the UCMerced archive. By assessing the table, one can observe that all

the considered DL model architectures provide a high performance. As an example,

although S-CNN is an explicitly shallow architecture, it achieves more than 50%

score as in Dense-Net-121 and ResNet-50. This shows that the proposed DAS-RHDIS

method is architecture-independent. One can also see from the table that the best

scores under all metrics were obtained when ResNet-50 was utilized. As an example,

ResNet-50 provides almost 9% higher precision and 8.5% higher recall compared

to DenseNet-121. When compared with S-CNN, ResNet-50 leads to more than 14%

higher

score and accuracy. These results show that a proper selection of a DL

model architecture can improve performance. For the rest of the experiments, we

provided the results obtained with ResNet-50 due to its proven success.

Chapter 4. Informative & Representative Triplet Selection for CBIR 55

TABLE 4.1: THE PERFORMANCE OF DIFFERENT DL MODEL ARCHITECTURES FOR THE

UCMERCED ARCHIVE.

Architecture Metric (%)

Accuracy Precision Recall F1Score

S-CNN 40.5 48.9 51.9 50.3

DenseNet-121 45.5 54.4 58.0 56.1

ResNet-50 54.5 63.3 66.5 64.8

TABLE 4.2: THE EFFECT OF VARYING EMBEDDING SIZES ON THE RETRIEVAL PERFORMANCE

FOR THE UCMERCED ARCHIVE.

Embedding

Size

Metric (%)

Accuracy Precision Recall F1Score

256 54.5 63.3 66.5 64.8

512 56.2 64.6 69.0 66.7

1024 56.8 65.3 70.0 67.5

2048 50.3 58.4 62.8 60.5

In Table 4.2, the results obtained by using different embedding sizes are shown for

the UCMerced archive. We evaluated the effect of the embedding sizes 256, 512, 1024

and 2048 used in the proposed DAS-RHDIS method. From the table, one can see that

the highest scores under all metrics are obtained when the embedding size is 1024.

Further increase of the embedding size to 2048 does not improve the performance.

As an example, the proposed method with the embedding size of 1024 provides

a 7% higher

score compared to that of 2048. This is in line with the works in

literature, which demonstrate that beyond a certain size, adding any new embedding

dimension may not improve the performance [151]–[153]. By analyzing the table, one

can also observe that the lowest performance is obtained when the embedding size is

256. In this case, the

score is reduced by almost 3% compared to the embedding

size of 1024. Accordingly, for the rest of the experiments, we set the embedding size

to 1024. These results were also conﬁrmed through experiments obtained by using

the IRS-BigEarthNet archive (not reported for space constraints).

4.5.2 Ablation Study

In this sub-section, we performed an ablation study to analyze the effectiveness of

the proposed DAS and RHDIS strategies. To demonstrate the effectiveness of the

proposed DAS strategy, we compare it with RAS and BAS strategies. Table 4.3 shows

the results associated with the different anchor strategies for the UCMerced archive

when the proposed RHDIS strategy is used for positive and negative image selection.

By analyzing the table, one can observe that the proposed DAS strategy provides the

highest scores under all the metrics compared to RAS and BAS. As an example, the

proposed DAS strategy provides more than 7% higher accuracy compared to RAS

under the same number of anchors (which is set to 10 in the experiments) when the

positive and negative selection strategy is set to proposed RHDIS. In addition, the

proposed DAS strategy leads to almost 4% higher recall with a smaller number of

anchors compared to BAS. It is worth noting that BAS uses all the possible anchors

Chapter 4. Informative & Representative Triplet Selection for CBIR 56

TABLE 4.3: RESULTS OBTAINED BY THE DIFFERENT ANCHOR SELECTION STRATEGIES (RAS,

BAS AND PROPOSED DAS) UNDER DIFFERENT METRICS FOR THE UCMERCED ARCHIVE

WHEN PROPOSED RHDIS IS USED FOR POSITIVE AND NEGATIVE IMAGE SELECTION.

Anchor

Selection Strategy

Metric (%)

Accuracy Precision Recall F1Score

RAS 49.2 58.1 61.9 60.0

BAS 53.5 62.0 66.5 64.2

Proposed DAS 56.8 65.3 70.0 67.5

TABLE 4.4: RESULTS OBTAINED BY THE DIFFERENT POSITIVE AND NEGATIVE IMAGE SELEC-

TION STRATEGIES (RIS, BIS AND PROPOSED RHDIS) UNDER DIFFERENT METRICS FOR THE

UCMERCED ARCHIVE WHEN PROPOSED DAS IS USED FOR ANCHOR SELECTION.

Positive and Negative

Selection Strategy

Metric (%)

Accuracy Precision Recall F1Score

RIS 48.6 57.4 60.1 58.7

BIS 48.9 57.6 61.4 59.4

Proposed RHDIS 56.8 65.3 70.0 67.5

from the mini-batch (i.e., 100 anchors). This shows the success of the proposed

DAS strategy to select diverse and representative anchors with respect to random

sampling and batch selection strategies.

In order to demonstrate the effectiveness of the proposed RHDIS strategy, we com-

pare it with RIS and BIS strategies. Table 4.4 shows the results associated with the

different positive and negative image selection strategies for the UCMerced archive

when the proposed DAS strategy is used for anchor selection. From the table, one

can see that the proposed RHDIS strategy achieves the highest performance under all

metrics compared to RIS and BIS. As an example, the recall of the proposed RHDIS

strategy is more than 8% higher compared to that of BIS when the anchor selection

strategy is set to proposed DAS. It is worth noting that BIS exploits all positive and

negative images in the batch, while RHDIS relies on a much smaller number of

triplets to achieve this result. The performance of RIS is lower than RHDIS and BIS

under each metric when the anchor selection strategy is set to proposed DAS. For

example, the recall obtained by RIS is about 10% lower than that of proposed RHDIS

under the same number of triplets. This shows the effectiveness of the proposed

RHDIS selection strategy to select relevant, hard and diverse positive-negative im-

ages compared to random sampling and batch selection strategies for a given set of

anchors. These results were also conﬁrmed through experiments obtained by using

the IRS-BigEarthNet archive.

4.5.3 Comparison with Different Triplet Sampling Methods

In this sub-section, we evaluate the effectiveness of the proposed DAS-RHDIS method

compared to different triplet selection methods, which are: TNDML [149], and

RSDML [15]. Table 4.5 shows the corresponding image retrieval performances on the

IRS-BigEarthNet and the UCMerced archives. By analyzing the table, one can see

Chapter 4. Informative & Representative Triplet Selection for CBIR 57

TABLE 4.5: THE PERFORMANCE OF DIFFERENT TRIPLET SELECTION METHODS FOR THE IRS-

BIGEARTHNET AND UCMERCED ARCHIVES.

Archive Method Metric (%)

Accuracy Precision Recall F1Score

IRS-BigEarthNet

TNDML [149] 59.3 73.7 73.8 73.8

RSDML [15] 60.2 75.4 73.9 74.6

Proposed DAS-RHDIS 62.7 77.7 75.7 76.7

UCMerced

TNDML [149] 44.0 52.6 55.8 54.2

RSDML [15] 48.4 56.3 61.9 59.0

Proposed DAS-RHDIS 56.8 65.3 70.0 67.5

that the proposed DAS-RHDIS method leads to the highest scores under all metrics

for both archives. For example, DAS-RHDIS outperforms TNDML by 4% in precision

and more than 3% in accuracy for the IRS-BigEarthNet archive, more than 13% in

score and almost 15% in recall for the UCMerced archive. The proposed DAS-RHDIS

method provides about 2% higher and 8% higher

scores compared to the RSDML

method for IRS-BigEarthNet and UCMerced, respectively. These results demonstrate

the success of the proposed DAS-RHDIS method compared to other triplet sampling

methods.

Fig. 4.5 shows an example of images retrieved from IRS-BigEarthNet by TNDML,

RSDML and the proposed DAS-RHDIS when the query image contains Arable land,

Pastures and Complex cultivation patterns. The retrieval order of images is given below

the query image. By analyzing the ﬁgure, one can observe that the classes of Pasture

and Arable land are very prominent in all retrieved images by RSDML and DAS-

RHDIS, while TNDML provides similar images to the query only at the retrieved

orders of 5 and 10. When DAS-RHDIS is compared with RSDML, the proposed

method retrieves semantically more similar images. One of the reasons is that the

RSDML relies only on the class label similarity, while the proposed DAS-RHDIS

method: i) extracts and exploits the semantic content of the images; and ii) considers

the diversity and hardness of images during triplet selection. We observed similar

behavior for the UCMerced archive. Fig. 4.6 shows an example of images retrieved

from UCMerced. The query image for this example only contains the Field class.

Most of the images retrieved by the proposed method (except the 20

image) belong

to the same class with the query (see Fig. 4.6-d). However, only a small number of

images retrieved by the TNDML and the RSDML methods contains the Field class

(see Fig. 4.6-b and 4.6-c).

During the learning of a metric space by using the triplet loss, a small subset of the

available triplets carries the information needed to learn an accurate representation

for image retrieval. The proposed DAS-RHDIS identiﬁes these triplets and only

learns from a subset of selected informative and representative samples, reducing the

number of training triplets. Fig. 4.7 shows the performance of TNDML, RSDML and

the proposed DAS-RHDIS method in terms of the number of accumulated training

triplets under the same number of epochs (which is set to 100 in the experiments) for

the UCMerced archive. The horizontal axis shows the number of triplets in a logarith-

mic scale, while the vertical axis shows the corresponding

scores. The performance

is associated with the numbers of triplets, which are utilized by the considered triplet

Chapter 4. Informative & Representative Triplet Selection for CBIR 58

(a)

2nd 5th 10th

(b)

15th 20th

(c)

(d)

FIGURE 4.5: An image retrieval example: (a) query image; (b) images retrieved by TNDML;

(IRS-BigEarthNet archive).

selection method. The annotation points indicate the number of triplets needed for

the considered method to reach at least 90% of its ﬁnal performance. From the ﬁgure,

one can observe that even after the last training epoch of the proposed DAS-RHDIS

method, the total number of triplets is signiﬁcantly smaller than the ﬁrst epoch of

the RSDML method. During training, the RSDML selects more triplets at each epoch

compared to the other two methods. This is due to the characteristic of RSDML that

selects all the possible triplets from a mini-batch, which grows cubically. The ﬁnal

score of our proposed method is more than 8% higher than RSDML with signiﬁcantly

less number of total triplets. One can also see from the ﬁgure that TNDML (which

uses random triplet selection) under the same number of triplets with our method

leads to a signiﬁcant performance drop. The

score obtained by TNDML is 13%

lower than the

score obtained by the proposed DAS-RHDIS method. These results

show the effectiveness of our method to select a subset of informative triplets during

training, resulting in faster convergence and a performance gain in the retrieval.

4.5.4 Comparison with the State-of-the-Art DML Approaches

In this sub-section, we assessed the effectiveness of the proposed DAS-RHDIS method

compared to the state-of-the-art deep metric learning approaches, which are: CB-

MIR [150], MSL [143], DATL [19] and NPL [144]. Table 4.6 shows the results under

different metrics for the IRS-BigEarthNet and UCMerced archives. By analyzing

the table, one can see that the proposed DAS-RHDIS method leads to the highest

scores under all metrics for both archives. As an example, the proposed DAS-RHDIS

Chapter 4. Informative & Representative Triplet Selection for CBIR 59

(a)

2nd 5th 10th

(b)

15th 20th

(c)

(d)

FIGURE 4.6: An image retrieval example: (a) query image; (b) images retrieved by TNDML;

(UCMerced archive).

method provides 2% higher and 8% higher accuracy compared to the DATL method

for IRS-BigEarthNet and UCMerced, respectively. The table also shows that the

CBMIR and the MSL methods obtain the lowest scores in most of the metrics. For

example, CBMIR provides more than 4% lower and 14% lower precision than the

proposed DAS-RHDIS for IRS-BigEarthNet and UCMerced, respectively. Since the

loss function in CBMIR forces a ﬁxed distance for all images, it is more restrictive

compared to the triplet-based DML losses. This can lead to learning the metric space,

in which the similarity between the images are not properly characterized [146].

When compared with the MSL method, DAS-RHDIS achieves 7% higher recall and

more than 4% higher accuracy for the IRS-BigEarthNet archive, more than 7% higher

precision and 8% higher

score for the UCMerced archive. Despite the proven

success of the MSL method for single label images, we observed that the full capac-

ity of this method is not applicable for multi-label images. Since the MSL method

considers all the possible negatives and positives and their relative feature distances

among each other, its performance is very sensitive to the proper deﬁnition of the

positive and the negative sets for a given anchor. However, the evident distinction

of these sets is difﬁcult to achieve for multi-label images. When compared with the

NPL method, the proposed DAS-RHDIS method provides 2% higher and 7% higher

scores for IRS-BigEarthNet and UCMerced, respectively. It is worth noting that

NPL obtains relatively closer results to the proposed DAS-RHDIS due to its negative

mining strategy. NPL uses an extension of the triplet loss, which selects multiple

negative images from different negative classes for each anchor and positive image.

This negative mining strategy allows NPL to include class-based diversity among

Chapter 4. Informative & Representative Triplet Selection for CBIR 60

104105106107108

0.3

0.4

0.5

0.6

0.7

∼690×105triplets

∼4×105triplets

Number of Accumulated Triplets

F1Score

TNDML RSDML Proposed DAS-RHDIS

FIGURE 4.7:

scores obtained by different triplet sampling strategies and the number of

accumulated triplets during the training (The UCMerced archive).

TABLE 4.6: THE PERFORMANCE OF DIFFERENT DEEP METRIC LEARNING METHODS FOR THE

IRS-BIGEARTHNET AND UCMERCED ARCHIVES.

Archive Method Metric (%)

Accuracy Precision Recall F1Score

IRS-BigEarthNet

CBMIR [150] 59.6 73.2 74.6 73.9

MSL [143] 57.9 75.0 68.7 71.7

DATL [19] 60.6 75.3 74.0 74.7

NPL [144] 60.8 76.5 72.6 74.5

Proposed DAS-RHDIS 62.7 77.7 75.7 76.7

UCMerced

CBMIR [150] 42.0 50.9 53.0 51.9

MSL [143] 46.6 58.1 61.0 59.5

DATL [19] 48.7 57.2 60.7 58.9

NPL [144] 51.8 61.5 58.7 60.1

Proposed DAS-RHDIS 56.8 65.3 70.0 67.5

the negative samples. However, in NPL, the hardness and diversity in the positive

samples are not considered, resulting in the selection of trivial triplets. This can affect

its performance for the retrieval task. The proposed DAS-RHDIS identiﬁes informa-

tive and representative triplets by relying on the relevancy, hardness and diversity of

images. This allows us to reach more effective image retrieval performance compared

to the other methods.

4.6 Conclusion

This chapter introduces a novel method to select a set of informative and represen-

tative triplets from multi-label training images to achieve deep metric learning for

multi-label CBIR problems in RS. The proposed triplet sampling method is deﬁned

based on a two-steps procedure and applied on each training mini-batch of a DL-

based retrieval system. In the ﬁrst step, diverse anchor images are selected based on a

simple but efﬁcient iterative algorithm. Then, in the second step, sets of positive and

Chapter 4. Informative & Representative Triplet Selection for CBIR 61

negative images for each anchor are selected based on relevancy, hardness and di-

versity of the positive and negative images. Finally, the triplets are constructed from

the selected anchors and their respective positive and negative images. Through the

above-mentioned steps, the proposed method results in selecting a compact subset of

informative and representative triplets, which enables accurate and efﬁcient learning

of DL models for multi-label CBIR in RS. Experimental results obtained on two multi-

label RS benchmark archives under different DL architectures show the effectiveness

of the proposed method in CBIR problems. In detail, the results have demonstrated

that most of the available triplets do not contribute to the learning progress and can

be safely discarded. Focusing on a small informative and representative subset is

sufﬁcient for achieving comparable performance compared to the case, for which

all possible triplets are used. It is worth noting that the proposed triplet sampling

method does not rely on a speciﬁc DL architecture and can be adapted to any metric

learning method.

As a ﬁnal remark, we would like to point out that the proposed method currently

relies on the class labels to select positive and negative images for each anchor. As a

future work, we plan to develop an unsupervised strategy that can select informative

positive and negative images without requiring any land-use land-cover class label.

Chapter 5

Towards Simultaneous Image

Compression and Indexing for Scalable

Content-Based Retrieval in Remote

Sensing

Due to the rapidly growing RS image archives, images are usually stored in a com-

pressed format for reducing their storage sizes. Thus, most of the existing CBIR

systems require fully decoding images (i.e., decompression) that is computationally

demanding for large-scale archives. To address this issue, in this chapter, we intro-

duce a novel approach devoted to simultaneous RS image compression and indexing

for scalable content-based image retrieval (denoted as SCI-CBIR). The proposed

SCI-CBIR prevents the requirement of decoding RS images prior to image search and

retrieval. To this end, it includes two main steps: i) deep learning-based compression;

and ii) deep hashing-based indexing. The ﬁrst step effectively compresses RS images

by employing a pair of deep encoder and decoder neural networks and an entropy

model. The second step produces hash codes with a high discrimination capability

for RS images by employing pairwise, bit-balancing and classiﬁcation loss functions.

For the training of the SCI-CBIR approach, we also introduce a novel multi-stage

learning procedure with automatic loss weighting techniques to characterize RS

image representations that are appropriate for both RS image indexing and compres-

sion. The proposed learning procedure enables automatically weighting different

loss functions considered for the proposed approach instead of computationally

demanding grid search. Experimental results show the effectiveness of the proposed

approach when compared to widely used approaches in RS. The code of the proposed

approach is available at

https://git.tu-berlin.de/rsim/SCI-CBIR

. This chapter

is mainly based on the following publications:

•

G. Sumbul, J. Xiang, and B. Demir, “Towards simultaneous image compression

and indexing for scalable content-based retrieval in remote sensing,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022. DOI:

10.

1109/TGRS.2022.3204914.

•

G. Sumbul, J. Xiang, N. T. Madam, and B. Demir, “A novel framework to

jointly compress and index remote sensing images for efﬁcient content-based

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 63

retrieval,” in Proceedings of the IEEE International Geoscience and Remote Sensing

Symposium, 2022, pp. 251–254. DOI:10.1109/IGARSS46834.2022.9884146.

5.1 Introduction

For large-scale CBIR, fast and accurate indexing methods that allow approximate

nearest neighbour search are fundamental. In this perspective, hashing-based in-

dexing has recently attracted attention to solve the large-scale approximate nearest

neighbor search problems for RS CBIR due to its high time-efﬁcient (in terms of

both storage and speed) and accurate search capability within huge image archives.

Hashing methods map high-dimensional image features into compact binary hash

codes [154]. Then, image retrieval can be achieved by calculating the Hamming

distances with simple bit-wise XOR operations [41]. Several hashing methods are

presented in RS [20], [21], [27], [42]–[46], [130], [155], [156]. The traditional hashing

methods extract hand-crafted image features and map them into low-dimensional

binary codes by using hashing functions [130], [155], [156]. In these methods, image

feature extraction and hash code generation are separately applied. Thus, they are

not capable of simultaneously optimizing feature learning and hash code learning

that results in the limited capability of generated hash codes to represent the high-

level semantic content of RS images. Recently, several deep hashing-based indexing

methods are introduced in RS to address this issue. As an example, in [21] a deep

hashing neural network (DHNN) is introduced to learn high-level semantic features

and compact hash codes in an end-to-end manner. To improve the training stability

of deep neural networks (DNNs) while learning hash codes, DHNN generates the

continuous approximations of hash codes during training while exploiting quan-

tization loss to push the approximated hash codes towards the discrete values. In

greater details, the likelihood pairwise loss is utilized in DHNN to preserve the

similarity of images on their hash codes. However, the pairwise loss can lead similar

images to cluster together in a small portion of the Hamming space that prevents to

generate discriminative hash codes. To avoid this problem, in [42], a deep hashing

convolutional neural network (DHCNN) is introduced to employ image labels for

learning more discriminative hash codes. To this end, DHCNN learns to predict

image labels together with generating hash codes by jointly optimizing cross-entropy

loss with pairwise and quantization losses. Despite the success of pairwise loss in

these methods, triplet loss has been found more effective than the pairwise loss by

introducing a margin threshold between the similar and dissimilar images. Accord-

ingly, in [27], metric learning-based deep hashing network (MiLaN) is introduced

to combine quantization loss with triplet loss. In addition, MiLaN also employs

bit-balancing loss for maximizing code variance and information by forcing each bit

to have an equal chance of being 0 or 1. Unlike the above-mentioned methods, which

utilize pre-trained convolutional neural networks (CNNs), in [43], a semi-supervised

hashing adversarial autoencoder (SSHAAE) is introduced to employ an adversarial

autoencoder network for generating the discriminative and similarity preserved hash

codes with low quantization errors by end-to-end training. In addition to losses used

by DHCNN, SSHAAE also employs bit-balancing and reconstruction losses. In [44],

a generative adversarial network is exploited for hash code learning while similar

losses to DCHHN for the generator and a sigmoid function for the discriminator

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 64

are used to determine if the generated codes are true codes that comply with the

bit-balancing rule. In [45], a meta-hashing algorithm is introduced to increase the

generalization capability of DNNs utilized for hash code generation under a small

number of training samples. To this end, this algorithm employs few-shot meta

learning for hash code generation by dividing a learning objective into multiple

sub-tasks and using all training samples multiple times. In [46], an asymmetric

hash code learning (AHCL) method is proposed to increase the training efﬁciency

of DNNs for hash code learning. To this end, AHCL learns a deep hashing function

only for query images, while hash codes of archive images are obtained from query

hash codes based on class label similarity.

The above mentioned hashing-based indexing methods are potentially effective for

RS CBIR. RS images are usually stored in compressed format in archives to reduce

their storage sizes [47]. Thus, image decoding (i.e., decompression) is required before

applying any hashing method. This is computationally-demanding and impractical

in the case of large-scale CBIR problems. According to our knowledge, there is no

hashing-based indexing method in RS that can be applied in the compressed domain

efﬁciently and effectively. To address this issue, in this chapter we introduce a novel

approach devoted to simultaneous RS image compression and indexing for scal-

able content-based image retrieval (denoted as SCI-CBIR). Unlike the existing CBIR

approaches in RS, the proposed approach simultaneously indexes RS images with

hash codes while effectively compressing them. To this end, the proposed SCI-CBIR

is made up of two main steps: i) deep learning-based compression; and ii) deep

hashing-based indexing. The ﬁrst step applies image feature extraction and image

reconstruction based on a pair of encoder and decoder DNNs, while a probabilistic

entropy model is employed to optimize the length of the compressed bitstreams. The

second step employs pairwise, bit-balancing and classiﬁcation loss functions for the

generation of hash codes based on image features characterized by the ﬁrst step. To

effectively characterize image features for both image indexing and compression,

we propose a novel multi-stage learning procedure for the training of the proposed

SCI-CBIR approach, allowing to automatically weight different loss functions con-

sidered in both steps. Please note that the aim of this study is to introduce neither

compression nor hashing algorithm but to propose a novel approach that simultane-

ously indexes and compresses RS images. Due to the proposed approach, the need

for decompressing images prior to indexing, unlike the existing CBIR approaches

in RS, is fully eliminated. The main contributions of this work are summarized as

follows:

•

As a ﬁrst time in RS, the proposed SCI-CBIR approach simultaneously applies

RS image compression and indexing, and thus does not require RS image de-

coding prior to CBIR that can save a signiﬁcant amount of time for operational

applications.

•

The proposed multi-stage learning procedure automatically weights all the

considered loss functions that allows to: i) learn appropriate RS image represen-

tations for both image compression and indexing; ii) eliminate computationally

demanding grid search; and iii) automatically achieve different rate-distortion

trade-off points.

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 65

•

The proposed SCI-CBIR approach is independent from image compression

and indexing methods being selected, and can operate with any DNN-based

method.

The rest of this chapter is organized as follows: Section 5.2 presents the related works

on RS image compression and RS CBIR on compressed domain. Section 5.3 introduces

the proposed SCI-CBIR approach. Section 5.4 describes the considered RS image

archives and the experimental setup, while Section 5.5 provides the experimental

results. Section 5.6 concludes the chapter.

5.2 Related Works

In this section, we survey the existing methods for RS image compression and RS

CBIR on compressed domain. Traditional RS image compression methods are cate-

gorized into three groups: i) prediction-based methods, which predict each spectral

band based on the other bands and encodes the prediction residuals to bitstreams

(e.g., CCDCS-123 multi- and hyperspectral image compression standard [157]); ii)

vector quantization methods, which independently reduce the clusters of image

pixels with similar characteristics by grouping them together (e.g., mean-normalized

vector quantization [158]); and iii) transform-based methods, which map RS im-

ages to transform domain (e.g., Karhunen-Loéve transform [159], discrete cosine

transform [160], discrete wavelet transform [161] etc.) representations, and thus

reduce the correlation among image pixels. Although prediction-based compression

methods apply lossless compression and embody a low computational complexity,

their compression ratio is generally low that makes them infeasible for large-scale

RS archives. Vector quantization methods provide a higher compression ratio than

the prediction-based methods. However, training these methods and generating

required codebooks can be computationally demanding. Transform-based methods

generally provide a high compression ratio and speed of computation, and thus

are widely used for RS image compression on operational archives. Among several

transform-based methods, JPEG 2000 [162] became very popular in RS due to its mul-

tiresolution paradigm, scalability and high compression ratio. JPEG 2000 algorithm

is widely used to compress RS images acquired by most of the recent satellites (such

as Sentinel-2 [163]).

Recent studies on learning-based compression show that deep learning (DL) based

compression methods preserve the perceptual quality of images at lower bit rates

compared to traditional methods such as JPEG2000 [164]. DL-based image compres-

sion methods usually consist of a pair of encoder and decoder DNNs for feature

extraction and image reconstruction, and an entropy model for bit-rate optimization.

According to the type of the DNN, recent DL-based image compression methods can

be divided into one-time feed-forward and multistage recurrent based compression

methods [165]. One-time feed-forward DNNs (e.g., CNNs) employ only one time

of image encoding and decoding, and thus require to be trained multiple times for

different bit-rates. However, for multi-stage recurrent DNNs (e.g., recurrent neural

networks), image encoding is iteratively applied, while the number of iterations

determines a variable range of bit-rates within a single training. In RS, few DL-based

image compression methods have been proposed in the framework of a standard

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 66

CNN-based image compression, where a piecewise linear approximation to the oc-

currences of pixel values is used as an entropy model. In [166], a residual network

framework is introduced to adapt the standard CNN-based compression for multi-

spectral RS images by characterizing RS image representations with residual blocks

and a weighted feature channel module. In [164], spectral–spatial feature partitioned

extraction is integrated into the standard CNN-based compression to characterize

spatial and spectral content of RS images in a parallel fashion. In [167], polydirec-

tional CNNs are introduced in the standard CNN-based compression to separately

extract the spectral and spatial RS image features for preventing the dominance of

either spatial or spectral content. In the computer vision community, generalized divi-

sive normalization [168], residual blocks [169], attention modules [170] and non-local

networks [171] have been employed in the context of CNN-based compression to

further reduce the spatial redundancy when characterizing image latents that results

in lower bit-rate for entropy encoding. To further improve the compression ratio,

hyperpriors [172], autoregressive context models [173] and discretized Gaussian

mixture likelihoods [170] are incorporated into the entropy model for more accurate

bit-rate optimization. The reader is referred to [165] for recent advances on DL-based

image compression.

According to our knowledge, there is only one study in RS that is devoted to apply

CBIR in compressed domain [174]. To reduce the time required for fully-decoding

images, in [174] a coarse to ﬁne progressive RS image description and retrieval

system in the partially decoded JPEG 2000 compressed domain is proposed. In that

system, the code-blocks associated only to the coarse wavelet resolution are initially

decoded. Then the most irrelevant images to the query image are discarded based

on the similarities computed on the coarse resolution wavelet features of the query

and archive images. The processes of code-blocks decoding and elimination of the

irrelevant images are iterated until the codestreams associated to the highest wavelet

resolution are decoded. Finally, the most similar images to the query are chosen.

Although that system reduces signiﬁcantly the retrieval time compared to those

that require full decoding, it still requires a partial decompression that may require

signiﬁcant time for operational CBIR applications.

As mentioned above, DL-based image compression methods are much more success-

ful to preserve the perceptual quality of images at lower bit-rate values compared

to JPEG2000 [164]. According to our knowledge, our SCI-CBIR approach is the ﬁrst

study in the framework of the scalable CBIR on the DL-based compressed domain in

RS.

5.3 Proposed SCI-CBIR Approach

Let

X={x1

. . .

xM}

be an RS image archive that includes

non-compressed images,

where

is the

th image in the archive. We assume that a training set

T ⊂ X

available, where

∀xi∈ T

is associated with a set of class labels

li∈ {

0,1

and

the number of classes.

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 67

First Step: DL-Based Compression

Second Step: Deep Hashing Based Indexing

Entropy Modelling

Image

Encoding

Compression

Decoding

Index

Decoding

Hash Code

Generation

Class Prediction

FIGURE 5.1: Illustration of the proposed SCI-CBIR approach.

The proposed SCI-CBIR approach aims to achieve accurate CBIR in a scalable way

without any need for decompression of RS images prior to CBIR. Accordingly, SCI-

CBIR simultaneously: i) compresses each image

xi∈ X

into a bitstream; and ii)

indexes each image through a

bit hash code

(which is stored in a hash table

for scalable CBIR). This is achieved based on two-steps: i) DL-based compression;

and ii) deep hashing-based indexing. For the training of SCI-CBIR, we introduce a

multi-stage learning procedure to automatically deﬁne different loss weights and rate-

distortion trade-off points. Fig. 5.1 shows an illustration of the proposed SCI-CBIR

approach, which is explained in detail in the following subsections.

5.3.1 First Step: DL-Based Compression

The DL-based compression step of the proposed SCI-CBIR approach aims to compress

each RS image to a minimum length bitstream, which is efﬁciently stored and utilized

for reconstructing the image with a minimum amount of distortion. By following

the recent advances on DL-based image compression, this step employs a pair of

encoder decoder DNNs for learning to reconstruct RS images and an entropy model

for reducing the length of bitstreams (i.e., bit-rate optimization). Accordingly, this

step includes three main blocks: i) image encoding; ii) compression decoding; and iii)

entropy modelling.

Let

f:X ↦→ Y

be an image encoder that maps the image

to its latent

, where

the set of all latents for

. The ﬁrst block of this step transforms

into its quantized

latent representation yˆias follows:

yi=f(xi;θf);yi

ˆ=Q(yi), (5.1)

where

Q(a) = ⌊a⌉

, is a rounding function that converts

into its nearest integer (i.e.,

quantization) and

θf

is the encoder parameters. During training,

Q(a)

is replaced by

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 68

U(a−1

a+1

, where

is a uniform distribution. Let

g:Y ↦→ X

be a decoder that

maps the latent

into the reconstructed image

xˆi

, where

is the set of reconstructed

images. The second block of this step reconstructs

from its quantized representation

as follows:

xˆi=g(yi

ˆ ; θg) = g(Q(f(xi;θf));θg), (5.2)

where

θg

is the decoder parameters. The third block of this step estimates the required

number of bits to encode

, which is deﬁned according to the mutual information

between

and

xˆi

. Since the actual distribution of image latents

is unknown, its

inference is intractable. Accordingly, the entropy modelling block estimates

with

an entropy model

ˆ|θe

, where

θe

is the entropy model parameters. This block also

employs arithmetic coding algorithm

, which consists of arithmetic encoder

and arithmetic decoder

for generating compressed bitstreams from quantized

representations.

To achieve minimum image compression distortion at a minimum length of bit-

stream, the image compression objective

is deﬁned according to a rate-distortion

optimization problem [175] as follows:

LC=LR+λLD,

LC=Exi∼px[−log(qyˆi(yˆi))] + λExi∼px[d(xi,xˆi)],(5.3)

where

is approximated over the images of

. The rate term

is the cross entropy

between the entropy model

qyˆ

and the marginal distribution of quantized image

latents

Exi∼pxpyˆ|x

is the distortion metric, for which we utilize multiscale structural

similarity index (

MS-SSIM

) [176] as

d(xi

xˆi) =

−MS-SSIM(xi

xˆi)

controls the

rate-distortion trade-off points.

It is worth noting that the proposed SCI-CBIR approach is independent from the

image compression method utilized in the ﬁrst step of our approach as soon as it

employs a pair of encoder and decoder DNNs. Recent studies on DL-based image

compression have focused on enhancing the capacity of considered entropy model

for an accurate estimation of

pyˆ

, while operating image reconstruction based on

encoder-decoder architectures. In this chapter, for the entropy modelling block of

the this step, we consider context-adaptive hyper-prior based Gaussian mixture

models introduced in [170] as the entropy model due to its proven success for

spatial redundancy reduction. In this entropy model, the probability estimation of

quantized image latents are conditioned on a hyper-prior (which is deﬁned by a

factorized density model) and an autoregressive context model to capture the spatial

dependencies among the elements of quantized latents. The reader is referred to [170]

for the details of this entropy model.

5.3.2 Second Step: Deep Hashing-Based Indexing

The deep hashing-based indexing step of the proposed SCI-CBIR approach aims

to map the latent representation of each RS image (which is characterized in the

ﬁrst step) into its discriminative hash code, which preserves the semantic image

content. Then, hash codes are indexed in a hash table for all RS images in the archive,

where semantically similar images are in the same hash bucket. To this end, this

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 69

step includes three main blocks: i) index decoding; ii) hash code generation; and

iii) class prediction. Let

t:Y

θt↦→ E

be a decoder that maps the latent

into the

corresponding image embedding for indexing

associated with the image

(i.e.,

t(yi) = ei

), where

θt

is the decoder parameters. The index decoding block employs

for characterizing image embeddings by extracting and decoding semantically

informative features speciﬁc to indexing based on the latent representations of images.

Accordingly,

is composed of the attention layer introduced in [170] followed by

convolutional layers. Let

b:E

θb↦→ {−

1,1

be a binarizer that maps the image

embedding

into the binary hash code

, where

θb

is the binarizer parameters.

Let

k:E

θk↦→ {

0,1

be a classiﬁer that maps the image embedding

into the class

prediction

ˆi

, where

θk

is the classiﬁer parameters. Once the image embedding

is characterized for

, the class prediction and hash code generation blocks operate

and don eito generate corresponding class prediction and hash code, respectively.

To characterize discriminative hash codes that preserve the semantic similarity of

images, we employ the soft pairwise loss (SPL) [147]

, bit-balancing loss [177]

and a classiﬁcation loss

. SPL considers the rank difference of semantic pairwise

similarities of images. To this end, image pairs are grouped into images with hard

similarity and images with soft similarity. An image pair shares either no common

labels or all its labels for hard similarity, while an image pair shares some of its labels

for soft similarity. Let

J={(xi

xj)|xi∈ T

xj∈ T

i=j}

be a set of all image pairs

in T. The SPL function is deﬁned as follows:

LP=∑

(xi,xj)∈J

mijlog(1+esh

ij )−sh

ijso

ij

+γ(1−mij)



2(sh

ij +q)−so

ijq



2,

ij =<li,lj>

∥li∥2∥lj∥2,sh

ij =<bi,bj>,

mij =(1, so

ij ∈ {0,1}

0, 0 <so

ij <1

(5.4)

where

and

are the pairwise similarities between

and

and their hash codes,

respectively.

mij

deﬁnes whether

(xi

xj)

is associated with soft similarity (

mij =

or hard similarity (

mij =

1).

is a weighting parameter between different types of

similarities. For balancing the distribution of hash code bits by maximizing their

variance, we adapt the bit-balancing loss [177] for image pairs as follows:

LB=∑

(xi,xj)∈J∥(bT

i1)∥2

2+∥(bT

j1)∥2

2, (5.5)

where

is a vector with all elements 1.

enforces the hash codes to contain the

equal numbers of

−

1 and 1. To further enhance the discriminative capability of hash

codes, we formulate the classiﬁcation loss over image pairs as follows:

LN=∑

(xi,xj)∈J∥li

ˆ−li∥2

2+∥lj

ˆ−lj∥2

2. (5.6)

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 70

By considering the above-mentioned losses deﬁned for the ﬁrst step of our SCI-CBIR

approach, the ﬁnal hashing objective is formulated as follows:

LH=wPLP+wBLB+wNLN, (5.7)

where

are the loss weights. We note that the proposed SCI-CBIR approach

is independent from the DNN architecture utilized in this step, and thus hash codes

can be obtained through different DNN architectures.

5.3.3 Multi-Stage Learning Procedure

The objectives of the both steps of the proposed approach

enforce to encode

different information through the image encoder

on image latents

. The ﬁrst

step enforces image latents to embody maximum information required for recon-

structing images, while the second step enforces image latents to preserve the most

discriminative image features for hash code learning. For the training of proposed

SCI-CBIR, one could optimize the aggregation of different losses considered for both

steps in a single learning procedure that is widely utilized for combining different

objectives in DL. However, due to different characteristics of image compression and

indexing tasks, this learning procedure may lead to: i) the competition of the learning

objectives of image compression and indexing tasks; ii) the dominance of one of

the objectives; and iii) limited characterization of each task compared to separately

learning each objective. In this case, either multiple instances of the considered DNN

need to be trained with different

values or recurrent models need to be integrated

to achieve a variable range of rate-distortion trade-off points [165]. Accordingly, to

prevent this limitation, we propose a multi-stage learning procedure for the training

of our SCI-CBIR that aims to: i) learn RS image latents compatible for both RS image

compression and indexing; ii) automatically weights different losses; and iii) auto-

matically achieve different rate-distortion trade-off points for compression without

applying computationally demanding grid search of

. To this end, the proposed

learning procedure is made up of three consecutive stages: i) learning reconstruction;

ii) bit-rate optimization; and ii) learning hash codes.

Learning Reconstruction: The compression objective in (5.3) involves the conﬂict of

bit-rate and distortion terms that leads to decreasing bit-rate term increases distortion

term, and vice versa (i.e., rate-distortion trade-off). Accordingly, to achieve an

effective reconstruction capability without effected by the rate-distortion trade-off, in

the ﬁrst stage, only the distortion loss

is optimized until its convergence with a

learning rate η1, which is gradually decreased based on the value of LD.

Bit-Rate Optimization: To accurately achieve different rate-distortion points, in

the second stage,

is continued to optimize together with bit-rate loss

with

a learning rate

η2

. The most of the existing DL-based image compression methods

require multiple trainings with different

values to achieve different trade-off points

for (5.3). Unlike them, in this stage, we reformulate (5.3) as a multi-objective opti-

mization problem and employ multiple-gradient descent algorithm (MGDA) [178]

for automatically achieving the set of optimal trade-offs points as the set of Pareto

optimal solutions. Let

gD=∇θCLD

and

gR=∇θCLR

be the gradient vectors of

and

, respectively, over the parameters

θC=θf∪θg∪θe

. The gradient descent

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 71

direction for a Pareto optimal solution (which leads to an optimal trade-off point) is

obtained by optimizing the following problem:

minn∥u∥2

2u=wDgD+wRgR,wD+wR=1,

wD≥0, wR≥0o,(5.8)

where wDand wRare estimated as follows:

wR=1−wD,

wD=









1, gT

DgR≥gT

DgD

0, gT

DgR≥gT

RgR

(gR−gD)TgR

∥gR−gD∥2

2, otherwise

(5.9)

After obtaining uby solving (8), the parameters θCare updated as follows:

θC=θC−η2u=θC−η2(wDgD+wRgR). (5.10)

Since the distortion loss is converged in the learning reconstruction stage,

wD≈

and

wR≈

0 at the beginning of this stage. As this stage continues,

decreases

until the ﬁrst Pareto solution is found by (9). Then, by increasing

η2

is gradually

increased to reach another Pareto solution. Thus, by adjusting the learning rate itself,

this stage allows to obtain the set of optimal rate-distortion trade-off points without

operating multiple trainings and applying computationally demanding grid search

of λ.

Learning Hash Codes: The last stage involves optimizing all the losses associated

with both steps of our approach to learn RS image latents compatible for both RS

image indexing and compression. To this end, this stage employs two learning

rates

ηC

and

ηH

for the losses of the ﬁrst and second steps, respectively. It is worth

noting that since the losses

and

are optimized in the ﬁrst two stages, we

keep

ηC

3<ηH

to prevent the domination of image compression over image indexing.

Since the different rate-distortion points are achieved in the second stage, the overall

objective is written for a rate-distortion point as follows:

L=LC+LH

LC=wDLD+wRLR,

LH=wPLP+wBLB+wNLN,

(5.11)

where

and

are estimated for the speciﬁc rate-distortion point in the previous

step. To automatically ﬁnd the weights

instead of time demanding grid

search, we utilize automatic loss weighting techniques. Accordingly, the update rules

for the SCI-CBIR parameters are written as follows:

θC=θC−ηC

3(∇θCLC+∇θCLH)

θH=θH−ηH

3∇θHLH,(5.12)

where θH=θt∪θk∪θb.

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 72

5.4 Dataset Description and Experimental Design

5.4.1 Dataset Description

To evaluate the proposed approach, experiments were conducted on: i) BigEarthNet-

S2; and ii) MLRSNet [179]. In the experiments, we considered a subset of BigEarthNet-

S2 acquired over Serbia and summer season that includes 14,832 images, each of

which is made up of 120

120 pixels for 10m bands, 60

60 pixels for 20m bands and

120 pixels for 60m bands. In the experiments, cubic interpolation was applied to

20m and 60m bands that leads to 120

120 pixels for each band. In the experiments,

the 19 class nomenclature of BigEarthNet-S2 was exploited. MLRSNet is a multi-label

RS image archive that contains 109,161 images selected from aerial orthoimagery

with varying spatial resolutions from 10m to 0.1m. For the experiments, we randomly

selected a subset of MLRSNet that consists of 15,302 images, each of which has the

size of 256

256 pixels. Each image is annotated with multi-labels from 60 classes. For

the experiments, we divided BigEarthNet-S2 and MLRSNet into training, validation

and test sets with the ratios of 50%, 25%, 25% and 50%, 10%, 40%, respectively. To

apply CBIR, we selected queries from the validation set, while images were retrieved

from the test set of each archive.

5.4.2 Experimental Design

For the ﬁrst step of the proposed approach, we utilize the auto-encoder DNN archi-

tecture presented in [170]. The indexing decoder within the second step of proposed

SCI-CBIR includes the attention layer from [170] followed by two convolutional

layers, each of which includes 512 hidden units with ReLU activation function, while

their ﬁlter sizes are 5

5 and 3

3. The class prediction and hash code generation

blocks of the second step include single convolutional layers with the ﬁlter size of

1. We tested different activation functions for the hash code generation block

among sigmoid, tanh, softsign [180] and greedy hash [181] functions. The parameter

was set to 0.1

, while the hash code length

was varied as

16,32,64. The

mini-batch size was selected as 32 for both archives. While training the second step,

horizontal and vertical ﬂipping were randomly applied to the training set. We trained

the proposed approach by using stochastic gradient descent algorithm.

As discussed in Section 5.3.3, the training of the proposed approach is divided into

three stages. In the ﬁrst stage, the ﬁrst step of the proposed approach was optimized

for the distortion loss only and

η1

was updated according to the MS-SSIM value

averaged on the validation set Vas follows:

η1=





10−4, MS-SSIM(V,V

ˆ)<24

5×10−5, 24 ≤MS-SSIM(V,V

ˆ)≤29

10−5, MS-SSIM(V,V

ˆ)>29

(5.13)

The second stage starts when the distortion loss value reaches its convergence. The

learning rate

η2

was set to 10

−5

at the beginning of the stage. After the ﬁrst Pareto

point was obtained,

η2

is increased to 9

−5

. In the third stage, the second step

of the proposed approach was jointly trained with the ﬁrst step, while the learning

rate

ηH

was set to 10

−4

ηC

was varied as

ηC

0,10

−8

,10

−4

, while automatic loss

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 73

weighting technique was varied among projecting conﬂicting gradients (PCGrad)

[182], dynamic weight average (DWA) [183] and equal weighting. All the experiments

were conducted on NVIDIA Tesla V100 GPUs. Experimental results were provided in

terms of MS-SSIM and bit-rate (bpp) for compression performances, while precision

(P (%)), recall (P (%)), mean average precision (MAP (%)) and retrieval time were used

for comparing retrieval performances. It is worth noting that we mapped MS-SSIM

values into decibel (dB) scale as suggested in [170]. The retrieval metrics P, R and

MAP were averaged on the 15 most similar images.

We conducted experiments to: 1) perform a sensitivity analysis; and 2) compare

the proposed SCI-CBIR approach with standard approaches. In detail, we compare

the results of the ﬁrst step of SCI-CBIR with those obtained by applying image

compression with a recurrent neural network (denoted as IC-RNN) [184] and JPEG

2000 [162]. We compare the results of the second step of SCI-CBIR with those

obtained by the second step of our approach trained on fully decompressed data

(denoted as SI-CBIR). We compare the results of proposed SCI-CBIR trained by using

our multi-stage learning procedure with those trained by using standard learning

procedure. For IC-RNN, we utilized MS-SSIM as the distortion measure and updated

the learning rate using (5.13). It was trained with 6 RNN iterations for 280 epochs. For

SI-CBIR, we trained the second step of our approach followed by the image encoder

of the ﬁrst step with the same hyper-parameters and the loss functions

and

. SI-CBIR is not capable of simultaneous compression and indexing, and thus

requires decoding prior to indexing. For standard learning procedure, we jointly

trained all the losses required for compression and indexing in a single learning

procedure. For the loss weights, we varied the weight of the distortion loss

and

kept the rest as equal to control rate-distortion trade-off.

5.5 Experimental Results

5.5.1 Sensitivity Analysis of the Proposed SCI-CBIR Approach

In this sub-section, the results of the sensitivity analysis for the proposed SCI-CBIR

approach is presented in terms of: i) different values of the learning rate

ηC

; ii) the

effectiveness of the attention layer applied in the second step; iii) different activation

functions of the hash code generation block within the second step; iv) different

automatic loss weighting techniques applied in the third stage of our multi-stage

learning procedure; and v) different values of

. It is worth noting that during the

sensitivity analysis, we set default values for the following hyper-parameters: i)

64; and ii) the bpp value as 0.63 and 0.33 on BigEarthNet-S2 and MLRSNet,

respectively, for the ﬁrst two stages of our learning procedure. We also set PCGrad as

default automatic loss weighting technique and Greedy hash as the default activation

function.

In the ﬁrst set of trials, we analyzed the effect of the learning rate

ηC

(which is utilized

in the third stage of the proposed multi-stage learning procedure). Table 5.1 shows

the corresponding results for the BigEarthNet-S2 archive when different values of

ηC

are used and the ﬁrst two stages of our learning procedure are achieved at different

bpp values. By analyzing the table, one can observe that using a higher value of

ηC

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 74

TABLE 5.1: RESULTS OBTAINED BY PROPOSED SCI-CBIR FOR DIFFERENT VALUES OF

ηC

WHEN THE FIRST TWO STAGES OF OUR LEARNING PROCEDURE ARE ACHIEVED AT

DIFFERENT BIT-RATES (BIGEARTHNET-S2 ARCHIVE)

ηC

3= 0 ηC

3= 10−8ηC

3= 10−4

MS-SSIM bpp P R MAP MS-SSIM bpp P R MAP MS-SSIM bpp P R MAP

26.6 0.63 74.1 69.1 73.8 26.7 0.62 74.2 70.1 74.1 15.2 0.08 77.9 73.1 75.4

27.8 0.79 73.3 70.1 73.1 27.9 0.78 74.5 70.0 74.3 14.9 0.08 75.7 74.2 72.7

28.8 0.96 72.9 70.3 72.7 29.0 0.94 73.9 70.0 73.7 14.5 0.08 76.2 75.2 75.3

29.3 1.07 73.0 69.7 72.7 29.5 1.05 73.8 69.5 73.4 15.0 0.08 76.1 75.2 75.3

30.1 1.39 73.5 69.9 73.2 30.3 1.34 74.2 69.7 73.8 14.1 0.05 76.3 73.6 75.6

30.2 1.45 73.2 70.2 73.0 30.5 1.38 73.8 69.9 73.4 14.0 0.05 74.9 74.2 73.7

30.6 1.66 73.0 68.1 72.6 30.8 1.56 73.8 69.9 73.5 14.3 0.06 77.1 73.4 76.2

TABLE 5.2: RESULTS OBTAINED BY PROPOSED SCI-CBIR WITH AND WITHOUT THE ATTEN-

TION LAYER WHEN THE FIRST TWO STAGES OF OUR LEARNING PROCEDURE ARE ACHIEVED

AT DIFFERENT BIT-RATES (BIGEARTHNET-S2 ARCHIVE)

MS-SSIM bpp With Attention Layer Without Attention Layer

P R MAP P R MAP

26.6 0.63 74.2 70.1 74.1 73.7 69.7 73.3

27.8 0.79 74.5 70.0 74.3 73.3 68.8 73.0

28.8 0.96 73.9 70.0 73.7 72.8 68.1 72.5

29.3 1.07 73.8 69.5 73.4 73.1 68.1 72.6

30.1 1.39 74.2 69.7 73.8 72.2 68.6 71.9

30.2 1.45 73.8 69.9 73.4 71.7 68.0 71.3

30.6 1.66 73.8 69.9 73.5 73.1 68.1 72.6

as 10

−4

leads to a signiﬁcant reduction on compression results while providing the

highest retrieval scores. One can see from the table that when the ﬁrst step of our

approach is not optimized (

ηC

0), our approach achieves the lowest retrieval scores

compared to using

ηC

0. However, when

ηC

is set a small value higher than zero

(

ηC

−8

), the proposed SCI-CBIR approach achieves comparable compression and

retrieval performances. This shows that our approach is capable of simultaneously

learning image representations for both indexing and compression in the third stage

of our multi-stage learning procedure when

ηC

is properly set. Accordingly, we set

ηC

to 10

−8

for the rest of experiments. We observed the similar effect of

ηC

for the

MLRSNet archive.

In the second set of experiments, we assessed the effectiveness of the attention layer,

which is used in the second step of our approach. Table 5.2 shows the retrieval results

obtained by the proposed SCI-CBIR approach with and without the attention layer

for BigEarthNet-S2 when the different bpp values are achieved in the ﬁrst two stages

of our learning procedure. From the table one can see that the overall scores obtained

with attention layer is signiﬁcantly higher than those without attention layer indepen-

dently from the bpp values. This shows the effectiveness of the attention layer that

increases the capability of our approach to accurately decode image representations

for indexing in the second step, and thus to learn discriminative hash codes. The

similar behaviour of the attention layer has been observed for the MLRSNet archive.

In the third set of trials, we analyzed the effect of different activation functions of the

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 75

TABLE 5.3: RESULTS OBTAINED BY PROPOSED SCI-CBIR UNDER DIFFERENT ACTIVATION

FUNCTIONS (THE BIGEARTHNET-S2 ARCHIVE)

Activation Function P R MAP

Sigmoid 71.4 71.8 71.4

Tanh 72.1 71.3 72.4

Softsign [180] 71.8 70.5 71.0

Greedy Hash [181] 74.2 70.1 74.1

TABLE 5.4: RESULTS OBTAINED BY PROPOSED SCI-CBIR FOR DIFFERENT AUTOMATIC LOSS

WEIGHTING TECHNIQUES (BIGEARTHNET-S2 ARCHIVE)

Automatic Loss Weighting Technique P R MAP

DWA [183] 73.3 69.7 72.9

PCGrad [182] 74.2 70.1 74.1

Equal Weighting 73.3 69.3 73.2

hash code generation block. Table 5.3 shows the corresponding retrieval results for

BigEarthNet-S2. One can observe from the table that using Greedy hash activation

function achieves the highest precision and MAP scores with comparable recall

score. It is due to the fact that Greedy hash function does not require to apply the

quantization loss on the discrete hash codes. Accordingly, this function minimizes

the quantization error compared to other activation functions [181]. Thus, we set

Greedy hash as the activation function for the rest of the experiments. We observed

the similar behaviour for the MLRSNet archive.

In the fourth set of experiments, we assessed the effect of different automatic loss

weighting techniques (which are applied in the third stage of our learning procedure)

on retrieval performance. Table 5.4 shows the corresponding retrieval performances

for the BigEarthNet-S2 archive. From the table one can observe that proposed SCI-

CBIR approach achieves the highest scores when PCGrad is chosen as the automatic

loss weighting technique. When DWA and the equal weighting technique (which

equally weights different losses) are used, SCI-CBIR leads to similar retrieval perfor-

mances. It is worth noting that the hashing objective in (5.7) is made up of different

types of losses. PCGrad is capable of projecting the gradient of a loss function onto

the normal plane of the gradient of another loss function. This reduces gradient

interference among different loss functions that allows more effective optimization

on hashing objective compared to DWA. Accordingly, for the rest of the experiments,

we utilized PCGrad as the automatic loss weighting technique applied in the third

stage of our learning procedure. The similar behaviour of these techniques on our

approach has been observed for the MLRSNet archive.

In the ﬁfth set of trials, we analyzed the effect of hash code length. Table 5.5 shows

the corresponding retrieval performances at different values of

for BigEarthNet-S2

and MLRSNet archives. One can observe from the table that, by increasing

, the

most of the metric values monotonically increase for both archives. Accordingly, the

proposed SCI-CBIR achieves the highest scores under all the metrics when

compared to other values of

. As an example, proposed SCI-CBIR with

achieves almost 14% higher precision and 15% higher recall compared to SCI-CBIR

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 76

TABLE 5.5: RESULTS OBTAINED BY PROPOSED SCI-CBIR FOR DIFFERENT VALUES OF q

qBigEarthNet-S2 MLRSNet

P R MAP P R MAP

16 72.2 69.0 70.6 46.7 45.0 44.7

32 72.5 70.9 72.6 57.7 57.3 56.5

64 74.2 70.1 74.1 60.6 59.8 58.9

0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15

bpp

MS-SSIM (dB)

SCI-CBIR IC-RNN JPEG2000

(a)

0.35 0.4 0.45 0.5 0.55

bpp

MS-SSIM (dB)

SCI-CBIR IC-RNN JPEG2000

(b)

FIGURE 5.2: Multi-scale similarity index (MS-SSIM) in dB versus bpp obtained by the pro-

posed SCI-CBIR approach, IC-RNN and JPEG2000 for (a) BigEarthNet-S2 and (b) MLRSNet

archives.

with

16 for MLRSNet archive. Thus, for the rest of the experiments, we set

64.

5.5.2 Comparison with Standard Approaches

In this sub-section, we compare the performance of the ﬁrst and second steps

of our approach and our multi-stage learning procedure with the standard ap-

proaches. Accordingly, we evaluated the effectiveness of: i) the ﬁrst step compared

to JPEG2000 [162] and IC-RNN [184]; ii) the second step compared to SI-CBIR; and

iii) the multi-stage learning procedure compared to standard learning procedure.

In the ﬁrst set of trials, we compare the DL-based compression step of our approach

with JPEG2000 and IC-RNN. Fig. 5.2 shows the compression results at different

bpp values for BigEarthNet-S2 and MLRSNet archives. By assessing the ﬁgure, one

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 77

(a) (b) (c) (d)

FIGURE 5.3: An RS image compression example: (a) original image; reconstructed image

at 0.7 bits per pixel (bpp) by (b) JPEG2000 [162]; (c) IC-RNN [184]; and (d) the proposed

SCI-CBIR approach (BigEarthNet-S2 archive).

(a) (b) (c) (d)

FIGURE 5.4: An RS image compression example: (a) original image; reconstructed image at

0.3 bpp by (b) JPEG2000 [162]; (c) IC-RNN [184]; and (d) the proposed SCI-CBIR approach

(MLRSNet archive).

can observe that our SCI-CBIR approach achieves the highest MS-SSIM at each bpp

value for both archives. This shows that the ﬁrst step of our approach is capable of

effectively decoding RS images with varying rate-distortion points while RS image

compression and indexing are simultaneously learnt in our approach. In greater

details, the proposed SCI-CBIR approach and IC-RNN signiﬁcantly outperform

the JPEG2000 algorithm. This shows the effectiveness of DL-based compression

compared to conventional methods for RS images. Fig. 5.3 and 5.4 show an example

of reconstructed RS images after they are compressed by proposed SCI-CBIR, IC-

RNN and JPEG2000 for the BigEarthNet-S2 and MLRSNet archives, respectively.

One can see from the ﬁgures that the proposed SCI-CBIR approach is as capable as

IC-RNN for reconstructing images without signiﬁcant loss of spatial information.

When compared to JPEG2000, our approach provides higher reconstruction quality.

As an example, when JPEG2000 is utilized to compress the original image given in

Fig. 5.3-a at 0.7 bpp, it is not able to reconstruct the spatial details of the original

image (see Fig. 5.3-b) in contrast to our approach.

In the second set of experiments, we assessed the effectiveness of the deep hashing-

based indexing step of our approach compared to SI-CBIR. Fig. 5.5 shows the

corresponding CBIR results for both archives when the ﬁrst step of our approach

were used to decode RS images at different bpp values for SI-CBIR. By analyzing

the ﬁgure one can see that the proposed SCI-CBIR approach achieves similar CBIR

performance compared to SI-CBIR under different bpp values. In greater details,

one can also observe that the CBIR performance of our approach is not signiﬁcantly

affected by the changes in bbp values. This shows that when compression and

indexing are simultaneously learnt, the proposed SCI-CBIR approach is capable

of indexing RS images as accurate as without learning image compression during

training as in SI-CBIR. Fig. 5.6 and 5.7 show an example of RS images retrieved by

both approaches for the BigEarthNet-S2 and MLRSNet archives, respectively. One

can see from the ﬁgures that proposed SCI-CBIR approach retrieves similar images

to the query images compared to SI-CBIR independently from the bpp values. This

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 78

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

bpp

MAP@15 (%)

SCI-CBIR SI-CBIR

(a)

0.4 0.5 0.6 0.7 0.8 0.9 1

bpp

MAP@15 (%)

SCI-CBIR SI-CBIR

(b)

FIGURE 5.5: MAP versus bpp obtained by the proposed SCI-CBIR approach and SI-CBIR for

(a) BigEarthNet-S2 and (b) MLRSNet archives.

TABLE 5.6: RETRIEVAL TIME PER IMAGE (IN MILLISECONDS) OBTAINED BY SI-CBIR AND

THE PROPOSED SCI-CBIR APPROACH

Archive Approach Time

Decoding Indexing Total

BigEarthNet-S2 SI-CBIR 970 149 1119

SCI-CBIR N/A 149 149

MLRSNet SI-CBIR 5287 733 6020

SCI-CBIR N/A 733 733

is inline with our conclusion from Fig. 5.5. Table 5.6 shows the required CBIR time

for both approaches. It can be seen from the table that the required retrieval time

per image of proposed SCI-CBIR is almost one-tenth of the time for both archives

compared to SI-CBIR under similar CBIR scores. This is due to the fact that the

retrieval time of SI-CBIR includes also the image decoding time, which is not required

for proposed SCI-CBIR approach. In detail, since RS image compression and indexing

are simultaneously learnt by our approach during training, hash codes (which are

generated by our deep hashing-based indexing step) are directly utilized for CBIR

without any need for decompressing RS images. Due to this, during large-scale RS

image indexing, the proposed SCI-CBIR approach saves the signiﬁcant amount of

time required for computationally demanding decompression of images.

In the third set of trials, we analyzed the effectiveness of the proposed multi-stage

learning procedure by comparing it with standard learning procedure. Table 5.7

shows the compression and retrieval results obtained by the proposed SCI-CBIR

approach trained with the proposed multi-stage and standard learning procedures

for the BigEarthNet-S2 archive. By assessing the table, one can see that the proposed

SCI-CBIR approach with our multi-stage procedure provides higher scores of CBIR

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 79

(a)

1st 5th 10th 15th

(b)

(c)

(d)

(e)

(f)

FIGURE 5.6: (a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed SCI-CBIR

at 0.62 bpp; (d) the proposed SCI-CBIR at 0.78 bpp; (e) the proposed SCI-CBIR at 1.05 bpp;

and (f) the proposed SCI-CBIR at 1.56 bpp (BigEarthNet-S2 archive).

metrics and MS-SSIM values compared to SCI-CBIR with the standard learning

procedure at similar bpp values. This is due to the fact that when a single learning

procedure with equal loss weights is utilized as in standard learning procedure, learn-

ing objectives for indexing and compression are conﬂicting each other independently

from the different rate-distortion trade-off points (which is controlled by

in stan-

dard learning procedure). This prevents to accurately learn RS image compression

together with RS image indexing. Unlike the standard learning procedure, due to the

proposed multi-stage learning procedure, our approach is capable of simultaneously

learning both tasks in an effective way by automatically: i) weighting different loss

functions; and ii) ﬁnding rate-distortion trade-off points. The similar behaviour of

the proposed multi-stage learning procedure has been observed for the MLRSNet

archive.

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 80

(a)

1st 5th 10th 15th

(b)

(c)

(d)

(e)

(f)

FIGURE 5.7: (a) Query image; and images retrieved by (b) SI-CBIR; (c) the proposed SCI-CBIR

at 0.33 bpp; (d) the proposed SCI-CBIR at 0.56 bpp; (e) the proposed SCI-CBIR at 0.69 bpp;

and (f) the proposed SCI-CBIR at 0.85 bpp (MLRSNet archive).

5.6 Conclusion

This chapter introduces a novel approach (denoted as SCI-CBIR) to simultaneously

compress and index RS images for scalable CBIR. The SCI-CBIR approach is charac-

terized by two steps that are simultaneously applied based on a novel multi-stage

learning procedure. The ﬁrst step is the DL-based compression step, where RS images

are ﬁrst mapped into their latent representations, and then reconstructed back from

the latents by exploiting a pair of encoder and decoder DNNs. An entropy model

is utilized to generate bitstreams for a rate-distortion trade-off point. The second

step is the deep hashing-based indexing step, where hash codes of RS images are

generated from their latent representations. With the proposed multi-stage learning

procedure, all the parameters of SCI-CBIR are learnt within three consecutive stages

as: i) minimizing a distortion loss to model reconstruction; ii) ﬁnding the Pareto

optimal solutions of a multi-objective optimization problem to achieve a variable

range of bit-rates; and iii) minimizing soft pairwise, bit-balancing and classiﬁcation

Chapter 5. Towards Simultaneous Image Compression & Indexing for CBIR 81

TABLE 5.7: RESULTS OBTAINED BY PROPOSED SCI-CBIR TRAINED WITH OUR MULTI-STAGE

LEARNING PROCEDURE AND STANDARD LEARNING PROCEDURE ASSOCIATED TO SIMILAR

BIT-RATES (THE BIGEARTHNET-S2 ARCHIVE)

Our Multi-Stage Learning Procedure Standard Learning Procedure

MS-SSIM bpp P R MAP λMS-SSIM bpp P R MAP

26.7 0.62 74.2 70.1 74.1 150 22.3 0.63 70.8 67.6 70.2

27.9 0.78 74.5 70.0 74.3 200 23.0 0.71 70.8 67.8 70.3

29.0 0.94 73.9 70.0 73.7 500 26.3 0.87 70.1 68.0 70.0

29.5 1.05 73.8 69.5 73.4 700 26.9 1.08 70.6 68.2 70.0

30.3 1.34 74.2 69.7 73.8 1000 27.6 1.29 70.0 68.0 69.3

30.5 1.38 73.8 69.9 73.4 1250 27.9 1.44 70.5 67.9 70.1

30.8 1.56 73.8 69.9 73.5 1500 28.0 1.64 70.1 67.2 69.6

losses with automatic loss weighting techniques to characterize hash codes. This

allows the proposed SCI-CBIR approach to: i) obtain different bit-rates without a

need for training the considered DNN multiple times; and ii) automatically ﬁnd the

weights for the ﬁve different losses considered in both steps without any need for

computationally expensive grid search.

Experimental results obtained on two benchmark archives show that the proposed

approach provides high compression performance, while resulting in high retrieval

accuracy without any need for decompressing the images prior the indexing (which

is required for the most of the CBIR systems in RS). We underline that this is a

very important advantage particularly for large-scale CBIR, and thus the proposed

approach is convenient for possible operational applications. It is worth noting

that the archives used in our experiments are benchmarks. However, in many real

applications we expect that the CBIR is applied to much larger archives. For large-

scale CBIR, by using our approach the gain in retrieval time is expected to be increased

signiﬁcantly compared to the existing approaches. In the case of compressing and

indexing very large size RS image scenes, we suggest to utilize light-weight DNNs

(such as Zoom-In [185] and ESPNetv2 [186]) that allow to apply training and inference

of our approach in a computationally efﬁcient manner.

It is worth noting that the proposed approach can be easily adapted to the CBIR

problems for which: i) images are compressed by other DL-based compression algo-

rithms; and also ii) hash codes are obtained through different DL-based architectures.

As a ﬁnal remark, we would like to point out that the development of DL-based

image compression methods is becoming a more and more important topic. In this

context, the proposed approach is very promising as it allows RS CBIR for the case

that images are compressed by using DNNs. As a future development, we plan to

study the development of DL-based 3D compression models where not only spatial

but also spectral redundancies are compressed. Moreover, we plan to explore RS

CBIR in the 3D compressed domain, which is expected to be particularly relevant for

search and retrieval from hyperspectral image archives.

Chapter 6

Generative Reasoning Integrated Label

Noise Robust Deep Image

Representation Learning in Remote

Sensing

Most of the DL-based IRL methods require the availability of a set of high quan-

tity and quality of annotated training RS images, which can be time-consuming,

complex and costly to gather. To reduce labeling costs, publicly available thematic

maps, automatic labeling procedures or crowdsourced data can be used. However,

such approaches increase the risk of including label noise in training data. It may

result in overﬁtting on noisy labels when discriminative reasoning is employed as

in most of the existing methods. This leads to sub-optimal learning procedures,

and thus inaccurate characterization of RS images. In this chapter, we introduce

a generative reasoning integrated label noise robust deep representation learning

(GRID) approach. The proposed GRID approach aims to model the complementary

characteristics of discriminative and generative reasoning for IRL under noisy labels.

To this end, we ﬁrst integrate generative reasoning into discriminative reasoning

through a supervised variational autoencoder. This allows the proposed GRID ap-

proach to automatically detect training samples with noisy labels. Then, through

our label noise robust hybrid representation learning strategy, GRID adjusts the

whole learning procedure for IRL of these samples through generative reasoning and

that of the other samples through discriminative reasoning. Our approach learns

discriminative RS image representations while preventing interference of noisy labels

during training independently from the IRL method being selected. Thus, unlike the

existing label noise robust methods, GRID does not depend on the type of annotation,

label noise, neural network architecture, loss function or learning task, and thus can

be directly utilized for various RS image understanding problems. Experimental

results show the effectiveness of the proposed GRID approach compared to the

state-of-the-art methods. The code of the proposed approach will be publicly avail-

able at

https://git.tu-berlin.de/rsim/GRID

. This chapter is mainly based on the

following publications:

•

G. Sumbul and B. Demir, “Generative reasoning integrated label noise robust

deep image representation learning,” IEEE Transactions on Image Processing,

2023. DOI:10.1109/TIP.2023.3293776.

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 83

•

G. Sumbul and B. Demir, “Label noise robust image representation learning

based on supervised variational autoencoders in remote sensing,” in Proceedings

of the IEEE International Geoscience and Remote Sensing Symposium, 2023.

6.1 Introduction

DL-based IRL of RS images is generally achieved in a supervised way during the

optimization of a loss function based on the characteristics of a learning task (e.g.,

single/multi-label classiﬁcation, semantic segmentation etc.). To effectively learn

DL model parameters, the availability of a high quantity and quality of annotated

training RS images is required. Depending on the considered learning task, annota-

tions of training RS images can be given at scene-level or pixel-level. For scene-level

annotations, each training image is annotated by either a single label, which is asso-

ciated to the most signiﬁcant content of the image, or multi-labels. In general, the

manual collection of RS image annotations by domain experts for large scale data

can be time consuming, complex and costly. To address this issue, publicly available

thematic maps (e.g., the CORINE Land Cover inventory [49]), automatic labeling

procedures and volunteered geographic information (VGI) as crowdsourced data can

be used. These strategies provide RS image annotations at zero cost. However, the

considered thematic map or VGI source can be outdated with respect to RS images

due to possible changes on the ground; or there can be annotation errors. Thus,

these strategies increase the risk of including noisy labels in training data. It is worth

noting that for a scene-level single-label and a pixel-level noisy annotation, label

noise occurs as an incorrect label associated to an image and a pixel, respectively.

However, for a scene-level multi-label noisy annotation, it can emerge as a missing

label (i.e., a class is present in an image while the corresponding label is not assigned

to that image), a wrong label (i.e., a class is not present in an image while its label is

assigned to the image) or combination of both missing and wrong labels.

Most of the existing DL-based IRL methods in RS employ discriminative learning

(i.e., discriminative reasoning) of image representations. This is based on directly

modeling a posterior data distribution

p(y|x)

by utilizing

image annotation

pairs from training data. The effectiveness of the discriminative reasoning has been

proven compared to generative reasoning (which is based on modelling the joint data

distribution

p(x

) when training data is abundant [187]. However, discriminative

models are more sensitive to label noise compared to generative models. Accordingly,

discriminative learning of RS image representations with noisy labels may result in

overﬁtting of the considered deep neural network (DNN) to noisy labels and lack

of its generalization capability, and thus inaccurate characterization of RS images

during both training and inference [50], [51].

To address this problem, several methods, mostly in computer vision (CV) commu-

nity, are presented to improve the robustness of discriminative IRL when training

data includes noisy labels. All these methods are potentially effective for DL-based

IRL under noisy labels in RS. However, most of them are dependent on the type

of: i) label noise present in training data; ii) image annotation; iii) loss function

(e.g., cross-entropy, focal loss etc.); iv) DNN architecture; or v) learning task. Some

methods also require the availability of a subset of the training set, which includes

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 84

clean labels, or require the computationally demanding noise correction strategies

prior to training. Thus, they may not be directly integrated into different scenarios

associated to IRL in RS.

To overcome this issue, in this chapter we introduce a Generative Reasoning Inte-

grated Label Noise Robust Deep Representation Learning (denoted as GRID here-

after) approach. The proposed GRID approach aims to model the complementary

characteristics of discriminative and generative reasoning for IRL under noisy labels.

To this end, for discriminative reasoning, we ﬁrst employ a DNN composed of an RS

image encoder (i.e., CNN backbone) and a discriminative task head for modelling the

posterior distribution of labeled RS images as in the most of supervised DL-based IRL

methods in RS. Then, we integrate generative reasoning into discriminative reasoning

through a supervised variational autoencoder (which includes a variational encoder,

a feature decoder and a generative task head) followed by the CNN backbone for

modelling the joint distribution of labeled RS images. This allows the proposed GRID

approach to automatically detect training samples with noisy labels based on the loss

values acquired from discriminative and generative task heads. Then, through our

label noise robust hybrid representation learning strategy, the model parameters of

the considered DNN is updated through: i) generative reasoning for the samples with

noisy labels; and ii) discriminative reasoning for the remaining samples in training

data. Accordingly, our approach allows to learn discriminative RS image representa-

tions through the CNN backbone, while preventing the overﬁtting on noisy labels

during training independent from the IRL method being selected. Thus, unlike the

existing label noise robust methods, GRID does not depend on the type of annotation,

label noise, DNN architecture, loss function or learning task. It also does not require

a trustworthy subset of a training set or require a computationally demanding noise

correction strategy prior to training. Thus, our approach can be directly utilized

for various scenarios for IRL in RS. In this chapter, we consider two IRL scenarios,

where training RS images are annotated with: 1) scene-level noisy multi-labels; and

2) pixel-level noisy labels. Under these scenarios, we consider three learning tasks

with the corresponding loss functions and DNN architectures. For different scenarios

and learning tasks, we conduct experiments on a single RS application for the sake of

simplicity. This application is selected as content-based image retrieval (CBIR) due to

the importance of employing accurate image features for similarity matching in CBIR.

We would like to note that, according to our knowledge, GRID is the ﬁrst approach

in RS that combines generative and discriminative reasoning for supervised IRL

under noisy labels that leads to characterize accurate RS image representations while

preventing interference of noisy labels during training.

The rest of the chapter is organized as follows: Section 6.2 presents the related works

on DL-based label noise robust IRL in CV and RS. Section 6.3 introduces the proposed

GRID approach. Section 6.4 describes the considered RS image data archives and the

experimental setup, while Section 6.5 provides the experimental results. Section 6.6

concludes the chapter.

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 85

6.2 Related Works

A few methods for label noise robust DL-based IRL are recently presented in RS for

image classiﬁcation [188]–[192] and semantic segmentation problems [31], [193]. As

an example for RS image classiﬁcation problems, a noisy label distillation method

is introduced in [188] to leverage the knowledge learnt through a teacher model on

images with noisy labels for a student model. In this method, two convolutional

neural networks (CNNs) are employed as a teacher-student framework, while a clean

and trustworthy subset of a training set is assumed to be available for the student

CNN. In [189], down-weighting factor is integrated into normalized softmax loss

function to reduce the effect of wrongly classiﬁed images (which are assumed to be

associated with noisy labels) on the model parameter updates. It is noted that these

methods are designed for RS images associated with single-labels. For RS images

annotated by multi-labels, a collaborative learning framework is proposed in [192]

to identify and exclude images with noisy multi-labels during training. To this end,

it employs two CNNs operating collaboratively, which are forced to characterize

distinct image representations and to produce similar predictions. In [191], the effects

of different label noise types in multi-label RS image classiﬁcation problems are

investigated, while different noise robust methods are integrated from single-label to

multi-label classiﬁcation problems in RS. Apart from scene-level image classiﬁcation,

label noise robust land-cover map generation through semantic segmentation has

been also attracted attention in RS. As an example, in [31], an online noise correction

approach is proposed to detect and correct pixel-level noisy labels via information

entropy at the early stage of training, and thus to continue training with corrected

labels.

It is noted that, for label noise robust IRL, the research is more extended in CV,

but mostly dedicated to single-label image classiﬁcation problems. Recent research

directions in CV community are mainly concentrated on the development of: i) deep

architectures [194]; ii) loss functions [195]; iii) regularization strategies [196]; and iv)

sample selection and label adjustment techniques [197] while aiming to achieve more

robust learning procedures towards label noise. The methods in the ﬁrst category

focus on designing DNN architectures speciﬁc to training data with noisy labels.

As an example, in [194], a contrastive-additive noise network is proposed to model

trustworthiness of noisy labels in the context of image classiﬁcation. To this end,

it includes a probabilistic latent variable model as a contrastive layer to estimate

the quality of labels and an additive layer to aggregate the class predictions and

noisy labels. The methods in the second category is mostly devoted to utilizing

loss functions, which have robust characteristics when used with noisy labels. For

instance, asymmetric loss function introduced in [198] allows to dynamically decrease

the weights of negative classes in multi-labels. This decreases the effect of images

with missing labels on IRL. The methods in the third category aim at regularizing

the whole learning procedure to prevent overﬁtting on noisy labels. As an example,

in [196], a regularization term is integrated into cross-entropy loss to guide the

learning process with the class predictions from an early stage of training to prevent

memorization of noisy-labels. The methods in the last category focus on ﬁrst detecting

images associated with correct labels or adjusting noisy labels, and then learning

through those samples or adjusted labels. For instance, in [197], a joint training

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 86

CNN

Backbone

Od(B)

W:L(ˆ

i,yi)−L(ˆ

i,yi)> aλ

Discriminative

Task Head

Generative

Task Head

VAE

Encoder

Feature

Decoder

Latent

Variable

Sampling

Og(B)

· · ·

∇γOd(B)

∇βOg(B)

∇θOd(C)

∇θOg(W)

C:L(ˆ

i,yi)−L(ˆ

i,yi)≤aλ

· · ·

Automatic Noisy Sample Detection

Forward

Pass

Backward Pass

(Clean Samples)

Backward Pass

(Noisy Samples)

Clean

Samples

Noisy

Samples

Backward Pass

(All Samples)

FIGURE 6.1: An illustration of the training of our GRID approach that jointly leverages the ro-

bustness of generative reasoning towards noisy labels and the effectiveness of discriminative

reasoning on image representation learning. During the forward pass on a mini-batch

, the

loss values

Od(B)

Og(B)

and the predicted labels

ˆd

ˆg

are obtained through discriminative

and generative reasoning for a given learning task. Then, the set

of training samples with

noisy labels (i.e., noisy samples) and the set

of training samples with correct labels (i.e.,

clean samples) are constructed through our automatic noisy sample detection procedure

based on the values of the loss function

associated with the learning task. During the

backward pass, the model parameters except the CNN backbone parameters are updated

with all samples based on

∇γOd(B)

and

∇βOg(B)

. The parameters of the CNN backbone

are updated through: i) the generative task head for the noisy samples based on

∇θOg(W)

;

and ii) the discriminative task head for the clean samples based on ∇θOd(C).

with co-regularization approach employs collaborative learning of two CNNs for

the selection of correct labels by an agreement strategy. For a detailed summary of

DL-based label noise robust IRL methods in CV, we refer the reader to [51].

6.3 Proposed Approach

Let

X={x1

. . .

xM}

be an RS image archive that includes

images, where

is the

th image in the archive. We assume that a training set

T={(x1

y1)

. . .

(xK

yK)}

that includes

i.i.d samples of random variables

and

is available.

is the

image and

is the corresponding image annotation. Annotations of training images

can be given at pixel-level or scene-level. An image can be annotated by a broad

category label (i.e., single-label) or multi-labels. We assume that the labels in the

set

of training image annotations can be noisy. For a scene-level single-label or a

pixel-level noisy annotation, label noise may occur as an incorrect label associated to

an image or a pixel, respectively. For a scene-level multi-label noisy annotation, label

noise may occur as a missing label, a wrong label or combination of both missing

and wrong labels.

The proposed GRID approach aims to jointly leverage the robustness of generative

reasoning towards noisy labels and the effectiveness of discriminative reasoning on

IRL. This is achieved by ﬁrst integrating generative reasoning into discriminative

reasoning through a supervised variational autoencoder, and then characterizing

discriminative RS image representations while preventing interference of noisy labels

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 87

through our label noise robust hybrid representation learning strategy. Fig. 6.1 shows

an illustration of the proposed GRID approach. We ﬁrst provide general information

on discriminative reasoning, and then present our approach in detail in the following

subsections.

6.3.1 Basics on Discriminative Reasoning

DL-based IRL methods through discriminative reasoning aim to employ the dis-

criminative capabilities of DNNs for the characterization of RS image features. This

is achieved by maximizing the posterior distribution of labeled RS images

p(y|x)

during training. To this end, the considered DNN typically includes an image

encoder (i.e., a CNN backbone) and a discriminative task head including fully con-

nected or convolutional layers (which is branched out from the CNN backbone). Let

φ:θ

X ↦→ F

be any type of CNN backbone that maps the image

into the corre-

sponding image descriptor

, which is a sample of random variable

is the set

of CNN parameters and

is the set of all descriptors for

. Let

td:γ

F ↦→ Y

ˆd

be a

discriminative task head that maps the image descriptor into the corresponding label

prediction associated with the image

[i.e.,

td(φ(xi

;

θ)

;

γ) = yˆd

], where

is the task

head parameters and

ˆd

is the set of all predicted image labels. The CNN backbone

models global image representation space, while overall DNN models the posterior

distribution

p(y|x)

. In this formulation, the model parameters

θ∪γ

are updated

to maximize

ET[logp(y|x)]

. Accordingly, the objective function

associated with

discriminative reasoning for a set of samples Sis written as follows:

Od(S) = 1

|S| ∑

(xi,yi)∈S L(td(φ(xi;θ);γ),yi);∃L ∈ L, (6.1)

where

is the set of all loss functions, whose each element is capable of measuring

how different the prediction

yˆd

is from

. Accordingly, any loss function that can

measure the sample-wise error can be used for L.

The discriminative learning of RS image representations has been found successful

for many applications when the labeled training data is abundant [187]. However,

learning image representations via modeling the posterior distribution of training

data can be sensitive to noisy labels included in the calculation of the loss function.

When the ratio of noisy labels over Yis signiﬁcantly high, the considered DNN can

suffer from the overﬁtting on noisy labels leading to inaccurate IRL and lack of the

generalization capability of the considered DNN [50], [51].

6.3.2 Integration of Generative Reasoning

Generative learning of image representations via modeling the joint data distribution

p(x

limits the overﬁtting of the considered DNN on noisy labels during training.

Thus, it is proven to be more robust to noisy labels compared to discriminative

learning [199]. However, learning image representations via generative reasoning

may limit to accurately characterize discriminative image descriptors, and thus

may lead to inaccurate IRL. Accordingly, the proposed GRID approach aims at

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 88

effectively integrating generative reasoning into discriminative reasoning to achieve

discriminative and generative modelling of RS images in a single learning procedure.

To model

p(x

, we assume that

and

are generated through a latent variable

. Each sample of the latent variable

is generated from a prior distribution

p(z)

while

and

are generated from

p(x

y|z)

. It is worth noting that the marginal

likelihood over the latent variable

Rp(z)p(x

y|z)dz

is intractable for DNNs since

it is hard to ﬁnd an analytical solution for the posterior distribution of the latent

variable

p(z|x

. To this end, we utilize a variational auto-encoder (VAE) introduced

in [200] as a latent variable model. Accordingly, we approximate the true posterior

distribution of latent variable with a variational approximate posterior

q(z|x

known functional form (e.g., a Gaussian distribution parameterized by the encoder

of a VAE). Then, the variational lower bound on the marginal log-likelihood (i.e.,

evidence lower bound [ELBO]) is deﬁned as follows:

log pβd(x,y)≥Eqβe(z|x,y)[log pβd(x,y|z)]

−DKL(qβe(z|x,y)|| pβd(z)),(6.2)

where

DKL(·||·)

is the Kullback-Leibler (KL) divergence [201],

βe

is the VAE encoder

parameters and

βd

is the VAE decoder parameters. It is worth noting that

qβe(z|x)

is a

sufﬁcient statistic for

qβe(z|x

. It guarantees that

generated from

embodies the

same information when it is jointly generated from

and

[202]. Since

pβd(x

y|z)

can be factorized into

pβd(x|z)pβd(y|z)

(i.e, conditional independence), (6.2) can be

written as follows:

log pβd(x,y)≥Eqβe(z|x)[log pβd(x|z)]

+Eqβe(z|x)[log pβd(y|z)]

−DKL(qβe(z|x)|| pβd(z)).

(6.3)

We deﬁne the variational approximate posterior and the latent prior as multivariate

Gaussian distributions as follows:

zi∼qβe(z|xi) = N(z|µi,σ2

iI), (6.4)

pβd(z) = N(z|0,I), (6.5)

Since

is the representative of

, we deﬁne the variational generative process based

rather than

. Let

be a VAE encoder that maps the image descriptor

into the

parameters of the qβedistribution µiand σifor xi. To prevent the interference of the

DNN training with the stochastic sampling of

, we utilize the reparameterization

trick introduced in [200] to generate zias follows:

zi=µi+σi·ϵi;ϵi∼ N(0,I). (6.6)

Let

tg:βt

Z ↦→ Y

ˆg

be a generative task head that maps the latent into the corre-

sponding label prediction associated with the image

[i.e.,

tg(zi

;

βt) = yˆg

], where

βt

is the task head parameters and

ˆg

is the set of all predicted image labels.

chosen as the duplicate of

, but they are associated to different model parameters

(i.e.,

βt=γ

). Let

r:βr

Z ↦→ F

be a feature decoder that maps the latent into the

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 89

reconstructed image descriptor

ˆi

for

, where

βr

is the feature decoder parameters.

models

pβd(y|z)

, while

models

pβd(x|z)

. Accordingly,

and

both form the

VAE decoder (i.e., βd=βt∪βr).

To accurately model

p(x

, the VAE parameters

β=βe∪βd

can be learned by

maximizing the ELBO deﬁned in (6.3). To this end, we deﬁne: i) the ﬁrst term of

the ELBO based on mean squared error loss function

LMSE

; ii) second term of the

ELBO based on the loss function

considered for discriminative reasoning; and iii)

third term of the ELBO based on the known functional forms of

qβe(z|xi)

and

pβd(z)

Accordingly, the objective function

associated with generative reasoning for a set

of samples Sis written as follows:

Og(S) = 1

|S| ∑

(xi,yi)∈S LMSE(r(zi;βr),fi)

|S| ∑

(xi,yi)∈S L(tg(zi;βt),yi)

∑

j=11+log(σ2

i,j)−µ2

i,j−σ2

i,j,

(6.7)

where

µi,j

and

σi,j

are the

th element of the vectors

µi

and

σi

, respectively, while

their length. For the derivation of the KL divergence term in the ELBO, the reader is

referred to [200].

It is worth noting that the proposed integration of generative reasoning into dis-

criminative reasoning does not depend on the selection of the loss function

and

discriminative task head, and thus can be applied to most of the supervised DL-

based IRL methods. It also does not require an additional CNN backbone as image

encoder since we deﬁne the variational generative process based on

. Thus, the

considered VAE is directly branched out from the CNN backbone to learn RS image

representations based on generative and discriminative reasoning together.

6.3.3 Label Noise Robust Hybrid Representation Learning

The proposed GRID approach aims to jointly model the posterior and joint distribu-

tions of annotated RS images in a single learning procedure, while achieving label

noise robust IRL. To this end, we introduce a label noise robust hybrid representa-

tion learning strategy to model RS images through: i) generative reasoning for the

training samples with noisy labels; and ii) discriminative reasoning for the remaining

samples in the training data. For the sake of simplicity, we refer training samples

with noisy labels as noisy samples, and those with correct labels as clean samples

hereafter. It is noted that generative reasoning is less annotation dependent com-

pared to discriminative reasoning due to modelling the joint distribution

p(x

in a

probabilistic generative process. Thus, for discriminative reasoning, the loss value

differences between noisy samples and clean samples are higher compared to genera-

tive reasoning. The proposed integration of generative reasoning into discriminative

reasoning allows to automatically detect noisy samples based on the loss values of

incurred through generative and discriminative reasoning. Accordingly, we decide

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 90

whether a training sample is noisy or clean based on the loss values acquired from

discriminative and generative task heads. To this end, we deﬁne our automatic noisy

sample detection procedure as follows. A training sample is considered as noisy if it

leads to a signiﬁcantly higher loss value from the discriminative task head compared

to the generative task head. For a given mini-batch

, we ﬁrst sort the differences of

normalized loss values acquired from discriminative and generative task heads. This

can be deﬁned as a non-decreasing sequence Aas follows:

A= (ak)|B|

k=1,ak≥ak+1

ak∈ {L(yˆd

i,yi)−L(yˆg

i,yi)}(xi,yi)∈B ∀k,(6.8)

where loss values are normalized based on the min-max scaling strategy. Then, we

divide

into the set

of noisy samples and the set

of clean samples, where

B=W ∪C;W ∩C =∅, as follows:

W={(xi,yi)|(xi,yi)∈B∧L(yˆd

i,yi)−L(yˆg

i,yi)>aλ}(6.9)

C={(xi,yi)|(xi,yi)∈B∧L(yˆd

i,yi)−L(yˆg

i,yi)≤aλ}. (6.10)

includes the samples from

associated with the

λ∈ {

1,2,

. . .

|B|}

largest ele-

ments of

(i.e., the

highest loss value differences), while

includes the rest of the

samples from B.λis a hyper-parameter of the proposed GRID approach.

To learn the model parameters associated with discriminative and generative reason-

ing, one could directly apply optimization to jointly minimize

Og(B)

and

Od(B)

. This

leads to optimization of the objectives for all samples in

based on both generative

and discriminative reasoning. When it is applied to the parameters

of the CNN

backbone

, it can limit to exploit the effectiveness of generative reasoning for noisy

samples and that of discriminative reasoning for clean samples due to interference of

different learning characteristics. Accordingly, the model parameters

are updated

based on whether a sample is assigned to

. Accordingly, the update rule for

is written as follows:

θ←θ−η∇θ|W|Og(W) + |C|Od(C)

|B| ,(6.11)

where

is the learning rate. It is noted that we deﬁne the variational generative

process based on the image descriptors. Accordingly, for the backbone parameters,

the ﬁrst and the third terms of the ELBO is assumed to be 0 (see (6.3)). Then, the

update rule can be written based on only Las follows:

θ←θ−η∇θ

|B| ∑

(xi,yi)∈W

L(yˆg

i,yi) + ∑

(xi,yi)∈C

L(yˆd

i,yi). (6.12)

Based on this update rule, the CNN backbone parameters are updated only to mini-

mize

, whose values are obtained from generative task head for noisy samples and

discriminative task head for clean samples. Accordingly, RS image representations

are learned based on: i) the generative reasoning for noisy training samples; and

ii) discriminative reasoning for clean samples. However, for the remaining model

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 91

parameters, it is important to maintain the characteristics of discriminative and gen-

erative reasoning throughout the training. Accordingly, discriminative task head

parameters

are updated based on

Od(B)

, while the VAE parameters

are updated

based on Og(B)as follows:

γ←γ−η∇γOd(B)

β←β−η∇βOg(B).(6.13)

Due to the automatic detection of training samples associated with noisy annotations

and learning RS image representation space (characterized by the CNN backbone),

the proposed GRID approach leverages the effectiveness of both discriminative and

generative reasoning. This leads to learning RS image representations robust to

label noise without overﬁtting on noisy labels as in discriminative learning. It is

worth mentioning that the proposed GRID approach is independent from the DNN

architecture, loss function

, learning task, annotation type being considered and the

type of label noise present in training data. In this chapter, we assess our approach

under two scenarios, where training RS images are annotated with: 1) scene-level

noisy multi-labels; and 2) pixel-level noisy labels. Under these scenarios, we consider

three learning tasks with the corresponding loss functions and DNN architectures

(see Section 6.4 for the details).

6.4 Dataset Description and Experimental Design

6.4.1 Dataset Description

We conducted experiments on the BigEarthNet-S2 and the DLRSD [203] RS image

archives. We employed a subset of BigEarthNet-S2 that includes images acquired

over Serbia in summer. It consists of 14,832 Sentinel-2 multispectral images. Each

image is a section of: i) 120

120 pixels for 10m bands; ii) 60

60 pixels for 20m bands;

and iii) 20

20 pixels for 60m bands. It is noted that bicubic interpolation is applied to

20m bands, while 60m bands are excluded from the experiments. For the experiments,

we utilized the 19 class nomenclature of BigEarthNet-S2. We also extracted the CLC

land cover map of each image for the selection of

(which requires the availability of

land-cover maps during training). The DLRSD archive includes 2,100 aerial images.

Each image has the size of 256

256 pixels with a spatial resolution of 30 cm, and

annotated with both multi-labels and pixel-level labels, where the class nomenclature

is deﬁned in [93]. For the experiments, these archives were divided into training,

validation and test sets with the ratios of 70%, 10%, 20% for DLRSD and 52%, 24%,

24% for BigEarthNet-S2.

6.4.2 Experimental Design

To conduct experiments, we considered two different scenarios, where training

images are annotated with: 1) scene-level noisy multi-labels; and 2) pixel-level noisy

labels. For these scenarios, we tested our approach under three learning tasks with

their corresponding loss functions and DNN architectures that are explained in detail

in the following.

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 92

In the ﬁrst scenario, IRL is achieved based on supervised multi-label RS image

classiﬁcation. For this scenario, binary cross entropy (BCE) loss function was chosen

of the proposed GRID approach. Accordingly, each of the generative and

discriminative task heads includes an FC layer as a classiﬁer that produces multi-

label class probabilities. The proposed approach applied to this scenario is denoted

as GRID (BCE) hereafter.

In the second scenario, IRL is achieved by: 1) semantic segmentation for land

cover map generation based on pixel-wise cross entropy loss function (denoted

as GRID (PCE) hereafter); and 2) multi-label co-occurrence prediction based on

region representation learning (RRL) loss function introduced in [79] (denoted as

GRID (RRL) hereafter). For GRID (PCE), each of the generative and discriminative

task heads consists of three transposed convolutional layers with the ﬁlters of 64, 32

and the number of considered classes. For GRID (RRL), each of the generative and

discriminative task heads includes an FC layer that produces the prediction for graph

driven region-based representations. The reader is referred to [79] for the details.

For both scenarios, we employed the DenseNet-121 architecture [148] as the CNN

backbone, and utilized the latent dimension of 128 for the VAE encoder. The feature

decoder of VAE employs an FC layer with the hidden unit size of image descriptor

dimension (which is 1024 for DenseNet-121) for GRID (BCE) and GRID (RRL), while

the FC layer is replaced with a convolutional layer with the kernel size of 1

1 for

GRID (PCE). The parameter

λ=k|B|/

100 was varied as

k∈ {

10,20,

. . .

,90

}

when

shows the percentage of each mini-batch that is identiﬁed as the set of noisy samples

(denoted as λk%∀khereafter).

To assess the robustness of our approach to label noise for both scenarios, we applied

synthetic label noise injection to the training sets in the range of

[

10%,60%

]

with

the step size of 10%. In particular, for scene-level annotations, the set of class labels

are randomly chosen from the training label set

based on a synthetic label noise

injection ratio (SLNIR). Then, each selected class label is randomly changed into one

of other class labels, which are not associated with the corresponding image. This

ensures that both missing and wrong labels are considered as noisy annotations. For

pixel-level annotations,

is converted to the set of unique class labels associated

with each image prior to random selection based on SLNIR. Then, changed classes

are reﬂected to all relevant pixel labels.

We conducted experiments related to all scenarios and learning tasks on a single RS

application for the sake of simplicity. This application was selected as content-based

image retrieval (CBIR) since learning accurate image features is of great importance

for similarity matching in CBIR. To apply CBIR after learning RS image represen-

tations, for each archive, we employed the training set for selecting query images,

while images were retrieved from the test set. We performed the hyper-parameter

selection of our approach on the validation set in the context of CBIR. We trained our

approach for 100 epochs by using the Adam gradient descent optimization algorithm

with the initial learning rate of 10

−3

and the mini-batch size of 128. After RS IRL is

achieved by the proposed approach, we obtained the features of query and archive

images from the last layer of the backbone. Then, to apply CBIR, similarity matching

of these features was performed by using the

χ2

-distance measure. CBIR results

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 93

TABLE 6.1: RESULTS (%) OBTAINED BY THE PROPOSED GRID (BCE) APPROACH FOR

DIFFERENT VALUES OF λAND SLNIR (%) (DLRSD ARCHIVE)

SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%

066.4 65.9 64.5 64.4 64.5 57.9 59.3 60.4 58.5

10 63.1 64.2 62.9 63.6 60.7 60.2 56.2 60.2 57.8

20 61.6 60.2 64.5 62.5 59.9 60.8 58.1 55.7 54.3

30 57.0 56.0 56.8 57.4 57.5 59.2 54.9 58.9 55.0

40 51.9 54.6 57.5 54.7 56.1 55.5 54.6 50.0 55.7

50 51.6 51.1 51.1 54.0 53.3 54.3 50.8 47.8 52.7

60 49.3 48.7 47.9 50.7 51.4 51.2 48.2 49.0 47.2

are provided in terms of normalized discounted cumulative gains (NDCG), which

was averaged on the 20 and 30 most similar images for DLRSD and BigEarthNet-S2,

respectively.

For the two above-mentioned scenarios, we carried out experiments to: 1) perform a

sensitivity analysis; 2) conduct an ablation study; and 3) compare our approach with

the state-of-the-art methods in the framework of CBIR. Under the ﬁrst scenario, we

compared our GRID (BCE) approach with: 1) the early-learning regularization (ELR)

framework [196]; 2) the joint training with co-regularization (JoCoR) approach [197];

RS IRL with multi-label classiﬁcation by using 3) focal loss (denoted as FL) [195]; 4)

asymmetric loss (denoted as ASL) [198]; and 5) the standard binary cross entropy

(BCE) loss. It is worth noting that ELR, JoCoR and FL are originally introduced for

single-label classiﬁcation problems. By following [191], we adapted them to multi-

label classiﬁcation. Under the second scenario, we compared our GRID (PCE) ap-

proach with: 1) the high-resolution land cover mapping through learning with noise

correction method [31] (denoted as LNC); and 2) RS IRL with semantic segmentation

by the standard pixel-wise cross-entropy loss (PCE). For the second scenario, we also

compared our GRID (RRL) approach with RS IRL with multi-label co-occurrence

prediction based on RRL loss [79] (denoted as RRL). For each comparison with our

approach, we used the same CNN backbone and the same task heads.

6.5 Experimental Results

6.5.1 Sensitivity Analysis of the Proposed Approach

In this sub-section, we present the results of the sensitivity analysis of the proposed

approach under scene-level noisy labels (i.e., ﬁrst scenario) and pixel-level noisy

labels (i.e., second scenario) in terms of different values of the

hyper-parameter at

different values of SLNIR. We also assessed the effectiveness of our automatic noisy

sample detection procedure for both scenarios in terms of the noisy sample detection

accuracy.

1st Scenario (Scene-Level Noisy Labels): Tables 6.1 and 6.2 show the results of

GRID (BCE) for the DLRSD and BigEarthNet-S2 archives, respectively. One can see

from the Table 6.1 that when the level of training label noise increases, our approach

achieves generally higher scores by detecting more training samples as noisy with

higher values of

for the DLRSD archive. However, as it can be seen from Table 6.2,

our GRID (BCE) approach achieves the highest scores when 20% of each mini-batch

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 94

TABLE 6.2: RESULTS (%) OBTAINED BY THE PROPOSED GRID (BCE) APPROACH FOR

DIFFERENT VALUES OF λAND SLNIR (%) (BIGEARTHNET-S2 ARCHIVE)

SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%

067.6 67.3 66.6 66.4 65.3 63.7 63.7 61.9 59.9

10 66.4 67.9 67.2 66.4 65.1 64.7 63.2 62.8 62.5

20 65.4 65.9 65.5 65.1 64.0 61.4 61.6 61.9 61.3

30 64.9 65.2 64.6 63.7 61.5 60.6 62.0 58.6 59.2

40 63.8 64.4 63.3 62.3 60.1 59.8 60.0 59.3 59.9

50 63.1 62.1 62.4 62.0 58.3 60.9 59.5 57.8 56.3

60 61.5 62.0 61.8 61.2 59.4 58.7 58.5 58.9 58.6

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(a)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(b)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(c)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(d)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(e)

λ10% λ20% λ30% λ40% λ50% λ60% Random Selection

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(f)

FIGURE 6.2: Noisy sample detection accuracy of the proposed GRID (BCE) approach versus

epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and

for

λk%

is set

as equal to the SLNIR value (DLRSD archive).

is identiﬁed as noisy (

λ20%

) for most of the SLNIR values under BigEarthNet-S2. It is

worth noting that BigEarthNet-S2 includes a higher number of RS images compared

to DLRSD, and thus there is a lower risk of overﬁtting to noisy labels. Accordingly,

when a training set size is higher than a certain extent as in BigEarthNet-S2, our

approach is capable of achieving a high performance with lower values of

under

even a high label noise rate. However, when the rate of label noise in a training set

is high for a small dataset like DLRSD, our approach requires to increase the effect

of generative reasoning through detecting a higher number of noisy samples (i.e., a

high value of

) for more accurate IRL. By considering that there is not a single

value that provides the highest scores under all SLNIR values for DLRSD, we set it

based on the results on BigEarthNet-S2. Accordingly, for the rest of the experiments,

we set λof GRID (BCE) to λ20%.

We would like to note that if the value of

is high, there is a risk of detecting training

samples with correct labels (i.e., clean samples) as noisy samples. To analyze the

effectiveness of our automatic noisy sample detection procedure, Figures 6.2 and 6.3

show the noisy sample detection accuracy when

for

λk%

is set as equal to the SLNIR

value (e.g.,

λ20%

for SLNIR

20%) for DLRSD and BigEarthNet-S2, respectively,

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 95

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(a)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(b)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(c)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(d)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(e)

λ10% λ20% λ30% λ40% λ50% λ60% Random Selection

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(f)

FIGURE 6.3: Noisy sample detection accuracy of the proposed GRID (BCE) approach versus

epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and

for

λk%

is set

as equal to the SLNIR value (BigEarthNet-S2 archive).

TABLE 6.3: RESULTS (%) OBTAINED BY THE PROPOSED GRID (PCE) APPROACH FOR

DIFFERENT VALUES OF λAND SLNIR (%) (DLRSD ARCHIVE)

SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%

065.8 63.5 63.3 62.2 59.5 63.3 62.5 61.1 61.5

10 62.7 61.2 61.6 58.1 60.1 57.2 59.1 60.0 59.2

20 59.7 59.8 59.5 58.3 55.0 57.9 56.1 55.6 57.4

30 59.0 57.1 56.5 57.5 53.6 53.0 55.5 56.1 55.0

40 56.1 55.9 56.5 56.2 55.1 53.4 55.0 54.0 52.6

50 53.9 52.1 52.2 54.4 50.5 52.1 49.7 51.4 51.9

60 49.4 49.1 46.7 44.4 47.5 47.4 48.2 50.8 47.1

under the ﬁrst scenario. One can observe from the ﬁgures that our approach detects

noisy samples more accurately than random selection under each SLNIR value. This

shows the effectiveness of our automatic noisy sample detection procedure in the

proposed approach. It can be also seen from the ﬁgures that after a certain number of

training epochs, noisy sample detection accuracy starts to decrease for most of the

SLNIR values. It is due the fact that as the proposed approach combines generative

and discriminative reasoning during training, image representation space encoded by

the CNN backbone starts to become robust to noisy samples. Then, for our approach,

detecting noisy samples becomes harder and harder based on the image features

from the backbone as training continues. This leads to decrease in noisy sample

detection accuracy after a certain number of epochs. In greater detail, our approach

trained on BigEarthNet-S2 provides higher detection accuracy compared to that on

DLRSD especially on higher SLNIR values. It is due to the higher number of training

samples in BigEarthNet-S2 compared to DLRSD that allows our approach to learn

model parameters and to detect noisy samples more accurately.

2nd Scenario (Pixel-Level Noisy Labels): Tables 6.3 and 6.4 show the results of

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 96

TABLE 6.4: RESULTS (%) OBTAINED BY THE PROPOSED GRID (PCE) APPROACH FOR

DIFFERENT VALUES OF λAND SLNIR (%) (BIGEARTHNET-S2 ARCHIVE)

SLNIR λ10% λ20% λ30% λ40% λ50% λ60% λ70% λ80% λ90%

064.7 64.7 64.6 64.1 63.8 62.5 61.0 61.5 61.2

10 63.7 63.4 62.9 62.6 60.7 60.2 60.7 61.0 58.8

20 62.1 60.7 61.8 60.4 60.3 55.8 55.8 55.6 61.2

30 61.8 62.0 60.2 60.6 59.2 54.2 60.4 60.2 57.8

40 60.6 60.0 59.9 58.7 57.8 53.8 55.0 53.9 53.5

50 59.4 59.1 58.3 57.0 52.8 53.9 55.7 56.8 52.6

60 59.8 58.6 52.0 54.5 53.2 53.5 52.7 54.2 54.5

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(a)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(b)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(c)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(d)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(e)

λ10% λ20% λ30% λ40% λ50% λ60% Random Selection

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(f)

FIGURE 6.4: Noisy sample detection accuracy of the proposed GRID (PCE) approach versus

epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and

for

λk%

is set

as equal to the SLNIR value (DLRSD archive).

GRID (PCE) for the DLRSD and BigEarthNet-S2 archives, respectively. By as-

sessing the table, one can observe that as SLNIR value increases the proposed

approach achieves the higher scores with higher values of

for DLRSD. How-

ever, for BigEarthNet-S2, the proposed GRID (PCE) approach achieves the highest

scores when

is set to

λ10%

for most of the SLNIR values. This is inline with our

conclusion from the ﬁrst scenario. In greater detail, for most of the SLNIR val-

ues, GRID (PCE) achieves the higher scores with lower values of

compared to

GRID (BCE) for both archives. This is due to the fact that the semantic segmentation

task of the 2nd scenario is more complex than the multi-label image classiﬁcation task.

Accordingly, our GRID (PCE) approach requires increasing the effect of discrimina-

tive reasoning over generative reasoning compared to GRID (BCE) to overcome the

complexity of the semantic segmentation task. This can be achieved by decreasing

the value of

as it can be seen from the results. For the rest of the experiments, we

set

of GRID (PCE) to

λ10%

based on the BigEarthNet-S2 results similar to the ﬁrst

scenario.

Figures 6.4 and 6.5 show the noisy sample detection accuracy of the proposed

GRID (PCE) approach for DLRSD and BigEarthNet-S2, respectively, when kfor λk%

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 97

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(a)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(b)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(c)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(d)

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(e)

λ10% λ20% λ30% λ40% λ50% λ60% Random Selection

0 10 20 30 40 50 60 70 80 90 100

Epoch

Accuracy (%)

(f)

FIGURE 6.5: Noisy sample detection accuracy of the proposed GRID (PCE) approach versus

epoch when SLNIR is (a) 10%, (b) 20%, (c) 30%, (d) 40%, (e) 50%, (f) 60%; and

for

λk%

is set

as equal to the SLNIR value (BigEarthNet-S2 archive).

is set as equal to the SLNIR value under the second scenario. One can see from the

ﬁgures that GRID (PCE) is capable of detecting noisy samples with higher accuracy

than the random sampling under each SLNIR value for both archives. In particular,

the proposed approach achieves higher detection accuracy on BigEarthNet-S2 than

DLRSD. These follow our conclusion from the ﬁrst scenario. This shows that our

approach is capable of accurately detecting noisy samples independently from the

considered loss function, learning task, DNN and training sample annotation type. In

greater detail, unlike the ﬁrst scenario, after a certain number of training epochs noisy

sample detection accuracy of GRID (PCE) becomes non-decreasing for some SLNIR

values. It is due to the relative complexity of semantic segmentation task compared

to multi-label image classiﬁcation task that may require more training epochs for our

approach under especially high SLNIR values. Since label noise rate of a training set

is assumed to be unknown for our approach, we avoided over-parameterization of

hyper-parameters such as number of training epochs. It is noted that the results for

the sensitivity analysis of the 2nd scenario were also conﬁrmed through experiments

for our GRID (RRL) approach on both archives (not reported for space constraints).

6.5.2 Ablation Study of the Proposed Approach

In this sub-section, we present an ablation study of our approach to analyze the

effectiveness of our label noise robust hybrid representation learning compared to

using: i) only discriminative reasoning; ii) only generative reasoning; and iii) their

standard joint learning under both ﬁrst and second scenarios. For the standard joint

learning of discriminative and generative reasoning, we jointly minimize

and

for all the training samples without the detection of noisy samples. This leads to

optimization of the all model parameters based on both generative and discrimina-

tive reasoning of noisy samples and clean samples together. Figure 6.6 shows the

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 98

0 10 20 30 40 50 60

SLNIR (%)

NDCG (%)

(a)

0 10 20 30 40 50 60

SLNIR (%)

NDCG (%)

(b)

0 10 20 30 40 50 60

SLNIR (%)

NDCG (%)

(c)

0 10 20 30 40 50 60

SLNIR (%)

NDCG (%)

(d)

0 10 20 30 40 50 60

SLNIR (%)

NDCG (%)

(e)

Discriminative reasoning Generative reasoning The standard joint learning Our hybrid representation learning

0 10 20 30 40 50 60

SLNIR (%)

NDCG (%)

(f)

FIGURE 6.6: Results obtained by using: 1) discriminative reasoning; 2) generative reason-

ing; 3) their standard joint learning; and 4) our label noise robust hybrid representation

learning strategy for different values of SLNIR when RS IRL is achieved by: i) multi-label

classiﬁcation on (a) DLRSD and (b) BigEarthNet-S2; ii) semantic segmentation on (c) DLRSD

and (d) BigEarthNet-S2; and iii) multi-label co-occurrence prediction on (e) DLRSD and (f)

BigEarthNet-S2.

results of using: i) discriminative reasoning; ii) generative reasoning; iii) the standard

joint learning of discriminative and generative reasoning; and iv) our label noise

robust hybrid representation learning strategy under different SLNIR values for both

archives. By assessing the ﬁgure, one can observe that our label noise robust hybrid

representation learning strategy provides the highest scores for most of the SLNIR

values independently from the considered scenarios. This shows that our approach is

capable of: i) accurately combining generative and discriminative reasoning indepen-

dently from the considered loss function, learning task and type of annotation; and ii)

effectively adjusting the whole learning procedure accordingly for label noise robust

IRL. In greater detail, generative reasoning achieves the lowest scores under most

of the SLNIR values and considered scenario compared to discriminative reasoning.

However, its performance is less affected by the increase in label noise rate compared

to discriminative reasoning. This shows the capability of generative reasoning to

allow robust learning of image representations under label noise. One can see from

the ﬁgure that the standard joint learning provides lower scores compared to using

only discriminative reasoning for most of the SLNIR values under both scenarios.

Learning image representations based on discriminative and generative reasoning

on all the training samples may not be accurately achieved due to interference of

different learning characteristics. However, when the complementary characteristics

of discriminative and generative reasoning is modeled based on our hybrid repre-

sentation learning strategy, the proposed approach is capable of overcoming this

limitation. This shows the importance of the label noise robust hybrid representation

learning strategy in our approach.

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 99

TABLE 6.5: RESULTS (%) OBTAINED BY BCE, ELR [196], FL [195], ASL [198], JOCOR [197]

AND THE PROPOSED GRID (BCE) APPROACH UNDER DIFFERENT VALUES OF SLNIR (%)

(DLRSD ARCHIVE)

SLNIR BCE ELR FL ASL JOCOR GRID (BCE)

0 62.7 63.8 62.8 63.5 61.7 67.2

10 60.3 62.0 58.7 57.2 61.7 64.2

20 55.9 59.2 55.3 51.4 59.3 62.5

30 55.6 54.4 52.6 50.6 55.5 62.2

40 50.1 50.9 48.6 46.3 51.8 55.7

50 48.3 50.0 46.4 43.7 49.1 53.8

60 47.1 47.2 46.1 43.6 47.9 50.6

TABLE 6.6: RESULTS (%) OBTAINED BY BCE, ELR [196], FL [195], ASL [198], JOCOR [197]

AND THE PROPOSED GRID (BCE) APPROACH UNDER DIFFERENT VALUES OF SLNIR (%)

(BIGEARTHNET-S2 ARCHIVE)

SLNIR BCE ELR FL ASL JOCOR GRID (BCE)

0 67.6 68.9 66.2 65.6 66.6 68.2

10 65.7 66.6 64.1 63.6 66.4 66.7

20 63.3 64.2 63.2 62.7 64.5 65.7

30 62.6 63.1 61.6 62.3 62.9 63.4

40 61.6 61.9 61.3 61.7 62.1 63.3

50 60.2 60.4 59.8 60.4 61.1 61.6

60 59.6 59.8 59.9 60.1 60.2 60.3

6.5.3 Comparison Among the State-of-the-Art Methods

In this sub-section, we analyze the effectiveness of the proposed approach compared

to different state-of-the-art methods for both scenarios under different values of

SLNIR.

1st Scenario (Scene-Level Noisy Labels): We compared our GRID (BCE) approach

with BCE, ELR [196], FL [195], ASL [198] and JoCoR [197] for the ﬁrst scenario.

Tables 6.5 and 6.6 show the corresponding results for DLRSD and BigEarthNet-S2

archives, respectively. By analyzing the tables, one can observe that the proposed

GRID (BCE) approach leads to the highest scores for almost all the SLNIR values

on both DLRSD and BigEarthNet-S2 archives. For example, our approach outper-

forms ELR by almost 8% NDCG score when SLNIR

30% for DLRSD. In detail, it

provides more than 3% higher NDCG score compared to ASL when SLNIR

10%

for BigEarthNet-S2. As SLNIR value increases, reduction in the NDCG scores is

higher for DLRSD compared to BigEarthNet-S2. This is due to the small number of

images present in DLRSD that leads to overﬁtting on noisy labels more easily than

BigEarthNet-S2. However, even under high SLNIR values for DLRSD, our approach

achieves comparable results with other methods under smaller SLNIR values. As

an example, our approach under SLNIR

40% achieves similar performance with

BCE under SLNIR

30%. These results demonstrate the success of the proposed

GRID (BCE) approach compared to other methods when the training images are

annotated with scene-level noisy multi-labels.

2nd Scenario (Pixel-Level Noisy Labels): We compared our GRID (PCE) approach

with PCE and LNC [31], while GRID (RRL) was compared with RRL [79] for the

second scenario. Tables 6.7 and 6.8 show the corresponding results for DLRSD

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 100

TABLE 6.7: RESULTS (%) OBTAINED BY PCE, LNC [31], RLL [79] AND THE PROPOSED

GRID (PCE) AND GRID (RRL) APPROACHES UNDER DIFFERENT VALUES OF SLNIR (%)

(DLRSD ARCHIVE)

SLNIR PCE LNC GRID (PCE) RRL GRID (RRL)

065.0 62.5 64.0 57.5 58.1

10 60.0 60.9 62.1 52.2 52.7

20 59.1 60.8 61.1 51.8 54.9

30 56.8 57.5 57.7 48.8 52.9

40 55.8 55.2 56.1 43.6 53.3

50 53.0 53.6 52.6 44.5 51.8

60 48.3 48.2 48.3 45.1 47.7

TABLE 6.8: RESULTS (%) OBTAINED BY PCE, LNC [31], RLL [79] AND THE PROPOSED

GRID (PCE) AND GRID (RRL) APPROACHES UNDER DIFFERENT VALUES OF SLNIR (%)

(BIGEARTHNET-S2 ARCHIVE)

SLNIR PCE LNC GRID (PCE) RRL GRID (RRL)

0 63.5 62.5 64.9 62.4 63.8

10 61.8 61.7 61.8 60.1 62.5

20 61.2 61.3 61.8 58.8 62.1

30 61.1 61.2 61.5 59.3 61.1

40 59.9 60.0 60.0 58.5 61.2

50 58.9 58.8 59.0 57.4 60.5

60 58.2 58.3 58.6 57.5 59.9

and BigEarthNet-S2 archives, respectively. One can see from the tables that both

GRID (PCE) and GRID (RRL) achieve the highest scores compared to other methods

under most of the SLNIR values. As an example, when SLNIR

10% for DLRSD,

GRID (PCE) achieves more than 1% higher NDCG score compared to LNC, which

is speciﬁcally designed for pixel-wise label noise robust semantic segmentation of

RS images. Even when synthetic pixel-wise label noise is not injected to the training

sets (SLNIR

0%), both GRID (PCE) and GRID (RRL) are capable of providing the

highest scores for the BigEarthNet-S2 archive. This is due to the fact that even if

SLNIR

0%, our approach is learning RS image representations robust to label noise

already present in the original training sets. In greater detail, only when SLNIR

equals to 50% and 0% for DLRSD, our GRID (PCE) approach is outperformed by

LNC and PCE, respectively, with 1% difference of NDCG scores. However, this is

speciﬁc to DLRSD archive and not valid for BigEarthNet-S2 archive. These results

show the success of our approach compared to other methods when the training

samples are annotated with pixel-level noisy labels. This is inline with our conclusion

from the ﬁrst scenario.

It is worth noting that under two scenarios, we tested our approach with three dif-

ferent loss functions (BCE, PCE and RRL), three different learning tasks (multi-label

image classiﬁcation, semantic segmentation and multi-label co-occurrence prediction)

with the corresponding DNN architectures and two different annotation types (scene-

level and pixel-level) compared to state-of-the-art methods. The results show that

the proposed approach is capable of accurately learning RS image representations

under label noise independently from the considered DNN architecture, loss function,

learning task and annotation type. This is due to the capability of our approach to

simultaneously leverage the robustness of generative reasoning to noisy labels and

Chapter 6. Label Noise Robust Deep Image Representation Learning in RS 101

the effectiveness of discriminative reasoning for IRL.

6.6 Conclusion

In this chapter, we have introduced a novel generative reasoning integrated label

noise robust deep representation learning (GRID) approach to model the comple-

mentary characteristics of discriminative and generative reasoning for IRL under

noisy labels. To achieve this, the proposed GRID approach ﬁrst integrates generative

reasoning into discriminative reasoning through a supervised VAE as the probabilis-

tic generative process. Due to this integration, both generative and discriminative

reasoning share the same CNN backbone that allows to: 1) model the posterior and

joint distributions of annotated RS images in a single learning procedure; 2) auto-

matically detect training samples with noisy labels based on the loss values acquired

from discriminative and generative task heads. This is achieved by the label noise

robust hybrid representation learning strategy (which models RS images through

generative reasoning for the training samples with noisy labels and discriminative

reasoning for the remaining samples in the training data) in our approach. By this

way, the proposed GRID approach learns discriminative RS image representations

through the CNN backbone while preventing interference of noisy labels during

training.

It is worth noting that our approach is independent from the type of DNN architec-

ture, loss function, learning task, annotation being considered, label noise present in

training data, and can operate with any DL-based IRL method. In addition, GRID

does not require the availability of a clean subset of a training set. In this chapter,

we consider two different scenarios, where training samples are annotated with: 1)

scene-level noisy multi-labels; and 2) pixel-level noisy labels. Experimental analysis

conducted on two RS image archives shows the effectiveness of our approach for

these scenarios. In particular, the success of our approach is shown under three learn-

ing tasks with the corresponding loss functions and DNN architectures at different

synthetic label noise injection rates while considering both wrong and missing labels.

This shows that the proposed approach accurately learns discriminative RS image

representations, while ensuring the robustness of whole learning procedure towards

noisy labels independently from the IRL method being considered. We underline

that this is a very important advantage for operational RS applications, which deal

with noisy annotations and require different IRL scenarios.

We would like to point out that our automatic noisy sample detection procedure is

controlled by the hyper-parameter

. Its selection may be dependent on the level

of noisy labels in a training set, which is unknown most of the time in operational

scenarios. Accordingly, as a future development of this work, we plan to investigate

the strategies for automatically detecting level of noise in training data, and then

integrating it into our automatic noisy sample detection procedure for the proposed

approach.

102

Chapter 7

Plasticity-Stability Preserving

Multi-Task Image Representation

Learning in Remote Sensing

DL-based multi-task learning (MTL) methods have recently attracted attention for

IRL in RS. For a given set of tasks (e.g., scene classiﬁcation, semantic segmentation,

image reconstruction, etc.), existing MTL methods employ a joint optimization al-

gorithm on the direct aggregation of task-speciﬁc loss functions. Such an approach

may provide limited performance when: i) tasks compete or even distract each other;

ii) one of the tasks dominates the whole learning procedure; or iii) characterization

of each task is under-performed compared to single-task learning. This is mainly

due to the lack of: i) plasticity condition (which is associated to sensitivity to new

information); or ii) stability condition (which is associated to protection from radical

disruptions by new information) of the whole learning procedure. To avoid this issue,

in this chapter, we propose a novel plasticity-stability preserving multi-task learning

(PLASTA-MTL) approach to ensure the plasticity and the stability conditions of

whole learning procedure independently from the number and type of tasks. This is

achieved by deﬁning two novel loss functions. The ﬁrst loss function is the plasticity

preserving loss (PPL) function that aims to enforce the global image representation

space to be sensitive to new information learned with each task. This is achieved by

minimizing the difference of gradient magnitudes for the global representation and

task-speciﬁc embedding spaces. The second loss function is the stability preserving

loss (SPL) function that aims to protect the global representation space radically

disrupted by a new task. This is achieved by minimizing the angular distances be-

tween the task gradients over global representation space. To effectively employ the

proposed loss functions, we also introduce a novel sequential optimization algorithm.

Experimental results show the effectiveness of the proposed approach compared to

the state-of-the-art MTL methods. This chapter is mainly based on the following

publication:

•

G. Sumbul and B. Demir, “Plasticity-stability preserving multi-task learning

for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 60, pp. 1–16, 2022. DOI:10.1109/TGRS.2022.3160097.

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 103

7.1 Introduction

As highlighted in the previous chapter, in DL-based IRL methods, image representa-

tions are automatically learned during the optimization of an objective function based

on the characteristics of a learning task. Most of the existing methods in RS utilize

the following learning tasks: 1) scene classiﬁcation [4]–[14]; 2) similarity learning

[15]–[27]; 3) image reconstruction [28], [29]; and 4) semantic segmentation [30], [79].

Each task has different objectives that leads to different optimization procedures

throughout the training of the considered deep neural network (DNN). Accordingly,

learned image representations have different characteristics for different tasks, and

thus carry different information to be utilized in the ﬁnal application. As an example,

when the task is scene classiﬁcation, RS image representations can be learned with

convolutional neural networks (CNNs) by optimizing entropy-based loss functions.

In this way, image representations are encoded to separate pre-deﬁned classes that

maximizes inter-class distances in the image representation space. For the similarity

learning task, on the other hand, image representations are learned to discriminate

dissimilar RS images that minimizes intra-class distance in the image representation

space [34]. This can be achieved by employing siamese CNNs on tuples of RS images

to optimize triplet or contrastive loss functions. If the task is chosen as the image

reconstruction, auto-encoder neural networks can be used ﬁrst to construct the rep-

resentations and then to recover RS images with reconstruction loss. In this way,

resulting image representations are robust to noise in RS images [204]. In RS, it is

common to use the above-mentioned tasks in the framework of single-task learning

(STL).

However, using a single task may not be sufﬁcient to describe the complex content

of RS images. To address this issue, multiple tasks can be jointly utilized for the

image representation learning. When image representation learning is achieved

based on multiple tasks, resulting latent space can better represent the complex

semantic content of RS images. Accordingly, few DL based multi-task learning (MTL)

methods have been recently introduced in RS. As an example, in [52], RS image

similarity learning based on triplet loss is combined with the scene classiﬁcation

task. In this method, task-speciﬁc heads are combined with the CNN backbone

shared by two tasks, while the joint optimization of task-speciﬁc loss functions is

employed by minimizing the summation of them. In this way, MTL is regarded as a

joint optimization problem based on the aggregation of task-speciﬁc loss functions.

This is followed by most of the MTL methods in RS.

Due to the complexity of MTL problem, it is common that: i) tasks may compete or

even distract each other during training; ii) one of the tasks may dominate the whole

learning procedure; or iii) characterization of each task can be under-performed

compared to STL [58]. These problems undermine the effectiveness of whole repre-

sentation learning procedure [59]. These issues occur due to the stability-plasticity

constraint of MTL[60]. MTL methods require to be sensitive to new information

learned from each task that allows the contribution of each task to further improve

modeling the image characterization. This condition is known as plasticity [60].

If there is a lack of plasticity condition in response to new information, an image

representation space will be slightly affected while learning a new task, and thus

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 104

will merely reﬂect different characteristics of representations learned via different

tasks. If the considered DNN suffers from the lack of plasticity condition, information

speciﬁc to each task will be only encoded in the corresponding task-speciﬁc head.

The possible drawbacks of this issue are twofold. First, only the general features of

RS images can be encoded in the CNN backbone, and thus image features extracted

from the considered DNN will have the lower discrimination capability compared to

STL. Second, one of the tasks can dominate the global image representation space. In

this case, all tasks except the one, which dominates the image representation space

learned via the backbone, will not signiﬁcantly affect the image features. For MTL,

during the learning process of a new task, new information encoded in the considered

DNN should not radically disrupt what is already characterized based on the other

tasks. This condition is known as stability. When there is a lack of stability condition

in response to new information captured via new task, there is a risk that previous

information encoded by the considered DNN can be forgotten. Thus, a global image

representation space will be mainly characterized based on the characteristics of

representations learned via single task. This risk is more evident when some of the

tasks compete each other. In this case, since every task aims to radically change the

global image representation space compared to other tasks, tasks may distract each

other that leads to less accurate RS image characterization for MTL compared to STL.

The MTL formulation of the existing DL based methods (which is based on joint

optimization) is limited to control learning of each task. Thus, it does not allow to

control plasticity and stability of the whole learning procedure. It is also worth noting

that, in the above-mentioned MTL formulation, whole learning procedure is sensitive

to proper selection of loss function weight for each task that generally requires a

grid search (which is computationally demanding) [61]. Thus, MTL methods that

can effectively combine multiple tasks without the need for selection of loss weights

while considering the stability-plasticity problem are needed to accurately apply RS

image representation learning.

To avoid the above-mentioned problems, as a ﬁrst time, we propose a novel PLAsticity-

STAbility preserving Multi-Task Learning (PLASTA-MTL) approach. The PLASTA-

MTL approach aims to preserve: 1) the plasticity for each task; and 2) the stability

in between learning consecutive tasks for whole learning procedure independently

from the number of tasks and the type of tasks. To this end, we introduce novel plas-

ticity preserving and stability preserving loss functions. The plasticity preserving loss

(PPL) function enforces the global image representation space (which is shared by all

the tasks) to be sensitive to new information learned with each task during training.

This is achieved by minimizing the gradient magnitude differences between global

image representation and task-speciﬁc embedding spaces. The stability preserving

loss (SPL) function protects the image representation space radically disrupted by

each task during training. This is achieved by minimizing the angular distances

between task gradients over global image representation space. To effectively apply

these two loss functions, unlike the most of the existing MTL methods, we also

propose a sequential optimization algorithm. The proposed algorithm aims to adap-

tively adjust the interactions between task-speciﬁc learning procedures, allowing

to ensure plasticity and stability conditions for all the tasks. To this end, instead of

joint optimization of all loss functions, task-speciﬁc objectives together with the PPL

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 105

function are sequentially optimized. By this algorithm, the SPL function is optimized

at the end of the task sequence for all the considered tasks.

The novelty of the proposed PLASTA-MTL approach consists in: 1) the adaptive

adjustment of interactions between task-speciﬁc learning procedures by the proposed

sequential optimization algorithm; 2) the protection of image representation space

from radical disruptions occurred due to each task by the proposed SPL function; and

3) the sensitivity assurance of the image representation space to new information from

each task by the proposed PPL function. Due to the proposed sequential optimization

algorithm, our PLASTA-MTL approach does not need to select loss function weights

for each task. Due to its stability and plasticity preserving capabilities, our PLASTA-

MTL approach overcomes the above-mentioned MTL problems of joint optimization

algorithm, which are mainly conﬂicts between tasks, the dominance of one of the

tasks and under-performance of tasks compared to STL. It is worth noting that the

proposed PLASTA-MTL approach is independent from the number of considered

tasks and their types. In this chapter, we consider the different combinations of four

learning tasks: 1) supervised scene classiﬁcation; 2) supervised similarity learning;

3) supervised multi-label co-occurrence prediction; and 4) unsupervised similarity

learning. For different combinations of these learning tasks, we conduct experiments

on a single RS application for the sake of simplicity. This application is selected as

content-based image retrieval (CBIR) due to the importance of employing accurate

image features for similarity matching in CBIR.

The rest of this chapter is organized as follows. Section 7.2 provides the related

works. Section 7.3 presents the proposed PLASTA-MTL approach. Section 7.4

describes the considered datasets and the experimental setup. Section 7.5 provides

the experimental results, while Section 7.6 concludes this chapter.

7.2 Related Works

In this section, we initially present the recent advances in single-task driven IRL

methods in RS based on the considered learning tasks and then survey the existing

DL based MTL methods for RS IRL.

7.2.1 Single-Task Driven Methods

In the context of DL based single-task driven IRL, an objective function is usually

selected on the basis of the characteristics of the considered learning task, and thus

image features are automatically learned during the optimization of this objective

function. We categorize the existing methods into ﬁve groups based on the tasks that

they utilize and survey an example of studies in the following.

Scene Classification Driven Methods: The task of scene classiﬁcation aims at auto-

matically assigning single-labels or multi-labels to image scenes. In [13], land-use

class probabilities obtained by a CNN are exploited for weighting the distance be-

tween a query image and the archive images for CBIR applications, while in [6], a

distance between image and its land-use class is used to apply re-ranking on the

order of retrieved images. In [9], aggregated deep local features are utilized for query

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 106

sensitive CBIR on RS images. To this end, vector of locally aggregated descriptors

obtained via multiplicative and additive attention mechanisms are used to construct

memory vector for expanded image description. In [5], fuzzy distance calculation

based on fuzzy rules is introduced for the deﬁnition of RS image similarity, while

image descriptors are extracted from a CNN. In [7], query-adaptive feature fusion

technique is introduced to employ different hierarchical image representations from

a CNN.

Similarity Learning Driven Methods: DL based similarity learning aims to auto-

matically identify image similarity based on an image representation space, where

semantically similar images are located close to each other. In [21], a twin CNN

is introduced for the prediction of pairwise image similarity during the hash code

generation of RS images. In [15], a triplet deep metric learning network (TDMLN)

is introduced for RS image similarity learning. TDMLN utilizes three CNNs with

shared model parameters that allow to learn RS image similarity through triplet loss

function on image triplets, each of which include anchor, positive and negative im-

age. TDMLN aims at learning a metric space where the distance between an anchor

and its positive image is minimized and that between the anchor and its negative

image is maximized. In [17], a Siamese graph convolutional network is proposed to

employ region adjacency graph based image descriptors for the characterization of

pairwise image similarity with contrastive loss function. In [27], RS image similarity

learning based on image triplets is utilized for hash code generation of RS images. In

[16], distribution consistency loss function is proposed in the context of deep metric

learning to make use of multiple positive and negative images for each anchor image

unlike the triplet loss function. In [22], quantized deep learning to hash approach is

introduced for efﬁcient CBIR. In this approach, DNN weights and activation func-

tions are binarized while pairwise image similarity characterization is used for hash

code generation of RS images. In [24], generative adversarial network regularization

based deep metric learning method is introduced to model pairwise image similarity

while a generative adversarial network is used to mitigate the overﬁtting problem.

In [25], a global optimization algorithm is introduced to jointly employ different

metric learning based loss functions on image representations for the consistency

between the loss reduction direction and the optimization direction. In [26], weighted

Wasserstein ordinal loss function is proposed for Siamese CNNs to formulate the

image similarity learning problem as an unsupervised deep ordinal classiﬁcation

problem. In [19], dual-anchor triplet loss function is introduced to make use of more

than one anchor for each image triplet (which is achieved by considering the positive

image as the second anchor).

Image Reconstruction Driven Methods: DL based image reconstruction task aims

at automatically reconstructing input images based on unsupervised image represen-

tation learning. In [28], A deep bag-of-words method is introduced. In this method,

a convolutional autoencoder (CAE) is utilized to: i) encode the RS image local areas

into a representation space; and ii) decode local descriptors to image space. A recon-

struction loss function is employed between an image local area and the CAE output,

while k-means clustering is used with bag-of-words approach to deﬁne the global

image representation. In [29], residual-dyad units (which is the combination of full

preactivation block and a convolutional shortcut block) are proposed for CAEs to

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 107

avoid diminishing feature reuse problem of conventional residual connections.

Semantic Segmentation Driven Methods: The semantic segmentation task aims to

automatically identify pixel-based class labels, which are associated to RS images.

As an example for such methods, in [30], a fully convolutional network (FCN) is

proposed to characterize local areas of multi-label RS images. The FCN is ﬁrst trained

to predict land-cover map of RS images, which are then used to characterize convolu-

tional descriptors of image local areas. The set of ﬁnal local descriptors are utilized

for region-based RS image matching. In [79], a graph-theoretic deep representation

learning method is introduced to characterize multi-label co-occurrence relationships

associated to each RS image in an archive. To this end, a CNN is employed for

the automatic prediction of graph driven region-based image representation with a

region representation learning loss function.

7.2.2 Multi-Task Driven Methods

MTL aims at enhancing the effectiveness of image representation learning and the

prediction accuracy of each task compared to using a separate learning procedure for

each task [205]. To this end, DL based MTL problem is formulated as learning the

model parameters of a DNN with respect to multiple loss functions, each of which is

associated with a task. In RS, DL based MTL has been applied to various applications

(e.g., motion deblurring [53], building damage mapping [54], change detection [55],

road extraction [206] etc.). As an example, in the context of CBIR, few DL based MTL

methods have been recently proposed in RS while combining two tasks: i) scene

classiﬁcation; and ii) similarity learning. In [56], a wide-context attention network is

introduced to learn the correlation of local descriptors with wide context information

by employing channel dependence-attention and spatial context-attention modules.

In [52], a center-metric learning method, which employs positive-negative center

loss function for modeling metric space, is proposed to characterize within-class

variations. In [57], a discriminative distillation network is introduced to incrase the

interclass variations and to reduce the intraclass differences. In [207], a deep hashing

CNN is employed for simultaneously generating hash codes and predicting land-use

classes of RS images. All above-mentioned deep MTL methods in RS utilize a CNN

backbone (which is shared by all tasks) followed by task-speciﬁc heads, while image

representation learning is done by jointly optimizing the aggregation of task-speciﬁc

loss functions. Although, the main problems of this MTL formulation are separately

addressed by automatically selecting loss weights with gradient adjustment strategies

in computer vision domain (e.g., [205], [182], [208], [183]), they are still based on the

joint optimization algorithm.

7.3 Proposed Approach

Let

X={x1

. . .

xM}

be an archive that includes

images, where

is the

ith

image in the archive

. Let

φ:θ

X ↦→ Rγ

be any type of DNN that maps the image

-dimensional image descriptor

φ(xi

;

θ)

, where

is the set of DNN parameters.

Let

T={T1

. . .

TN}

be a set of

tasks, where i

task

is associated with a loss

function

LTi

. When image representation learning is achieved based on multiple

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 108

tasks, the objective function consists of multiple loss functions

{LTi}N

i=1

. In this

chapter, MTL is performed by hard parameter sharing technique [58], that allows to

characterize a global descriptor for each image based on the multiple tasks. In this

way, considered DNN typically includes an encoder (i.e., a CNN backbone), which

is shared by all the tasks, and task-speciﬁc heads, which are branched out from the

CNN backbone. Each task-speciﬁc head characterizes the task-speciﬁc embedding

space based on the characteristics of each task. The CNN backbone models global

image representation space. Let

G∈θ

be the set of DNN parameters that is used

for deﬁning global image representation space.

is chosen as the parameters of

the last layer of the CNN backbone shared by all the tasks. Let

ETi∈θ

be the

set of parameters that is used to construct the task-speciﬁc embedding for the

ith

task

. Accordingly, after learning DNN parameters

is used to obtain image

representations.

In the standard MTL formulation (which is based on joint optimization algorithm), all

the model parameters

including

and

{ETi}N

i=1

are simultaneously updated based

on the gradients of aggregated loss functions (

∇θ∑iLTi

). This MTL formulation is

limited to control learning process of each task and thus the plasticity and stability

conditions of the whole learning procedure. This leads to the problems, which are

discussed in the ﬁrst section of this chapter. To avoid these problems by preserving

the plasticity and stability capabilities for all the considered tasks, the proposed

PLASTA-MTL approach is characterized by two novel loss functions and a novel

optimization algorithm. By the proposed plasticity preserving loss (PPL) function,

the PLASTA-MTL approach minimizes the gradient magnitude differences between

global image representation space and task-speciﬁc embedding spaces for the sensi-

tivity of the global image representation space to new information learned via each

task. By the proposed stability preserving loss (SPL) function, the PLASTA-MTL

approach minimizes angular distances between task gradients over global image

representation space to protect it from radical disruptions by each task. To accurately

apply these loss functions, the proposed optimization algorithm sequentially opti-

mizes task-speciﬁc objectives together with the PPL function. In our algorithm, the

SPL function is optimized at the end of the task sequence for all the tasks. In the

following sections, we initially explain in detail the proposed PPL and SPL functions

and then introduce the proposed sequential optimization algorithm.

7.3.1 Plasticity Preservation

The proposed PLASTA-MTL approach aims to control the level of plasticity for each

task in the context of MTL, and thus to ensure the sensitivity to new information

learned via each task. The level of plasticity for each task is controlled by what

extent information encoded in task-speciﬁc embedding space is also encoded in the

global image representation space. To this end, we deﬁne the plasticity condition

for the

ith

task

as how much change is occurred in

compared to that of

ETi

while learning

is based on the corresponding loss function

LTi

. To measure the

change occurred in

and

ETi

for

, we utilize the gradients of

LTi

with respect to

the global image representation and task-speciﬁc embedding parameters (

∇GLTi(θ)

and

∇ETiLTi(θ)

). Then, the gradient magnitude difference between global image

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 109

representation space

and task-speciﬁc embedding space

ETi

for task

represents

the change occurred in

and

ETi

as follows:

∥∇GLTi∥−∥∇ETiLTi∥

. When this

difference increases throughout the learning procedure, information speciﬁc to task

is only encoded by task-speciﬁc embedding space. Then, the considered DNN

suffers from the lack of plasticity condition for global image representation space.

Accordingly, to minimize the degree of changes in global image representation space

and the task-speciﬁc embedding space

ETi

, we deﬁne the PPL function

LTi

PPL

for

the task Tias follows:

LTi

PPL =|∥∇GLTi(θ)∥

dim(∇GLTi(θ)) −∥∇ETiLTi(θ)∥

dim(∇ETiLTi(θ))|, (7.1)

where

dim

function gives the dimensions of the gradient vectors that are used to

normalize the gradient magnitude difference. Since each task is associated with a

separate set of task-speciﬁc embedding parameters, PPL is deﬁned for each task.

In detail, we deﬁne the PPL objective based on the gradients of a task-speciﬁc loss

function. It is worth noting that deﬁning loss functions based on the task-speciﬁc

gradients is often considered in the framework of MTL (e.g., [61], [208], [209]) to

control the effect of each task on the weight update of a DNN [58].

Due to our PPL function, the proposed PLASTA-MTL approach keeps the gradient

magnitudes of

and

ETi

on the same scale while modelling the the task

. This leads

the task-speciﬁc information to be characterized in both global image representation

space and task-speciﬁc embedding space. Thus, the global image representation

space (which is shared by all the tasks) is enforced to be sensitive to new information

learned with each task during training. Accordingly, the proposed PLASTA-MTL

approach prevents the considered DNN from the lack of plasticity condition for

each considered task. It is worth noting that when joint optimization algorithm is

employed on the aggregation of all task-speciﬁc loss functions, application of our

PPL function for all tasks can increase the complexity of whole learning procedure.

In this case, the gradient magnitude of

is forced to simultaneously have the same

scale with that of

ETi

for each

i∈ {

. . .

that can exacerbate confusion for the

whole learning procedure.

7.3.2 Stability Preservation

The proposed PLASTA-MTL approach aims to adjust the level of stability in between

consecutive tasks in the context of MTL, and thus to prevent whole learning proce-

dure from radical disruptions while learning multiple tasks. The level of stability

in between learning different tasks is characterized by the degree of change (which

is occurred in global image representation space) due to a new task with respect to

that of previous tasks. Accordingly, the level of stability condition for all the tasks

{T1

. . .

TN}

can be deﬁned as how much change is occurred in

in-between learn-

ing consecutive tasks based on their corresponding loss functions

{LT1

. . .

LTN}

To this end, we deﬁne the relative change in

between learning two consecutive

tasks

and

Ti+1

as the angular distance between the gradients of the associated

loss functions

∇GLTi(θ)

and

∇GLTi+1(θ)

. If this angular distance between two gradi-

ent vectors (that is associated with two consecutive tasks) becomes extremely high

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 110

throughout the learning procedure, the gradient of the latter task enforces global

image representation to change into very different direction compared to the former

task. In this way, the latter task radically changes the global image representation

space. This may lead to lack of stability for the considered learning. Accordingly,

to minimize the angular distances, each of which is between the gradients of each

consecutive tasks, we deﬁne the SPL function as follows:

LSPL =1

N−1

∑

i=1

arccos(∇GLTi(θ)·∇GLTi+1(θ)

∥∇GLTi(θ)∥∥∇GLTi+1(θ)∥), (7.2)

where

arccos(a·b

||a||||b||)

measures the angle between the vectors

and

. To ensure the

stability condition for all the tasks

{T1

. . .

TN}

, the proposed SPL function considers

the angular distances between all consecutive pairs in the task sequence.

Due to our SPL function, the proposed PLASTA-MTL approach keeps the angular

distances between different task gradients minimum while learning all the tasks

{T1

. . .

TN}

. Thus, the directions of task gradients over global image representation

space are forced to be stable throughout the whole learning procedure. This prevents

radical changes in global image representation space due to learning any task. Ac-

cordingly, the proposed PLASTA-MTL approach prevents the considered DNN from

the lack of stability condition for all the task. We would like to point out that if the

conventional optimization algorithm of MTL is applied, the optimization of all loss

functions is applied simultaneously. In this way, there is a single change in

based

on the gradient of aggregated loss functions of all tasks. Then, it is hard to model

relative changes in Gwith respect to different tasks.

7.3.3 Sequential Optimization Algorithm

For the whole learning procedure, the proposed sequential optimization algorithm

aims to adaptively adjust the interactions between task-speciﬁc learning procedures,

and thus allows the proposed PLASTA-MTL approach to ensure plasticity and

stability conditions for all the tasks. As in most of the DL based MTL methods,

learning the parameters of the considered DNN for the tasks

{Ti}N

i=1

can be achieved

based on the following empirical risk minimization formulation:

min

∑

i=1

λiLTi(θ), (7.3)

where

λi

is the weight parameter of the task

. In this formulation, for a given mini-

batch of training images, there is one optimization procedure, where all the model

parameters are jointly updated to minimize the aggregation of all loss functions.

This formulation limits to control plasticity and stability conditions for each task as

explained in the previous sections of this chapter. Unlike the existing MTL meth-

ods, in the proposed sequential optimization algorithm, there is one optimization

procedure for each task-speciﬁc loss function together with the corresponding PPL

function. At the end of the task-sequence, this algorithm applies one more additional

optimization procedure for SPL by considering all the tasks. To this end, we ﬁrst

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 111

formulate (7.3) as a multi-level optimization problem as follows:

min

G,θTNLTN(G,θTN)

s.t. G∈argmin

G,θTN−1LTN−1(G,θTN−1)(7.4)

. . .

s.t. G∈argmin

G,θT1LT1(G,θT1),

where

θTi∈θ

is the set of task-speciﬁc parameters associated to the task

(i.e.,

task-speciﬁc head parameters). The reader is referred to [210] for the details of

multi-level optimization formulation. For (7.4), the set of all tasks

is regarded as a

sequence

⟨Ti|i∈ {

. . .

N}⟩

. Accordingly, instead of jointly optimizing all the tasks,

every task

in the sequence is optimized sequentially. In this way, global image

representation space (which is deﬁned by

) is always affected by the optimization

of last task in the sequence. This allows to adaptively adjust the interactions between

task-speciﬁc learning procedures, and thus to integrate the plasticity and stability

preserving capabilities of the proposed PLASTA-MTL approach into the whole

learning procedure. To this end, for each task, we minimize the corresponding PPL

function LTi

PPL with the task-speciﬁc loss function LTiby integrating multi-objective

optimization of two loss functions to (7.4), as follows:

min

G,θTN

(LTN(G,θTN),LTN

PPL(∇GLTN,∇ETNLTN))

s.t. G∈argmin

G,θTN−1

(LTN−1,LTN−1

PPL )(7.5)

. . .

s.t. G∈argmin

G,θT1

(LT1,LT1

PPL).

It is worth noting that during the optimization of

LTi

PPL

∇ETiLTi

is regarded as

constant. Due to this, global image representation space (which is deﬁned by

) is

affected by the optimization of last task in the sequence with the corresponding PPL

function. Since SPL function

LSPL

is applied for all the tasks, it is optimized at the

end of the sequence, as follows:

min

GLSPL({∇GLTi}N

i=1)

s.t. G∈argmin

G,θTN

(LTN,LTN

PPL)(7.6)

s.t. G∈argmin

G,θTN−1

(LTN−1,LTN−1

PPL )

. . .

s.t. G∈argmin

G,θT1

(LT1,LT1

PPL),

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 112

where

∇GLTi

is stored in each minimization step to be utilized for the optimization

of LSPL.

It is worth noting that depending on the selection of tasks, the assurance of the

stability condition for the considered DNN may decrease the level of plasticity

condition, and vice versa. In this way, the lack of one of the stability and plasticity

conditions is associated to the excess of the other condition. As an example, if some

of the considered tasks are in a heavy competition during training and one of the

tasks can distract the other tasks, there is the lack of stability condition. This is

also due to the excess of plasticity condition. In this way, increasing the level of

stability condition results in the decrease of the plasticity condition that leads to the

lack of stability condition. Under such conditions, the stability-plasticity constraint

of a DNN is deﬁned as a dilemma between these two capabilities of the DNN. If

there is this dilemma, it can be misleading to address both stability and plasticity

capabilities at the same time. This may lead to ineffective characterization of one

of the conditions. The drawback of this can be more evident if preserving one of

the capabilities is more important than the other one. Accordingly, in the proposed

PLASTA-MTL approach, we aim to automatically detect which capability should be

preserved if there is a need for selecting only one of them. To this end, we deﬁne the

importance level of stability condition for the considered DNN and the tasks based

on the

-norm of the gradient of SPL. Accordingly, for a given set of tasks, we deﬁne

the set of all the loss functions to be considered based on the two different levels of

importance for LSPL as follows:

L:









LT1. . . LTN,LSPL, if ∥∇GLSPL∥ ≥ α

LT1,LT1

PPL . . . LTN,LTN

PPL, if ∥∇GLSPL∥ ≤ β

LT1,LT1

PPL . . . LTN,LTN

PPL,LSPL, otherwise

(7.7)

where

controls the importance limits, while

α>β

. If

-norm of the gradient

∇GLSPL

is signiﬁcantly high (higher than

), we assume that there is no need to

apply

LPPL

. This applies to

LSPL

-norm of the gradient

∇GLSPL

is signiﬁcantly

low (lower than

). If the

-norm is in between

and

, we deﬁne this interval

as the condition where stability-plasticity constraint is not a dilemma anymore,

and thus both of the capabilities can be preserved in the proposed PLASTA-MTL

approach. It is worth noting that since

∇GLSPL

depends on the normalized gradients

of consecutive task-speciﬁc loss functions (see (7.2)), it is mostly affected by which

tasks are jointly considered. However, it is less affected by the considered data set

since the input samples indirectly changes the gradient of the SPL function. The

proposed sequential algorithm automatically decides to apply PPL, SPL or both loss

functions together depending on the parameters

and

. Accordingly, (7.5) is used to

apply only PPL function, (7.6) is used without

LTi

PPL

to apply SPL function, and (7.6)

is used to apply both loss functions together. In practice, this decision can be made

at the end of the ﬁrst epoch of the training based on the parameters of

and

. The

proposed sequential optimization algorithm is summarized in Algorithm 2. To better

understand the applied operations in it, Fig. 7.1 shows an illustration of the proposed

PLASTA-MTL approach training with the proposed optimization algorithm. It is

noted that, for simplicity, forward and backward passes applied in our optimization

algorithm are visualized for two tasks. For the ﬁrst task, while

∇θLT1

is propagated

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 113

)

Plasticity Preservation for

CNN

Backbone

Head

Stability Preservation for ,

Standard Forward Pass Backward Pass for Backward Pass for

Backward Pass for

Plasticity Preservation for

Applied

Operations

Head

CNN

Backbone

CNN

Backbone

(a)

(c)

(e)

(b)

(d)

(f)

FIGURE 7.1: An illustration of the proposed plasticity-stability preserving multi-task learning

(PLASTA-MTL) approach training, when two tasks

and

are considered. Standard and

plasticity preservation backward passes for (a)

, and (c)

are shown, while the changes

over the gradient vectors (b)

∇GLT1

and (d)

∇GLT2

during the plasticity preservation of these

tasks are visualized. (e) The backward pass for stability preservation of all the tasks are given

with (f) the illustration of changes over their gradient vectors.

back (which is visualized with red arrows in (a)),

LT1

PPL

is calculated. Then, backward

pass for

LT1

PPL

is applied (which is illustrated with purple arrows in (a)). During the

plasticity preservation for the ﬁrst task, the change over the gradient vector

∇GLT1

is visualised in (b). Same steps are also presented for the second task in (c) and

(d). After the plasticity preservation is employed for both tasks,

LSPL

is calculated

(see (e)). At the end, the backward pass for the SPL function is applied (which is

visualized with blue arrows). During the stability preservation for both tasks, the

changes over the gradient vectors of both tasks are presented in (f).

Since the proposed algorithm allows to apply a task-speciﬁc optimization procedure

for each task unlike the joint optimization algorithm, the PLASTA-MTL approach is

capable of effectively preserving plasticity and stability capabilities for each task in

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 114

Algorithm 2 The proposed sequential optimization algorithm to train the proposed

PLASTA-MTL approach

Require:

Mini-batch

B ∈ X

, set of tasks

T={T1

. . .

TN}

, set of model parameters

θ,α,β

1: for i←1to Ndo

2: Compute LTi(θ)

3: Compute ∇θLTi(θ),∇GLTi(θ)and ∇ETiLTi(θ)

4: Compute LTi

PPL =|∥∇GLTi(θ)∥

dim(∇GLTi(θ)) −∥∇ETiLTi(θ)∥

dim(∇ETiLTi(θ)) |

5: Compute ∇GLTi

PPL

6: Update θusing ∇θLTi(θ)

7: if ∥∇GLSPL∥<αthen

8: Update Gusing ∇GLTi

PPL

9: end if

10: end for

11: Compute LSPL =1

N−1

∑

i=1arccos(∇GLTi(θ)·∇GLTi+1(θ)

∥∇GLTi(θ)∥∥∇GLTi+1(θ)∥)

12: Compute ∇GLSPL

13: if ∥∇GLSPL∥>βthen

14: Update Gusing ∇GLSPL

15: end if

the context of MTL. We would like to point out that this algorithm does not require

the selection of any loss function weights (which generally require a computationally

demanding grid search in joint optimization algorithm). It is also worth noting that

the proposed algorithm works independently from the number of considered tasks

and the type of tasks.

7.4 Dataset Description and Experimental Design

7.4.1 Dataset Description

The experiments were performed on the DLRSD [203] and the BigEarthNet-S2 bench-

mark archives. The DLRSD archive includes the same images with the UC Merced

archive [84] that consists of 2,100 aerial images, each of which has the size of 256

256 pixels with a spatial resolution of 30 cm. In the DLRSD archive, the images are

associated to the multi-labels and the pixel-based labels, where the set of class labels

are deﬁned in [93]. We utilized the Serbia subset of the BigEarthNet-S2 benchmark

archive, where images are acquired during summer season. This subset includes

14,832 Sentinel-2 images, each of which is a section of: 1) 120

120 pixels for 10m

bands; 2) 60

60 pixels for 20 m bands; and 3) 20

20 pixels for 60 m bands. For

the experiments, we applied bicubic interpolation to 20m bands and excluded 60 m

bands. For the experiments, we utilized the 19 class nomenclature of BigEarthNet-S2.

For the tasks that require the availability of land-cover maps, we extracted the CLC

land cover map of each image.

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 115

To perform experiments, we divided the DLRSD and the BigEarthNet-S2 archives

into training, validation and test sets with the ratios of 70%, 10%, 20% and 52%, 24%,

24%, respectively. To apply CBIR, the training set of the DLRSD archive and the

validation set of the BigEarthNet-S2 archive were used for selecting query images,

while images were retrieved from the test set for both archives.

7.4.2 Experimental Design

In the experiments, we utilized the DenseNet-121 CNN architecture [148] as the

MTL backbone shared by all the tasks. To perform the experiments, we utilized

the different combinations of four tasks: 1) supervised scene classiﬁcation; 2) super-

vised similarity learning; 3) supervised multi-label co-occurrence prediction; and

4) unsupervised similarity learning. For each task, we added a task-speciﬁc head

to the CNN backbone. Each task-head includes a fully connected (FC) layer that: i)

takes the global image representation from the CNN backbone; and ii) produces a

64-dimensional task-speciﬁc embedding. Supervised scene classiﬁcation task (which

is denoted as

) aims to automatically assign multi-labels to image scenes. To this

end, the task-head of

also includes a classiﬁcation layer that produces multi-label

class probabilities. For this task, the task-speciﬁc loss function

LT1

is selected as cross-

entropy loss function. For the details of this task, the reader is referred to Chapter 3.

Supervised similarity learning task (which is denoted as

) aims to automatically

identify image similarities. To this end, we selected a triplet loss function as the task-

speciﬁc loss function

LT2

. Triplet loss function directly operates on the task-speciﬁc

embeddings, and requires the availability of image triplets (each of which includes

anchor, positive and negative images). For this task, image triplets are selected by

using hard triplet sampling technique based on the multi-label similarities. The

reader is referred to Chapter 4 for the details of triplet loss function and the triplet

sampling techniques. Supervised multi-label co-occurrence prediction task (which is

denoted as

) aims to predict co-occurence relationships of multiple classes present

in an image. To this end, by following the method presented in [79], the task-head

also includes an FC layer that takes task-speciﬁc embeddings and produces

the prediction for graph driven region-based image representations. For this task,

the region representation learning loss function [79] is selected as the task-speciﬁc

loss function

LT3

. It minimizes the prediction error of the task-speciﬁc head with

comparison to the image graphs, which are obtained based on the image land-cover

maps. Unsupervised similarity learning task (which is denoted as

) aims at learning

image representations by maximizing similarity between different views of the same

image without relying on any ground truth information. To this end, by following

the strategy of self-supervised contrastive learning presented in [211], we used a set

of data augmentation techniques to generate different views of each training image.

Then, the task-speciﬁc loss function

LT4

is selected as contrastive loss function, which

operates on the task-speciﬁc embeddings of two different augmented views of each

image. It allows to maximize the similarity between the augmented views of images

with respect to the rest of images. The reader is referred to [211] for the details of

contrastive loss and the set of data augmentation techniques, which is applied to

generate different views of images.

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 116

We trained the proposed PLASTA-MTL approach for 100 epochs. For training, we

utilized the Adam variant of stochastic gradient descent with the initial learning rate

of 10

−3

. All the experiments were performed on 4 NVIDIA Tesla V100 GPUs. After

training is ﬁnished by employing the above-mentioned tasks in the context of MTL,

we extracted the features of query and archive images from the last layer of the CNN

backbone. To apply CBIR, we applied similarity matching of the extracted image

features based on the

χ2

-distance measure. CBIR results are provided in terms of

two evaluation metrics: 1) normalized discounted cumulative gains (NDCG); and 2)

mean average precision (mAP).

We carried out various experiments to: 1) perform a sensitivity analysis of the

proposed PLASTA-MTL approach; and 2) compare our approach with state-of-the-

art MTL methods in the context of CBIR. For the sensitivity analysis, we assessed: i)

the effectiveness of the selection of plasticity and stability preserving capabilities; ii)

the effect of task sequence order on the proposed sequential optimization algorithm;

iii) computational complexity of the PLASTA-MTL approach; and iv) the comparison

of utilizing multiple tasks in our approach with separately employing each task (that

is based on single-task learning (STL)).

We compared the proposed approach with: 1) conventional multi-task learning (equal

weighting); 2) multi-task learning using uncertainty to weigh losses (uncertainty

weighting) [205]; 3) projecting conﬂicting gradients (PCGrad) [182]; 4) gradient

normalization for adaptive loss balancing in deep multitask networks (GradNorm)

[208]; and 5) dynamic weight average (DWA) [183]. For all the methods, we used the

same CNN backbone and task-speciﬁc heads with our approach. For the ﬁrst method,

we applied joint optimization on the summation of task-speciﬁc loss functions with

equal weights. For the other four methods, we used the same method-speciﬁc

parameters given in [205], [182], [208] and [183].

7.5 Experimental Results

We performed different kinds of experiments in order to: 1) carry out a sensitivity

analysis; and 2) compare the effectiveness of the proposed PLASTA-MTL approach

with the state-of-the-art MTL methods in the framework of CBIR.

7.5.1 Sensitivity Analysis of the Proposed Approach

In this sub-section, we performed the sensitivity analysis of the proposed PLASTA-

MTL approach in terms of: i) the effectiveness of the automatic selection of plasticity

and stability preserving capabilities; ii) the task sequence order utilized in our ap-

proach; iii) the computational complexity; and iv) the comparison with single-task

learning.

In the ﬁrst set of trials, we analyzed the effectiveness of automatically detecting the

preservation of plasticity and stability capabilities in the proposed PLASTA-MTL

approach. Table 7.1 shows the mAP scores for the DLRSD archive when different

combinations of the tasks

{T1

T4}

are utilized with the different combinations

of plasticity and stability preserving capabilities in the PLASTA-MTL approach. By

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 117

TABLE 7.1: MEAN AVERAGE PRECISION (MAP) SCORES ASSOCIATED TO THE DIFFERENT

COMBINATIONS OF TASKS WITH DIFFERENT CAPABILITIES OF THE PLASTA-MTL AP-

PROACH ARE UTILIZED (THE DLRSD ARCHIVE)

Tasks PLASTA-MTL

∥∇GLSPL∥mAP (%)

T1T2T3T4LPPL LSPL

✓ ✓ ✗ ✗

✓✗

0.33

95.0

✗✓95.7

✓ ✓ 94.0

✓✗✓✗

✓✗

0.43

96.0

✗✓97.6

✓ ✓ 96.7

✓✗ ✗ ✓

✓✗

0.13

95.2

✗✓91.0

✓ ✓ 96.0

✗✓ ✓ ✗

✓✗

0.18

94.8

✗✓95.2

✓ ✓ 95.5

✗✓✗✓

✓✗

0.09

86.1

✗✓84.5

✓ ✓ 85.4

✗ ✗ ✓ ✓

✓✗

0.09

95.4

✗✓94.8

✓ ✓ 93.8

✓ ✓ ✓ ✗

✓✗

0.12

96.5

✗✓96.3

✓ ✓ 97.2

✓ ✓ ✗✓

✓✗

0.04

96.7

✗✓94.7

✓ ✓ 94.8

✓✗✓ ✓

✓✗

0.05

97.0

✗✓94.5

✓ ✓ 96.8

✗✓ ✓ ✓

✓✗

0.06

95.5

✗✓93.4

✓ ✓ 95.2

✓ ✓ ✓ ✓

✓✗

0.13

97.5

✗✓97.0

✓ ✓ 97.6

assessing the table, one can observe that the selection of which capabilities are pre-

served in our PLASTA-MTL approach is one of the most important factors affecting

the overall CBIR performance. This issue becomes more evident under two scenarios.

First, if some of the considered tasks are in competition during training, the preserva-

tion of both capabilities at the same time leads to the ineffective characterization of

either stability or plasticity conditions. This results in lower mAP scores compared

to preserving only one of the capabilities. As an example, when the considered tasks

include

and

, employing only either PPL or SPL leads to 1.7% and 1% higher

mAP scores, respectively, compared to utilizing both loss functions together in the

proposed PLASTA-MTL approach. This is due to the fact that learning the task

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 118

TABLE 7.2: MEAN AVERAGE PRECISION (MAP) SCORES WHEN THE TASKS

AND

ARE

UTILIZED IN DIFFERENT ORDERS FOR THE PLASTA-MTL APPROACH (THE DLRSD

ARCHIVE)

Task Order mAP (%)

T1→T2→T397.2

T1→T3→T297.0

T2→T1→T397.1

T2→T3→T196.8

T3→T1→T297.7

T3→T2→T197.5

(which is supervised scene classiﬁcation) enforces to maximize inter-class distances

in the global image representation space, while learning the task

(which is super-

vised similarity learning) enforces to minimize intra-class distance. These learning

characteristics can easily result in the competition of the two tasks. However, when

the considered tasks include

and

(which are not in competition during training),

preserving each capability further improves the CBIR performance. Second, when

the number of considered tasks decreases, the effect of selecting one of the plasticity

and stability preserving capabilities on mAP scores increases. As an example, when

the considered tasks include only

and

, the difference of mAP scores between

preserving plasticity and stability capabilities is more than 4%. However, when all the

tasks are considered including

and

, this difference is less than 1%. These

two scenarios show that the accurate selection of which capabilities are preserved

in our PLASTA-MTL approach is crucial for accurate CBIR performance. The pro-

posed sequential optimization strategy automatically detects which capabilities are

preserved by controlling the importance level of stability condition, which is deﬁned

based on the

-norm of the gradient of the SPL function. Table 7.1 also includes the

average gradient norm values, which are obtained in the ﬁrst epoch of the training.

By the analyzing the table, one can observe that when the norm value is signiﬁcantly

high (e.g.,

T={T1

T2}

and

T={T1

T3}

), preserving only stability capability in the

PLASTA-MTL approach provides the highest mAP scores. When the norm value

is signiﬁcantly low (e.g.,

T={T1

T4}

and

T={T1

T4}

), preserving only

plasticity capability in the PLASTA-MTL approach provides the highest mAP scores.

This shows the effectiveness of the automatic detection strategy of the proposed

sequential optimization algorithm, which is utilized to identify which capabilities are

preserved in our PLASTA-MTL approach. The average gradient norm values given

in Table 7.1 show that two importance levels of stability condition can be deﬁned

α=

0.3 and

β=

0.1. Accordingly, we used these parameters in the proposed

sequential optimization algorithm for the rest of the experiments.

In the second set of trials, we analyzed the effect the task sequence order utilized

in the proposed PLASTA-MTL approach. Table 7.2 shows the mAP scores for the

DLRSD archive when the tasks

{T1

T3}

are utilized with all the possible orders

in the task sequence of our approach. By analyzing the table, one can see that when

the order of the considered tasks is changed, the proposed PLASTA-MTL approach

provides different mAP scores. This is due to the fact that since all the tasks are

learned sequentially in the proposed optimization algorithm, different task sequence

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 119

FIGURE 7.2: Normalized discounted cumulative gains (NDCG) versus the number of re-

trieved images obtained for the DLRSD archive when the tasks

and

are utilized in

different orders for the PLASTA-MTL approach.

orders lead to changes in the whole learning procedure. However, from the table, one

can also observe that the differences between the mAP scores of different task orders

are not signiﬁcantly high. The difference between the highest mAP score (which

is obtained with the task order of

T3→T1→T2

) and the lowest mAP score (which

is obtained with the task order of

T2→T3→T1

) is less than 1%. Figure 7.2 shows

the NDCG scores of the same tasks and their orders for the DLRSD archive under

different numbers of retrieved images. From the ﬁgure, one can see that increasing the

number of retrieved images does not change our conclusion. These results show that

utilizing different task orders does not signiﬁcantly affect the CBIR performance of

the proposed PLASTA-MTL approach. For the rest of the experiments, we employed

the numerical order of tasks (i.e.,

T1→T2→T3→T4

) for the proposed PLASTA-MTL

approach.

In the third set of trials, we assessed the computational complexity of the proposed

PLASTA-MTL approach. To this end, in Table 7.3, we compared our approach with

the equal weighting method in terms of the training time required per epoch when

the different combinations of the tasks

{T1

T4}

are utilized on the DLRSD

archive. It is worth noting that the equal weighting method jointly optimizes all the

loss function without the need of any other steps that may increase the computational

complexity. Accordingly, this method can be regarded as one of the MTL methods,

which are associated to the lowest computational complexity. By assessing the table,

one can observe that our approach requires higher training time per epoch compared

to the equal weighting method for each task combination. This is due to the fact

that the sequential optimization applied in the proposed PLASTA-MTL approach

requires higher number of forward and backward passes of the considered DNN

compared to the joint optimization algorithm. This increases the required training

time per epoch for our approach. This becomes more evident if the same batches

of training images are used for all the tasks (e.g.,

T={T1

T2}

). In this condition,

the equal weighting method requires one forward pass and one backward pass for

each batch, while our approach requires at least two forward and backward passes

depending on the number of tasks. When some of the considered tasks require

different batches of training images that leads to more than one forward pass, the

computational complexity of the equal weighting method increases. However, it does

not affect the computational complexity of our proposed approach. As an example,

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 120

TABLE 7.3: TRAINING TIMES PER EPOCH ON THE DLRSD ARCHIVE WHEN THE DIFFERENT

COMBINATIONS OF TASKS ARE UTILIZED FOR THE PROPOSED PLASTA-MTL APPROACH

AND EQUAL WEIGHTING.

Tasks

Method Training Time per Epoch (sec)

T1T2T3T4

✓ ✓ ✗ ✗ Equal Weighting 9.3

PLASTA-MTL 18.0

✓✗✓✗Equal Weighting 18.3

PLASTA-MTL 24.5

✓✗ ✗ ✓Equal Weighting 57.6

PLASTA-MTL 62.9

✗✓ ✓ ✗Equal Weighting 17.7

PLASTA-MTL 24.6

✗✓✗✓Equal Weighting 60.2

PLASTA-MTL 64.9

✗ ✗ ✓ ✓ Equal Weighting 66.4

PLASTA-MTL 70.9

✓ ✓ ✓ ✗Equal Weighting 15.7

PLASTA-MTL 32.6

✓ ✓ ✗✓Equal Weighting 57.9

PLASTA-MTL 73.6

✓✗✓ ✓ Equal Weighting 69.1

PLASTA-MTL 77.7

✗✓ ✓ ✓ Equal Weighting 67.8

PLASTA-MTL 77.4

✓ ✓ ✓ ✓ Equal Weighting 64.5

PLASTA-MTL 88.4

when the tasks

{T1

T2}

are utilized, the training time per epoch of our approach

is almost twice as large as that of the equal weighting method. However, when

the tasks

{T1

T4}

are utilized, the task

requires to feed the augmented views of

images into the considered DNN that costs an additional forward pass step. In this

case, the required training time per epoch of the proposed PLASTA-MTL approach

is almost same as that of the equal weighting method. It is worth noting that the

overall computational complexity is also affected by the total number of epochs in

addition to the training time per epoch. Accordingly, Figure 7.3 shows the minimum

numbers of training epochs at which the proposed PLASTA-MTL approach and the

equal weighting method reaches a range of mAP scores, when the different number

of tasks are considered. By analyzing the ﬁgure, one can see that our approach is

able to achieve same mAP scores with the less number of training epochs compared

to the equal weighting method. As an example, when the tasks

{T1

T3}

are

considered, our approach achieves 93% mAP score with 25 less training epochs

compared to the equal weighting method. This leads to less total training time for

our approach although the corresponding training time per epoch is higher than

the equal weighting method. This issue becomes more evident when the number of

considered tasks increase. As an example, when all the tasks are utilized, the total

training time of our approach to reach 93% mAP score is signiﬁcantly less than that

of the Equal Weighing method. These results show that the learning efﬁciency of the

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 121

(a) (b)

(c)

FIGURE 7.3: Mean Average Precision (mAP) versus the minimum number of train-

ing epochs for the DLRSD archive when the tasks: (a)

and

; (b)

and

;

and (c)

and

are utilized for the PLASTA-MTL approach and the equal weighting

method.

proposed PLASTA-MTL approach is signiﬁcantly higher than the equal weighting

method. This leads to the reduction of total training time (which is required to reach

a high CBIR performance) for the proposed PLASTA-MTL approach.

In the fourth set of trials, we analyzed the effectiveness of the proposed PLASTA-MTL

approach compared to separately employing each task of the considered task set (that

is based on single-task learning (STL)). For the DLRSD archive, Table 7.4 shows the

mAP scores of the PLASTA-MTL approach for the different combinations of the tasks

{T1

T4}

and the STL for each task. By analyzing the table, one can observe

that, for each combination, our approach provides higher mAP scores compared

to separately learning each task. As an example, when the tasks

{T1

T4}

are

considered, the proposed PLASTA-MTL approach provides almost 2%, 15%, 14%

higher mAP scores compared to applying separate learning procedures for

and

, respectively. This shows that our approach effectively combines multiple

tasks together that leads to more accurate image representation learning compared

to utilizing a single task.

7.5.2 Comparison with Existing Methods

In the ﬁfth set of trials, we analyzed the effectiveness of the proposed PLASTA-MTL

approach compared to the state-of-the-art MTL methods in the context of CBIR

under various combinations of the considered four tasks. These methods are: equal

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 122

TABLE 7.4: MEAN AVERAGE PRECISION (MAP) SCORES WHEN THE DIFFERENT COMBINA-

TIONS OF TASKS ARE UTILIZED IN THE PLASTA-MTL APPROACH COMPARED TO SINGLE

TASK LEARNING (THE DLRSD ARCHIVE)

Tasks

Method mAP (%)

T1T2T3T4

✓✗ ✗ ✗

STL

94.9

✗✓✗ ✗ 81.8

✗ ✗ ✓✗95.4

✗ ✗ ✗ ✓83.2

✓ ✓ ✗ ✗

PLASTA-MTL

95.7

✓✗✓✗97.6

✓✗ ✗ ✓96.0

✗✓ ✓ ✗95.5

✗✓✗✓86.1

✗ ✗ ✓ ✓ 95.4

✓ ✓ ✓ ✗97.2

✓ ✓ ✗✓96.7

✓✗✓ ✓ 97.0

✗✓ ✓ ✓ 95.5

✓ ✓ ✓ ✓ 97.6

(a) (b) (c)

(d) (e)

FIGURE 7.4: Normalized discounted cumulative gains (NDCG) versus the number of re-

trieved images obtained for the DLRSD archive when the tasks: (a)

and

; (b)

and

; (c)

and

; (d)

and

; and (e)

and

are used in the context of multi-task

learning.

weighting, uncertainty weighting [205], PCGrad [182], GradNorm [208] and DWA

[183]. Table 7.5 and 7.6 show the corresponding mAP scores on the DLRSD and

the BigEarthNet-S2 archives, respectively. By assessing the tables, one can observe

that the proposed PLASTA-MTL approach leads to the highest mAP scores on each

task combination for both archives compared to the state-of-the-art MTL methods.

As an example, the proposed PLASTA-MTL approach outperforms the PCGrad by

more than 4% for the DLRSD archive and more than 8% for the BigEarthNet-S2

archive when the tasks

{T2

T4}

are utilized. When all the tasks

{T1

T4}

are used, our approach provides almost 3% higher mAP scores for both archives

compared to the GradNorm. We observed the similar behaviours while comparing

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 123

TABLE 7.5: MEAN AVERAGE PRECISION (MAP) SCORES ASSOCIATED TO THE DIFFERENT

COMBINATIONS OF TASKS (THE DLRSD ARCHIVE)

Tasks

Method mAP (%)

T1T2T3T4

✓✗ ✗ ✓

Equal Weighting 90.1

Uncertainty Weighting [205] 94.4

PCGrad [182] 92.7

GradNorm [208] 94.3

DWA [183] 93.0

PLASTA-MTL 96.0

✗✓ ✓ ✗

Equal Weighting 93.2

Uncertainty Weighting [205] 94.0

PCGrad [182] 92.6

GradNorm [208] 92.9

DWA [183] 92.5

PLASTA-MTL 95.5

✓ ✓ ✗✓

Equal Weighting 91.6

Uncertainty Weighting [205] 95.4

PCGrad [182] 92.9

GradNorm [208] 93.8

DWA [183] 91.4

PLASTA-MTL 96.7

✗✓ ✓ ✓

Equal Weighting 92.0

Uncertainty Weighting [205] 95.0

PCGrad [182] 91.2

GradNorm [208] 91.4

DWA [183] 90.9

PLASTA-MTL 95.5

✓ ✓ ✓ ✓

Equal Weighting 92.6

Uncertainty Weighting [205] 95.8

PCGrad [182] 94.9

GradNorm [208] 95.0

DWA [183] 93.7

PLASTA-MTL 97.6

the methods of equal weighting, uncertainty weighting and DWA with our approach.

This shows that the proposed PLASTA-MTL approach provides more accurate RS

image representations that leads to more effective CBIR compared to other methods.

This is due to the plasticity and stability preserving capabilities of our approach that

overcomes the well-known problems of MTL. Figure 7.4 and 7.5 show the NDCG

scores of the considered state-of-the-art methods and our approach under different

combinations of the tasks

{T1

T4}

and different numbers of retrieved images

for the DLRSD and the BigEarthNet-S2 archives, respectively. From the ﬁgures, one

can see that when the number of retrieved images are increased (from 1 to 50 for

DLRSD and 1 to 100 for BigEarthNet-S2), the proposed PLASTA-MTL approach

provides the highest NDCG scores for almost all task combinations at each number of

retrieved images on both archives. For the DLRSD archive, Fig. 7.6 shows an example

of a query image and the retrieved images by these methods and our approach,

when all the tasks are utilized and query image contains the classes of buildings,

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 124

TABLE 7.6: MEAN AVERAGE PRECISION (MAP) SCORES ASSOCIATED TO THE DIFFERENT

COMBINATIONS OF TASKS (THE BIGEARTHNET-S2 ARCHIVE)

Tasks

Method mAP (%)

T1T2T3T4

✗✓ ✓ ✗

Equal Weighting 95.9

Uncertainty Weighting [205] 83.8

PCGrad [182] 96.3

GradNorm [208] 90.4

DWA [183] 94.7

PLASTA-MTL 97.2

✗✓✗✓

Equal Weighting 87.7

Uncertainty Weighting [205] 92.0

PCGrad [182] 92.1

GradNorm [208] 84.0

DWA [183] 88.2

PLASTA-MTL 93.4

✓✗✓ ✓

Equal Weighting 95.7

Uncertainty Weighting [205] 96.3

PCGrad [182] 87.0

GradNorm [208] 92.6

DWA [183] 94.7

PLASTA-MTL 97.4

✗✓ ✓ ✓

Equal Weighting 80.4

Uncertainty Weighting [205] 90.7

PCGrad [182] 85.5

GradNorm [208] 89.4

DWA [183] 90.7

PLASTA-MTL 93.8

✓ ✓ ✓ ✓

Equal Weighting 94.8

Uncertainty Weighting [205] 97.3

PCGrad [182] 93.9

GradNorm [208] 95.0

DWA [183] 95.2

PLASTA-MTL 97.7

cars,grass,pavement and trees. The retrieval orders of images are given above the

ﬁgure. By assessing the ﬁgure, one can observe that the proposed PLASTA-MTL

approach leads to retrieval of similar images at all retrieval orders (see Fig. 7.6g).

However, by using state-of-the-art MTL methods, retrieved images contain classes

which are not present in the query image. As an example, the equal weighting and

the DWA methods lead to retrieval of the image, which include only field class, at the

and 4

retrieval orders, respectively (see Fig. 7.6b and 7.6d). We observed the

similar behaviours of these methods for the BigEarthNet-S2 archive. We would like

to point out that these methods employ different gradient adjustment strategies for

overcoming the well-known problems of MTL. Accordingly, their success has been

proven for many MTL problems in computer vision domain. However, since they do

not consider the stability-plasticity constraint of MTL and they are still based on the

joint optimization algorithm, they are limited to solve all possible problems of MTL

under various task combinations for RS images. This leads to less accurate image

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 125

(a) (b) (c)

(d) (e)

FIGURE 7.5: Normalized discounted cumulative gains (NDCG) versus the number of re-

trieved images obtained for the BigEarthNet-S2 archive when the tasks: (a)

and

; (b)

and

; (c)

and

; (d)

and

; and (e)

and

are used in the context of

multi-task learning.

representations learned via these methods compared to the proposed PLASTA-MTL

approach. Accordingly, the image representations learned via our approach lead to

more effective CBIR results.

7.6 Conclusion

In this chapter, we have proposed a novel plasticity-stability preserving multi-task

learning (PLASTA-MTL) approach for DL-based IRL. This approach is characterized

by novel: i) plasticity preserving loss (PPL) function; ii) stability preserving loss

(SPL) function; and iii) sequential optimization algorithm. The PPL function allows

our approach to minimize the differences of gradient magnitudes for the global

representation space and each task-speciﬁc embedding spaces of the considered

DNN. The use of the SPL function in the proposed PLASTA-MTL approach leads

to minimization of the angular distances between task gradients over global image

representation space. The proposed optimization algorithm sequentially optimizes:

i) each task-speciﬁc objective with the corresponding PPL function; and ii) the SPL

function for all the considered tasks. Experimental results conducted on two bench-

mark archives show the effectiveness of the proposed PLASTA-MTL approach over

the state-of-the-art MTL methods in the context of CBIR. The main reasons for the

success of our approach are summarized as follows:

Due to the proposed PPL function, the PLASTA-MTL approach enforces the

global image representation space to be sensitive to new information learned

with each task that leads to the preservation of plasticity condition for the

considered DNN.

Due to the proposed SPL function, the PLASTA-MTL approach protects the

global image representation space radically disrupted by a new task that leads

to the preservation of stability condition for the considered DNN.

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 126

1st 2nd 3rd 4th

(b)

5th 10th 20th

(c)

(a) (d)

(e)

(f)

(g)

FIGURE 7.6: (a) Query image; and images retrieved by using (b) equal weighting; (c) un-

certainty weighting; (d) PCGrad; (e) GradNorm; (f) DWA; (g) the proposed PLASTA-MTL

approach when the tasks: T1,T2,T3and T4are utilized for the DLRSD archive.

Due to the proposed sequential optimization algorithm, the PLASTA-MTL

approach accurately characterizes: i) the plasticity condition for each task; and

ii) the stability condition in between consecutive tasks.

Due to the effective combination of multiple tasks independently from the

number and type of tasks while considering the stability-plasticity constraint of

MTL without the need for selection of loss weights, the PLASTA-MTL approach

prevents: i) conﬂicts between tasks; ii) the dominance of one of the tasks; and

ii) under-performance of tasks compared to STL. This leads to more accurate

image representation learning compared to utilizing a single task and the

conventional deep multi-task learning procedures.

It is worth noting that, in this chapter, we conducted experiments in the context of a

single RS application, CBIR, for the sake of simplicity. Moreover, the global image

Chapter 7. Plasticity-Stability Preserving Multi-Task IRL in Remote Sensing 127

representation space learned via our approach can be also used for other applications

since it applies image representation learning based on the information learned via

multiple tasks to represent the complex semantic content of RS images. We would like

to also point out that the set of all tasks are assumed to be known during the training

of our approach. However, inclusion of new tasks to the set of considered tasks after

training for the PLASTA-MTL approach can further improve the characterization of

RS image content. Accordingly, as a future development of this work, we plan to

study on continual learning to include new tasks to the PLASTA-MTL approach after

completing the whole learning procedure while preserving its plasticity and stability

capabilities also for these tasks.

128

Chapter 8

Conclusion and Outlook

In this chapter, we conclude this thesis with: i) a summary of presented methodolo-

gies in Section 8.1; and ii) an overview on the possible research directions comple-

mentary to the thesis in Section 8.2.

8.1 Conclusion

In this thesis, we have mainly presented six novel contributions to the state-of-the-

art DL-based representation learning of RS images to foster automatic knowledge

discovery from massive EO archives in effective and efﬁcient ways.

As the ﬁrst main contribution of this thesis, in Chapter 2 we have proposed a large-

scale benchmark RS image archive (which is denoted as BigEarthNet) to address

the limitations of existing benchmark datasets, which mostly include single-modal

RS images (e.g., multispectral or SAR) and single-label image annotations with an

insufﬁcient amount of training data for recent DNNs. BigEarthNet includes 590,326

RS image pairs acquired over 10 different European countries. Each pair is made

up of two image patches from new generation satellites Sentinel-1 and Sentinel-

2 acquired in the same geographical area; and annotated by multiple land-cover

classes (i.e., multi-labels) from the CORINE Land Cover (CLC) database. We have

also proposed an alternative nomenclature for the characteristics of BigEarthNet

image pairs as an evolution of the original CLC labels. We would like to note that

BigEarthNet makes a signiﬁcant advancement for DL-based IRL in RS as it fulﬁlls

the requirement of training DNNs with a large number of annotated training images.

It also opens up promising research directions for DL-based IRL based on multiple

modalities. Our experimental analysis shows that IRL directly from BigEarthNet

provides more accurate characterization of RS images compared to transfer learning

strategy (e.g., utilizing DNN models pre-trained on ImageNet). Together with all the

BigEarthNet data, we have also made several DL models pre-trained on it publicly

available. This eliminates the limitations of using DNNs, which are pre-trained on

general purpose computer vision datasets, for research studies in RS images.

The second main contribution of the thesis consists in our deep multi-attention driven

approach proposed for multi-label RS image scene classiﬁcation problems in Chapter

3. The proposed approach is capable of efﬁciently and effectively describing the

spatial and spectral information content of high dimensional and high spatial resolu-

tion RS images based on three main steps: 1) spatial and spectral characterization of

Chapter 8. Conclusion and Outlook 129

image local areas through a novel K-Branch CNN; 2) deﬁnition of a multi-attention

driven global descriptor through a novel multi-attention strategy; and 3) classiﬁca-

tion of RS image scenes with multi-labels. Due to the proposed K-Branch CNN of

the ﬁrst step, our approach models the complex information content of RS images

for which the spectral bands can be associated to varying spatial resolutions, while

leading to a signiﬁcant reduction on the computational complexity. Thanks to the

multi-attention strategy deﬁned in the framework of RNNs, our approach accurately

identiﬁes importance levels for different image local areas, and then deﬁnes image

representations based on these scores. Experimental results obtained on BigEarthNet

demonstrate that the proposed approach has a high potential for the operational RS

scene classiﬁcation scenarios, where EO data archives contain RS images with highly

complex spatial and spectral information content as in new generation satellite image

archives such as Sentinel-2.

As the third main contribution of the thesis, in Chapter 4 a novel triplet selection

method has been proposed for DL-based IRL through the characterization of image

similarities for multi-label CBIR problems in RS. Our method selects a small set of the

most representative and informative triplets by evaluating the relevancy, hardness

and diversity of multi-label RS images. With those image triplets, a metric space,

where semantically similar images are located close to each other, can be modeled

through triplet loss to perform CBIR in large-scale RS image archives. The selection

of a compact subset of informative and representative triplets in our method enables

effective learning of a metric space on DNNs for accurate multi-label CBIR in RS,

while reducing the total number of triplets and increasing the learning efﬁciency

in terms of the converge speed. The experimental analysis in this chapter conﬁrms

that our triplet selection method is much more suitable to be used with operational

CBIR applications compared to well-known methods, as it signiﬁcantly reduces

the computational complexity of training DNNs without sacriﬁcing from CBIR

performance.

The fourth main contribution of the thesis is composed of our SCI-CBIR approach

proposed in Chapter 5 to simultaneously characterize image representations through

hash codes and achieve image compression, and thus eliminate the need for decom-

pressing RS images prior to CBIR. Our SCI-CBIR approach employs ﬁrst: i) DL-based

compression through an encoder-decoder DNN and a probabilistic entropy model;

and then ii) deep hashing-based indexing through pairwise, bit-balancing and classi-

ﬁcation loss functions based on the encoded RS image representations. The novel

multi-stage learning procedure for the training of SCI-CBIR allows to effectively char-

acterize image features for both image indexing and compression by automatically

weighting different loss functions and rate-distortion trade-off points. We would like

to emphasize that due to the proposed approach, the need for decompressing images

prior to indexing, unlike the existing CBIR approaches in RS, is fully eliminated. This

can save a signiﬁcant amount of time for large-scale CBIR applications on massive

RS image archives that is demonstrated with the experimental results on benchmark

archives.

As the ﬁfth main contribution of the thesis, in Chapter 6 we have introduced the

GRID approach to accurately learn RS image representations when training images

are associated with noisy labels. The proposed approach models the complementary

Chapter 8. Conclusion and Outlook 130

characteristics of discriminative and generative reasoning for IRL under noisy label

by integrating generative reasoning into discriminative reasoning through a super-

vised variational autoencoder. Due to its label noise robust hybrid representation

learning strategy (which employs generative reasoning for the training samples with

noisy labels; and discriminative reasoning for the remaining samples in training

data), our approach allows to learn discriminative RS image representations, while

preventing the overﬁtting on noisy labels during training. GRID does not depend

on the type of annotation, label noise present in training data, DNN architecture,

loss function or learning task, and can operate with any DL-based IRL method. In

greater details, a small clean subset (training samples with clean labels) of a training

set or a computationally demanding noise correction strategy prior to training is not

required for GRID unlike the existing methods. The experiments conducted for three

different learning tasks with the corresponding loss functions and DNN architectures

at different synthetic label noise injection rates show the success of GRID indepen-

dently of the IRL method being considered. This can be a very important advantage

for operational RS applications, which deal with noisy annotations in training data

and require to perform under different IRL scenarios.

The sixth main contribution of the thesis consists in our PLASTA-MTL approach

introduced in Chapter 7 for effectively learning RS image representations when

multiple learning task are involved in IRL. Our approach: 1) adaptively adjusts the

interactions between task-speciﬁc learning procedures by the proposed sequential

optimization algorithm; 2) protects image representation space from radical disrup-

tions occurred due to each task by the proposed stability preserving loss function;

and 3) assures the sensitivity of the image representation space to new information

from each task by the proposed plasticity preserving loss function. Due to its sta-

bility and plasticity preserving capabilities, our PLASTA-MTL approach overcomes

the well-known multi-task learning problems, which are mainly conﬂicts between

tasks, the dominance of one of the tasks and under-performance of tasks compared

to single-task learning. Consequently, PLASTA-MTL is capable of learning an RS

image latent space that can better represent the complex semantic content of RS

images compared to IRL under single learning task. Extensive experimental analysis

conducted for different combinations of four learning tasks conﬁrms the potential

of our approach to describe the complex content of RS images by using multiple

learning tasks. This carries a huge potential for EO applications, which require to

model the complex patterns of RS image semantics on a large scale.

In conclusion, this thesis tackles several challenges of learning RS image represen-

tations imposed in recent years, which have witnessed the wide use of DNNs for

a wide range of research problems. We hope that this thesis can be regarded as an

important step for DL-based automatic knowledge discovery on massive RS im-

age archives by considering its: i) novel methodologies; ii) BigEarthNet benchmark

archive; iii) theoretical and experimental analyses of the proposed methodologies;

and iv) the public availability of research outcomes. We would like to note that while

our work provides new insights, it also leads to new research questions waiting to be

addressed. In the following section, we discuss two main directions of research as

future developments of this thesis.

Chapter 8. Conclusion and Outlook 131

8.2 Future Research Directions

As highlighted in Chapters 1 and 2, the availability of multi-source/multi-modal

RS images (e.g., multispectral, hyperspectral, SAR etc.) associated to the same geo-

graphical area allows for rich characterization of Earth’s surface and thus learning

more accurate image representations with DNNs when different data modalities

are jointly considered in a convenient way. To this end, BigEarthNet has been

proposed to contribute to the development of unsupervised, self-supervised and

semi-supervised multi-modal IRL methods for information discovery from big data

archives. However, the development of efﬁcient and effective IRL methods, which

employ information from multiple RS image sensors for learning joint feature rep-

resentations among different data modalities, has not been addressed in this thesis.

To pave the way on this direction, in [80], we introduce a novel self-supervised

method designed for only cross-modal CBIR problems on multi-modal RS image

archives. This method is capable of simultaneously preserving intra and inter-modal

similarities and eliminating inter-modal discrepancies without requiring annotated

training images. It is achieved by considering multi-modal RS images as the multiple

views of the same geographical area that allows IRL in an unsupervised way by

maximizing agreement between the multiple views of a shared context [211]. This

self-supervised strategy can be extended to utilize publicly available multi-sensor

RS images of next-generation Earth observation missions (e.g., Sentinels) on a large

scale for the joint use of RS image representations in atmospheric, oceanic, and land

monitoring. To support this research direction, we plan to enrich the BigEarthNet

archive by extending it to whole Europe with zero-annotation cost, as CORINE

land cover database is publicly available for all European countries. In addition to

this, BigEarthNet can be also easily extended on a world scale by enriching it with

Sentinel-1 and Sentinel-2 image patches without annotations or combining it with

other multi-modal benchmark archives. In parallel with this, we also plan to develop

self/semi-supervised IRL methodologies, which are capable of learning joint RS

image representations on such multi-sensor data for large-scale knowledge discovery

in EO applications through cross/multi-modal image classiﬁcation and retrieval.

Throughout this thesis, we assume that the training stages of the proposed methods

are performed with pre-deﬁned training sets, while full access to training data is

guaranteed. However, RS image archives of some data providers (e.g., commercial

providers) may not be accessible during training due to commercial concerns and

legal regulations, or it may not be feasible to gather all training data in a centralized

server due to data storage limitations. To address this issue, as a future development

of this thesis, we plan to make the proposed methodologies compatible for federated

and distributed learning of image representations to learn the model parameters

of DNNs on distributed servers without full access to training data of some data

providers. It may also happen that possible changes on the ground require to re-

learn or to update already learned IRL models with new RS images. This may not

be always feasible with re-training of DNNs from scratch with updated training

data due to the excessive growth of RS image archives. Updating DNNs by ﬁne-

tuning with only more recent RS images may lead to inaccurate learning of RS image

representations, as new training data may require the removal of previous knowledge

encoded by DNNs. To address this issue, as another extension of this thesis, we plan

Chapter 8. Conclusion and Outlook 132

to investigate continual life-long learning of RS image representations in effective and

efﬁcient ways to extract information from dynamically growing RS image archives.

133

Bibliography

[1]

A. G. Castriotta, “Copernicus sentinel data access annual report,” European

Space Agency, Tech. Rep., 2021. [Online]. Available:

https://sentinels.

copernicus . eu / web / sentinel/ - /copernicus - sentinel - data - access -

annual-report-2021.

[2]

C. Persello, J. D. Wegner, R. Hansch, D. Tuia, P. Ghamisi, M. Koeva, and G.

Camps-Valls, “Deep learning and earth observation to support the sustain-

able development goals: Current approaches, open challenges, and future

opportunities,” IEEE Geoscience and Remote Sensing Magazine, pp. 2–30, 2022.

DOI:10.1109/MGRS.2021.3136100.

[3]

Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review

and new perspectives,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. DOI:10.1109/TPAMI.2013.50.

[4]

M. Guo, C. Zhou, and J. Liu, “Jointly Learning of Visual and Auditory: A New

Approach for RS Image and Audio Cross-Modal Retrieval,” IEEE Journal of

Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 11,

pp. 4644–4654, 2019.

[5]

F. Ye, W. Luo, M. Dong, D. Li, and W. Min, “Content-Based Remote Sensing

Image Retrieval Based on Fuzzy Rules and a Fuzzy Distance,” IEEE Geoscience

and Remote Sensing Letters, pp. 1–5, 2020. DOI:10.1109/LGRS.2020.3030858.

[6]

F. Ye, M. Dong, W. Luo, X. Chen, and W. Min, “A New Re-Ranking Method

Based on Convolutional Neural Network and Two Image-to-Class Distances

for Remote Sensing Image Retrieval,” IEEE Access, vol. 7, pp. 141498–141 507,

2019.

[7] F. Ye, X. Zhao, W. Luo, D. Li, and W. Min, “Query-Adaptive Remote Sensing

Image Retrieval Based on Image Rank Similarity and Image-to-Query Class

Similarity,” IEEE Access, vol. 8, pp. 116824–116839, 2020.

[8]

C. Liu, J. Ma, X. Tang, F. Liu, X. Zhang, and L. Jiao, “Deep Hash Learning for

Remote Sensing Image Retrieval,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 59, no. 4, pp. 3420–3443, 2021.

[9]

R. Imbriaco, C. Sebastian, E. Bondarev, and P. H. N. de With, “Aggregated

deep local features for remote sensing image retrieval,” Remote Sensing, vol. 11,

no. 5, p. 493, 2019. DOI:10.3390/rs11050493.

[10]

Y. Boualleg and M. Farah, “Enhanced Interactive Remote Sensing Image

Retrieval with Scene Classiﬁcation Convolutional Neural Networks Model,”

in Proceedings of the IEEE International Geoscience and Remote Sensing Symposium,

2018, pp. 4748–4751.

[11]

W. Zhou, S. Newsam, C. Li, and Z. Shao, “Learning Low Dimensional Con-

volutional Neural Networks for High-Resolution Remote Sensing Image Re-

trieval,” Remote Sensing, vol. 9, no. 5, p. 489, 2017.

Bibliography 134

[12]

F. Hu, X. Tong, G. Xia, and L. Zhang, “Delving into deep representations for

remote sensing image retrieval,” in International Conference on Signal Processing,

2016, pp. 198–203.

[13]

F. Ye, H. Xiao, X. Zhao, M. Dong, W. Luo, and W. Min, “Remote Sensing

Image Retrieval Using Convolutional Neural Network Features and Weighted

Distance,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 10, pp. 1535–

1539, 2018.

[14]

C. Ma, F. Chen, J. Yang, J. Liu, W. Xia, and X. Li, “A remote-sensing image-

retrieval model based on an ensemble neural networks,” Big Earth Data, vol. 2,

no. 4, pp. 351–367, 2018.

[15]

R. Cao, Q. Zhang, J. Zhu, Q. Li, Q. Li, B. Liu, and G. Qiu, “Enhancing re-

mote sensing image retrieval using a triplet deep metric learning network,”

International Journal of Remote Sensing, vol. 41, no. 2, pp. 740–751, 2020.

[16]

L. Fan, H. Zhao, and H. Zhao, “Distribution Consistency Loss for Large-Scale

Remote Sensing Image Retrieval,” Remote Sensing, vol. 12, no. 1, p. 175, 2020.

[17]

U. Chaudhuri, B. Banerjee, and A. Bhattacharya, “Siamese graph convolu-

tional network for content based remote sensing image retrieval,” Computer

Vision and Image Understanding, vol. 184, pp. 22–30, 2019.

[18]

U. Chaudhuri, B. Banerjee, A. Bhattacharya, and M. Datcu, “A Zero-Shot

Sketch-Based Intermodal Object Retrieval Scheme for Remote Sensing Im-

ages,” IEEE Geoscience and Remote Sensing Letters, pp. 1–5, 2021. DOI:

10.1109/

LGRS.2021.3056392.

[19]

M. Zhang, Q. Cheng, F. Luo, and L. Ye, “A triplet nonlocal neural network with

dual-anchor triplet loss for high-resolution remote sensing image retrieval,”

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,

vol. 14, pp. 2711–2723, 2021.

[20]

Y. Li, Y. Zhang, X. Huang, and J. Ma, “Learning source-invariant deep hash-

ing convolutional neural networks for cross-source remote sensing image

retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 11,

pp. 6521–6536, 2018. DOI:10.1109/TGRS.2018.2839705.

[21]

Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma, “Large-scale remote sensing

image retrieval by deep hashing neural networks,” IEEE Transactions on Geo-

science and Remote Sensing, vol. 56, pp. 950–965, 2018.

[22]

P. Li, L. Han, X. Tao, X. Zhang, C. Grecos, A. Plaza, and P. Ren, “Hashing Nets

for Hashing: A Quantized Deep Learning to Hash Framework for Remote

Sensing Image Retrieval,” IEEE Transactions on Geoscience and Remote Sensing,

vol. 58, no. 10, pp. 7331–7345, 2020.

[23]

W. Xiong, Z. Xiong, Y. Zhang, Y. Cui, and X. Gu, “A Deep Cross-Modality

Hashing Network for SAR and Optical Remote Sensing Images Retrieval,”

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,

vol. 13, pp. 5284–5296, 2020.

[24]

Y. Cao, Y. Wang, J. Peng, L. Zhang, L. Xu, K. Yan, and L. Li, “DML-GANR:

Deep Metric Learning With Generative Adversarial Network Regularization

for High Spatial Resolution Remote Sensing Image Retrieval,” IEEE Transac-

tions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8888–8904, 2020.

Bibliography 135

[25]

L. Fan, H. Zhao, and H. Zhao, “Global Optimization: Combining Local Loss

With Result Ranking Loss in Remote Sensing Image Retrieval,” IEEE Transac-

tions on Geoscience and Remote Sensing, vol. 59, no. 8, pp. 7011–7026, 2021. DOI:

10.1109/TGRS.2020.3029334.

[26]

Y. Liu, L. Ding, C. Chen, and Y. Liu, “Similarity-Based Unsupervised Deep

Transfer Learning for Remote Sensing Image Retrieval,” IEEE Transactions on

Geoscience and Remote Sensing, vol. 58, no. 11, pp. 7872–7889, 2020.

[27]

S. Roy, E. Sangineto, B. Demir, and N. Sebe, “Metric-learning-based deep

hashing network for content-based retrieval of remote sensing images,” IEEE

Geoscience and Remote Sensing Letters, vol. 18, no. 2, pp. 226–230, 2021. DOI:

10.1109/LGRS.2020.2974629.

[28]

X. Tang, X. Zhang, F. Liu, and L. Jiao, “Unsupervised deep feature learning for

remote sensing image retrieval,” Remote Sensing, vol. 10, no. 8, p. 1243, 2018.

DOI:10.3390/rs10081243.

[29]

N. Khurshid, M. Tharani, M. Taj, and F. Z. Qureshi, “A Residual-Dyad En-

coder Discriminator Network for Remote Sensing Image Matching,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 58, no. 3, pp. 2001–2014,

2020.

[30]

Z. Shao, W. Zhou, X. Deng, M. Zhang, and Q. Cheng, “Multilabel Remote

Sensing Image Retrieval Based on Fully Convolutional Network,” IEEE Journal

of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13,

pp. 318–328, 2020.

[31]

R. Dong, W. Fang, H. Fu, L. Gan, J. Wang, and P. Gong, “High-resolution land

cover mapping through learning with noise correction,” IEEE Transactions

on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022. DOI:

10.1109/TGRS.

2021.3068280.

[32]

G. Hoxha, F. Melgani, and B. Demir, “Toward Remote Sensing Image Retrieval

Under a Deep Image Captioning Perspective,” IEEE Journal of Selected Topics

in Applied Earth Observations and Remote Sensing, vol. 13, pp. 4462–4475, 2020.

[33]

G. Hoxha, S. Chouaf, F. Melgani, and Y. Smara, “Change captioning: A new

paradigm for multitemporal remote sensing image analysis,” IEEE Transac-

tions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022. DOI:

10.1109/

TGRS.2022.3195692.

[34]

G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learning meets

metric learning: Remote sensing image scene classiﬁcation via learning dis-

criminative cnns,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56,

no. 5, pp. 2811–2821, 2018.

[35]

Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wise attention in

a hybrid convolutional and bidirectional lstm network for multi-label aerial

image classiﬁcation,” ISPRS Journal of Photogrammetry and Remote Sensing,

vol. 149, pp. 188–199, 2019.

[36]

Y. Hua, L. Mou, and X. X. Zhu, “Relation network for multilabel aerial image

classiﬁcation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58,

no. 7, pp. 4558–4572, 2020. DOI:10.1109/TGRS.2019.2963364.

[37]

A. Alshehri, Y. Bazi, N. Ammour, H. Almubarak, and N. Alajlan, “Deep

attention neural network for multi-label classiﬁcation in unmanned aerial

vehicle imagery,” IEEE Access, vol. 7, pp. 119873–119880, 2019.

Bibliography 136

[38]

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embedding

for face recognition and clustering,” IEEE Conference on Computer Vision and

Pattern Recognition, pp. 815–823, 2015. DOI:10.1109/CVPR.2015.7298682.

[39]

C. Zhou, L. Po, W. Y. F. Yuen, K. W. Cheung, X. Xu, K. W. Lau, Y. Zhao, M. Liu,

and P. H. W. Wong, “Angular deep supervised hashing for image retrieval,”

IEEE Access, vol. 7, pp. 127521–127532, 2019. DOI:

10.1109/ACCESS.2019.

2939650.

[40]

X. Yang, P. Zhou, and M. Wang, “Person reidentiﬁcation via structural deep

metric learning,” IEEE Transactions on Neural Networks and Learning Systems,

vol. 30, no. 10, pp. 2987–2998, 2019. DOI:10.1109/TNNLS.2018.2861991.

[41]

Z. Li, J. Tang, L. Zhang, and J. Yang, “Weakly-supervised semantic guided

hashing for social image retrieval,” International Journal of Computer Vision,

vol. 128, no. 8–9, 2265–2278, 2020.

[42]

W. Song, S. Li, and J. A. Benediktsson, “Deep hashing learning for visual and

semantic retrieval of remote sensing images,” IEEE Transactions on Geoscience

and Remote Sensing, vol. 59, pp. 9661–9672, 2021.

[43]

X. Tang, C. Liu, X. Zhang, J. Ma, C. Jiao, and L. Jiao, “Remote sensing image

retrieval based on semi-supervised deep hashing learning,” in Proceedings of

the IEEE International Geoscience and Remote Sensing Symposium, 2019, pp. 879–

882. DOI:10.1109/IGARSS.2019.8898676.

[44]

C. Liu, J. Ma, X. Tang, X. Zhang, and L. Jiao, “Adversarial hash-code learning

for remote sensing image retrieval,” in Proceedings of the IEEE International

Geoscience and Remote Sensing Symposium, 2019, pp. 4324–4327. DOI:

10.1109/

IGARSS.2019.8900431.

[45]

X. Tang, Y. Yang, J. Ma, Y.

M. Cheung, C. Liu, F. Liu, X. Zhang, and L. Jiao,

“Meta-hashing for remote sensing image retrieval,” IEEE Transactions on Geo-

science and Remote Sensing, vol. 60, pp. 1–19, 2022. DOI:

10.1109/TGRS.2021.

3136159.

[46]

W. Song, Z. Gao, R. Dian, P. Ghamisi, Y. Zhang, and J. A. Benediktsson,

“Asymmetric hash code learning for remote sensing image retrieval,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022. DOI:

10.1109/TGRS.2022.3143571.

[47]

H. Kramer, Observation of the Earth and Its Environment: Survey of Missions and

Sensors. Springer Berlin Heidelberg, 2019.

[48]

D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang,

“More diverse means better: Multimodal deep learning meets remote-sensing

imagery classiﬁcation,” IEEE Transactions on Geoscience and Remote Sensing,

vol. 59, no. 5, pp. 4340–4354, 2021.

[49]

J. Feranec, T. Soukup, G. Hazeu, and G. Jaffrain, European Landscape Dynamics:

CORINE Land Cover Data. CRC Press, 2016.

[50]

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep

learning requires rethinking generalization,” in Proceedings of the International

Conference on Learning Representations, 2017.

[51]

H. Song, M. Kim, D. Park, Y. Shin, and J.

G. Lee, “Learning from noisy labels

with deep neural networks: A survey,” IEEE Transactions on Neural Networks

and Learning Systems, 2022, doi: 10.1109/TNNLS.2022.3152527.

Bibliography 137

[52]

Y. Liu, Z. Han, C. Chen, L. Ding, and Y. Liu, “Eagle-eyed multitask cnns for

aerial image retrieval and scene classiﬁcation,” IEEE Transactions on Geoscience

and Remote Sensing, vol. 58, no. 9, pp. 6699–6721, 2020. DOI:

10.1109/TGRS.

2020.2979011.

[53]

J. Fang, X. Cao, D. Wang, and S. Xu, “Multitask Learning Mechanism for

Remote Sensing Image Motion Deblurring,” IEEE Journal of Selected Topics in

Applied Earth Observations and Remote Sensing, vol. 14, pp. 2184–2193, 2021.

[54]

F. Chen and B. Yu, “Earthquake-Induced Building Damage Mapping Based

on Multi-Task Deep Learning Framework,” IEEE Access, vol. 7, pp. 181396–

181404, 2019.

[55]

R. Caye Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Multitask learn-

ing for large-scale semantic change detection,” Computer Vision and Image

Understanding, vol. 187, p. 102783, 2019.

[56]

H. Wang, Z. Zhou, H. Zong, and L. Miao, “Wide-Context Attention Network

for Remote Sensing Image Retrieval,” IEEE Geoscience and Remote Sensing

Letters, pp. 1–5, 2020.

[57]

W. Xiong, Z. Xiong, Y. Cui, and Y. Lv, “A Discriminative Distillation Network

for Cross-Source Remote Sensing Image Retrieval,” IEEE Journal of Selected

Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1234–1247,

2020.

[58]

S. Vandenhende, S. Georgoulis, W. V. Gansbeke, M. Proesmans, D. Dai, and

L. V. Gool, “Multi-Task Learning for Dense Prediction Tasks: A Survey,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.

[59]

X. Zhao, H. Li, X. Shen, X. Liang, and Y. Wu, “A modulation module for

multi-task learning with applications in image retrieval,” in Proceedings of the

European Conference on Computer Vision, 2018, pp. 415–432.

[60]

R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in

cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.

[61]

O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,”

in Proceedings of the Advances in Neural Information Processing Systems, 2018,

pp. 525–536.

[62]

G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-

label remote sensing image classiﬁcation,” IEEE Access, vol. 8, pp. 95934–

95946, 2020. DOI:10.1109/ACCESS.2020.2995805.

[63]

G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides,

M. Caetano, B. Demir, and V. Markl, “BigEarthNet-MM: A large scale multi-

modal multi-label benchmark archive for remote sensing image classiﬁcation

and retrieval,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3,

pp. 174–180, 2021. DOI:10.1109/MGRS.2021.3089174.

[64]

G. Sumbul, M. Ravanbakhsh, and B. Demir, “Informative and representa-

tive triplet selection for multilabel remote sensing image retrieval,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. DOI:

10.1109/TGRS.2021.3124326.

[65]

G. Sumbul and B. Demir, “Plasticity-stability preserving multi-task learning

for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 60, pp. 1–16, 2022. DOI:10.1109/TGRS.2022.3160097.

Bibliography 138

[66]

G. Sumbul, J. Xiang, and B. Demir, “Towards simultaneous image compression

and indexing for scalable content-based retrieval in remote sensing,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022. DOI:

10.1109/TGRS.2022.3204914.

[67]

G. Sumbul and B. Demir, “Generative reasoning integrated label noise robust

deep image representation learning,” IEEE Transactions on Image Processing,

2023. DOI:10.1109/TIP.2023.3293776.

[68]

G. Sumbul, J. Kang, and B. Demir, “Deep learning for image search and re-

trieval in large remote sensing archives,” in Deep Learning for the Earth Sciences:

A comprehensive approach to remote sensing, climate science and geosciences, Hobo-

ken, NJ, USA: Wiley, 2021, ch. 11, pp. 150–160. DOI:

10.1002/9781119646181.

ch11.

[69]

G. Sumbul, M. Charfuelan, B. Demir, and M. Volker, “BigEarthNet: A large-

scale benchmark archive for remote sensing image understanding,” in Proceed-

ings of the IEEE International Geoscience and Remote Sensing Symposium, 2019,

pp. 5901–5904. DOI:10.1109/IGARSS.2019.8900532.

[70]

G. Sumbul and B. Demir, “A novel multi-attention driven system for multi-

label remote sensing image classiﬁcation,” in Proceedings of the IEEE Inter-

national Geoscience and Remote Sensing Symposium, 2019, pp. 5726–5729. DOI:

10.1109/IGARSS.2019.8898188.

[71]

G. Sumbul, M. Ravanbakhsh, and B. Demir, “A relevant, hard and diverse

triplet sampling method for multi-label remote sensing image retrieval,” in

Proceedings of the IEEE Mediterranean and Middle-East Geoscience and Remote Sens-

ing Symposium, 2022, pp. 5–8. DOI:10.1109/M2GARSS52314.2022.9839759.

[72]

G. Sumbul, J. Xiang, N. T. Madam, and B. Demir, “A novel framework to

jointly compress and index remote sensing images for efﬁcient content-based

retrieval,” in Proceedings of the IEEE International Geoscience and Remote Sensing

Symposium, 2022, pp. 251–254. DOI:10.1109/IGARSS46834.2022.9884146.

[73]

G. Sumbul and B. Demir, “Label noise robust image representation learning

based on supervised variational autoencoders in remote sensing,” in Pro-

ceedings of the IEEE International Geoscience and Remote Sensing Symposium,

2023.

[74]

A. Preethy Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Remote-sensing im-

age scene classiﬁcation with deep neural networks in JPEG 2000 compressed

domain,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4,

pp. 3458–3472, 2021. DOI:10.1109/TGRS.2020.3007523.

[75]

G. Sumbul, S. Nayak, and B. Demir, “SD-RSIC: Summarization-driven deep

remote sensing image captioning,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 59, no. 8, pp. 6922–6934, 2021. DOI:

10.1109/TGRS.2020.3031111

[76]

A. P. Byju, G. Sumbul, B. Demir, and L. Bruzzone, “Approximating JPEG 2000

wavelet representation through deep neural networks for remote sensing

image scene classiﬁcation,” in Proceedings of the Image and Signal Processing

for Remote Sensing Conference, vol. 11155, 2019, 111550S. DOI:

10.1117/12.

2534643.

[77]

K. Zhang, G. Sumbul, and B. Demir, “An approach to super-resolution of

sentinel-2 images based on generative adversarial networks,” in Proceedings of

Bibliography 139

the IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Sympo-

sium, 2020, pp. 69–72. DOI:10.1109/M2GARSS47143.2020.9105165.

[78]

H. Yessou, G. Sumbul, and B. Demir, “A comparative study of deep learning

loss functions for multi-label remote sensing image classiﬁcation,” in Proceed-

ings of the IEEE International Geoscience and Remote Sensing Symposium, 2020,

pp. 1349–1352. DOI:10.1109/IGARSS39084.2020.9323583.

[79]

G. Sumbul and B. Demir, “A novel graph-theoretic deep representation learn-

ing method for multi-label remote sensing image retrieval,” in Proceedings of

the IEEE International Geoscience and Remote Sensing Symposium, 2021, pp. 266–

269. DOI:10.1109/IGARSS47720.2021.9554466.

[80]

G. Sumbul, M. Müller, and B. Demir, “A novel self-supervised cross-modal

image retrieval method in remote sensing,” in Proceedings of the IEEE Inter-

national Conference on Image Processing, 2022, pp. 2426–2430. DOI:

10.1109/

ICIP46576.2022.9897475.

[81]

A. Zell, G. Sumbul, and B. Demir, “Deep metric learning-based semi-supervised

regression with alternate learning,” in Proceedings of the IEEE International Con-

ference on Image Processing, 2022, pp. 2411–2415. DOI:

10.1109/ICIP46576.

2022.9897939.

[82]

B. Büyüktas, G. Sumbul, and B. Demir, “Learning across decentralized multi-

modal remote sensing archives with federated learning,” in Proceedings of the

IEEE International Geoscience and Remote Sensing Symposium, 2023.

[83]

J. Henkel, G. Hoxha, G. Sumbul, L. Möllenbrok, and B. Demir, “Annotation

cost efﬁcient active learning for remote sensing image retrieval,” in Proceedings

of the IEEE International Geoscience and Remote Sensing Symposium, 2023.

[84]

Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-

use classiﬁcation,” in Proceedings of the International Conference on Advances in

Geographic Information Systems, 2010, 270–279.

[85]

W. Shao, W. Yang, and G. S. Xia, “Extreme value theory-based calibration for

the fusion of multiple features in high-resolution satellite scene classiﬁcation,”

International Journal of Remote Sensing, vol. 34, no. 23, pp. 8588–8602, 2013.

[86]

Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based feature selection

for remote sensing scene classiﬁcation,” IEEE Geoscience and Remote Sensing

Letters, vol. 12, no. 11, pp. 2321–2325, 2015.

[87]

B. Zhao, Y. Zhong, G. Xia, and L. Zhang, “Dirichlet-derived multiple topic

scene classiﬁcation model for high spatial resolution remote sensing imagery,”

IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 2108–2123,

2016.

[88]

G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “Aid: A

benchmark data set for performance evaluation of aerial scene classiﬁcation,”

IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981,

2017.

[89]

G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classiﬁcation:

Benchmark and state of the art,” Proceedings of the IEEE, vol. 105, no. 10,

pp. 1865–1883, 2017.

[90]

H. Li, X. Dou, C. Tao, Z. Wu, J. Chen, J. Peng, M. Deng, and L. Zhao, “Rsi-cb:

A large-scale remote sensing image classiﬁcation benchmark using crowd-

sourced data,” Sensors, vol. 20, no. 6, 2020.

Bibliography 140

[91]

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset

and deep learning benchmark for land use and land cover classiﬁcation,”

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,

vol. 12, no. 7, pp. 2217–2226, 2019. DOI:10.1109/JSTARS.2019.2918242.

[92]

W. Zhou, S. Newsam, C. Li, and Z. Shao, “Patternnet: A benchmark dataset

for performance evaluation of remote sensing image retrieval,” ISPRS Journal

of Photogrammetry and Remote Sensing, vol. 145, pp. 197–209, 2018.

[93]

B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabel remote

sensing image retrieval using a semisupervised graph-theoretic method,”

IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1144–

1158, 2018.

[94]

L. Zhao, P. Tang, and L. Huo, “Feature signiﬁcance-based multibag-of-visual-

words model for remote sensing image scene classiﬁcation,” Journal of Applied

Remote Sensing, vol. 10, no. 3, pp. 1 –21, 2016.

[95]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-

nition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2016, pp. 770–778.

[96]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-

scale image recognition,” International Conference on Learning Representations,

2015.

[97]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.

Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet

Large Scale Visual Recognition Challenge,” International Journal of Computer

Vision, vol. 115, no. 3, pp. 211–252, 2015. DOI:10.1007/s11263-015-0816-y.

[98]

G. Jaffrain, C. Sannier, A. Pennec, and H. Dufourmont, “Corine land cover

2012 - ﬁnal validation report,” European Environment Agency, Tech. Rep.,

2017. [Online]. Available:

https://land.copernicus . eu / user- corner/

technical-library/clc-2012-validation-report-1.

[99]

C. Paris, L. Bruzzone, and D. Fernández-Prieto, “A novel approach to the

unsupervised update of land-cover maps by classiﬁcation of time series of

multispectral images,” IEEE Transactions on Geoscience and Remote Sensing,

vol. 57, no. 7, pp. 4259–4277, 2019, ISSN: 1558-0644. DOI:

10.1109/TGRS.2018.

2890404.

[100]

S. Arnold, B. Kosztra, G. Banko, G. Smith, G. Hazeu, M. Bock, and N Valcarcel

Sanz, “The eagle concept—a vision of a future european land monitoring

framework,” in Proceedings of the EARSeL Symposium towards Horizon, vol. 2020,

2013, pp. 551–568.

[101]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in

Proceedings of the International Conference on Learning Representations, 2014,

pp. 1–41.

[102]

M. Tao, J. Su, Y. Huang, and L. Wang, “Mitigation of radio frequency inter-

ference in synthetic aperture radar data: Current status and future trends,”

Remote Sensing, vol. 11, no. 20, p. 2438, 2019.

[103]

F. Zhang, B. Du, and L. Zhang, “Scene classiﬁcation via a gradient boosting

random convolutional network framework,” IEEE Transactions on Geoscience

and Remote Sensing, vol. 54, no. 3, pp. 1793–1802, 2016.

Bibliography 141

[104]

K. Nogueira, O. A. B. Penatti, and J. A. Santos, “Towards better exploiting

convolutional neural networks for remote sensing scene classiﬁcation,” Pattern

Recognition, vol. 61, pp. 539–556, 2017.

[105]

G. Sumbul, R. G. Cinbis, and S. Aksoy, “Multisource region attention net-

work for ﬁne-grained object recognition in remote sensing imagery,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 57, no. 7, pp. 4929–4937,

2019.

[106]

S. Roy, E. Sangineto, N. Sebe, and B. Demir, “Semantic-fusion gans for semi-

supervised satellite image classiﬁcation,” in Proceedings of the International

Conference on Image Processing, 2018, pp. 684–688.

[107]

X. Lu, H. Sun, and X. Zheng, “A feature aggregation convolutional neural net-

work for remote sensing scene classiﬁcation,” IEEE Transactions on Geoscience

and Remote Sensing, vol. 57, no. 10, pp. 7894–7906, 2019.

[108]

J. Xie, N. He, L. Fang, and A. Plaza, “Scale-free convolutional neural network

for remote sensing scene classiﬁcation,” IEEE Transactions on Geoscience and

Remote Sensing, vol. 57, no. 9, pp. 6916–6928, 2019.

[109]

A. Zeggada, F. Melgani, and Y. Bazi, “A deep learning approach to UAV

image multilabeling,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5,

pp. 694–698, 2017.

[110]

S. Koda, A. Zeggada, F. Melgani, and R. Nishii, “Spatial and structured SVM

for multilabel image classiﬁcation,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 56, no. 10, pp. 5948–5960, 2018.

[111]

R. Stivaktakis, G. Tsagkatakis, and P. Tsakalides, “Deep learning for multilabel

land cover scene categorization using data augmentation,” IEEE Geoscience

and Remote Sensing Letters, vol. 16, no. 7, pp. 1031–1035, 2019.

[112]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Compu-

tation, vol. 9, no. 8, pp. 1735–1780, 1997.

[113]

F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual

prediction with LSTM,” Neural Computation, vol. 12, no. 10, pp. 2451–2471,

2000.

[114]

S. Masum, J. P. Chiverton, Y. Liu, B. Vuksanovic, and M. Petridis, “Investi-

gation of machine learning techniques in forecasting of blood pressure time

series data,” in Proceedings of the International Conference on Innovative Tech-

niques and Applications of Artificial Intelligence, 2019, pp. 269–282.

[115]

T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-

based neural machine translation,” in Proceedings of the Conference on Empirical

Methods in Natural Language Processing, 2015, pp. 1412–1421.

[116]

Z. A. Daniels and D. N. Metaxas, “Addressing imbalance in multi-label classiﬁ-

cation using structured hellinger forests,” in Proceedings of the AAAI Conference

on Artificial Intelligence, 2017, pp. 1826–1832.

[117]

D. Mishkin, N. Sergievskiy, and J. Matas, “Systematic evaluation of CNN

advances on the ImageNet,” Computer Vision and Image Understanding, vol. 161,

no. C, pp. 11–19, 2017.

[118] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deep feed-

forward neural networks,” in Proceedings of the International Conference on

Artificial Intelligence and Statistics, 2010, pp. 249–256.

Bibliography 142

[119] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: A simple way to prevent neural networks from overﬁtting,” Journal

of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[120]

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network

training by reducing internal covariate shift,” in Proceedings of the International

Conference on Machine Learning, 2015, pp. 448–456.

[121]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-

scale image recognition,” in Proceedings of the International Conference on Learn-

ing Representations, 2015.

[122]

M. Zhang and Z. Zhou, “A review on multi-label learning algorithms,” IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819–1837,

2014.

[123]

R. A. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-

Wesley, 2011, pp. 327–328.

[124]

G. Tsoumakas and I. Katakis, “Multi-label classiﬁcation: An overview,” Inter-

national Journal of Data Warehousing and Mining, vol. 3, no. 3, pp. 1–13, 2007.

[125]

G. Tsoumakas, I. Katakis, and I. Vlahavas, “Data mining and knowledge

discovery handbook,” in Springer, 2010, ch. Mining Multi-label Data, pp. 667–

685.

[126]

M. Ahmed, “Data summarization: A survey,” Knowledge and Information Sys-

tems, vol. 58, no. 2, pp. 249–273, 2019.

[127]

Y. Yang and S. Newsam, “Geographic image retrieval using local invariant

features,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 2,

pp. 818–832, 2013. DOI:10.1109/TGRS.2012.2205158.

[128]

E. Aptoula, “Remote sensing image retrieval with global morphological tex-

ture descriptors,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52,

no. 5, pp. 3023–3034, 2014. DOI:10.1109/TGRS.2013.2268736.

[129]

I. Tekeste and B. Demir, “Advanced local binary patterns for remote sensing

image retrieval,” IEEE International Geoscience and Remote Sensing Symposium,

pp. 6855–6858, 2018. DOI:10.1109/IGARSS.2018.8518856.

[130]

B. Demir and L. Bruzzone, “Hashing-based scalable remote sensing image

search and retrieval in large archives,” IEEE Transactions on Geoscience and

Remote Sensing, vol. 54, no. 2, pp. 892–904, 2016. DOI:

10.1109/TGRS.2015.

2469138.

[131]

B. Chaudhuri, B. Demir, L. Bruzzone, and S. Chaudhuri, “Region-based re-

trieval of remote sensing images using an unsupervised graph-theoretic ap-

proach,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 7, pp. 987–991,

2016. DOI:10.1109/LGRS.2016.2558289.

[132]

Y. Li, Y. Zhang, C. Tao, and H. Zhu, “Content-based high-resolution remote

sensing image retrieval via unsupervised feature learning and collaborative

afﬁnity metric fusion,” Remote Sensing, vol. 8, no. 9, p. 709, 2016. DOI:

10.3390/

rs8090709.

[133]

Y. Boualleg and M. Farah, “Enhanced interactive remote sensing image re-

trieval with scene classiﬁcation convolutional neural networks model,” IEEE

International Geoscience and Remote Sensing Symposium, pp. 4748–4751, 2018.

DOI:10.1109/IGARSS.2018.8518388.

Bibliography 143

[134]

F. Sabahi, M. O. Ahmad, and M. N. S. Swamy, “An unsupervised learn-

ing based method for content-based image retrieval using hopﬁeld neural

network,” Proceedings of the International Conference of Signal Processing and

Intelligent Systems, pp. 1–5, 2016. DOI:10.1109/ICSPIS.2016.7869882.

[135]

H. Lai, Y. Pan, Ye Liu, and S. Yan, “Simultaneous feature learning and hash

coding with deep neural networks,” IEEE Conference on Computer Vision and

Pattern Recognition, pp. 3270–3278, 2015. DOI:10.1109/CVPR.2015.7298947.

[136]

P. Zhu, Y. Tan, L. Zhang, Y. Wang, J. Mei, H. Liu, and M. Wu, “Deep learning

for multilabel remote sensing image annotation with dual-level semantic

concepts,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 6,

pp. 4047–4060, 2020.

[137]

H. Xuan, A. Stylianou, and R. Pless, “Improved embeddings with easy positive

triplet mining,” IEEE International Conference on Computer Vision, pp. 2474–

2482, 2020.

[138]

D. Zhang, Y. Li, and Z. Zhang, “Deep metric learning with spherical embed-

ding,” in Proceedings of the Advances in Neural Information Processing Systems,

vol. 33, 2020, pp. 18772–18783.

[139]

W. Ge, W. Huang, D. Dong, and M. R. Scott, “Deep metric learning with

hierarchical triplet loss,” in Proceedings of the European Conference on Computer

Vision, 2018, pp. 269–285.

[140]

Y. Yuan, K. Yang, and C. Zhang, “Hard-aware deeply cascaded embedding,”

IEEE International Conference on Computer Vision, pp. 814–823, 2017.

[141]

S. Kim, M. Seo, I. Laptev, M. Cho, and S. Kwak, “Deep metric learning beyond

binary supervision,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2019, pp. 2283–2292. DOI:

10.1109/CVPR.2019.00239

[142]

S. Zhang, Q. Zhang, X. Wei, Y. Zhang, and Y. Xia, “Person re-identiﬁcation

with triplet focal loss,” IEEE Access, vol. 6, pp. 78092–78099, 2018. DOI:

10.

1109/ACCESS.2018.2884743.

[143]

X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss

with general pair weighting for deep metric learning,” in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5017–5025.

DOI:10.1109/CVPR.2019.00516.

[144]

K. Sohn, “Improved deep metric learning with multi-class n-pair loss objec-

tive,” in Proceedings of the Advances in Neural Information Processing Systems,

vol. 29, 2016.

[145]

X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson, “Ranked

list loss for deep metric learning,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2019, pp. 5202–5211. DOI:

10.1109/

CVPR.2019.00535.

[146]

C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in

deep embedding learning,” in Proceedings of the IEEE International Conference

on Computer Vision, 2017, pp. 2840–2848.

[147]

Z. Zhang, Q. Zou, Y. Lin, L. Chen, and S. Wang, “Improved deep hashing with

soft pairwise similarity for multi-label image retrieval,” IEEE Transactions on

Multimedia, vol. 22, no. 2, pp. 540–553, 2020.

Bibliography 144

[148]

G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected

convolutional networks,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2017, pp. 2261–2269.

[149]

E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in

Proceedings of the International Conference on Learning Representations, 2015.

[150]

S. Deepak and P. Ameer, “Retrieval of brain MRI with tumor using con-

trastive loss based similarity on googlenet encodings,” Computers in Biology

and Medicine, vol. 125, p. 103993, 2020. DOI:

https://doi.org/10.1016/j.

compbiomed.2020.103993.

[151]

H. Xuan, R. Souvenir, and R. Pless, “Deep randomized ensembles for metric

learning,” in Proceedings of the European Conference on Computer Vision, 2018,

pp. 723–734.

[152]

H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via

lifted structured feature embedding,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.

[153]

W. Chen, Y. Liu, W. Wang, E. M. Bakker, T. Georgiou, P. Fieguth, L. Liu, and

M. S. Lew, “Deep learning for instance retrieval: A survey,” IEEE Transactions

on Pattern Analysis and Machine Intelligence, pp. 1–20, 2022. DOI:

10.1109/

TPAMI.2022.3218591.

[154]

J. Lin, Z. Li, and J. Tang, “Discriminative deep hashing for scalable face

image retrieval,” in Proceedings of the International Joint Conference on Artificial

Intelligence, 2017, pp. 2266–2272.

[155]

P. Li and P. Ren, “Partial randomness hashing for large-scale remote sensing

image retrieval,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 3,

pp. 464–468, 2017. DOI:10.1109/LGRS.2017.2651056.

[156]

T. Reato, B. Demir, and L. Bruzzone, “An unsupervised multicode hash-

ing method for accurate and scalable remote sensing image retrieval,” IEEE

Geoscience and Remote Sensing Letters, vol. 16, no. 2, pp. 276–280, 2019. DOI:

10.1109/LGRS.2018.2870686.

[157]

E. Augé, J. E. Sánchez, A. Kiely, I. Blanes, and J. Serra-Sagristá, “Performance

impact of parameter tuning on the CCSDS-123 lossless multi- and hyperspec-

tral image compression standard,” Journal of Applied Remote Sensing, vol. 7,

no. 1, pp. 1–16, 2013. DOI:10.1117/1.jrs.7.074594.

[158]

M. Ryan and J. Arnold, “The lossless compression of aviris images by vector

quantization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 3,

pp. 546–550, 1997. DOI:10.1109/36.581964.

[159]

P. Hao and Q. Shi, “Reversible integer KLT for progressive-to-lossless com-

pression of multiple component images,” in Proceedings of the International

Conference on Image Processing, vol. 1, 2003, pp. I–633. DOI:

10.1109/ICIP.

2003.1247041.

[160]

G. P. Abousleman, M. Marcellin, and B. R. Hunt, “Compression of hyper-

spectral imagery using the 3-D DCT and hybrid DPCM/DCT,” IEEE Trans-

actions on Geoscience and Remote Sensing, vol. 33, no. 1, pp. 26–34, 1995. DOI:

10.1109/36.368225.

[161]

W. Sweldens, “The lifting scheme: A custom-design construction of biorthog-

onal wavelets,” Applied and Computational Harmonic Analysis, vol. 3, no. 2,

Bibliography 145

pp. 186–200, 1996, ISSN: 1063-5203. DOI:

https://doi.org/10.1006/acha.

1996.0015.

[162]

A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still image

compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36–

58, 2001. DOI:10.1109/79.952804.

[163]

European Space Agency (ESA), “Sentinel-2 user handbook,” Sentinel User

Handbook and Exploitation Tools, Tech. Rep., 2015. [Online]. Available:

https:

/ / sentinel . esa . int / documents / 247904 / 685211 / sentinel - 2 _ user _

handbook.

[164]

F. Kong, K. Hu, Y. Li, D. Li, and S. Zhao, “Spectral-spatial feature partitioned

extraction based on CNN for multispectral image compression,” Remote Sens-

ing, vol. 13, no. 1, pp. 2072–4292, 2021.

[165]

Y. Hu, W. Yang, Z. Ma, and J. Liu, “Learning end-to-end lossy image com-

pression: A benchmark,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 44, no. 8, pp. 4194–4211, 2022.

[166]

F. Kong, S. Zhao, Y. Li, D. Li, and Y. Zhou, “A residual network framework

based on weighted feature channels for multispectral image compression,”

Ad Hoc Networks, vol. 107, p. 102272, 2020, ISSN: 1570-8705.

[167]

F. Kong, K. Hu, Y. Li, D. Li, X. Liu, and T. S. Durrani, “A spectral-spatial

feature extraction method with polydirectional CNN for multispectral image

compression,” IEEE Journal of Selected Topics in Applied Earth Observations and

Remote Sensing, vol. 15, pp. 2745–2758, 2022. DOI:

10.1109/JSTARS.2022.

3158281.

[168]

J. Ball, V. Laparra, and E. P. Simoncelli, “End-to-end optimization of nonlinear

transform codes for perceptual quality,” in Proceedings of the Picture Coding

Symposium, 2016, pp. 1–5.

[169]

L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression

with compressive autoencoders,” in Proceedings of the International Conference

on Learning Representations, 2017.

[170]

Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression

with discretized gaussian mixture likelihoods and attention modules,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

2020, pp. 7936–7945. DOI:10.1109/CVPR42600.2020.00796.

[171]

T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “End-to-end learnt

image compression via non-local attention optimization and improved context

modeling,” IEEE Transactions on Image Processing, vol. 30, pp. 3179–3191, 2021.

[172]

J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational

image compression with a scale hyperprior,” in Proceedings of the International

Conference on Learning Representations, 2018.

[173]

D. Minnen, J. Ballé, and G. Toderici, “Joint autoregressive and hierarchical

priors for learned image compression,” in Proceedings of the Advances in Neural

Information Processing Systems, 2018, pp. 10794–10803.

[174]

A. Preethy Byju, B. Demir, and L. Bruzzone, “A progressive content-based

image retrieval in JPEG 2000 compressed remote sensing archives,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 58, no. 8, pp. 5739–5751,

2020. DOI:10.1109/TGRS.2020.2969374.

Bibliography 146

[175]

J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image com-

pression,” in Proceedings of the International Conference on Learning Representa-

tions, 2017.

[176]

Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for

image quality assessment,” in Proceedings of the Asilomar Conference on Signals,

Systems & Computers, vol. 2, 2003, pp. 1398–1402. DOI:

10.1109/ACSSC.2003.

1292216.

[177]

H. F. Yang, K. Lin, and C. S. Chen, “Supervised learning of semantics-preserving

hash via deep convolutional neural networks,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 40, no. 2, pp. 437–451, 2018.

[178]

A. Désidéri, “Multiple-gradient descent algorithm (MGDA) for multiobjec-

tive optimization,” Comptes Rendus Mathematique, vol. 350, no. 5, pp. 313–318,

2012, ISSN: 1631-073X. DOI:

https://doi.org/10.1016/j.crma.2012.03.014

[179]

X. Qi, P. Zhu, Y. Wang, L. Zhang, J. Peng, M. Wu, J. Chen, X. Zhao, N. Zang,

and P. T. Mathiopoulos, “MLRSNet: A multi-label high spatial resolution

remote sensing dataset for semantic scene understanding,” ISPRS Journal of

Photogrammetry and Remote Sensing, vol. 169, pp. 337–350, 2020.

[180]

J. Bergstra, G. Desjardins, G. Lamblin, and Y. Bengio, “Quadratic polynomials

learn better image features (technical report 1337),” Département d’Informatique

et de Recherche Opérationnelle, Université de Montréal, Tech. Rep., 2009.

[181]

S. Su, C. Zhang, K. Han, and Y. Tian, “Greedy hash: Towards fast optimization

for accurate hash coding in CNN,” in Proceedings of the Advances in Neural

Information Processing Systems, 2018, pp. 806–815.

[182]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient

surgery for multi-task learning,” in Proceedings of the Advances in Neural Infor-

mation Processing Systems, vol. 33, 2020, pp. 5824–5836.

[183]

S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with

attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2019, pp. 1871–1880. DOI:10.1109/CVPR.2019.00197.

[184]

K. Islam, L. M. Dang, S. Lee, and H. Moon, “Image compression with recurrent

neural network and generalized divisive normalization,” in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021,

pp. 1875–1879.

[185]

F. Kong and R. Henao, “Efﬁcient classiﬁcation of very large images with tiny

objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2022, pp. 2384–2394.

[186]

S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “ESPNetv2: A light-

weight, power efﬁcient, and general purpose convolutional neural network,”

in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

2019, pp. 9182–9192.

[187]

R. Raina, Y. Shen, A. McCallum, and A. Ng, “Classiﬁcation with hybrid

generative/discriminative models,” in Proceedings of the Advances in Neural

Information Processing Systems, vol. 16, 2003.

[188]

R. Zhang, Z. Chen, S. Zhang, F. Song, G. Zhang, Q. Zhou, and T. Lei, “Remote

sensing image scene classiﬁcation with noisy label distillation,” Remote Sensing,

vol. 12, no. 15, p. 2376, 2020.

Bibliography 147

[189]

J. Kang, R. Fernandez-Beltran, P. Duan, X. Kang, and A. J. Plaza, “Robust

normalized softmax loss for deep metric learning-based characterization of

remote sensing images with label noise,” IEEE Transactions on Geoscience and

Remote Sensing, vol. 59, no. 10, pp. 8798–8811, 2021, ISSN: 1558-0644. DOI:

10.1109/TGRS.2020.3042607.

[190]

P. Li, X. He, X. Cheng, M. Qiao, D. Song, M. Chen, T. Zhou, J. Li, X. Guo, S.

Hu, and Z. Tian, “An improved categorical cross entropy for remote sensing

image classiﬁcation based on noisy labels,” Expert Systems with Applications,

vol. 205, p. 117296, 2022. DOI:10.1016/j.eswa.2022.117296.

[191]

T. Burgert, M. Ravanbakhsh, and B. Demir, “On the effects of different types

of label noise in multi-label remote sensing image classiﬁcation,” IEEE Trans-

actions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022. DOI:

10.1109/

TGRS.2022.3226371.

[192]

A. K. Aksoy, M. Ravanbakhsh, and B. Demir, “Multi-label noise robust col-

laborative learning method for remote sensing image classiﬁcation,” IEEE

Transactions on Neural Networks and Learning Systems, 2022. DOI:

10.1109/

TNNLS.2022.3209992.

[193]

N. Ahmed, R. M. Rahman, M. S. G. Adnan, and B. Ahmed, “Dense prediction

of label noise for learning building extraction from aerial drone imagery,”

International Journal of Remote Sensing, vol. 42, no. 23, pp. 8906–8929, 2021.

[194]

J. Yao, J. Wang, I. W. Tsang, Y. Zhang, J. Sun, C. Zhang, and R. Zhang, “Deep

learning from noisy image labels with quality embedding,” IEEE Transactions

on Image Processing, vol. 28, no. 4, pp. 1909–1922, 2019. DOI:

10.1109/TIP.

2018.2877939.

[195]

Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense

object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 42, no. 2, pp. 318–327, 2020. DOI:10.1109/TPAMI.2018.2858826.

[196]

S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda, “Early-learning

regularization prevents memorization of noisy labels,” Advances in Neural

Information Processing Systems, vol. 33, pp. 20331–20342, 2020.

[197]

H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by agreement:

A joint training method with co-regularization,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2020, pp. 13723–13 732.

DOI:10.1109/CVPR42600.2020.01374.

[198]

T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, and L.

Zelnik-Manor, “Asymmetric loss for multi-label classiﬁcation,” in Proceedings

of the IEEE International Conference on Computer Vision, 2021, pp. 82–91. DOI:

10.1109/ICCV48922.2021.00015.

[199]

K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin, “Robust inference via gener-

ative classiﬁers for handling noisy labels,” in Proceedings of the International

Conference on Machine Learning, 2019, pp. 3763–3772.

[200]

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceed-

ings of the International Conference on Learning Representations, 2014.

[201]

S. Kullback and R. A. Leibler, “On Information and Sufﬁciency,” The Annals

of Mathematical Statistics, vol. 22, no. 1, pp. 79 –86, 1951. DOI:

10.1214/aoms/

1177729694.

Bibliography 148

[202]

Z. Feng, K. Kong, M. Chen, T. Zhang, M. Zhu, and W. Chen, “SHOT-VAE:

Semi-supervised deep generative models with label-aware ELBO approxima-

tions,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

[203]

Z. Shao, K. Yang, and W. Zhou, “Performance evaluation of single-label and

multi-label remote sensing image retrieval using a dense labeling dataset,”

Remote Sensing, vol. 10, no. 6:964, 2018.

[204] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[205]

R. Cipolla, Y. Gal, and A. Kendall, “Multi-task learning using uncertainty to

weigh losses for scene geometry and semantics,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491. DOI:

10.1109/CVPR.2018.00781.

[206]

X. Lu, Y. Zhong, Z. Zheng, Y. Liu, J. Zhao, A. Ma, and J. Yang, “Multi-Scale and

Multi-Task Deep Learning Framework for Automatic Road Extraction,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 9362–9377,

2019.

[207]

W. Song, S. Li, and J. A. Benediktsson, “Deep Hashing Learning for Visual and

Semantic Retrieval of Remote Sensing Images,” IEEE Transactions on Geoscience

and Remote Sensing, vol. 59, no. 11, pp. 9661–9672, 2021.

[208]

Z. Chen, V. Badrinarayanan, C.

Y. Lee, and A. Rabinovich, “GradNorm: Gra-

dient normalization for adaptive loss balancing in deep multitask networks,”

in Proceedings of the International Conference on Machine Learning, vol. 80, 2018,

pp. 794–803.

[209]

K. Maninis, I. Radosavovic, and I. Kokkinos, “Attentive single-tasking of

multiple tasks,” in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2019, pp. 1851–1860.

[210]

A. Migdalas, P. Pardalos, and P. Värbrand, Multilevel Optimization: Algorithms

and Applications. Springer US, 2013.

[211]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for

contrastive learning of visual representations,” in Proceedings of the Interna-

tional Conference on Machine Learning, vol. 119, 2020, pp. 1597–1607.