scieee Science in your language
[en] (orig)
This version is available at https://doi.org/10.14279/depositonce-9352
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including reprinting/republishing this material
for advertising or promotional purposes, creating new collective works, for resale or redistribution
to servers or lists, or reuse of any copyrighted component of this work in other works.
Terms of Use
Roy, S., Sangineto, E., Demir, B., & Sebe, N. (2018). Deep Metric and Hash-Code Learning for Content-
Based Retrieval of Remote Sensing Images. IGARSS 2018 - 2018 IEEE International Geoscience and
Remote Sensing Symposium. pp. 4539–4542. https://doi.org/10.1109/igarss.2018.8518381
Subhankar Roy, Enver Sangineto, Begüm Demir, Nicu Sebe
Deep Metric and Hash-Code Learning for
Content-Based Retrieval of Remote
Sensin
g
Ima
g
es
Accepted manuscript (Postprint)Conference paper |
DEEP METRIC AND HASH-CODE LEARNING FOR
CONTENT-BASED RETRIEVAL OF REMOTE SENSING IMAGES
Subhankar Roy1, Enver Sangineto1, Beg¨
um Demir2and Nicu Sebe1
1Dept. of Information Engineering and Computer Science, University of Trento, Trento, Italy
2Faculty of Electrical Engineering and Computer Science, TU Berlin, Berlin, Germany
ABSTRACT
The growing volume of Remote Sensing (RS) image archives
demands for feature learning techniques and hashing func-
tions which can: (1) accurately represent the semantics in the
RS images; and (2) have quasi real-time performance during
retrieval. This paper aims to address both challenges at the
same time, by learning a semantic-based metric space for con-
tent based RS image retrieval while simultaneously producing
binary hash codes for an efficient archive search. This double
goal is achieved by training a deep network using a combina-
tion of different loss functions which, on the one hand, aim
at clustering semantically similar samples (i.e., images), and,
on the other hand, encourage the network to produce final ac-
tivation values (i.e., descriptors) that can be easily binarized.
Moreover, since RS annotated training images are too few to
train a deep network from scratch, we propose to split the im-
age representation problem in two different phases. In the first
we use a general-purpose, pre-trained network to produce an
intermediate representation, and in the second we train our
hashing network using a relatively small set of training im-
ages. Experiments on two aerial benchmark archives show
that the proposed method outperforms previous state-of-the-
art hashing approaches by up to 5.4% using the same number
of hash bits per image.
Index Termsdeep hashing, metric learning, content
based image retrieval, remote sensing
1. INTRODUCTION
In recent years there has been a tremendous increase in the
volume of remote sensing (RS) image archives due to the con-
tinuous development of satellite technology. Thus, one of the
most important research topics in RS is the development of
scalable content-based RS retrieval (CBIR) methods, which
aim at retrieving the most similar images to a query image
from massive archives in an accurate and efficient manner. A
CBIR system generally consists of a two-step procedure: (1)
characterization of the content of each image by its descriptor;
and (2) computation of similarities between the query image
and the archive images based on the extracted descriptors.
One of the essential requirements for large-scale CBIR is
the fast similarity search. The conventional similarity search
methods, such as nearest neighbour search, are impractical for
large scale RS image archives, particularly when the dimen-
sion of the image descriptor is high. To achieve efficient sim-
ilarity search, hashing-based approximate nearest neighbour
search methods have been recently introduced in RS [1], [2].
Hashing methods encode high-dimensional image descriptors
using compact binary hash codes that significantly reduce the
storage cost and improve the computational efficiency. To this
end, hash functions are initially generated and then applied to
each image descriptor. In [1], the kernel-based locality sen-
sitive hashing techniques that learn hash functions in the ker-
nel space from hand-crafted features (e.g., the bag-of-visual-
words based on the scale invariant feature transform) are ap-
plied to RS CBIR problems. However, hand-crafted features
may not accurately represent the high level semantic content
of RS images. This leads to inaccurate retrieval results under
complex RS image retrieval tasks.
To address this problem, inspired by the progress of
deep convolutional neural networks (CNNs), a deep hashing
method has been recently introduced in the framework of RS
image retrieval problems [2]. This method jointly learns the
deep image features (which efficiently characterize the rich
semantic content of RS images and outperforms hand-crafted
features) and hash codes (which represent those deep features
with binary bits). To this end, their CNN architecture adopts
the cross-entropy loss to formulate the objective function.
The cross-entropy loss does not define any separation be-
tween positive and negative images due to the absence of a
margin threshold. This leads to poor generalization capabil-
ity, and thus long hash codes and a high number of annotated
training images are required to reach a high retrieval accuracy.
To address these issues, in this paper we present an ap-
proach that learns a semantic-based metric space, while si-
multaneously producing binary hash codes for fast and ac-
curate retrieval of RS images in large archives. Differently
from [2] the proposed approach provides more compact bi-
nary hash codes with a small number of annotated training
images.
2. METRIC AND HASH-CODE LEARNING
The lack of large annotated training images makes it challeng-
ing to train deep networks from scratch in RS. We propose to
solve this problem using two different stages. In the first stage
we use a pre-trained network (Inception Net [3]), trained on
ImageNet, in order to extract an intermediate image represen-
tation. In the second stage, this intermediate representation
is fed to our Metric and Hash-Code Learning Network (MH-
CLN). The latter is a smaller network which can be trained
from scratch using a relatively small dataset. This network is
trained using a combination of different losses, which simul-
taneously aim at clustering similar images while producing an
easy-to-binarize final representation. Specifically, we use the
triplet loss to learn a metric space where the Euclidean dis-
tance between pair of points corresponds to the semantic dis-
tance between the corresponding images. Moreover, we use
two other losses: (1) a representation penalty which pushes
the final network activations toward 0 and 1 and; (2) a balanc-
ing loss which incentivizes a balanced number of 0s and 1s in
the final hashing code. We provide below all the details.
Let I={X1, ..., XP}be the training set of RS im-
ages where Xiis associated with a class label yi Y =
{y1, y2, ...}(e.g., “airplane”, “parkinglot”, etc.). Our goal is
to learn a hashing function h:I {0,1}K, which maps
images to binary hash codes of length Ksuch that the gener-
ated binary codes embed the semantics of the corresponding
images. Using these codes, at testing time, retrieving the
most similar images to a given query image Xqis done by
(efficiently) comparing bitwise their binary hash codes.
In the first stage of the proposed approach, each image
in Iis fed to the pre-trained Inception Net [3] and a fea-
ture vector composed of the 2048 neuron activations of the
layer just before the softmax layer (pool3) is extracted. Let
G={g1, g2, ..., gP},giIR2048, be the extracted features
corresponding to the set of images in I. Although the Incep-
tion Net was trained on a completely different set of images
(ImageNet), its high-level features capture general-purpose
visual semantics and we use this representation as a starting
point of our representation process. Note that the Inception
Net is not fine-tuned but used as a black-box to extract gi
from Xi:giis used as input of our MHCLN (Fig. 1).
In the second stage of the proposed approach we train
(from scratch) our MHCLN aiming at mapping each giinto
a semantically significant metric space: f: IR2048 IRK.
The final hashing function is obtained by means of a quan-
tization of IRK. In order to learn the metric space we adopt
a triplet loss. The intuition behind the triplet loss (see Fig.2)
is that similar images should be clustered together in the tar-
get metric space, while different images should be pushed far
apart. To achieve this result, we use the class label (yi) asso-
ciated with each Xiand we impose that images of the same
class should be closer to each other than images of different
classes. More specifically, from Gwe extract a set of triplets
Fig. 1. Our two-step image representation process. Top: In-
ception Net, pretrained on ImageNet images. Bottom: Incep-
tion Net is used to extract an intermediate image representa-
tion, which is then fed to our MHCLN.
Fig. 2. The intuition behind the triplet loss: after training, a
positive sample is ”moved” closer to the anchor sample than
the negative samples of the other classes.
T={(ga
i, gp
i, gn
i)}, where ga
i(called anchor), is a randomly
sampled feature vector associated with label yi,gp
iis a posi-
tive sample (i.e., an image associated with the same class label
yi) and gn
iis a negative sample (i.e., an image associated with
a different class label yj6=yi). Using Tand a mini-batch of
cardinality M, randomly extracted from T, our triplet loss is
defined as follows:
LMetric =PM
i=1 max0,||f(ga
i)f(gp
i)||2
2
||f(ga
i)f(gn
i)||2
2+α,(1)
where αis a minimum margin that is imposed between the
positive and the negative distances.
Our hashing network is composed of 3 fully-connected
layers, composed of 1024, 512, and Kneurons respectively,
where Kdepends on the number of desired bits in the final
hashing-based image representation. We use Leaky ReLU
non-linearities in the first two layers and a sigmoid in the last
layer, which produces neuron activations in [0,1]. In order to
push the latter toward the extremes of the range, similarly to
[4], we use a second loss which aims at maximizing the sum
of the squared errors between the last layer activations and
0.5:
Advertisement
Table 1. mAP and average retrieval time for the KSLSH [1] and the proposed MHCLN for the UCMD archive.
Image Features # Hash Bits K
Methods mAP Time (in ms) K=16 K=24 K=32
mAP Time (in ms) mAP Time (in ms) mAP Time (in ms)
SVM 0.556 92.3 - - - - - -
KSLSH [1] - - 0.557 25.3 0.594 25.5 0.630 25.6
Our MHCLN - - 0.875 25.3 0.890 25.5 0.904 25.6
LP ush =1
KPP
i=1 ||f(gi)0.51||2,(2)
where 1is the K-dimensional vector with all elements 1.
Finally, inspired by [4], we use a third loss function which
aims at balancing the number of 1s and 0s in the binary code
of each image representation:
LBalancing =PP
i=1(mean(f(gi)) 0.5)2,(3)
where mean(f(gi)) computes the mean of the activation val-
ues in the last layer of the hashing function.
The three losses are combined in our final objective:
L=LMetric +λ1LP ush +λ2LBalancing,(4)
with λ1= 0.001 and λ2= 1 that are selected using cross-
validation.
Once the network is trained, the final hashing function
h() is obtained by binarizing the values in IRK. Specifically,
given a test image X, which corresponds to an Inception-Net
feature vector g, we compute a binary code b=h(X), where,
for each 1nK:
bn= (sign(vn0.5) + 1)/2,(5)
where v=f(g)and vnis the n-th component of v. Finally, in
order to retrieve an image Xjsemantically similar to a query
image Xq, we use the Hamming distance between h(Xq)and
h(Xj).
3. EXPERIMENTAL RESULTS
Experiments were conducted on two different benchmark
archives. The first one is the widely used UC Merced
(UCMD) [5] containing 2100 images from 21 different cat-
egories, where each category includes 100 images (each of
size 256 ×256 pixels with a spatial resolution of 30cm). The
second one is the Aerial Images Dataset (AID) [6] containing
10000 aerial images from 30 different categories, where each
category includes 220 to 420 images (each of size 600 ×600
pixels with a spatial resolution ranging from 50cm to 8m).
The proposed MHCLN1was trained by choosing a mini-
batch of triplets of cardinality M=30, which comprises of 30
1Code is available at https://github.com/MLEnthusiast/
MHCLN
anchors, positives and negatives each. The value of the thresh-
old margin, α, was set to 0.2. For the loss function optimiza-
tion, Adam Optimizer was used with a small learning-rate η
=104. The other two hyper-parameters of the Adam Opti-
mizer, β1and β2were set to 0.5 and 0.9 respectively.
The performance of the proposed approach is evaluated
through the mean average precision (mAP) score, also used
in [2]. For the UCMD archive results achieved with the pro-
posed MHCLN are compared with those obtained by: 1) Sup-
port Vector Machine (SVM) classifier; 2) the Kernel-based
Supervised Locality Sensitive Hashing (KSLSH) [1]; and 3)
the Deep Hashing Neural Networks (DHNN) [2]. Results of
each method are provided in terms of computational time and
the mAP score that are evaluated for the top-20 retrieved im-
ages. In the experiments: i) Gaussian Radial Basis Function
kernel was used for the SVM and KSLSH; and ii) the number
Kof hash bits is varied in the range [16-64] with a step size
increment of 8. In the experiments, we have considered two
different scenarios. In the first scenario, we did not include
any data augmentation and have selected 60% of images as-
sociated to each category as training images (which are used
to train the MHCLN), while the rest is considered as test im-
ages (which are used to evaluate the retrieval performance).
Table 1 shows the results of the first scenario obtained
by the SVM, the KSLSH and the proposed MHCLN when
K=16, 24 and 32. We would like to point out that due to
lack of availability of code of the DHNN we could not report
its results within this scenario. From Table 1, one can see that
the proposed MHCLN provides 31.8% higher mAP compared
to the KSLSH for K=16 under the same retrieval time. In
addition, the proposed MHCLN yields a mAP 31.9% higher
with respect to the SVM with significantly reduced retrieval
time. From our analysis, we have also seen that increasing
the number Kof hash bits leads to higher mAP at the cost of
slightly increasing the retrieval time.
Fig 3 shows a single trial of the retrieval results with the
query image selected from the airplane category by apply-
ing the KSLSH [1] and the proposed MHCLN. The retrieval
order of each image is given below the related image. By
visually analyzing the results, one can see that the proposed
method retrieves semantically more similar images from the
archive. As an example, the 4th and 19th retrieved images
by the KSLSH method belongs to the freeway and storage
tanks categories, respectively, whereas those by the proposed
method belong to the airplane category.
In the second scenario, we have considered data augmen-
tation as suggested in [2] and compared the proposed MH-
CLN with the DHNN [2]. To have a fair comparison 2100 im-
ages are rotated by 90, 180and 270, producing 8400 im-
ages. Then, among these images, 1000 images are randomly
chosen as test images while the remaining 7400 images are
selected as training and searching images as suggested in [2].
Table 2 shows the results obtained by the proposed MHCLN
and the DHNN when K=32 and K=64 for the top-50 retrieved
images. By analyzing the results one can see that hash codes
obtained by the proposed method are generally more distinc-
tive than those of the DHNN when small values of hash bits
are considered. As an example, the proposed method yields
5.4% better mAP when K=32. It is worth noting that the data
augmentation approach adopted in [2] may lead to the pres-
ence of rotated but identical images in both the train and test
sets. This causes the network to memorize the test samples
(rotated variants) during training. However, we have adopted
this evaluation approach to fairly compare with their method.
Fig. 3. (a) Query image, (b) images retrieved by KSLSH [1]
and (c) images retrieved by the proposed MHCLN.
For the AID archive we have chosen a 60:40 split of im-
ages from each category, similar to the UCMD archive. For
this archive, results achieved with the proposed MHCLN are
compared with the SVM and the KSLSH. From the results,
we observed that the same relative behavior with respect to
the UCMD archive is obtained. As an example, for the pro-
posed MHCLN we obtain 40.66% higher mAP, for K=32,
when compared to the KSLSH under the same retrieval time.
In addition the proposed MHCLN yields a mAP of 91.14%,
which is 0.37% higher than that obtained by the SVM with
retrieval time reduced by one order of magnitude.
4. CONCLUSION
In this paper, we have introduced a deep metric and hash-
code learning approach for fast and accurate image search
and retrieval in large RS data archives. The proposed ap-
Table 2. mAP obtained by the DHNN [2] and the proposed
MHCLN with data augmentation for the UCMD archive.
# Hash Bits K
Methods K= 32 K= 64
DHNN [2] 0.939 0.971
Our MHCLN 0.993 0.995
proach is defined based on two main stages. In the first stage,
an intermediate representation is obtained for each image in
the archive by exploiting the pre-trained network (Inception
Net), while in the second stage a hashing network is trained
by considering different losses (e.g., triplet loss, representa-
tion penalty and a balancing loss) to represent each image
by binary hash code. Experimental results point out that the
hash codes obtained by the proposed approach: 1) efficiently
characterize the complex content of RS images; 2) enable fast
image search and retrieval through compact hash codes; and
3) can be learnt using a relatively few annotated training im-
ages. As a future development of this work, we will extend
our approach to the framework of other deep networks.
5. ACKNOWLEDGEMENTS
This work was supported by the European Research Coun-
cil under the ERC Starting Grant BigEarth-759764. We also
want to thank the NVIDIA Corporation for the donation of
the GPUs used in this project.
6. REFERENCES
[1] B. Demir and L. Bruzzone, “Hashing-based scal-
able remote sensing image search and retrieval in large
archives, IEEE Transactions on Geoscience and Remote
Sensing, vol. 54, no. 2, pp. 892–904, 2016.
[2] Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma, “Large-
scale remote sensing image retrieval by deep hashing neu-
ral networks, IEEE Transactions on Geoscience and Re-
mote Sensing, 2017.
[3] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
jna, “Rethinking the inception architecture for computer
vision, in CVPR, 2016.
[4] H.-F. Yang, K. Lin, and C.-S. Chen, “Supervised learn-
ing of semantics-preserving hash via deep convolutional
neural networks, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2017.
[5] Y. Yang and S. Newsam, “Bag-of-visual-words and spa-
tial extensions for land-use classification, in SIGSPA-
TIAL, 2010, pp. 270–279.
[6] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L.
Zhang, and X. Lu, Aid: A benchmark data set for per-
formance evaluation of aerial scene classification, IEEE
Transactions on Geoscience and Remote Sensing, 2017.
Advertisement