Document [original]

for all other uses, in any current or future media, including reprinting/republishing this material

for advertising or promotional purposes, creating new collective works, for resale or redistribution

to servers or lists, or reuse of any copyrighted component of this work in other works.

Roy, S., Sangineto, E., Sebe, N., & Demir, B. (2018). Semantic-Fusion Gans for Semi-Supervised Satellite

Image Classification. Presented at the 2018 25th IEEE International Conference on Image Processing

(ICIP). https://doi.org/10.1109/icip.2018.8451836

Subhankar Roy, Enver Sangineto, Begüm Demir, Nicu Sebe

Semantic-Fusion Gans for Semi-

Supervised Satellite Ima

e Classification

Conference paper | Accepted manuscript (Postprint)

This version is available at https://doi.org/10.14279/depositonce-9345.2

SEMANTIC-FUSION GANS FOR SEMI-SUPERVISED SATELLITE IMAGE

CLASSIFICATION

Subhankar Roy1, Enver Sangineto1, Beg¨

um Demir2and Nicu Sebe1

1Dept. of Information Engineering and Computer Science, University of Trento, Trento, Italy

2Faculty of Electrical Engineering and Computer Science, TU Berlin, Berlin, Germany

ABSTRACT

Most of the public satellite image datasets contain only a

small number of annotated images. The lack of a sufficient

quantity of labeled data for training is a bottleneck for the

use of modern deep-learning based classification approaches

in this domain. In this paper we propose a semi-supervised

approach to deal with this problem. We use the discrimi-

nator (D) of a Generative Adversarial Network (GAN) as

the final classifier, and we train Dusing both labeled and

unlabeled data. The main novelty we introduce is the repre-

sentation of the visual information fed to Dby means of two

different channels: the original image and its “semantic” rep-

resentation, the latter being obtained by means of an external

network trained on ImageNet. The two channels are fused in

Dand jointly used to classify fake images, real labeled and

real unlabeled images. We show that using only 100 labeled

images, the proposed approach achieves an accuracy close

to 69% and a significant improvement with respect to other

GAN-based semi-supervised methods. Although we have

tested our approach only on satellite images, we do not use

any domain-specific knowledge. Thus, our method can be

applied to other semi-supervised domains.

Index Terms—semi-supervised learning, generative ad-

versarial networks, satellite image classification

1. INTRODUCTION

One of the reasons for which satellite image classification

is challenging is due to the lack of large annotated training

datasets which has prevented so far the systematic adoption

of modern deep-learning based approaches in this field. Com-

mon deep-learning methods (e.g., ResNets [1]) achieve a high

image classification accuracy when trained in a supervised

regime with plenty of annotated data [2]. However, despite

very recently a few satellite datasets have been publicly re-

leased which contain thousands of images, most of the cur-

rent application scenarios in this field are based on training

datasets of only a few hundreds of labeled images.

On the other hand, recent trends in deep learning research

have shown the possibility to use a semi-supervised train-

ing regime for training deep networks. For instance, Sali-

Fig. 1. SF-GAN overview: The generator Gproduces fake

images by sampling from the noise distribution pz. The dis-

criminator Dhas access to Xreal (containing both labeled and

unlabeled images) and Xfake, as well as their semantic repre-

sentation s(·), obtained using a pre-trained deep network. D

outputs a probability distribution over K+ 1 classes where

the first Kclasses are real and the final class is fake.

mans et al. [3] showed that Generative Adversarial Networks

(GANs) [4] can be used to boost the accuracy of a classifier

using semi-supervised data. The main idea is that the clas-

sifier corresponds to the discriminator Dof a GAN, trained

together with a generator G. However, different from a stan-

dard GAN, where Dis asked to discriminate between “real”

and “fake” images (the latter being produced by G), in the

semi-supervised framework proposed in [3], Dis also asked

to predict the correct class of those subset of images which are

associated with labels. Intuitively, the gain comes from the

exploitation of the additional unlabeled images, from which

Dneeds to extract dataset-specific visual information which

allows it to discriminate these images from the fake ones.

In this paper we build on this idea of adding semantics.

Specifically, we exploit an external network, trained on Ima-

geNet (which contains no satellite-image), to extract generic

visual information from our domain-specific images. We feed

the satellite images to the Inception Net [5] and we extract

a high-level representation of these images using the activa-

tion values of its last convolutional layer. Then we fuse this

representation with an analogous representation obtained in

the last convolutional layer of D. In this way, the decision

of Ddepends (also) on generic visual semantics, extracted

by means of the Inception Net, where the latter leverages the

large dataset (ImageNet) it has been trained on (see Fig. 1).

We call this approach Semantic Fusion GAN (SF-GAN) and

we empirically show that SF-GAN achieves a large accuracy

boost with respect to both ”standard” supervised-trained deep

networks and semi-supervised GANs, especially when the

cardinality of the labeled training subset is very small.

2. RELATED WORK

Semi-supervised learning has been largely addressed in the

past years using kernel-based methods. For instance, Chang

et. al. [6] extend Locality-Constrained Linear Coding (LLC)

[7] to a semi-supervised scenario where a kernelized LLC is

used to learn the underlying data manifold, given only a sub-

set of labeled images. Blanchart et al. [8] use SVMs in a

semi-supervised setting for satellite image classification.

More recently, Salimans et al. [3] showed that the combi-

nation of a supervised and a semi-supervised loss in a GAN

framework helps in boosting the target classification problem

(more details in Sec. 3). Springenberg et al. [9] extend this

idea combining the optimization of the Shannon entropy as

the adversarial objective with minimizing the cross-entropy

loss for the labeled samples. The feature matching loss, in-

troduced in [3], which compares real and fake images us-

ing an intermediate layer of the discriminator, is extended in

[10] (perceptual loss) using the feature space of a layer of an

externally-trained network. We also use an externally-trained

network to inject “semantics” in our framework. However,

while the perceptual loss in [10] can be used only for condi-

tional GANs, in which the generator’s outcome depends on

a real input image, our SF-GAN operates in an unconditional

regime. Moreover, differently from [10], the external network

in our case is not used as an auxiliary loss function but for pro-

viding semantic information to aid the discriminator decision.

Semi-supervised classification using GANs is also pro-

posed in [11] where the discriminator outputs a multi-class

probability distribution. Unsupervised and fully-supervised

learning are combined in [12] in a two-stage approach. In

the first stage, unlabeled data are used in the GAN setting to

train the discriminatror D. Once fully trained, Dis used as

a feature extractor to obtain a representation of the labeled

samples. In the second stage these representations are used to

train an SVM classifier in a standard supervised framework.

3. PROBLEM SETTING

In this section we review the standard GAN [4] and the semi-

supervised GAN approach [3] and we introduce our notation.

Our proposed SF-GAN is presented in the next section.

Let X={x1, ..., xN}be the set of training images which

are partly associated with class labels. Specifically, Xl=

{x1, ..., xM}is the subset of images associated with labels,

respectively collected in Y={y1, ..., yM},yi∈ {1, ..., K}.

On the other hand, Xu={xM+1, ..., xN}is the subset of

unlabeled images, where typically M << N. The goal of

a semi-supervised approach is to train a classifier simultane-

ously exploiting both (Xl, Y )and Xu.

The standard GAN framework consists of two antagonis-

tic networks: a generator Gand a discriminator D.Gtakes

as input a noise vector, randomly generated using an a-priori

distribution (z∼pz) and deterministically generates a fake

image ˆx=G(z;θG), typically using an up-convolutional

neural network [12], where θGare the parameters of G. On

the other hand, Dtakes as input an image, which is either

real, x, or fake, ˆx. The outcome of Dis a binary classi-

fication probability of the input image being extracted from

the real dataset or produced by G, which can be denoted as

pD(x) = D(x;θD),θDbeing the parameters of D. The goal

of Dis to assign a high probability to x∼pdata and a low

probability to ˆx=G(z),z∼pz. On the other hand, Gaims

to maximize the probability of the fake images being classi-

fied as real without having access to the real data. The overall

GAN objective function can be written as follows:

Gin m

Dax Ex∼pdata(x)[ log(D(x))]

+Ez∼pz(z)[ log(1 −D(G(z)))] (1)

Salimans et al. [3] extend the above framework to deal

with semi-supervised learning by adding Kfinal neurons to

D, one per target class. The outcome of Dis now a multi-

class prediction represented by a K+ 1 dimensional logit

output which comprises of Kreal classes and a (K+ 1)-th

class representing the fake images. The loss function of Dis

consequently split into a supervised and an unsupervised loss:

LD=Lsup +Lunsup, where:

Lsup =−Ex,y∼pdata(x,y)[ log(pD(y|x, y < K + 1))] (2)

and

Lunsup =−Ex∼pdata(x)[ log(1 −pD(y=K+ 1|x))]

−Ez∼pz(z)[ log(pD(y=K+ 1|G(z)))]

(3)

The loss function of Gremains unchanged. In the next

section we show how to modify the posterior probabilities

computed by D(i.e., pD(x)) in order to embed visual seman-

tics extracted from a generic, external network.

4. PROPOSED SF-GAN

The main idea behind SF-GAN is to enrich the image repre-

sentation fed to Dusing generic visual semantics extracted by

means of an external network, trained on a generic, large and

fully-supervised dataset (ImageNet). Specifically, let s(x)be

the vector of the activation values of the last convolutional

layer (Mixed 7c) of the Inception Net [5] when input with

image x. We write D(x, s(x)) to highlight the dependence of

Dfrom both the original image xand its semantic represen-

tation s(x)(see below for details). The posterior probability

of class kis computed using:

pD(y=k|x, s(x)) = eDk(x,s(x))

k0=1 eDk0(x,s(x)) ,(4)

where Dk(·,·)is the score assigned to class kby D. Using

Eq. 4 to compute pD() in Eq. 2-3 we obtain our discriminator

loss. For training G, we use a standard generator loss with the

addition of the feature matching loss (see Sec. 2).

Fig. 2. The proposed SF-GAN discriminator Dtakes as input

both a 64 ×64 ×3RGB image xand its semantic represen-

tation s(x)and outputs a K+1 logit. The vector s(x)is fused

with f(x), the internal representation of x, in the penultimate

layer of D.

4.1. THE DISCRIMINATOR ARCHITECTURE

As shown in Fig. 2, Dtakes as input an RGB image (ei-

ther real or fake), of spatial dimension 64 ×64. This input

is passed through a sequence of convolutional layers, batch

normalizations and Leaky ReLU non-linearities, finally pro-

ducing a 4×4×128 tensor, where 4×4is the spatial resolution

and 128 is the number of feature maps. We extract a repre-

sentation f(x)from this tensor using Global Average Pooling

(GAP) [13]. GAP averages the information content of the fea-

ture maps spatially, each map being averaged independently

of the others. In our case, the content of each feature map is

averaged over the 4×4spatial grid to produce a single scalar

value. f(x)is the concatenation of all the 128 average values

and is further concatenated with s(x). The latter is obtained

by feeding a pre-trained Inception Net with x. Using the last

convolutional layer of the Inception Net we obtain a repre-

sentation of xas a tensor of dimension 8×8×2048. Simi-

larly to D, we apply GAP to this second tensor to get a 2048-

dimensional feature vector s(x). After fusion, [f(x), s(x)] is

processed by a final fully-connected layer which outputs the

(K+ 1)-dimensional logit.

Generator Discriminator

Layer Configuration Layer Configuration

FC 1 2048 Conv 1 filter: 64x[3,3,3];

stride: 2

UpConv 1 filter: 64x[5,5,128];

stride: 0.5 Conv 2 filter: 64x[3,3,64];

stride: 2

UpConv 2 filter: 32x[5,5,64];

stride: 0.5 Conv 3 filter: 64x[3,3,64];

stride: 2

UpConv 3 filter: 32x[5,5,32];

stride: 0.5 Conv 4 filter: 64x[3,3,64];

stride: 2

UpConv 4 filter: 3x[5,5,32];

stride: 0.5 Conv 5 filter: 128x[3,3,64];

stride: 1

- - Conv 6 filter: 128x[3,3,128];

stride: 1

- - Conv 7 filter: 128x[3,3,128];

stride: 1

- - Avg pool 7 pool: 4x4

- - FC 8 2176 (=128+2048)

Table 1. Details of Gand D. The filter configuration is de-

scribed as: number of filters x [height, width, input channels].

4.2. IMPLEMENTATION DETAILS

Since the number of labeled images is usually small, we use

dropout [14] in the discriminator network to help regularizing

the learning process. We do not use batch normalization in the

intermediate layer (Conv 7) utilized for computing the feature

matching loss. This is done in order to make the mean of

the intermediate features of the real data different from the

generated samples.

The generator Gis a standard DCGAN [12] network com-

posed of a sequence of up-convolutional layers with fractional

stride, each layer except the last being followed by a batch

normalization layer and a Leaky ReLU non-linearity. Table

1 shows the architectural details of both Gand D.

5. EXPERIMENTAL RESULTS

In our experiments we use the recently published EuroSAT

dataset [15], composed of 27,000 annotated satellite images

acquired by the Sentinel-2 satellite and grouped into 10 differ-

ent land-use categories where each image belongs to a single

category (e.g., “Industrial”, “Residential”, etc.). Each image

consists of 13 bands, however, in our experiments we have

considered RGB bands only as in [15]. The image spatial res-

olution is 64 ×64. Following the protocol in [15], we use

21,600 images for training. Moreover, we further split the re-

maining 5,400 images in 4,860 samples used for testing and

540 images used for validation.

Note that this dataset is much larger than common pub-

lic satellite image datasets, and we chose EuroSAT in order

to show results obtained varying the amount of labels acces-

sible during the training process. Specifically, we simulate a

scenario in which we have access to only a limited amount of

labeled data M(M=|Xl|, see Sec. 3), varying Mbetween

100 and 21,600. For a fixed value of M, the remaining train-

Method Training regime # of labels M(% over the full training set)

100 (0.46) 1000 (4.6) 2000 (9.25) 21,600 (100)

CNN (from scratch) Supervised 29.3 46.1 59.0 83.2

Inception Net [5] (fine tuned) Supervised 63.9 84.6 87.9 91.5

SS-GAN [3] Semi-supervised 63.0 75.8 78.3 86.9

Proposed SF-GAN Semi-Supervised 68.6 86.1 89.0 93.2

Table 2. Classification accuracy (%) on the EuroSAT test set.

ing data are used without labels (Xu). We train our SF-GAN1

using Adam with β1= 0.5and β2= 0.9and a batch size of

128. Gand Dare trained for 30 epochs and in every epoch

the learning rate is shrank by a factor of 0.9 starting from an

initial value of 3∗10−4.

We compare the classification accuracy of SF-GAN with:

1) A Convolutional Neural Network (CNN) trained from

scratch, with a network capacity similar to the SF-GAN’s

discriminator network capacity; 2) Inception Net [5] with

its final layer fine tuned on EuroSAT; and 3) The Semi-

Supervised GAN (SS-GAN) approach proposed in [3]. Note

that, to the best of our knowledge, no other semi-supervised

method has been tested on EuroSAT yet. The results re-

ported in Table 2 show that, as expected, when M= 100,

the CNN trained from scratch performs very poorly. Note

that, being the CNN trained in a fully-supervised fashion, it

cannot use Xu. The same situation applies to the fine-tuning

of the Inception Net. Conversely, with the same number of

labeled images, M= 100, the proposed SF-GAN surpasses

all the other classification methods including SS-GAN [3].

As we increase the number of labels M, the accuracy in-

creases monotonically for every method. For instance, at

M= 2,000, the accuracy of the fine-tuned Inception Net

comes pretty close to our method. However, when compared

to [3], our method is still 10.7% better. Interestingly, SF-

GAN achieves a higher accuracy with respect to Inception

Net even when all the training data are associated with their

corresponding labels. This is likely due to the fact that the

discriminator Din SF-GAN has access to Xfake (see Fig. 1),

an additional source of information which is not available to

the Inception Net, and needs to additionally discriminate fake

images from real ones.

Finally, in our experiments we observed that SF-GAN

reaches a faster convergence with less number of epochs when

compared with SS-GAN [3]. As shown in Fig. 3, the accuracy

on the validation set of our SF-GAN converges after epoch 9,

whereas SS-GAN is still rising even after the 15-th epoch.

Note that the Inception Net needs 200 epochs to converge;

however, being only the last layer involved in the fine-tuning

process, its overall training time is shorter. Both the faster

convergence and the higher final accuracy results of SF-GAN

with respect to SS-GAN show that the injection of seman-

1Code is available at https://github.com/MLEnthusiast/

SFGAN

tic information into Dhelps the discriminator (and, conse-

quently, also the generator) to quickly learn the underlying

real data distribution.

Fig. 3. Accuracy on the validation set over different training

epochs of the tested methods when M= 100.

6. CONCLUSIONS

In this paper we proposed SF-GAN, a semi-supervised clas-

sification approach based on a GAN framework, for satellite

image classification with scarcity of annotated data. The SF-

GAN discriminator fuses the high-level representation of an

image, obtained using a pre-trained, external deep network,

with the image representation of the standard DCGAN dis-

criminator. Experimental results show that the proposed ar-

chitecture: 1) achieves a significantly higher overall accu-

racy when compared with other semi-supervised and fully-

supervised classification methods, especially in a scenario in

which only a few images are annotated; 2) achieves a faster

convergence while training.

Even if the proposed method has been tested with satellite

images, no domain-specific constraint or a-priori knowledge

is used in our approach. Consequently, we believe that SF-

GANs can be easily adopted in other semi-supervised image

classification tasks.

Acknowledgements: This work was supported by the Eu-

ropean Research Council under the ERC Starting Grant

BigEarth-759764. We also want to thank the NVIDIA Cor-

poration for the donation of the GPUs used in this project.

7. REFERENCES

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual

learning for image recognition,” in Proceedings of the

IEEE conference on computer vision and pattern recog-

nition, 2016, pp. 770–778.

[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,

S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

et al., “Imagenet large scale visual recognition chal-

lenge,” International Journal of Computer Vision, vol.

115, no. 3, pp. 211–252, 2015.

[3] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A.

Radford, and X. Chen, “Improved techniques for train-

ing GANs,” in Advances in Neural Information Process-

ing Systems, 2016, pp. 2234–2242.

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.

Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,

“Generative adversarial nets,” in Advances in neural in-

formation processing systems, 2014, pp. 2672–2680.

[5] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z.

Wojna, “Rethinking the inception architecture for com-

puter vision,” in CVPR, 2016.

[6] Y.-J. Chang and T. Chen, “Semi-supervised learning

with kernel locality-constrained linear coding,” in Im-

age Processing (ICIP), 2011 18th IEEE International

Conference on. IEEE, 2011, pp. 2977–2980.

[7] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong,

“Locality-constrained linear coding for image classifi-

cation,” in Computer Vision and Pattern Recognition

(CVPR), 2010 IEEE Conference on. IEEE, 2010, pp.

3360–3367.

[8] P. Blanchart and M. Datcu, “A semi-supervised algo-

rithm for auto-annotation and unknown structures dis-

covery in satellite image databases,” IEEE journal of

selected topics in applied earth observations and remote

sensing, vol. 3, no. 4, pp. 698–717, 2010.

[9] J. T. Springenberg, “Unsupervised and semi-supervised

learning with categorical generative adversarial net-

works,” arXiv preprint arXiv:1511.06390, 2015.

[10] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses

for real-time style transfer and super-resolution,” in

ECCV, 2016.

[11] A. Odena, “Semi-supervised learning with gen-

erative adversarial networks,” arXiv preprint

arXiv:1606.01583, 2016.

[12] A. Radford, L. Metz, and S. Chintala, “Unsu-

pervised representation learning with deep convolu-

tional generative adversarial networks,” arXiv preprint

arXiv:1511.06434, 2015.

[13] M. Lin, Q. Chen, and S. Yan, “Network in network,”

arXiv preprint arXiv:1312.4400, 2013.

[14] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever,

and R. Salakhutdinov, “Dropout: a simple way to pre-

vent neural networks from overfitting.,” Journal of ma-

chine learning research, vol. 15, no. 1, pp. 1929–1958,

2014.

[15] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eu-

roSAT: A novel dataset and deep learning benchmark for

land use and land cover classification,” arXiv preprint

arXiv:1709.00029, 2017.