This repo implements several deep learning methods for semantic segmentation of remote sensing images.
- Python 3.x
- Tensorflow (<2.0)
- SKLearn
- Scipy
- SKImage
- ImageIO
pip install scikit-learn scipy scikit-image imageio
You can train a new model using the following command:
python3 main.py --operation training \
--output_path fcn/256/ \
--dataset arvore \
--dataset_input_path datasets/arvore/image.tif \
--dataset_gt_path datasets/arvore/gt.tif \
--num_classes 2 \
--model_name fcn_50_3_2x \
--learning_rate 0.01 \
--weight_decay 0.05 \
--batch_size 16 \
--niter 200000 \
--reference_crop_size 256 \
--reference_stride_crop 256 \
--values 256 \
--distribution_type single_fixed
where,
operation
represents the process that will be performed. Two options:training
, to train a new model,generate_map
, to classify the whole image and generate an output map using a trained model (must inform flag--model_path
in this case)
output_path
is the path to save models, images, etcdataset
is the name of the current dataset. This is only used in thedataloaders/factory.py
dataset_input_path
path to the input imagedataset_gt_path
path to the input ground-truthnum_classes
is the number of classesmodel_name
is the model name. This is only used in thenetworks/factory.py
. Options are:- For FCN:
fcn_25_3_2x_icpr
,fcn_50_3_2x
- For U-Net:
unet
- For SegNet:
segnet
,segnet_4
- For DeepLabV3+:
deeplabv3+
- For DCNN:
dilated_grsl_rate8
- For FCN:
learning_rate
corresponds to the learning rate used in the Stochastic Gradient Descentweight_decay
represents the weight decay used to regularize the learningbatch_size
is the size of the batchniter
is the number of iterations of the algorithm (used related to the epoch)reference_crop_size
represents the reference crop size used to map the input. This will be used to probe the input image creating a set of positions (x,y) that will be further used to generate the patches during the processing. This must be an integer.reference_stride_crop
represents the reference stride size to map the input. This must be an integer.distribution_type
represents the probability distribution that should be used to select the values. Options are:single_fixed
, which uses one patch size (provide in the flagvalues
) during the whole training,multi_fixed
, equally divides the probability into the provided values.
values
represents the values of patch size that will be used (together with the distribution) during the processing. This can be a single value if used withdistribution_type = single_fixed
, or a sequence of integers separated by a comma (ex.: 50,75,100) if used with the other distributions.
To use your own data, you need to implement a dataloader (in the dataloaders folder) and map this dataloader to the
dataloaders\factory.py
.
This dataloader will be responsible to load the original input data and labels into the memory and to manipulate this data.
Your dataloader needs to have, at least, the following attributes:
data
ortrain_data and test_data
, which will store the original datalabels
ortrain_labels and test_labels
, which will store the original ground-truthtrain_distrib
andtest_distrib
, which will store the positions (x,y) that will be used to create the patches during the processing. This can be created internally using a method similar to thedataloaders\utils.py -> create_distrib_multi_images()
. The flagsreference_crop_size
andreference_stride_crop
are used to create thetrain_distrib
andtest_distrib
, which will store positions (x, y) and will be used to create dynamically create the input patches.num_classes
, which will store the number of classes,_mean
and_std
, which will the mean and standard deviation for normalization purposes. This can be created internally using a method similar to thedataloaders\utils.py -> create_or_load_mean()
After creating your dataloader, just run the code and everything else should work.
Pixelwise classifies each pixel of the input image independently. Precisely, each pixel is represented by a context window, i.e., overlapping fixed-size patches, in which each one is centered on a specific pixel helping to understand the spatial patterns around that pixel. Observe that these context windows are really necessary because the pixel itself has not enough information to be used in its classification. Such patches are, in fact, used to train and evaluate the network. In both processes, the ConvNet outputs a class for each input context window, which is associated with the central pixel of the window.
@inprocedings{knogueira_sibgrapi_2015,
author={K. {Nogueira} and W. O. {Miranda} and J. A. D. {Santos}},
booktitle={2015 28th SIBGRAPI Conference on Graphics, Patterns and Images},
title={Improving Spatial Feature Representation from Aerial Scenes by Using Convolutional Networks},
year={2015},
pages={289-296},
month={Aug}
}
Fully Convolutional Network (FCN) was one of the first deep learning-based techniques proposed to perform semantic segmentation. This network extracts features and generates an initial coarse classification map using a set of convolutional layers that, due to their internal configuration, outputs a spatially reduced (when compared to the original input) outcome. In order to restore the original resolution and output the thematic map, this approach employs deconvolution layers (also known as transposed convolution) that learn how to upsample the initial classification map and produce the final dense prediction.
@inproceedings{long2015fully,
title={Fully convolutional networks for semantic segmentation},
author={Long, Jonathan and Shelhamer, Evan and Darrell, Trevor},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={3431--3440},
year={2015}
}
U-Net was one of the first networks to propose encoder-decoder architectures to perform semantic segmentation. In this design, the encoder is usually composed of several convolution and pooling layers, and responsible to extract the features and generate an initial coarse prediction map. The decoder, commonly composed of convolution, deconvolution and/or unpooling layers, is responsible to further process the initial prediction map, increasing its spatial resolution gradually and generating the final prediction. Note that, normally, the decoder can be seen as a mirrored/symmetrical version of the encoder, with the same number of layers but replacing some of the operations with their counterparts (i.e., convolution with deconvolution, pooling with unpooling, etc).
@inproceedings{ronneberger2015u,
title={U-net: Convolutional networks for biomedical image segmentation},
author={Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas},
booktitle={International Conference on Medical image computing and computer-assisted intervention},
pages={234--241},
year={2015},
organization={Springer}
}
SegNet is another type of encoder-decoder network proposed specifically for semantic segmentation. However, differently from the previous model, this network employs unpooling operations, instead of deconvolution layers, in the decoder to increase the spatial resolution of the coarse map generated by the encoder.
@article{badrinarayanan2017segnet,
title={Segnet: A deep convolutional encoder-decoder architecture for image segmentation},
author={Badrinarayanan, Vijay and Kendall, Alex and Cipolla, Roberto},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={39},
number={12},
pages={2481--2495},
year={2017},
publisher={IEEE}
}
DeepLabV3+ is another encoder-decoder network. In order to aggregate multi-scale information, this method uses: (i) features extracted from multiple layers, (ii) multi-parallel dilated convolutions in a module called Atrous Spatial Pyramid Pooling.
@inproceedings{chen2018encoder,
title={Encoder-decoder with atrous separable convolution for semantic image segmentation},
author={Chen, Liang-Chieh and Zhu, Yukun and Papandreou, George and Schroff, Florian and Adam, Hartwig},
booktitle={Proceedings of the European conference on computer vision (ECCV)},
pages={801--818},
year={2018}
}
Dynamic Dilated ConvNet (DDCNN) proposes a novel multi-scale training strategy that uses dynamically-generated input images to converge a dilated model that never downsamples the input data. Technically, this technique receives as input the original images and a probability distribution over the possible input sizes, i.e., over the sizes that might be used to generate the input patches. In each iteration of the training procedure, a size is randomly selected from this distribution and is then used to create a totally new batch. By processing these batches, each composed of several images with one specific pre-selected size, the model is capable of capturing multi-scale information. Furthermore, in the prediction step, the algorithm selects, based on scores accumulated during the training phase for each evaluated input size, the best resolution. Then, the technique processes the testing images using batches composed of images with the best-evaluated size.
@article{knogueira_tgrs_2019,
author={K. Nogueira and M. Dalla Mura and J. Chanussot and W. R. Schwartz and J. A. dos Santos},
journal={IEEE Transactions on Geoscience and Remote Sensing},
title={Dynamic Multicontext Segmentation of Remote Sensing Images Based on Convolutional Networks},
year={2019},
volume={57},
number={10},
pages={7503--7520},
month={Oct}
}