Authors: Kaihua Zhang, Jin Chen, Bo Liu, Qingshan Liu
Object co-segmentation is to segment the shared objects in multiple relevant images, which has numerous applications in computer vision. This paper presents a spatial and semantic modulated deep network framework for object co-segmentation. A backbone network is adopted to extract multi-resolution image features. With the multi-resolution features of the relevant images as input, we design a spatial modulator to learn a mask for each image. The spatial modulator captures the correlations of image feature descriptors via unsupervised learning. The learned mask can roughly localize the shared foreground object while suppressing the background. For the semantic modulator, we model it as a supervised image classification task. We propose a hierarchical second-order pooling module to transform the image features for classification use. The outputs of the two modulators manipulate the multi-resolution features by a shift-and-scale operation so that the features focus on segmenting co-object regions. The proposed model is trained end-to-end without any intricate post-processing. Extensive experiments on four image co-segmentation benchmark datasets demonstrate the superior accuracy of the proposed method compared to state-of-the-art methods.
In order to compare the deep learning methods in recent years fairly, we conduct extensive evaluations on four widely-used benchmark datasets including sub-set of MSRC, Internet, sub-set of iCoseg, and PASCAL-VOC. Among them:
- The sub-set of MSRC includes 7 classes: bird, car, cat, cow, dog, plane, sheep, and each class contains 10 images.
- The Internet has 3 categories of airplane, car and horse. Each class has 100 images including some images with noisy labels.
- The sub-set of iCoseg contains 8 categories, and each has a different number of images.
- The PASCAL-VOC is the most challenging dataset with 1037 images of 20 categories selected from the PASCAL-VOC 2010 dataset.
- VGG16-backbone: Google Drive.
- HRNet-backbone: Google Drive.
- Ubuntu 16.04, Nvidia RTX 2080Ti
- Python 3
- PyTorch>=1.0, TorchVision>=0.2.2
- Numpy==1.16.2, Pillow, pycocotools
- Get or download the dataset we have processed in Google Drive.
- Download VGG16-backbone pretrained model in Google Drive.
- Modify the path config in coseg_test.py and run it.
- Get the COCO2017 Dataset for training the whole network.
- Get the test dataset for val and test phase.
- Download VGG16 pretrained weights in Google Drive. Actually is from PyTorch offical model weights, expect for deleting the last serveral layers.
- Download dict.npy in Google Drive.
- Modify the path config in main.py and run it.
- Following the suggestion of reviewers in AAAI20, we would not release the HRNet-backbone trained model for fairly comparing with others methods.
- There are some slight differences in the 'Fusion' part of the model but little impact.
- There is a mistake value in Table 2, our HRNet J-index(82.5) in 'Car' in Internet Dataset should be modified with (73.9).
- There is something wrong about the share link of BaiduPan, contact me if want.
- Create github repo (2019.11.18)
- Release arXiv pdf (2019.12.2)
- Release AAAI20 pdf (2020.7.3)
- All results (2020.7.3)
- Test and Train code (2021.6.4)