Authors: Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu*, Yifan Liu. *Corresponding author
[paper] [github] [docker image] [pretrained models] [visualization] [visualization of class queries]
Abstract: Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both ''inductive'' and ''transductive'' zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference.
Option 1:
- Install pytorch
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio=0.10.1 cudatoolkit=10.2 -c pytorch
- Install the mmsegmentation library and some required packages.
pip install mmcv-full==1.4.4 mmsegmentation==0.24.0
pip install scipy timm==0.3.2
Option 2:
- Directly apply the same Image we provieded in Dockerhub:
docker push ziqinzhou/zegclip:latest
According to MMseg:
Download the pretrained model here: Path/to/
Dataset | Setting | pAcc | mIoU(S) | mIoU(U) | hIoU | Model Zoo |
PASCAL VOC 2012 | Inductive | 94.6 | 91.9 | 77.8 | 84.3 | [Google Drive] |
PASCAL VOC 2012 | Transductive | 96.2 | 92.3 | 89.9 | 91.1 | [Google Drive] |
PASCAL VOC 2012 | Fully | 96.3 | 92.4 | 90.9 | 91.6 | [Google Drive] |
COCO Stuff 164K | Inductive | 62.0 | 40.2 | 41.1 | 40.8 | [Google Drive] |
COCO Stuff 164K | Transductive | 69.2 | 40.7 | 59.9 | 48.5 | [Google Drive] |
COCO Stuff 164K | Fully | 69.9 | 40.7 | 63.2 | 49.6 | [Google Drive] |
Note that here we report the averaged results of several training models and provide one of them.
Dataset | #Params(M) | Flops(G) | FPS |
PASCAL VOC 2012 | 13.8 | 110.4 | 9.0 |
COCO Stuff 164K | 14.6 | 123.9 | 6.7 |
Note that all experience are conducted on a single 1080Ti GPU and #Params(M) represents the number of learnable parameters.
bash configs/coco/ Path/to/coco/zero_12_100
bash configs/voc12/ Path/to/voc12/zero_12_10
bash ./configs/coco/ Path/to/coco/zero_12_100_st --load-from=Path/to/coco/zero_12_100/iter_40000.pth
bash ./configs/voc12/ Path/to/voc12/zero_12_10_st --load-from=Path/to/voc12/zero_12_10/iter_10000.pth
bash configs/coco/ Path/to/coco/fully_12_100
bash configs/voc12/ Path/to/voc12/fully_12_10
python ./path/to/config ./path/to/model.pth --eval=mIoU
For example:
CUDA_VISIBLE_DEVICES="0" python configs/coco/ Path/to/coco/zero_12_100/latest.pth --eval=mIoU
CUDA_VISIBLE_DEVICES="0" python ./configs/cross_dataset/ Path/to/coco/vpt_seg_zero_80k_12_100_multi/iter_80000.pth --eval=mIoU
CUDA_VISIBLE_DEVICES="0" python ./configs/cross_dataset/ Path/to/coco/vpt_seg_zero_80k_12_100_multi/iter_80000.pth --eval=mIoU
Our work is closely related to the following assets that inspire our implementation. We gratefully thank the authors.
- Maskformer:
- Zegformer:
- zsseg:
- MaskCLIP:
- SegViT:
- DenseCLIP:
- Visual Prompt Tuning:
If you find this project useful, please consider citing:
title={ZegCLIP: Towards adapting CLIP for zero-shot semantic segmentation},
author={Zhou, Ziqin and Lei, Yinjie and Zhang, Bowen and Liu, Lingqiao and Liu, Yifan},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},