Preview version paper of this work is available at Arxiv.
The conference poster is available at this github repo.
Long paper presentation video is available at GoogleDrive and YouTube.
Qualitative results and comparisons with previous SOTAs are available at YouTube.
Welcome to starts ⭐ & comments 💹 & collaboration 😀 !!**
- 2022.11.16: All the codes are cleaned and released ~
- 2022.10.21: Add the robustness evaluation dataloader for other models, e.g., AOT~
- 2022.10.1:Add the code of key implementations of this work~
- 2022.9.25:Add the poster of this work~
- 2022.8.27: Add presentation video and PPT for this work~
- 2022.7.10: Add future works towards robust VOS!
- 2022.7.5: Our ArXiv-version paper is available.
- 2022.7.1: Repo init. Please stay tuned~
In the booming video era, video segmentation attracts increasing research attention in the multimedia community.
Semi-supervised video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. Most existing methods build pixel-wise reference-target correlations and then perform pixel-wise tracking to obtain target masks. Due to neglecting object-level cues, pixel-level approaches make the tracking vulnerable to perturbations, and even indiscriminate among similar objects.
Towards robust VOS, the key insight is to calibrate the representation and mask of each specific object to be expressive and discriminative. Accordingly, we propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
First, we construct the object representations by applying an adaptive object proxy (AOP) aggregation method, where the proxies represent arbitrary-shaped segments via clustering at multi-levels for reference.
Then, prototype masks are initially generated from the reference-target correlations based on AOP. Afterwards, such proto-masks are further calibrated through network modulation, conditioning on the object proxy representations. We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively.
Extensive experiments are conducted on the standard VOS benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the state-of-the-art performance among existing published works, and also exhibits significantly superior robustness against perturbations.
- Python3
- pytorch >= 1.4.0
- torchvision
- opencv-python
- Pillow
You can also use the docker image below to set up your env directly. However, this docker image may contain some redundent packages.
docker image: xxiaoh/vos:10.1-cudnn7-torch1.4_v3
A more light-weight version can be created by modified the Dockerfile provided.
-
Datasets
-
YouTube-VOS
A commonly-used large-scale VOS dataset.
datasets/YTB/2019: version 2019, download link.
train
is required for training.valid
(6fps) andvalid_all_frames
(30fps, optional) are used for evaluation.datasets/YTB/2018: version 2018, download link. Only
valid
(6fps) andvalid_all_frames
(30fps, optional) are required for this project and used for evaluation. -
DAVIS
A commonly-used small-scale VOS dataset.
datasets/DAVIS: TrainVal (480p) contains both the training and validation split. Test-Dev (480p) contains the Test-dev split. The full-resolution version is also supported for training and evaluation but not required.
-
-
pretrained weights for the backbone
The key implementation of matching with adaptive-proxy-based representation is provided in THIS FILE. Other implementation and training/evaluation details can refer to PRCMVOS or CFBI.
The key implementation of the preliminary robust VOS benchmark evaluation is provided in THIS FILE.
The whole project code is provided in THIS FOLDER.
Feel free to contact me if you have any problems with the implementation~
- For evaluation, please use official YouTube-VOS servers (2018 server and 2019 server), official DAVIS toolkit (for Val), and official DAVIS server (for Test-dev).
- Extension of the proposed clustering-based adaptive proxy representation to other dense-tracking tasks in a more efficient and robust way
- Leverage the robust layered representation, i.e., intermediate masks, for robust mask calibration in other segmentation tasks
- More diverse perturbation/corruption types can be studied for video segmentation tasks like VOS and VIS
- Adversial attack and defence for VOS models is still an open question for further exploration
- VOS model robustness verification and theoretical analysis
- Model enhancement from the perspective of data management
(to be continued...)
If you find this work is useful for your research, please consider citing:
@inproceedings{xu2022towards,
title={Towards Robust Video Object Segmentation with Adaptive Object Calibration},
author={Xu, Xiaohao and Wang, Jinglu and Ming, Xiang and Lu, Yan},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
pages={2709--2718},
year={2022}
}
CFBI: https://github.com/z-x-yang/CFBI
Deeplab: https://github.com/VainF/DeepLabV3Plus-Pytorch
GCT: https://github.com/z-x-yang/GCT
Semisupervised video object segmentation repo/paper link:
ARKitTrack [CVPR 2023]:https://arxiv.org/pdf/2303.13885.pdf
TarVis [Arxiv 2023]:https://arxiv.org/pdf/2301.02657.pdf
LBLVOS [AAAI 2023]:https://arxiv.org/pdf/2212.02112.pdf
DeAOT [NeurIPS 2022]:https://arxiv.org/pdf/2210.09782.pdf
BATMAN [ECCV 2022 Oral]:https://arxiv.org/pdf/2208.01159.pdf
XMEM [ECCV 2022]:https://github.com/hkchengrex/XMem
TBD [ECCV 2022]:https://github.com/suhwan-cho/TBD
QDMN [ECCV 2022]:https://github.com/workforai/QDMN
GSFM [ECCV 2022]:https://github.com/workforai/GSFM
SWEM [CVPR 2022]:https://tianyu-yang.com/resources/swem.pdf
RDE [CVPR 2022]:https://arxiv.org/pdf/2205.03761.pdf
COVOS [CVPR 2022] :https://github.com/kai422/CoVOS
RPCM [AAAI 2022 Oral] :https://github.com/JerryX1110/RPCMVOS
AOT [NeurIPS 2021]: https://github.com/z-x-yang/AOT
STCN [NeurIPS 2021]: https://github.com/hkchengrex/STCN
JOINT [ICCV 2021]: https://github.com/maoyunyao/JOINT
HMMN [ICCV 2021]: https://github.com/Hongje/HMMN
DMN-AOA [ICCV 2021]: https://github.com/liang4sx/DMN-AOA
MiVOS [CVPR 2021]: https://github.com/hkchengrex/MiVOS
SSTVOS [CVPR 2021 Oral]: https://github.com/dukebw/SSTVOS
GraphMemVOS [ECCV 2020]: https://github.com/carrierlxk/GraphMemVOS
AFB-URR [NeurIPS 2020]: https://github.com/xmlyqing00/AFB-URR
CFBI [ECCV 2020]: https://github.com/z-x-yang/CFBI
FRTM-VOS [CVPR 2020]: https://github.com/andr345/frtm-vos
STM [ICCV 2019]: https://github.com/seoungwugoh/STM
FEELVOS [CVPR 2019]: https://github.com/kim-younghan/FEELVOS
(The list may be incomplete, feel free to contact me by pulling a issue and I'll add them on!)
The 1st Large-scale Video Object Segmentation Challenge: https://competitions.codalab.org/competitions/19544#learn_the_details
The 2nd Large-scale Video Object Segmentation Challenge - Track 1: Video Object Segmentation: https://competitions.codalab.org/competitions/20127#learn_the_details
The Semi-Supervised DAVIS Challenge on Video Object Segmentation @ CVPR 2020: https://competitions.codalab.org/competitions/20516#participate-submit_results
DAVIS: https://davischallenge.org/
YouTube-VOS: https://youtube-vos.org/
Papers with code for Semi-VOS: https://paperswithcode.com/task/semi-supervised-video-object-segmentation
This work is heavily built upon CFBI and RPCMVOS. Thanks to the author of CFBI to release such a wonderful code repo for further work to build upon!
Xiaohao Xu: [email protected]
This project is released under the Mit license. See LICENSE for additional details.