`Awesome Self-Supervised Learning in Videos`

This repository is originating from our survey paper "Unifying Video Self-Supervised Learning across Families of Tasks: A Survey" and authors (Ishan Dave*, Malitha Gunawardhana*, Limalka Sadith, Honglu Zhou, Liel David, Daniel Harari, Mubarak Shah, Muhammad Haris Khan) will continue to update this over time.

Abstract: Video self-supervised learning (VideoSSL) offers significant potential for reducing annotation costs and enhancing a wide range of downstream tasks in video understanding. The ultimate goal of VideoSSL is to achieve human-level video intelligence across a spectrum of tasks, from low-level tasks such as pixel temporal correspondence to high-level complex spatio-temporal tasks like action recognition. However, most existing VideoSSL methods focus on isolated aspects of this spectrum and fail to integrate different levels of task complexity. Our study presents the first comprehensive survey that connects all families of VideoSSL methods. We provide a detailed review of the full spectrum of VideoSSL, from low to high levels, by conceptually linking their self-supervised learning objectives and including a comprehensive categorization. Our extensive evaluation highlights the strengths and limitations of each SSL objective across various downstream task families. We also detail the challenges in current VideoSSL research such as data curation, interpretability, deployment, and privacy concerns, an area that previous surveys have not thoroughly explored. In addressing these challenges, we recognize the strengths of existing methods in addressing these challenges and outline future directions for research.

Overview of the three major families of video self-supervised learning methods. Dave and Gunawardhana et al. (2024)

This repository contains a collection of state-of-the-art self-supervised learning in video approaches for various downstream tasks, such as action recognition, video retrieval, etc. With the exponential growth of video data, there is an increasing need for automatic video analysis methods that can learn from large amounts of unlabeled data. Self-supervised learning provides an effective solution to this problem by allowing models to learn from the data itself without explicit supervision.

Acknowledgments

This research was supported by the joint grant P007 from Mohamed Bin Zayed University of Artificial Intelligence and the Weizmann Institute of Science. The authors would like to express their sincere gratitude for this generous support, which made the study possible.

Citing

If you find our work useful. Please consider giving a star ⭐ and a citation.

@article{dave2024unifying,
  title={Unifying Video Self-Supervised Learning across Families of Tasks: A Survey},
  author={Dave, Ishan and Gunawardhana, Malitha and Sadith, Limalka and Zhou, Honglu and David, Liel and Harari, Daniel and Shah, Mubarak and Khan, Muhammad Haris},
  year={2024},
  publisher={Preprints}
}

In this repository, we have gathered some of the most promising self-supervised learning approaches for video analysis and organized them based on their publication year. Whether you are new to self-supervised learning in videos or an experienced researcher in the field, we hope that this repository will serve as a valuable resource for exploring the latest advances in this exciting area of research.

Let's collaborate and enrich this list together! Reach out to me or submit a pull request. Your contributions are highly appreciated.

Surveys

Unifying Video Self-Supervised Learning across Families of Tasks: A Survey (2024)
Preprint
Ishan Dave*, Malitha Gunawardhana*, Limalka Sadith, Honglu Zhou, Liel David, Daniel Harari, Mubarak Shah, Muhammad Hairs Khan
[Paper]
Self-Supervised Learning for Videos: A Survey (2022)
ACM Computing Surveys
Madeline C. Schiappa, Yogesh S. Rawat, And Mubarak Shah
[Paper]

Benchmarking

How Effective are Self-Supervised Models for Contact Identification in Videos (2024)
arXiv preprint
Malitha Gunawardhana, Limalka Sadith, Liel David, Daniel Harari, Muhammad Haris Khan
[Paper] [Code]
Benchmarking self-supervised video representation learning (2023)
arXiv preprint arXiv:2306.06010
Akash Kumar, Ashlesha Kumar, Vibhav Vineet, Yogesh Singh Rawat
[Paper] [Page]
A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition (2023)
arXiv preprint arXiv:2303.13505
Deng, A., Yang, T., & Chen, C.
[Paper]
How Severe Is Benchmark-Sensitivity in Video Self-supervised Learning? (2022, October)
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022
Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, Cees Snoek
[Paper] [Github] [Page]

Representation Learning

2024

ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders (2024)
European Conference on Computer Vision (ECCV), 2024
Jefferson Hernandez, Ruben Villegas, Vicente Ordonez;
[Paper][Code]
Learning to Predict Activity Progress by Self-Supervised Video Alignment (2024)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Gerard Donahue, Ehsan Elhamifar;
[Paper][Code]
Repeat and learn: Self-supervised visual representations learning by Repeated Scene Localization (2024)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen;
[Paper][Code]
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations (2024)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen;
[Paper]
Self-supervised Learning of Semantic Correspondence Using Web Videos (2024)
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024
Donghyeon Kwon, Minsu Cho, Suha Kwak;
[Paper]
Video Compression and Action Recognition in Self-supervised Learning (2024)
2024 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC)
Zongbo Hao; Conghui Hao; Kecheng He
[Paper]
CycleCL: Self-supervised Learning for Periodic Videos (2024)
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024
Matteo Destro, Michael Gygl
[Paper]
Self-Supervised Learning via Multi-Transformation Classification for Action Recognition (2024)
2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)
Duc-Quang Vu; Ngan Le; Jia-Ching Wang
[Paper]
Self-supervised object-centric learning for videos (2024)
Advances in Neural Information Processing Systems (Neurips) 2024
Görkay Aydemir, Weidi Xie, Fatma Guney
[Paper]
CycleCL: Self-supervised Learning for Periodic Videos (2024)
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024
Matteo Destro, Michael Gygli
[Paper]
Motion-guided spatiotemporal multitask feature discrimination for self-supervised video representation learning (2024)
Pattern Recognition (Elsevier)
Shuai Bi, Zhengping Hu, Hehao Zhang, Jirui Di, Zhe Sun
[Paper]
What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions (2024)
Computer Vision and Pattern Recognition (CVPR), 2024
Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne
[Paper]
BIMM: Brain Inspired Masked Modeling for Video Representation Learning (2024)
arxiv
Zhifan Wan, Jie Zhang, Changzhen Li, Shiguang Shan
[Paper]
Clustering-based multi-featured self-supervised learning for human activities and video retrieval (2024)
Applied Intelligence - Springer
Muhammad Hafeez Javed, Zeng Yu, Taha M. Rajeh, Fahad Rafique & Tianrui Li
[Paper]
Positive and negative sampling strategies for self-supervised learning on audio-video data (2024)
arxiv
Shanshan Wang, Soumya Tripathy, Toni Heittola, Annamaria Mesaros
[Paper]
Self-supervised learning of video representations from a child's perspective (2024)
arxiv
Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, Brenden M. Lake
[Paper]
Collaboratively Self-supervised Video Representation Learning for Action Recognition (2024)
arxiv
Jie Zhang, Zhifan Wan, Lanqing Hu, Stephen Lin, Shuzhe Wu, Shiguang Shan
[Paper]
Language-based Action Concept Spaces Improve Video Self-Supervised Learning (2024)
Advances in Neural Information Processing Systems 36 (2024)
Kanchana Ranasinghe, Michael S Ryoo
[Paper]
Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts (2024)
Advances in Neural Information Processing Systems 36 (2024)
Pritam Sarkar, Ahmad Beirami, Ali Etemad
[Paper] [Project Page]
Self-supervised video pretraining yields robust and more human-aligned visual representation (2024)
Advances in Neural Information Processing Systems 36 (2024)
Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff.
[Paper]
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision (2024)
AAAI Conference on Artificial Intelligence, Main Technical Track (AAAI) , 2024
Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah.
[Paper] [Project Page]
GLOCAL: A self-supervised learning framework for global and local motion estimation (2024)
Pattern Recognition Letters
Yihao Zheng , Kunming Luo , Shuaicheng Liu , Zun Li , Ye Xiang , Lifang Wu , Bing Zeng , Chang Wen Chen
[Paper]

2023

OmniMAE: Single Model Masked Pretraining on Images and Videos (2023)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh,Kalyan Vasudev Alwala, Armand Joulin , Ishan Misra
[Paper] [Github]
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition (2023)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah
[Paper] [Github] [Project Page]
Self-supervised Video Representation Learning via Capturing Semantic Changes Indicated by Saccades (2023)
IEEE Transactions on Circuits and Systems for Video Technology
Qiuxia Lai, Ailing Zeng, Ye Wang, Lihong Cao, Yu Li, Qiang Xu, IEEE
[Paper]
Attentive spatial-temporal contrastive learning for self-supervised video representation (2023)
Image and Vision Computing Journal
Xingming Yang, Sixuan Xiong, Kewei Wu, Dongfeng Shan, Zhao Xie
[Paper]
Attentive spatial-temporal contrastive learning for self-supervised video representation (2023)
Image and Vision Computing Journal
Xingming Yang, Sixuan Xiong, Kewei Wu, Dongfeng Shan, Zhao Xie
[Paper]
Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning (2023)
18th International Conference on Machine Vision and Applications (MVA) 2023
Srijan Das; Michael Ryoo
[Paper]
CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing (2023)
Proceedings of the 31st ACM International Conference on Multimedia
Rukai Wei, Yu Liu, Jingkuan Song, Heng Cui, Yanzhao Xie, Ke Zhou
[Paper]
Data-Efficient Masked Video Modeling for Self-supervised Action Recognition (2023)
Proceedings of the 31st ACM International Conference on Multimedia
Qiankun Li, Xiaolong Huang, Zhifan Wan, Lanqing Hu, Shuzhe Wu, Jie Zhang, Shiguang Shan, Zengfu Wang(
[Paper]
MAR: Masked Autoencoders for Efficient Action Recognition (2023)
IEEE Transactions on Multimedia
Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang
[Paper] [Github]
Temporal Transformer Networks with Self-Supervision for Action Recognition (2023)
IEEE Internet of Things Journal
Yongkang Zhang, Jun Li, Guoming Wu, Han Zhang, Zhiping Shi, Member, IEEE, Zhaoxun Liu, Zizhang Wu
[Paper]
CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition (2023)
arXiv preprint arXiv:2301.06018
Cheng-Ze Lu, Xiaojie Jin, Zhicheng Huang, Qibin Hou, Ming-Ming Cheng, Jiashi Feng
[Paper]
Learning Representational Invariances for Data-Efficient Action Recognition (2023)
Computer Vision and Image Understanding, 227, 103597
Yuliang Zou, Jinwoo Choi, Qitong Wang, Jia-Bin Huang
[Paper] [Github]
SOR-TC: Self-attentive octave ResNet with temporal consistency for compressed video action recognition (2023)
Neurocomputing, 533, 191-205
Junsan Zhang, Xiaomin Wang, Yao Wan, Leiquan Wang, Jian Wang, Philip S. Yu
[Paper]
VicTR: Video-conditioned Text Representations for Activity Recognition (2023)
arXiv preprint arXiv:2304.02560
Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo
[Paper]
Masked Motion Encoding for Self-Supervised Video Representation Learning (2023)
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xinyu Sun, Peihao Chen, Liangwei Chen, Thomas H. Li, Mingkui Tan, Chuang Gan
[Paper] [Github]
Spatiotemporal consistency enhancement self-supervised representation learning for action recognition (2023)
Signal, Image and Video Processing
Shuai Bi, Zhengping Hu, Mengyao Zhao, Shufang Li & Zhe Sun
[Paper]
Self-Supervised Video-Based Action Recognition With Disturbances (2023)
IEEE Transactions on Image Processing.
Wei Lin, Xinghao Ding, Yue Huang, Huanqiang Zeng
[Paper]
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning (2023)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6312-6322)
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang
[Paper] [Github]
Enhancing motion visual cues for self-supervised video representation learning (2023)
In Engineering Applications of Artificial Intelligence, Volume 123, Pages 106203
Mu Nie, Zhibin Quan, Weiping Ding, and Wankou Yang
[Paper]
Continuous frame motion sensitive self-supervised collaborative network for video representation learning (2023)
In Advanced Engineering Informatics, Volume 56, Pages 101941
Shuai Bi, Zhengping Hu, Mengyao Zhao, Hehao Zhang, Jirui Di, and Zhe Sun
[Paper]
Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition (2023)
In Signal, Image and Video Processing, Pages 1--8
Shuai Bi, Zhengping Hu, Mengyao Zhao, Hehao Zhang, Jirui Di, and Zhe Sun
[Paper]
Self-Supervised Learning from Untrimmed Videos via Hierarchical Consistency (2023)
In IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Changxin Gao, Rong Jin, and Nong Sang
[Paper]
Self-Supervised Video Representation Learning by Video Incoherence Detection (2023)
In IEEE Transactions on Cybernetics
Haozhi Cao, Yuecong Xu, Kezhi Mao, Lihua Xie, Jianxiong Yin, Simon See, Qianwen Xu, and Jianfei Yang
[Paper]
Audio-Visual Contrastive Learning with Temporal Self-Supervision (2023)
Preprint on arXiv
Simon Jenni, Alexander Black, and John Collomosse
[Paper]
Video Test-Time Adaptation for Action Recognition (2023)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof
[Paper] [GitHub]

Self-Supervised Video Representation Learning via Latent Time Navigation (2023)
Preprint on arXiv
Di Yang, Yaohui Wang, Quan Kong, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, and Francois Bremond
[Paper]
Temporal Contrastive Learning with Curriculum (2023)
In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Shuvendu Roy and Ali Etemad
[Paper]
Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos (2023)
Preprint on arXiv
David Fan, Deyu Yang, Xinyu Li, Vimal Bhat, and Rohith MV
[Paper]
Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization (2023)
Preprint on arXiv
Fida Mohammad Thoker, Hazel Doughty, and Cees Snoek
[Paper]
Multi-scale Compositional Constraints for Representation Learning on Videos (2023)
In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Georgios Paraskevopoulos, Chandrashekhar Lavania, Lovish Chum, and Shiva Sundaram
[Paper]
Flavr: Flow-agnostic Video Representations for Fast Frame Interpolation (2023)
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran
[Paper]
HomE: Homography-Equivariant Video Representation Learning (2023)
Preprint on arXiv
Anirudh Sriram, Adrien Gaidon, Jiajun Wu, Juan Carlos Niebles, Li Fei-Fei, and Ehsan Adeli
[Paper] [GitHub]
ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints (2023)
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Srijan Das and Michael S Ryoo
[Paper]
Videomae v2: Scaling Video Masked Autoencoders with Dual Masking (2023)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao
[Paper]
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity (2023)
Proceedings of the AAAI Conference on Artificial Intelligence
Pritam Sarkar, Ali Etemad
[Paper] [Github]
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding (2023)
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
Xiong, Y., Zhao, L., Gong, B., Yang, M. H., Schroff, F., Liu, T., ... & Yuan, L.
[Paper]
Previts: contrastive pretraining with video tracking supervision (2023)
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 1560-1570)
Chen, B., Selvaraju, R. R., Chang, S. F., Niebles, J. C., & Naik, N.
[Paper]
Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning (2023)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2225-2234)
Zhang, H., Liu, D., Zheng, Q., & Su, B.
[Paper]
Learning Fine-Grained Features for Pixel-wise Video Correspondences (2023)
arXiv preprint arXiv:2308.03040
Li, R., Zhou, S., & Liu, D.
[Paper]
Cali-NCE: Boosting Cross-Modal Video Representation Learning With Calibrated Alignment (2023)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6316-6326)
Zhao, N., Jiao, J., Xie, W., & Lin, D.
[Paper]

2022

SPAct: Self-supervised Privacy Preservation for Action Recognition (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20164-20173)
Ishan Rajendrakumar Dave, Chen Chen, Mubarak Shah
[Paper] [Github]
Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning (2022, June)
In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 3, pp. 3300-3308)
Manlin Zhang, Jinpeng Wang, Andy J. Ma
[Paper] [Github]
Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders (2022)
arXiv preprint arXiv:2210.04154
Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, Zehuan Yuan
[Paper] [Github]
Self-supervised Video Transformer (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2874-2884)
Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Michael S. Ryoo
[Paper] [Github]
Exploring Relations in Untrimmed Videos for Self-Supervised Learning (2022)
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM),18(1s), 1-21
Dezhao Luo, Bo Fang, Yu Zhou, Yucan Zhou, Dayan Wu, Weiping Wang
[Paper]
MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning (2022, October)
In Proceedings of the 30th ACM International Conference on Multimedia (pp. 1348-1357)
Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Dongliang He, Weiping Wang
[Paper]
TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning (2022)
IEEE Transactions on Image Processing, 31, 1978-1993
Yang Liu , Keze Wang , Lingbo Liu , Haoyuan Lan, and Liang Lin
[Paper] [Github]
Cross-Architecture Self-supervised Video Representation Learning (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19270-19279)
Sheng Guo, Zihua Xiong, Yujie Zhong, Limin Wang, Xiaobo Guo, Bing Han, Weilin Huang
[Paper]
Contrastive spatio-temporal pretext learning for self-supervised video representation (2022, June)
In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 3, pp. 3380-3389)
Yujia Zhang, Lai-Man Po, Xuyuan Xu, Mengyang Liu, Yexin Wang, Weifeng Ou, Yuzhi Zhao, Wing-Yin Yu
[Paper] [Github]
Transrank: Self-supervised video representation learning via ranking-based transformation recognition (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3000-3010)
Haodong Duan, Nanxuan Zhao, Kai Chen, Dahua Lin
[Paper] [Github]
Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13821-13831)
Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin,Nong Sang
[Paper] [Github]
Motion-aware contrastive video representation learning via foreground-background merging (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9716-9726)
Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Haohang Xu, Qingyi Chen, Jue Wang, Hongkai Xiong
[Paper] [Github]
Self-Supervised Video Representation Learning with Motion-Contrastive Perception (2022, July)
In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE
Jinyu Liu, Ying Cheng, Yuejie Zhang, Rui-Wei Zhao, Rui Feng
[Paper]
Self-supervised video representation learning using improved instance-wise contrastive learning and deep clustering (2022)
IEEE Transactions on Circuits and Systems for Video Technology, 32(10), 6741-6752
Yisheng Zhu, Hui Shuai, Guangcan Liu, Senior Member, Qingshan Liu
[Paper]
TCLR: Temporal contrastive learning for video representation (2022)
Computer Vision and Image Understanding, 219, 103406
Ishan Dave, Rohit Gupta, Mamshad Nayeem Rizve, Mubarak Shah
[Paper] [Github]
Self-supervised motion perception for spatiotemporal representation learning (2022)
IEEE Transactions on Neural Networks and Learning Systems
Chang Liu, Yuan Yao, Dezhao Luo, Yu Zhou, Qixiang Ye
[Paper] [Github]
Self-supervised spatiotemporal representation learning by exploiting video continuity (2022, June)
In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 2, pp. 1564-1573)
Hanwen Liang, Niamul Quader, Zhixiang Chi, Lizhe Chen, Peng Dai, Juwei Lu, Yang Wang
[Paper]
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning (2022)
arXiv preprint arXiv:2212.11187
Julien Denize, Jaonary Rabarisoa, Astrid Orcesi, Romain H´erault
[Paper] [Github]
Probabilistic representations for video contrastive learning (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14711-14721)
Jungin Park, Jiyoung Lee, Ig-Jae Kim, Kwanghoon Sohn
[Paper]
Contextualized spatio-temporal contrastive learning with self-supervision (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13977-13986)
Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong,Florian Schroff,Ming-Hsuan Yang, Hartwig Adam, Ting Liu
[Paper] [Github]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training (2022)
In Advances in Neural Information Processing Systems, 2022
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
[Paper] [Github]
Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection (2022)
arXiv preprint arXiv:2211.10636
Sunil Hwang, Jaehong Yoon, Youngwan Lee, Sung Ju Hwan
[Paper] [Github]
Self-supervised video representation learning with cross-stream prototypical contrasting (2022)
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 108-118)
Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu
[Paper] [Github]
SLIC: Self-supervised learning with iterative clustering for human action videos (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16091-16101)
Salar Hosseini Khorasgani, Yuxuan Chen, Florian Shkurti
[Paper]
GOCA: guided online cluster assignment for self-supervised video representation Learning (2022, October)
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022
Huseyin Coskun, Alireza Zareian, Joshua L. Moore, Federico Tombari, Chen Wang
[Paper] [Github]
TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning (2022)
In Proceedings of the Asian Conference on Computer Vision (pp. 1539-1555)
Fengrui Tian, Jiawei Fan, Xie Yu, Shaoyi Du, Meina Song, Yu Zhao
[Paper]
Static and Dynamic Concepts for Self-supervised Video Representation Learning (2022, November)
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022
Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin
[Paper]
Audio-Visual Contrastive Learning for Self-Supervised Action Recognition (2022)
arXiv preprint arXiv:2204.13386
Yang Liu, Ying Tan, Haoyuan Lan
[Paper]
SOS! Self-supervised Learning over Sets of Handled Objects in Egocentric Action Recognition (2022, November)
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022
Victor Escorcia, Ricardo Guerrero, Xiatian Zhu, Brais Martinez
[Paper]
Self-Supervised Video Representation Learning with Cascade Positive Retrieval (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4070-4079)
Cheng-En Wu, Farley Lai, Yu Hen Hu, Asim Kadav
[Paper] [Github]
Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment (2022)
IEEE Journal of Selected Topics in Signal Processing,16(6), 1467-1479
Shanshan Wang, Archontis Politis, Annamaria Mesaros
[Paper]
Hierarchically decoupled spatial-temporal contrast for self-supervised video representation learning (2022)
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3235-3245)
Zehua Zhang, David Crandall
[Paper]
Spatio-temporal self-supervision enhanced transformer networks for action recognition (2022, July)
In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE
Yongkang Zhang, Han Zhang, Guoming Wu, Jun Li
[Paper]
Inter-Intra Cross-Modality Self-Supervised Video Representation Learning by Contrastive Clustering (2022)
In 2022 26th International Conference on Pattern Recognition (ICPR) (pp. 4815-4821). IEEE
Jiutong Wei. Guan Luo, Bing Li, Weiming Hu
[Paper]
Self-Supervised Scene-Debiasing for Video Representation Learning via Background Patching (2022)
IEEE Transactions on Multimedia
Maregu Assefa, Wei Jiang, Kumie Gedamu, Getinet Yilma, Bulbula Kumeda, Melese Ayalew
[Paper]
SCVRL: Shuffled Contrastive Video Representation Learning (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4132-4141)
Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, Davide Modolo
[Paper]
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning (2022)
arXiv preprint arXiv:2211.13929
Pritam Sarkar, Ali Etemad
[Paper] [Github]
InternVideo: General Video Foundation Models via Generative and Discriminative Learning (2022)
arXiv preprint arXiv:2212.03191
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang,Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu,Yali Wang, Limin Wang, Yu Qiao
[Paper] [Github]
Video Motion Perception for Self-supervised Representation Learning (2022)
31st International Conference on Artificial Neural Networks, Bristol, UK, September 6–9, 2022
Wei Li, Dezhao Luo, Bo Fang, Xiaoni Li, Yu Zhou, Weiping Wang
[Paper]
An improved inter-intra contrastive learning framework on self-supervised video representation (2022)
IEEE Transactions on Circuits and Systems for Video Technology
Li Tao, Xueting Wang, Toshihiko Yamasaki
[Paper]
Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
[Paper] [Github]
LgNet: A local-global network for action recognition and beyond (2022)
IEEE Transactions on Multimedia
Jiaqi Zhou, Zehua Fu, Qiuyu Huang, Qingjie Liu, Yunhong Wang
[Paper]
Motion Sensitive Contrastive Learning for Self-supervised Video Representation (2022)
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022
Jingcheng Ni, Nan Zhou, Jie Qin, Qian Wu, Junqi Liu, Boxun Li, Di Huang
[Paper]
Unsupervised Video-based Action Recognition With Imagining Motion And Perceiving Appearance (2022)
IEEE Transactions on Circuits and Systems for Video Technology
Wei Lin , Xiaoyu Liu , Yihong Zhuang , Xinghao Ding , Xiaotong Tu , Yue Huang , Huanqiang Zeng
[Paper]
Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval (2022)
In 2022 National Conference on Communications
Vidit Kumar
[Paper]
Federated Self-supervised Learning for Video Understanding (2022)
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022
Yasar Abbas Ur Rehman, Yan Gao, Jiajun Shen, Pedro Porto Buarque de Gusmão , Nicholas Lane
[Paper] [Github]
Contrastive predictive coding with transformer for video representation learning (2022)
Neurocomputing
Yue Liu, Junqi Ma, Yufei Xie, Xuefeng Yang, Xingzhen Tao, Lin Peng, Wei Gao
[Paper] [Github]
Video representation learning by identifying spatio-temporal transformation (2022)
Applied Intelligence
Sheng Geng, Shimin Zhao , Hu Liu
[Paper]
On temporal granularity in self-supervised video representation learning (2022)
In 33rd British Machine Vision Conference 2022, BMVC 2022
Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, and Yin Cui
[Paper] [Github]
LAVA: Language Audio Vision Alignment for Data-Efficient Video Pre-Training (2022)
First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022
Sumanth Gurram , Andy Fang , David Chan , John Canny
[Paper]
It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training (2022)
arXiv preprint arXiv:2210.05234.
Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang
[Paper]
MAC: Mask-Augmentation for Motion-Aware Video Representation Learning (2022)
In 33rd British Machine Vision Conference 2022, BMVC 2022
Arif Akar, Ufuk Umut Senturk, and Nazli Ikizler-Cinbis.
[Paper] [Github]
Temporal-Invariant Video Representation Learning with Dynamic Temporal Resolutions. (2022)
IEEE International Conference on Advanced Video and Signal Based Surveillance
Seong-Yun Jeong, Ho-Joong Kim, Myeong-Seok Oh, Gun-Hee Lee, Seong-Whan Lee
[Paper]
Frequency Selective Augmentation for Video Representation Learning (2022)
arXiv preprint arXiv:2204.03865
Jinhyung Kim, Taeoh Kim, Minho Shim, Dongyoon Han, Dongyoon Wee, Junmo Kim
[Paper]
Dual Contrastive Learning for Spatio-temporal Representation (2022)
In ACM International Conference on Multimedia
Shuangrui Ding,, Rui Qian, and Hongkai Xiongo
[Paper]
Consistent Intra-video Contrastive Learning with Asynchronous Long-term Memory Bank (2022)
In IEEE Transactions on Circuits and Systems for Video Technology
Zelin Chen, Kun-Yu Lin, Wei-Shi Zheng
[Paper]
Controllable Augmentations for Video Representation Learning (2022)
arXiv preprint arXiv:2203.16632
Rui Qian, Weiyao Lin, John See, and Dian Li
[Paper]
MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning (2022)
arXiv preprint arXiv:2212.10870
Yuan Liu, Jiacheng Chen, Hao Wu
[Paper]
On Negative Sampling for Audio-Visual Contrastive Learning from Movies (2022)
arXiv preprint arXiv:2205.00073
Mahdi M. Kalayeh, Shervin Ardeshir, Lingyi Liu, Nagendra Kamath, Ashok Chandrashekar
[Paper]
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13801-13810)
Minghao Chen, Fangyun Wei, Chong Li, Deng Cai
[Paper] [Github]
Masked feature prediction for self-supervised visual pre-training (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14668-14678)
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C.
[Paper]
Static and dynamic concepts for self-supervised video representation learning (2022)
Qian, R., Ding, S., Liu, X., & Lin, D.
In European Conference on Computer Vision (pp. 145-164)
[Paper]
Pixel-level Correspondence for Self-Supervised Learning from Video (2022)
Sharma, Y., Zhu, Y., Russell, C., & Brox, T.
arXiv preprint arXiv:2207.03866
[Paper]
Temporal alignment networks for long-term video (2022)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2906-2916)
Han, T., Xie, W., & Zisserman, A.
[Paper]
Simvtp: Simple video text pre-training with masked autoencoders (2022)
arXiv preprint arXiv:2212.03490
Ma, Y., Yang, T., Shan, Y., & Li, X.
[Paper]
Learning audio-visual speech representation by masked multimodal cluster prediction (2022)
arXiv preprint arXiv:2201.02184
Shi, B., Hsu, W. N., Lakhotia, K., & Mohamed, A.
[Paper]

2021

Watching too much television is good: Self-supervised audio-visual representation learning from movies and tv shows (2021)
arXiv preprint arXiv:2106.08513
Mahdi M. Kalayeh, Nagendra Kamath, Lingyi Liu
[Paper]
Temporally coherent embeddings for self-supervised video representation learning (2021, January)
In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 8914-8921)
Joshua Knights, Ben Harwood, Daniel Ward, Anthony Vanderkop, Olivia Mackenzie-Ross, Peyman Moghadam
[Paper] [Github]
Audio-visual instance discrimination with cross-modal agreement (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12475-12486)
Pedro Morgado, Nuno Vasconcelos, Ishan Misra
[Paper] [Github]
Removing the background by adding the background: Towards background robust self-supervised video representation learning (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11804-11813)
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J. Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, Xing Sun
[Paper] [Github]
Enhancing unsupervised video representation learning by decoupling the scene and the motion (2021, May)
In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 11, pp. 10129-10137)
Jinpeng Wang, Yuting Gao, Ke Li, Jianguo Hu, Xinyang Jiang, Xiaowei Guo, Rongrong Ji, Xing Sun
[Paper] [Github]
Self-supervised video representation learning by uncovering spatio-temporal statistics (2021)
IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3791-3806
Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, Yun-hui Liu
[Paper] [Github]
Seco: Exploring sequence supervision for unsupervised representation learning (2021, May)
In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 12, pp. 10656-10664)
Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, Tao Mei
[Paper] [Github]
Enhancing self-supervised video representation learning via multi-level feature optimization (2021)
In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7990-8001)
Rui Qian, Yuxi Li, Huabin Liu, John See, Shuangrui Ding, Xian Liu, Dian Li, Weiyao Lin
[Paper] [Github]
RSPnet: Relative speed perception for unsupervised video representation learning (2021, May)
In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 2, pp. 1045-1053)
Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, Chuang Gan
[Paper] [Github]
Videomoco: Contrastive video representation learning with temporally adversarial examples (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11205-11214)
Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, Wei Liu
[Paper] [Github]
On compositions of transformations in contrastive self-supervised learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9577-9587)
Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi
[Paper] [Github]
Unsupervised visual representation learning by tracking patches in video (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2563-2572)
Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, Zhiwei Xiong
[Paper] [Github]
A large-scale study on unsupervised spatiotemporal representation learning (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3299-3309)
Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He
[Paper] [Github]
Cocon: Cooperative-contrastive learning (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3384-3393)
Nishant Rai, Ehsan Adeli ,Kuan-Hui Lee, Adrien Gaidon, Juan Carlos Niebles
[Paper] [Github]
VATT: Transformers for multimodal self-supervised learning from raw video, audio and text (2021)
Advances in Neural Information Processing Systems, 34, 24206-24221
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
[Paper] [Github]
ASCNet: Self-supervised video representation learning with appearance-speed consistency (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8096-8105)
Deng Huang, Wenhao Wu, Weiwen Hu, Xu Liu, Dongliang He, Zhihua Wu, Xiangmiao Wu, Mingkui Tan, Errui Ding
[Paper]
Self-supervised visual learning by variable playback speeds prediction of a video (2021)
IEEE Access, 79562-79571
Hyeon Cho, Taehoon Kim, Hyungjin Chang, Wonjun Hwang
[Paper] [Github]
Self-supervised video representation learning with meta-contrastive network (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8239-8249)
Yuanze Lin, Xun Guo, Yan Lu
[Paper]
Self-Supervised Video Representation Learning by Video Incoherence Detection (2021)
arXiv preprint arXiv:2109.12493
Haozhi Cao, Yuecong Xu, Jianfei Yang, Kezhi Mao, Lihua Xie, Jianxiong Yin, Simon See
[Paper]
Long short view feature decomposition via contrastive video representation learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9244-9253)
Nadine Behrmann, Mohsen Fayyaz, Juergen Gall, Mehdi Noroozi
[Paper]
Time-equivariant contrastive video representation learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9970-9980)
Simon Jenni, Hailin Jin
[Paper]
Self-supervised video representation learning by context and motion decoupling (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13886-13895)
Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, Rong Jin
[Paper]
Unsupervised video representation learning by bidirectional feature prediction (2021)
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 1670-1679)
Nadine Behrmann, Juergen Gall, Mehdi Noroozi
[Paper]
Self-supervised learning of compressed video representations (2021)
In International Conference on Learning Representation
Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song
[Paper]
Spatiotemporal contrastive video representation learning (2021)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6964-6974)
Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, Yin Cui
[Paper] [Github]
Modist: Motion distillation for self-supervised video representation learning (2021)
arXiv preprint arXiv:2106.09703, 1(2), 4
Fanyi Xiao, Joseph Tighe, Davide Modolo
[Paper]
Broaden your views for self-supervised video learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1255-1265)
Adria Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Ross Hemsley, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altche, Michal Valko, Jean-Bastien Grill, Aaron van den Oord, Andrew Zisserman
[Paper] [Github]
Vi2CLR: Video and image for visual contrastive learning of representation (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1502-1512)
Ali Diba, Vivek Sharma, Reza Safdari, Dariush Lotfi, M. Saquib Sarfraz,Rainer Stiefelhagen, Luc Van Gool,
[Paper]
Contrast and order representations for video self-supervised learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7939-7949)
Kai Hu, Jie Shao, Yuan Liu, Bhiksha Raj, Marios Savvides, Zhiqiang Shen
[Paper]
Motion-augmented self-training for video recognition at smaller scale (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10429-10438)
Kirill Gavrilyuk, Mihir Jain, Ilia Karmanov, Cees G. M. Snoek
[Paper]
Composable augmentation encoding for video representation learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8834-8844)
Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
[Paper]
Video contrastive learning with global context (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3195-3204)
Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe,Soren Schwertfeger, Cyrill Stachniss, Mu Li
[Paper] [Github]
Motion-focused contrastive learning of video representations (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2105-2114)
Rui Li, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei
[Paper] [Github]
Back to the Future: Cycle Encoding Prediction for Self-supervised Video Representation Learning (2021, November)
In The 32nd British Machine Vision Conference
Xinyu Yang, Majid Mirmehdi,Tilo Burghardt
[Paper] [Github]
Composable augmentation encoding for video representation learning (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8834-8844)
Sun, C., Nagrani, A., Tian, Y., & Schmid, C.
[Paper]
Learning temporal dynamics from cycles in narrated video (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1480-1489)
Epstein, D., Wu, J., Schmid, C., & Sun, C.
[Paper]
Crossclr: Cross-modal contrastive learning for multi-modal video representations (2021)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1450-1459)
Zolfaghari, M., Zhu, Y., Gehler, P., & Brox, T.
[Paper]

2020

Self-Supervised Learning to Detect Key Frames in Videos (2020, October)
MDPI-Sensors
Xiang Yan,Syed Zulqarnain Gilani,Mingtao Feng ,Liang Zhang,Hanlin Qin and Ajmal Mian
[Paper]
Self-supervised motion representation via scattering local motion cues (2020, October)
Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020
Yuan Tian, Zhaohui Che, Wenbo Bao, Guangtao Zhai, Zhiyong Gao1
[Paper]
Self-supervised video representation learning using inter-intra contrastive framework (2020, October)
In Proceedings of the 28th ACM International Conference on Multimedia (pp. 2193-2201)
Li Tao, Xueting Wang, Toshihiko Yamasaki
[Paper] [Github]
Video representation learning with visual tempo consistency (2020)
arXiv preprint arXiv:2006.15489
Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou
[Paper] [Github]
Self-supervised temporal discriminative learning for video representation learning (2020)
arXiv preprint arXiv:2008.02129
Jinpeng Wang, Yiqi Lin, Andy J. Ma,Pong C. Yuen
[Paper] [Github]
Self-supervised learning by cross-modal audio-video clustering (2020)
Advances in Neural Information Processing Systems, 33, 9758-9770
Humam Alwassel, Dhruv Mahajan, Bruno Korbar ,Lorenzo Torresani, Bernard Ghanem, Du Tran
[Paper] [Github]
Self-supervised video representation learning by pace prediction (2020)
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020
Jiangliu Wang, Jianbo Jiao, Yun-Hui Liu
[Paper] [Github]
Unsupervised learning from video with deep neural embeddings (2020)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9563-9572)
Chengxu Zhuang, Tianwei She, Alex Andonian, Max Sobol Mark, Daniel Yamins
[Paper] [Github]
Unsupervised learning of video representations via dense trajectory clustering (2020)
In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020
Pavel Tokmakov, Martial Hebert, Cordelia Schmid
[Paper] [Github]
Video representation learning by recognizing temporal transformations (2020)
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020
Simon Jenni, Givi Meishvili, Paolo Favaro
[Paper] [Github]
Video playback rate perception for self-supervised spatio-temporal representation learning (2020)
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6548-6557)
Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, Qixiang Ye
[Paper] [Github]
Self-supervised co-training for video representation learning (2020)
Advances in Neural Information Processing Systems, 33, 5679-5690
Tengda Han, Weidi Xie, Andrew Zisserman
[Paper] [Github]
Video cloze procedure for self-supervised spatio-temporal learning (2020, April)
In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 11701-11708)
Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, Weiping Wang
[Paper] [Github]
End-to-end learning of visual representations from uncurated instructional videos (2020)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9879-9889)
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira,Ivan Laptev, Josef Sivic, Andrew Zisserman
[Paper] [Github]
Speednet: Learning the speediness in videos (2020)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9922-9931)
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, Tali Dekel
[Paper] [Github]
Contrastive multiview coding (2020)
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16 (pp. 776-794)
Yonglong Tian, Dilip Krishnan, Phillip Isola
[Paper] [Github]
Self-supervised video representation learning by maximizing mutual information (2020)
Signal processing: Image communication, 88, 115967
Fei Xue, Hongbing Ji, Wenbo Zhang, Yi Cao
[Paper]
Memory-augmented dense predictive coding for video representation learning (2020)
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020
Tengda Han, Weidi Xie, Andrew Zisserman
[Paper] [Github]
Evolving losses for unsupervised video representation learning (2020)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 133-142)
AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
[Paper]
Audiovisual slowfast networks for video recognition (2020)
arXiv preprint arXiv:2001.08740
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
[Paper] [Github]
Cycle-contrast for self-supervised video representation learning (2020)
Advances in Neural Information Processing Systems, 33, 8089-8100
Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, Tomokazu Murakami
[Paper]
Can temporal information help with contrastive self-supervised learning? (2020)
arXiv preprint arXiv:2011.13046
Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu
[Paper]
Self-supervised multimodal versatile networks (2020)
Advances in Neural Information Processing Systems, 33, 25-37
Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelovic, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira Sander Dieleman, Andrew Zisserman
[Paper]
Watching the world go by: Representation learning from unlabeled videos (2020)
arXiv preprint arXiv:2003.07990
Daniel Gordon, Kiana Ehsani, Dieter Fox, Ali Farhadi
[Paper]
Pretext-contrastive learning: Toward good practices in self-supervised video representation leaning (2020)
arXiv preprint arXiv:2010.15464
Li Tao, Xueting Wang, Toshihiko Yamasaki
[Paper] [Github]
Univl: A unified video and language pre-training model for multimodal understanding and generation (2020)
arXiv preprint arXiv:2002.06353
Luo, H., Ji, L., Shi, B., Huang, H., Duan, N., Li, T., ... & Zhou, M.
[Paper]
Self-supervised learning of audio-visual objects from video (2020)
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020
Afouras, T., Owens, A., Chung, J. S., & Zisserman, A.
[Paper]
Parameter efficient multimodal transformers for video representation learning (2020)
arXiv preprint arXiv:2012.04124
Lee, S., Yu, Y., Kim, G., Breuel, T., Kautz, J., & Song, Y.
[Paper]
Active contrastive learning of audio-visual video representations (2020)
arXiv preprint arXiv:2009.09805
Ma, S., Zeng, Z., McDuff, D., & Song, Y.
[Paper]
Speech2action: Cross-modal supervision for action recognition (2020)
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10317-10326)
Nagrani, A., Sun, C., Ross, D., Sukthankar, R., Schmid, C., & Zisserman, A.
[Paper]
Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning (2020)
In Proceedings of the 28th ACM International Conference on Multimedia (pp. 3884-3892)
Cheng, Y., Wang, R., Pan, Z., Feng, R., & Zhang, Y.
[Paper]

2019

Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics (2019)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4006-4015)
Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, Wei Liu
[Paper] [Github]
Video representation learning by dense predictive coding (2019)
In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Korea (South), 2019 pp. 1483-1492
Tengda Han, Weidi Xie, Andrew Zisserman
[Paper] [Github]
Self-supervised spatiotemporal learning via video clip order prediction (2019)
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10334-10343)
Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, Yueting Zhuang
[Paper] [Github]
Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition (2019, January)
In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 179-189)
Unaiza Ahsan, Rishi Madhok, Irfan Essa
[Paper]
Self-supervised video representation learning with space-time cubic puzzles (2019, July)
In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8545-8552)
Dahun Kim, Donghyeon Cho, In So Kweon
[Paper]
Learning video representations using contrastive bidirectional transformer (2019)
arXiv preprint arXiv:1906.05743
Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid
[Paper]
Dynamonet: Dynamic action and motion network (2019)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6192-6201)
Ali Diba, Vivek Sharma, Luc Van Gool, Rainer Stiefelhagen
[Paper]
Temporal cycle-consistency learning (2019)
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1801-1810)
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., & Zisserman, A.
[Paper]
Videobert: A joint model for video and language representation learning (2019)
In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7464-7473)
Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C.
[Paper]

2018

Geometry guided convolutional neural networks for self-supervised video representation learning
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5589-5597)
Chuang Gan, Boqing Gong, Kun Liu, Hao Su, Leonidas J. Guibas
[Paper]
Self-supervised spatiotemporal feature learning via video rotation prediction (2018)
arXiv preprint arXiv:1811.11387
Longlong Jing, Xiaodong Yang, Jinggen Liu, Yingli Tian
[Paper]
Cooperative learning of audio and video models from self-supervised synchronization (2018)
Advances in Neural Information Processing Systems, 31
Bruno Korbar, Du Tran, Lorenzo Torresani
[Paper]
Audio-visual scene analysis with self-supervised multisensory features (2018)
In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 631-648)
Andrew Owens, Alexei A. Efros
[Paper] [Github]
Compressed video action recognition (2018)
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6026-6035)
Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krahenb
[Paper]
Improving spatiotemporal self-supervision by deep reinforcement learning (2018)
In Proceedings of the European conference on computer vision (ECCV) (pp. 770-786)
Uta Buchler, Biagio Brattoli, Bjorn Ommer
[Paper]
Learning and using the arrow of time (2018)
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8052-8060)
Donglai Wei, Joseph Lim, Andrew Zisserman, William T. Freeman
[Paper]

2017

Unsupervised representation learning by sorting sequences (2017)
In Proceedings of the IEEE international conference on computer vision (pp. 667-676)
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, Ming-Hsuan Yang
[Paper]
Self-supervised video representation learning with odd-one-out networks (2017)
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3636-3645)
Basura Fernando, Hakan Bilen, Efstratios Gavves, Stephen Gould
[Paper]

2016

Shuffle and learn: unsupervised learning using temporal order verification (2016)
In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 527-544)
Ishan Misra, C. Lawrence Zitnick, Martial Hebert
[Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
media		media
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Awesome Self-Supervised Learning in Videos`

Acknowledgments

Citing

Contents

Surveys

Benchmarking

Representation Learning

2024

2023

2022

2021

2020

2019

2018

2017

2016

About

Contributors 2

Malitha123/awesome-video-self-supervised-learning

Folders and files

Latest commit

History

Repository files navigation

Awesome Self-Supervised Learning in Videos

Acknowledgments

Citing

Contents

Surveys

Benchmarking

Representation Learning

2024

2023

2022

2021

2020

2019

2018

2017

2016

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

`Awesome Self-Supervised Learning in Videos`