The paper list for language based visual trackers
[Note] Please submit an issue, if you find more related papers on this direction.
[1]. Tracking by natural language specification [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6495-6503. Li Z, Tao R, Gavves E, et al. [Paper] [Code]
[2]. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking[J]. Wang X, Li C, Yang R, et al. arXiv preprint arXiv:1811.10014, 2018. [Paper]
[3]. [Benchmark-Dataset] Lasot: A high-quality benchmark for large-scale single object tracking [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 5374-5383. Fan H, Lin L, Yang F, et al. [Paper] [Project] [Journal-Version]
[4]. [Benchmark-Dataset] Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark, [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 13763-13773. Wang X, Shu X, Zhang Z, et al. [Paper] [Eval-Toolkit] [Project]
[5]. Grounding-tracking-integration[J]. Yang Z, Kumar T, Chen T, et al. IEEE Transactions on Circuits and Systems for Video Technology, 2020. [Paper]
[6]. Capsule-based Object Tracking with Natural Language Specification, ACM-MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021, Ding Ma, Xiangqian Wu [Paper]
[7] Siamese Natural Language Tracker: Tracking by Natural Language Descriptions With Siamese Trackers, Qi Feng, Vitaly Ablavsky, Qinxun Bai, Stan Sclaroff; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5851-5860 [Paper] [Code]
[8] Siamese Tracking with Lingual Object Constraints. Filtenborg, Maximilian, Efstratios Gavves, and Deepak Gupta. arXiv preprint arXiv:2011.11721 (2020). [Paper] [Code]
[9] Real-time visual object tracking with natural language description, [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020: 700-709. Feng Q, Ablavsky V, Bai Q, et al. [Paper]
[10] WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking, Chunhui Zhang, Guanjie Huang, Li Liu, Shan Huang, Yinan Yang, Yuxuan Zhang, Xiang Wan, Shiming Ge, [Paper] [Github]
[11] Cityflow-nl: Tracking and retrieval of vehicles at city scale by natural language descriptions[J]. Feng Q, Ablavsky V, Sclaroff S. arXiv preprint arXiv:2101.04741, 2021. [Paper] [Dataset]
[12] SBNet: Segmentation-based Network for Natural Language-based Vehicle Search Lee S, Woo T, Lee S H. [C]//Proceedings of the IEEE/CVF CVPR Workshop. 2021: 4054-4060. [Paper] [Code]
[13] Spatio-temporal person retrieval via natural language queries[C] Yamaguchi M, Saito K, Ushiku Y, et al. //Proceedings of the IEEE International Conference on Computer Vision. 2017: 1453-1462. [Paper] [project]
[14] Person tube retrieval via language description[C] Fan H, Yang Y. //Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 10754-10761. [Paper]
[15] Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video, Chen Z, Ma L, Luo W, et al. [C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1884-1894. [Paper] [Dataset & Code]
[16] Referring to Objects in Videos Using Spatio-Temporal Identifying Descriptions[C] Wiriyathammabhum P, Shrivastava A, Morariu V, et al. //Proceedings of the Second Workshop on Shortcomings in Vision and Language. 2019: 14-25. [Paper]
[17] Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding[C] Su R, Yu Q, Xu D. //Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1533-1542. [Paper]
[18] Connecting language and vision for natural language-based vehicle retrieval[C] Bai S, Zheng Z, Wang X, et al. //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4034-4043. [Paper] [Code]
[19] "Semantics-aware spatial-temporal binaries for cross-modal video retrieval." Qi, Mengshi, et al. IEEE Transactions on Image Processing 30 (2021): 2989-3004. [Paper]
[20] Song, Sijie, et al. "Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. [Paper] [Code]
[21] [NeuIPS-2022] Divert more attention to vision-language tracking. Guo, M., Zhang, Z., Fan, H., & Jing, L. (2022). Advances in Neural Information Processing Systems, 35, 4446-4460., [arXiv] [Code]
[22] Li, Yihao, et al. "Cross-Modal Target Retrieval for Tracking by Natural Language." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [Paper]
[23] BAPO: A Large-Scale Multimodal Corpus for Ball Possession Prediction in American Football Games, Ziruo Yi, Eduardo Blanco, Heng Fan, Mark V. Albert, 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR) [Paper]
[24] [PRL-2023] Transformer vision-language tracking via proxy token guided cross-modal fusion. Pattern Recognition Letters, Zhao, H., Wang, X., Wang, D., Lu, H., & Ruan, X. (2023). 168, 10-16. [Paper]
[25] [CVPR-2023] Joint Visual Grounding and Tracking with Natural Language Specification. Zhou, L., Zhou, Z., Mao, K., & He, Z. (2023). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 23151-23160).
[Paper]
[Code]
[26] Type-to-Track: Retrieve Any Object via Prompt-based Tracking, Pha Nguyen, Kha Gia Quach, Kris Kitani, Khoa Luu [Paper]
[27] [IEEE TMM 2023] "One-stream Vision-Language Memory Network for Object Tracking," H. Zhang, J. Wang, J. Zhang, T. Zhang and B. Zhong, in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2023.3285441. [Paper]
[28] [ACM MM-2023] All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment, Chunhui Zhang, Xin Sun, Li Liu, Member, IEEE, Yiqian Yang, Qiong Liu, Xi Zhou, Yanfeng Wang [Paper]
[29] [IEEE TCSVT 2023] Towards Unified Token Learning for Vision-Language Tracking, Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, Xianxian Li, IEEE TCSVT 2023, [Paper]
[30] [ICCV-2023] CiteTracker: Correlating Image and Text for Visual Tracking, Xin Li, Yuqing Huang, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang [Paper] [Code]
[31] [arXiv-2023] Divert More Attention to Vision-Language Object Tracking. Guo, M., Zhang, Z., Jing, L., Ling, H., & Fan, H. (2023). arXiv preprint arXiv:2307.10046. [Paper]
[32] [AAAI-2024] Unifying Visual and Vision-Language Tracking via Contrastive Learning, Yinchao Ma1, Yuyang Tang1, Wenfei Yang1, Tianzhu Zhang1*, Jinpeng Zhang2, Mengxue Kang [Paper] [Code]
[33] VastTrack: Vast Category Visual Object Tracking, Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, Libo Zhang [Paper]
[34] Beyond MOT: Semantic Multi-Object Tracking, Yunhao Li, Hao Wang, Qin Li, Xue Ma, Jiali Yao, Shaohua Dong, Heng Fan, Libo Zhang [Paper]
[35] [IEEE TCSVT] G. Zhang, B. Zhong, Q. Liang, Z. Mo, N. Li and S. Song, "One-Stream Stepwise Decreasing for Vision-Language Tracking," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2024.3395352. [Paper]
[36] [CVPR Workshop] DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM, arXiv:2405.12139, CVPR Workshop 2024, Oral Presentation, Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, Kaiqi Huang [Paper]
[37] [arXiv:2405.19818] WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark, Chunhui Zhang, Li Liu, Guanjie Huang, Hao Wen, Xi Zhou, Yanfeng Wang [Paper] [Code]
[38] Context-Aware Integration of Language and Visual References for Natural Language Tracking, Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 19208-19217
[39] Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark, Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang [Paper]
[40] [ECCV 2024] SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild, Pengfei Wang1,⋆ , Xiaofei Hui1,2,⋆ , Jing Wu1,⋆ , Zile Yang1, Kian Eng Ong1,⋆ , Xinge Zhao1, Beijia Lu1, Dezhao Huang3, Evan Ling3, Weiling Chen4,‡, eng Teck Ma3, Minhoe Hur3, and Jun Liu [Paper]
[41] ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model, [Paper]
[42] [arXiv:2411.13183] ClickTrack: Towards Real-time Interactive Single Object Tracking, [Paper]
[43] How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking,
Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang
[Paper]
[44] MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking, Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, Zhenyu He [Paper]