Official implementation of "Single-Model and Any-Modality for Video Object Tracking" CVPR 2024 (arxiv)
We propose Un-Track, a Unified Tracker of a single set of parameters for any modality, which learns their common latent space with only the RGB-X pairs. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture and without the need for modality-specific fine-tuning.
Our ckpt can be found here (Google Drive)
Put the ckpt into the "models" folder.
You should then be able to obtain our UnTrack results, which can also be downloaded here (Google Drive)
A comparison against ViPT (SOTA specialized method) and SeqTrack (SOTA Tracker) can found in the following video:
Our shared embedding is somehow similar to a Mixture of Experts (MoE) model.
The difference is that we manually force the network to pick the best expert, according to the sensor prior, for feature processing.
We have also developed a generalist and blind tracker, where the MoE is formally introduced and dynamically assigns the most appropriate expert for feature processing.
More details can be found in the [Preprint] or [GitHub]
This repository is heavily based on ViPT and OSTrack. Thanks to their great work!