Stable Control Representations

Vision- and language-guided embodied AI requires a fine-grained understanding of the physical world through language and visual inputs. Such capabilities are difficult to learn solely from task-specific data, which has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding—a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark

Installation

Please follow the instructions in INSTALLATION.md to install the model and associated benchmarks.

Directory structure

vc_models: contains config files for SCR and baseline models, the model loading code and, as well as some project utilities.
- See README for more details.
benchmarks: embodied AI downstream tasks to evaluate SCR.
third_party: Third party submodules which aren't expected to change often.

Reproducing Results with the SCR Model

To reproduce the results with the SCR model, please follow the README instructions for each of the benchmarks in cortexbench.

Citing SCR

If you use SCR in your research, please cite the following paper:

@inproceedings{gupta2024scr,
      title={Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control},
      author={Gunshi Gupta and Karmesh Yadav and Yarin Gal and Dhruv Batra and Zsolt Kira and Cong Lu and Tim G. J. Rudner},
      year={2024},
      eprint={2405.05852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

We are thankful to the creators of Stable Diffusion for releasing the model, which has significantly contributed to the progress in the field. Additionally, we extend our thanks to the authors of Visual Cortex for releasing the code for CortexBench evaluations.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
cortexbench		cortexbench
res/img		res/img
third_party		third_party
vc_models		vc_models
.gitignore		.gitignore
.gitmodules		.gitmodules
INSTALLATION.md		INSTALLATION.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stable Control Representations

Installation

Directory structure

Reproducing Results with the SCR Model

Citing SCR

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

ykarmesh/stable-control-representations

Folders and files

Latest commit

History

Repository files navigation

Stable Control Representations

Installation

Directory structure

Reproducing Results with the SCR Model

Citing SCR

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages