PyTorch S3D Text-Video trained HowTo100M

This repo contains a PyTorch S3D Text-Video model trained from scratch on HowTo100M using MIL-NCE [1] If you use this model, we would appreciate if you could cite [1] and [2] :).

The official Tensorflow hub version of this model can be found here: https://tfhub.dev/deepmind/mil-nce/s3d/1 with a colab on how to use it here: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/text_to_video_retrieval_with_s3d_milnce.ipynb

Getting the data

You will first need to download the model weights and the word dictionary.

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

How To use it ?

The following code explain how to instantiate S3D Text-Video with the pretrained weights and run inference on some examples.

import torch as th
from s3dg import S3D

# Instantiate the model
net = S3D('s3d_dict.npy', 512)

# Load the model weights
net.load_state_dict(th.load('s3d_howto100m.pth'))

# Video input should be of size Batch x 3 x T x H x W and normalized to [0, 1] 
video = th.rand(2, 3, 32, 224, 224)

# Evaluation mode
net = net.eval()
 
# Video inference
video_output = net(video)

# Text inference
text_output = net.text_module(['open door', 'cut tomato'])

NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32 frames (same as during training), T=16 frames also works ok. For H and W we have been using values from 200 to 256.

video_output is a dictionary containing two keys:

video_embedding: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.
mixed_5c: This is the global averaged pooled feature from S3D of dimension 1024. This should be use for classification on downstream tasks.

text_output is also a dictionary with a single key:

text_embedding: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product between text_embedding and video_embedding.

Computing all the pairwise video-text similarities:

The similarity scores can be computed with a dot product between the text_embedding and the video_embedding.

video_embedding = video_output['video_embedding']
text_embedding = text_output['text_embedding']
# We compute all the pairwise similarity scores between video and text.
similarity_matrix = th.matmul(text_embedding, video_embedding.t())

References

[1] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos. https://arxiv.org/abs/1912.06430

[2] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev and J. Sivic, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. https://arxiv.org/abs/1906.03327

Bibtex:

@inproceedings{miech19howto100m,
   title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
   author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
   booktitle={ICCV},
   year={2019},
}

@inproceedings{miech19endtoend,
   title={{E}nd-to-{E}nd {L}earning of {V}isual {R}epresentations from {U}ncurated {I}nstructional {V}ideos},
   author={Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},
   booktitle={CVPR},
   year={2020},
}

Acknowledgements

We would like to thank Yana Hasson for the help provided in the non trivial porting of the original Tensorflow weights to PyTorch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PyTorch S3D Text-Video trained HowTo100M

Getting the data

How To use it ?

Computing all the pairwise video-text similarities:

References

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

PyTorch S3D Text-Video trained HowTo100M

Getting the data

How To use it ?

Computing all the pairwise video-text similarities:

References

Acknowledgements