Skip to content

Commit

Permalink
readme: video_features (api change) + minor fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
v-iashin committed Dec 8, 2021
1 parent feaaf75 commit eee222d
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ BMVC 2021 – Oral Presentation
[[BMVC Proceedings](https://www.bmvc2021-virtualconference.com/conference/papers/paper_1213.html)]
[[Poster (for PAISS)](https://v-iashin.github.io/images/specvqgan/poster.pdf)]
[[Presentation on YouTube](https://www.youtube.com/watch?v=Bucb3nAa398)] ([Can't watch YouTube?](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/SpecVQGAN%20YouTube.mp4))

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pxTIMweAKApJZ3ZFqyBee3HtMqFpnwQ0?usp=sharing)

Expand Down Expand Up @@ -131,7 +131,7 @@ Each part holds ~3k clips and can be used independently as a subset of the whole

For `BN Inception` features, we employ the same procedure as [RegNet](https://github.com/PeihaoChen/regnet#data-preprocessing).

For `ResNet50` features, we rely on [video_features](https://v-iashin.github.io/video_features/models/resnet/)
For `ResNet50` features, we rely on [video_features (branch `specvqgan`)](https://github.com/v-iashin/video_features/tree/specvqgan)
repository and used these commands:
```bash
# VAS (few hours on three 2080Ti)
Expand Down Expand Up @@ -299,9 +299,9 @@ The setting **(a)**: the transformer is trained on *VGGSound* to sample from the
| 212 Feats | BN Inception | 9.6 | 6.8 | 11.8 | [79895ac08303b1536809cad1ec9a7502](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/models/2021-06-03T07-27-58_vggsound_transformer.tar.gz) |
| 212 Feats | ResNet50 | 10.5* | 6.9* | 11.8 | [b222cc0e7aeb419f533d5806a08669fe](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/models/2021-07-30T21-34-41_vggsound_transformer.tar.gz) |

\* – calculated on 1 sampler per video the test set instead of 10 samples per video as the rest.
\* – calculated on 1 sample per video the test set instead of 10 samples per video that is used for the rest.
Evaluating a model on a larger number of samples per video is an expensive procedure.
When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).
When evaluated on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting **(b)**: the transformer is trained on *VAS* to sample from the *VGGSound* codebook
| Condition | Features | FID ↓ | Avg. MKL ↓ | Sample Time️ ↓ | Link / MD5SUM |
Expand All @@ -314,9 +314,9 @@ The setting **(b)**: the transformer is trained on *VAS* to sample from the *VGG
| 212 Feats | BN Inception | 20.5 | 6.0 | 11.8 | [1c4e56077d737677eac524383e6d98d3](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/models/2021-06-03T05-38-40_vas_transformer.tar.gz) |
| 212 Feats | ResNet50 | 20.8* | 6.2* | 11.8 | [6e553ea44c8bc7a3310961f74e7974ea](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/models/2021-07-29T11-52-28_vas_transformer.tar.gz) |

\* – calculated on 10 sampler per video the validation set instead of 100 samples per video as the rest.
\* – calculated on 10 samples per video the test set instead of 100 samples per video that is used for the rest.
Evaluating a model on a larger number of samples per video is an expensive procedure.
When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).
When evaluated on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting **(c)**: the transformer is trained on *VAS* to sample from the *VAS* codebook
| Condition | Features | FID ↓ | Avg. MKL ↓ | Sample Time ↓ | Link / MD5SUM |
Expand All @@ -329,9 +329,9 @@ The setting **(c)**: the transformer is trained on *VAS* to sample from the *VAS
| 212 Feats | BN Inception | 25.4 | 5.9 | 11.6 | [4542632b3c5bfbf827ea7868cedd4634](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/models/2021-06-09T15-17-18_vas_transformer.tar.gz) |
| 212 Feats | ResNet50 | 22.6* | 5.8* | 11.6 | [dc2b5cbd28ad98d2f9ca4329e8aa0f64](https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/models/2021-07-29T13-34-39_vas_transformer.tar.gz) |

\* – calculated on 10 sampler per video the validation set instead of 100 samples per video as the rest.
\* – calculated on 10 samples per video the test set instead of 100 samples per video that is used for the rest.
Evaluating a model on a larger number of samples per video is an expensive procedure.
When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).
When evaluated on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

A transformer can also be trained to generate a spectrogram given a specific **class**.
We also provide pre-trained models for all three settings:
Expand Down Expand Up @@ -440,7 +440,7 @@ cd ./specvqgan/modules/losses/vggishish
# vggish-ish
python train_vggishish.py config=./configs/vggish.yaml device='cuda:0'
# melception
python train_melception.py config=./configs/melception.yaml device='cuda:1'
python train_melception.py config=./configs/melception.yaml device='cuda:0'
```

## Training MelGAN
Expand Down Expand Up @@ -565,7 +565,7 @@ Please, use this bibtex if you would like to cite our work
# Acknowledgments
Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

We also acknowledge the following codebases:
We also acknowledge the following work:
- The code base is built upon an amazing [taming-transformers](https://github.com/CompVis/taming-transformers) repo.
Check it out if you are into high-res image generation.
- The implementation of some evaluation metrics is partially borrowed and adapted from [torch-fidelity](https://github.com/toshas/torch-fidelity).
Expand Down

0 comments on commit eee222d

Please sign in to comment.