This repository contains all code used by first placed team "Next top GB model" (David and Miha skalic) in the Kaggle's competition The 2nd YouTube-8M Video Understanding Challenge.
The repository is a fork of google's repository and borrows from Wang et al, Miech et al and Skalic et al. Code is released under Apache License Version 2.0.
This readme walks through a specific example to reproduce training, eval, distillation, quantization, graph combination for a single model type.
@inproceedings{skalic2018building,
title={Building A Size Constrained Predictive Models for Video Classification},
author={Skalic, Miha and Austin, David},
booktitle={European Conference on Computer Vision},
pages={297--305},
year={2018},
organization={Springer}
}
All models herein were trained in single GPU mode and the instructions that follow will reproduce this step. The overall flow for training each model is as follows:
- Train model
- Evaluate model
- (Optional) Perform EMA- Exponentially weighted Moving Average of weights
- Quantize model
- Perform inference on quantized model
- (Optional) If running distillation, create distillation dataset, then repeat from step 1.
- Combine multiple graphs into single graph
This readme will walk through all commands to train both stand-alone and a distillation model. For model details for other models see all_models.txt.
All code was run using Python 2.7 and Tensorflow 1.8.0. All models were trained on GPU's. The requirements.txt file contains a list of all libraries installed in the environment used for training and testing. While all libraries are not required, the having the full list should ensure complete compatibility with all code.
Make sure to set your local paths correctly for the train and save paths:
export CUDA_VISIBLE_DEVICES=0
SAVEPATH="../trained_models"
RECORDPAT="../data/frame/train"
python train.py \
--train_data_pattern="$RECORDPAT/*.tfrecord" \
--model=NetVLADModelLF \
--train_dir="$SAVEPATH//NetVLAD" \
--frame_features=True --feature_names="rgb,audio" \
--feature_sizes="1024,128" \
--batch_size=160 --base_learning_rate=0.0002 \
--netvlad_cluster_size=256 \
--netvlad_hidden_size=1024 \
--moe_l2=1e-6 --iterations=300 \
--learning_rate_decay=0.8 \
--netvlad_relu=False \
--gating=True \
--moe_prob_gating=True \
--lightvlad=False \
--num_gpu 1 \
--num_epochs=10 \
Once training is complete, eval is performed as follows:
RECORDPATVAL="../data/frame/train"
python eval.py \
--eval_data_pattern="$RECORDPATVAL/*.tfrecord" \
--model=NetVLADModelLF \
--train_dir="$SAVEPATH//NetVLAD" \
--frame_features=True --feature_names="rgb,audio" \
--feature_sizes="1024,128" \
--batch_size=160 --base_learning_rate=0.0002 \
--netvlad_cluster_size=256 \
--netvlad_hidden_size=1024 \
--moe_l2=1e-6 --iterations=300 \
--learning_rate_decay=0.8 \
--netvlad_relu=False \
--gating=True \
--moe_prob_gating=True \
--lightvlad=False \
--num_gpu 1 \
--num_epochs=10 \
--run_once \
--build_only \
--sample_all
python train.py \
--train_data_pattern="$RECORDPAT/*.tfrecord" \
--model=NetVLADModelLF \
--train_dir="$SAVEPATH//NetVLAD_ema" \
--video_level_classifier_model="LogisticModel" \
--frame_features \
--feature_names="rgb, audio" \
--feature_sizes="1024, 128" \
--batch_size=160 \
--base_learning_rate=0.00008 \
--lstm_cells=1024 \
--num_epochs=2 \
--num_gpu 1 \
--num_readers 8 \
--loss_lambda 0.5 \
--ema_halflife 2000 \
--ema_source "$SAVEPATH//NetVLAD/inference_model"
python eval.py \
--eval_data_pattern="$RECORDPATVAL/*.tfrecord" \
--model=NetVLADModelLF \
--train_dir="$SAVEPATH//NetVLAD_ema" \
--video_level_classifier_model="LogisticModel" \
--frame_features \
--feature_names="rgb, audio" \
--feature_sizes="1024, 128" \
--batch_size=160 \
--base_learning_rate=0.00008 \
--lstm_cells=1024 \
--num_epochs=2 \
--num_gpu 1 \
--num_readers 8 \
--build_only \
--run_once \
--sample_all
Change savefile to specific save path
python quantize.py \
--transform_type quant_uniform \
--model "$SAVEPATH//NetVLAD_ema/inference_model" \
--savefile ../trained_models/quants/your_model/inference_model
cp $SAVEPATH//NetVLAD_ema/model_flags.json ../trained_models/quants/your_model
graph_ensemble.py takes in 2 or more trained models and combines them into a single graph. Sample usage:
python graph_ensemble.py \
--models ../trained_models/quants/74/inference_model \
../trained_models/model_1/inference_model \
../trained_models/model_2/inference_model \
../trained_models/model_3/inference_model \
--weights 0.3333 0.3333 0.3334 \
--save_folder ../train_models/your_combined_output_graph
RECORDPATTEST="../data/frame/test"
python inference_gpu.py \
--train_dir "../trained_models/quants/your_model" \
--output_file="./output.csv" \
--input_data_pattern="$RECORDPATtest/*.tfrecord" \
--batch_size 200 \
--sample_all
WARNING: Large dataset creation! Creating a new Distillation set will consume ~1.4TB of data so you'll need to have the storage space available.
python prepare_distill_dataset.py --batch_size 128 --file_size 512 --input_data_pattern "$RECORDPATVAL/*.tfrecord" --output_dir "output_folder/train_distill/" --model_file "../train_models/your_ensemble_model/inference_model"
Training on a distillation dataset can be done using train_distill.py
script. Use same flags as in train.py
.
File model_configs.xlsx
contains the arhitectures of models used in the work.
Trained model as .tar.gz
can be downloaded from here.
See inference.py
for sample usage of the model. Folder feature_extractor
contains information on
preprocessing custom videos.