Skip to content

yuweijiang/HybridNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hybrid Reasoning Network for Video-based Commonsense Captioning

Introduction

This repository contains source code for our ACM MM 2021: Hybrid Reasoning Network for Video-based Commonsense Captioning.

This repo should be ready to replicate our results from the paper. If you have any issues with getting it set up though, please file a github issue. Still, the paper is just an arxiv version, so there might be more updates in the future.

Background

This repository is for the new task of video-based commonsense captioning, which aims to generate event-wise captions and meanwhile provide multiple commonsense descriptions (e.g., attribute, effect and intention) about the underlying event in the video.

Dataset

V2C dataset in V2C_annotations.zip, which consists:

V2C_annotations.zip
├── msrvtt_new_info.json                      # MSR-VTT captions and token dictionary.
├── v2c_info.json                             # V2C Raw, captions/CMS, and token dictionary.
├── V2C_MSR-VTT_caption.json                  # V2C Raw, captions/CMS after tokenization.
├── train_cvpr_humanRank_V2C_caption.json     # a human re-verified clean split for V2C annotations.
└── v2cqa_v1_train.json                       # for V2C QA, consisting captions, CMS, and CMS related questions/answers.

Video Features

We use the pre-trained models including ResNet152, SoundNet and I3D to extract the appearance feature, audio feature and motion feature, respectively. Video Features data can be obtained in the link.

Training and Evaluation

  • Enviroment: This implementation was complemented on PyTorch-1.8.1, there was reported some errors if newer version PyToch is usednad we will work on a updation for that later.

E.g., to initiate a training on intention prediction tasks (set --cms 'int'), with 1 RNN video encoder layer, and 6 transformer decoder layers with 8 attention heads, 64 head dim, and 1024 model dim, for 600 epochs under CUDA mode, and shows intermedia generation examples:

python train.py --cms int --batch_size 64 --epochs 600 --num_layer 6 --dim_head 64 --dim_inner 1024 \
                --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save \
                --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json \
                --print_loss_every 20 --cuda --show_predict   

For completion evaluations:

python test_cms.py  --cms int --batch_size 64 --num_layer 6 --dim_head 64 --dim_inner 1024 \
                    --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save  \
                    --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json  \
                    --load_checkpoint save/**.pth --cuda

For generation evaluations:

python test_cap2cms.py  --cms int --batch_size 64 --num_layer 6 --dim_head 64 --dim_inner 1024 \
                        --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save  \
                        --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json \
                        --load_checkpoint save/*.pth --cuda

Model Zoo

Download MODEL_ZOO.zip for our trained captioning models for intention, effects and attributes generations.

Citations

Please consider citing this paper if you find it helpful:

@inproceedings{yu2021hybrid,
      title={Hybrid Reasoning Network for Video-based Commonsense Captioning}, 
      author={Weijiang Yu and Jian Liang and Lei Ji and Lu Li and Yuejian Fang and Nong Xiao and Nan Duan},
      booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
      year={2021}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages