[ICME 2024] Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection
Shengyang Sun, Xiaojin Gong
An overview of the proposed framework. It includes three unimodal encoders, a multimodal fusion module, and a global encoder for multimodal feature generation. Each unimodal encoder consists of a modality-specific feature extraction backbone and a linear projection layer for tokenization and a modality-shared transformer for context aggregation within one modality. The fusion module contains a multi-scale bottleneck transformer (MSBT) to fuse any pair of modalities and a sub-module to weight concatenated fused features. The global encoder, implemented by a transformer, aggregates context over all modalities. Finally, the produced multimodal features are fed into a regressor to predict anomaly scores.- [2024.05.09] ⭐️ Release the Multi-scale Bottleneck Transformer (MSBT), we encourage you to integrate the MSBT module into your framework to enhance the performance of feature fusion, the detailed implementation of MSTB can be referred to MultiScaleBottleneckTransformer.py.
Method | Modality | AP (%) |
---|---|---|
MSBT (Ours) | RGB & Audio | 82.54 |
MSBT (Ours) | RGB & Flow | 80.68 |
MSBT (Ours) | Audio & Flow | 77.47 |
MSBT (Ours) | RGB & Audio & Flow | 84.32 |
python==3.7.13
torch==1.11.0
cuda==11.3
numpy==1.21.5
The extracted features can be downloaded from this official page. We use the RGB, Flow, and Audio features in this paper, you should download the features from that page and arrange the features paths into path lists in the list/ folder one by one as follows:
I3D/RGB -> rgb.list
I3D/RGBTest -> rgb_test.list
I3D/Flow -> flow.list
I3D/FlowTest -> flow_test.list
vggish-features/Train -> audio.list
vggish-features/Test -> audio_test.list
python main.py
python main.py --eval --model_path ckpt/MSBT_best_84.32.pkl
If you find our paper useful, hope you can star our repo and cite our paper as follows:
@inproceedings{sun2024multiscale,
title={Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection},
author={Sun, Shengyang and Gong, Xiaojin},
booktitle={2024 IEEE International Conference on Multimedia and Expo (ICME)},
pages={1--6},
year={2024},
organization={IEEE}
}
This project is released under the MIT License.
Some codes are based on MACIL_SD and XDVioDet, we sincerely thank them for their contributions.