This repo contains the official dataset and a unified framework for multimodal intent benchmarks of the research paper MIntRec: A New Dataset for Multimodal Intent Recognition (Accepted by ACM MM 2022).
In real-world conversational interactions, we usually combine information from multiple modalities (e.g., text, video, audio) to help analyze human intentions. Though intent analysis has been widely explored in the Natural Language Processing community, there is a scarcity of data for multimodal intent analysis. Thus, we provide a novel multimodal intent benchmark dataset, MIntRec, to boom the research. To the best of our knowledge, it is the first multimodal intent dataset from real-world conversational scenarios.
The overall process of building the MIntRec dataset is shown below:
We collect raw data from the Superstore TV series. The reasons are that it contains (1) a wealth of characters (including seven prominent and twenty recurring roles) with different identities in the superstore and (2) a mass of stories in various scenes (e.g., shopping mall, warehouse, office).
In this work, we design new hierarchical intent taxonomies for multimodal scenes. Inspired by human intention philosophy and goal-oriented intentions in artificial intelligence research, we categorize two coarse-grained intent categories: "Express emotions or attitudes" and "Achieve goals".
We further categorize the two coarse-grained intent classes into 20 fine-grained classes by analyzing as many video segments and summarizing high-frequency intent tags. They are as follows:
Complain, Praise, Apologize, Thank, Criticize, Care, Agree, Taunt, Flaunt, Oppose, Joke
Inform, Advise, Arrange, Introduce, Comfort, Leave, Prevent, Greet, Ask for help.
Five annotators label the full dataset independently. They need to combine text, video, and audio information and determine one intent label with the most confidence. The qualified samples are saved with votes no less than three notes among the twenty fine-grained intent classes.
Item | Statistics |
---|---|
Number of coarse-grained intents | 2 |
Number of fine-grained intents | 20 |
Number of videos | 43 |
Number of video segments | 2,224 |
Number of words in text utterances | 15,658 |
Number of unique words in text utterances | 2,562 |
Average length of text utterances | 7.04 |
Average length of video segments (s) | 2.38 |
Maximum length of video segments (s) | 9.59 |
We propose an automatic process for annotating the visual information towards speakers. It contains four main steps: (1) scene detection: distinguish different visual scenes in a video segment, (2) detect object boundings and corresponding faces, and establish a one-to-one mapping between them; and (3) face tracking: compute IoU for two faces in adjacent frames to judge whether they are from the same person, (4) TalkNet is used to perform audio-visual speaker detection, and we use the mapping in step 2 to obtain respective object boundings towards speakers.
This process is simple yet effective to generate over 120K keyframes within about 7 hours on 3090 Ti. We construct a data set with over 12K manual-annotated keyframes for testing. The automatic speaker annotation process achieves a low missing rate (2.3%) and a high proportion (90.9%) with high-quality predicted bounding boxes (IoU>0.9).
Note that we have released the full codes of the automatic speaker annotation process in the TalkNet_ASD directory, enjoy it!
The text features are extracted by a pre-trained BERT language model. The vision features are extracted with ROI features of the pre-trained Faster R-CNN with backbone ResNet-50. The audio features are extracted by wav2vec 2.0 on audio time series (obtained with librosa at 16,000Hz). The tools for extracting video and audio features can be found here.
The text benchmark is built by fine-tuning BERT. The multimodal intent benchmarks adapted from three powerful multimodal fusion methods, MAG-BERT, MISA, and MULT. We also have another set of two annotators for human evaluation. The detailed results can be found here.
You can download the full data from Google Drive or BaiduYun Disk (code:95lo)
Dataset Description:
Contents | Description |
---|---|
audio_data/audio_feats.pkl | This directory includes the "audio_feats.pkl" file. It contains audio feature tensors in each video segment. |
video_data/video_feats.pkl | This directory includes the "video_feats.pkl" file. It contains video feature tensors for all keyframes in each video segment. |
train.tsv / dev.tsv / test.tsv | These files contain (1) video segment indexes (season, episode, clip) (2) clean text utterances (3) multimodal annotations (among 20 intent classes) for training, validation, and testing. |
raw_data | It contains the original video segment files in .mp4 format. Among directory names, "S" means season id, "E" means episode id. |
speaker_annotations | It contains 12,228 keyframes and corresponding manual-annotated bounding box information for speakers. The speaker annotations are obtained by using a pre-trained Faster R-CNN to predict "person" on images and select speaker index. |
-
Use anaconda to create Python (version >=3.6) environment
conda create --name mintrec python=3.6 conda activate mintrec
-
Install PyTorch (Cuda version 11.2)
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
-
Clone the MIntRec repository.
git clone [email protected]:thuiar/MIntRec.git cd MIntRec
-
Install related environmental dependencies
pip install -r requirements.txt
-
Run examples (Take mag-bert as an example, more can be seen here)
sh examples/run_mag_bert.sh
To do: We will provide more details of this framework in the Wiki document.
If you want to use the dataset, codes and results in this repo, please star this repo and cite the following paper:
@inproceedings{10.1145/3503161.3547906,
author = {Zhang, Hanlei and Xu, Hua and Wang, Xin and Zhou, Qianrui and Zhao, Shaojie and Teng, Jiayan},
title = {MIntRec: A New Dataset for Multimodal Intent Recognition},
year = {2022},
doi = {10.1145/3503161.3547906},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
pages = {1688–1697},
numpages = {10}
}
Some of the codes in this repo are adapted from the following repos, and we are greatly thankful to them: MMSA, TalkNet, SyncNet.
If you have any questions, please open issues and illustrate your problems as detailed as possible. If you want to integrate your method into our repo, please contact [email protected] and feel free to pull request !