This is the official code for the NAACL 2021 paper: MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories..
The slides can be found here.
Dataset | #tokens | %M | #Seq | Seq len |
---|---|---|---|---|
VUA-18 (train) | 116,622 | 11.2 | 6,323 | 18.4 |
VUA-18 (dev) | 38,628 | 11.6 | 1,550 | 24.9 |
VUA-18 (test) | 50,175 | 12.4 | 2,694 | 18.6 |
VUA-20 (train) | 160,154 | 12.0 | 12,109 | 15 |
VUA-20 (test) | 22,196 | 17.9 | 3,698 | 15.5 |
VUA-VERB (test) | 5,873 | 30 | 2,694 | 18.6 |
MOH-X | 647 | 48.7 | 647 | 8 |
TroFi | 3,737 | 43.5 | 3,737 | 28.3 |
We use four well-known public English datasets. The VU Amsterdam Metaphor Corpus (VUA) has been released in metaphor detection shared tasks in 2018 and 2020. We use two versions of VUA datasets, called VUA-18 and VUA-20, where VUA-20 is the extension of VUA-18. We split VUA-18 and VUA-20 each for training, validation, and test datasets. VUA-20 includes VUA-18, and VUA-Verb (test) is a subset of VUA-18 (test) and VUA-20 (test). We also use VUA datasets categorized into different POS tags (verb, noun, adjective, and adverb) and genres (news, academic, fiction, and conversation).
We employ MOH-X and TroFi for testing only.
You can get datasets from the following link.
The datasets are tsv formatted files and the format is as follows.
index label sentence POS w_index
a3m-fragment02 45 0 Design: Crossed lines over the toytown tram: City transport could soon be back on the right track, says Jonathan Glancey NOUN 0
a3m-fragment02 45 1 Design: Crossed lines over the toytown tram: City transport could soon be back on the right track, says Jonathan Glancey ADJ 1
a3m-fragment02 45 1 Design: Crossed lines over the toytown tram: City transport could soon be back on the right track, says Jonathan Glancey NOUN 2
You can also get the original datasets from the following links:
- Change the experimental settings in
main_config.cfg
. - Run
main.py
to train and test models. - Command line arguments are also acceptable with the same naming in configuration files.
- You can simply download the model checkpoint trained on VUA-18 dataset from the link.
-
Train MelBERT with the specfic huggingface transformer model:
python main.py --model_type MELBERT --bert_model roberta-base
-
Test MelBERT with the path of saves file:
python main.py --model_type MELBERT --bert_model {path of saves file}
-
Using RoBERTa, MelBERT gets about 78.5 and 75.7 F1 scores on the VUA-18 and the VUA-verb set, respectively. Using model bagging techniques, we get about 79.8 and 77.1 F1 scorea on the VUA-18 and VUA-verb set, respectively.
-
The argument
task_name
indicates the name of task where 'vua' for VUA datasets and 'trofi' for TroFi and MOH-X datasets. Iftask_name
is 'trofi', K-fold is applied for both training and evaluation. -
The pretrained transformer model can be specified with the argument
bert_model
. The processing of tokenizer may be different for models, so be careful. The work is currently based on RoBERTa-base model. -
The type of model can be specified with the argument
model_type
and the types are as follows.models (paper) model_type
RoBERTa_BASE BERT_BASE RoBERTa_SEQ BERT_SEQ MelBERT MELBERT MelBERT_MIP MELBERT_MIP MelBERT_SPV MELBERT_SPV
python==3.7
pytorch==1.6
transformers==4.2.2
Please cite our paper:
@inproceedings{DBLP:conf/naacl/ChoiLCPLLL21,
author = {Minjin Choi and
Sunkyung Lee and
Eunseong Choi and
Heesoo Park and
Junhyuk Lee and
Dongwon Lee and
Jongwuk Lee},
title = {MelBERT: Metaphor Detection via Contextualized Late Interaction using
Metaphorical Identification Theories},
booktitle = {Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
{NAACL-HLT} 2021, Online, June 6-11, 2021},
pages = {1763--1773},
publisher = {Association for Computational Linguistics},
year = {2021},
url = {https://www.aclweb.org/anthology/2021.naacl-main.141/},
}