Skip to content

Latest commit

 

History

History
120 lines (96 loc) · 8.38 KB

README.md

File metadata and controls

120 lines (96 loc) · 8.38 KB

Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training

The goal of stance detection is to determine the viewpoint expressed in a piece of text towards a target. These viewpoints or contexts are often expressed in many different languages depending on the user and the platform, which can be a local news outlet, a social media platform, a news forum, etc. Most research in stance detection, however, has been limited to working with a single language and on a few limited targets, with little work on cross-lingual stance detection. Moreover, non-English sources of labelled data are often scarce and present additional challenges. Recently, large multilingual language models have substantially improved the performance on many non-English tasks, especially such with limited numbers of examples. This highlights the importance of model pre-training and its ability to learn from few examples. In this paper, we present the most comprehensive study of cross-lingual stance detection to date: we experiment with 15 diverse datasets in 12 languages from 6 language families, and with 6 low-resource evaluation settings each. For our experiments, we build on pattern-exploiting training, proposing the addition of a novel label encoder to simplify the verbalisation procedure. We further propose sentiment-based generation of stance data for pre-training, which shows sizeable improvement of more than 6% F1 absolute in low-shot settings compared to several strong baselines.

Setup

$ python3 -m venv ~/.virtualenvs/stance-detection
$ source ~/.virtualenvs/stance-detection/bin/activate

Updating project dependencies

# And to install the packages
$ pip install -r requirements.txt

Getting the datasets

Our splits

We release our few-shot splits (32, 64, 128, 256) in the data/fewshow folder. Moreover, we release the sentiment annotated Wiki snipets in the data/wikipedia folder. The full training, dev and test sets can be obtained from the links below.

Multilingual

  • Stance Prediction and Claim Verification: An Arabic Perspective Data (ans)
  • Integrating Stance Detection and Fact Checking in a Unified Corpus Data (arabicfc)
  • Detecting Stance in Czech News Commentaries Data (czech)
  • Stance Evolution and Twitter Interactions in an Italian Political Debate Data (conref)
  • (Danish) Joint Rumour Stance and Veracity Prediction Data (dast)
  • Multilingual stance detection in social media political debates Data (e-fra, r-ita)
  • An English-Hindi Code-Mixed Corpus: Stance Annotation and Baseline System Data (hindi)
  • Overview of NLPCC Shared Task 4: Stance Detection in Chinese Microblogs Data (nlpcc)
  • Stance and Gender Detection in Tweets on Catalan Independence@Ibereval 2017 Data (iberval)
  • Stance Prediction for Russian: Data and Analysis Data (rustance)
  • SardiStance @ EVALITA2020 Data (sardistance)
  • X-Stance: A Multilingual Multi-Target Dataset for Stance Detection Data

Some datasets may require additional steps to aquire, e.g., in order to obtain SardiStance you need to fill out a form, IberEval's test sets need to be obtained from the competition organizers.

English

  • Stance Detection Benchmark Data (arc argmin fnc1 iac1 ibmcs perspectrum scd semeval2016t6 semeval2019t7 snopes)
  • Will-They-Won't-They Data (wtwt)
  • Emergent Data (emergent)
  • Rumor has it Data (rumor)
  • Multi-Target Stance Dataset Data (mtsd)
  • Political Debates Data (poldeb)
  • VAried Stance Topics Data (vast)

We used the data splits as described in Cross-Domain Label-Adaptive Stance Detection (code).

Running the models

DATASETS=(arc argmin fnc1 iac1 ibmcs perspectrum scd semeval2016t6 semeval2019t7 snopes emergent mtsd poldeb rumor vast wtwt)
CROSS_LINGUAL_DATASETS=(conref-ita arabicfc ans nlpcc czech dast e-fra hindi iberval2017-ca iberval2017-es r-ita rustance sardistance xstance-de xstance-fr) 

python src/stancedetection/models/trainer_le.py --data_dir "data/all/" \
                                      --model_name_or_path ${MODEL_NAME} \
                                      --output_dir ${OUTPUT_DIR} \
                                      --task_names ${DATASET_NAME} \
                                      --model_type xlm-r \
                                      --replace_classification \
                                      --do_train \
                                      --do_eval \
                                      --learning_rate ${LEARNING_RATE} \
                                      --weight_decay 0.01 \
                                      --per_gpu_train_batch_size 16 \
                                      --per_gpu_eval_batch_size 128 \
                                      --num_train_epochs 50000 \
                                      --warmup_proportion ${WARMUP} \
                                      --adam_epsilon 1e-08 \
                                      --logging_steps 200 \
                                      --max_steps ${MAX_STEPS} \
                                      --max_seq_length ${MAX_SEQ_LEN} \
                                      --evaluate_during_training \
                                      --gradient_accumulation_steps 1 \
                                      --seed ${SEED} \
                                      --dataset_suffix "_${SHOTS}_${i}" \
                                      --fp16 \
                                      --cache_dir cache \
                                      --balanced \
                                      --lambda_mlm ${LAMBDA_MLM} \
                                      --positive_samples_synonyms ${POSITIVE_SAMPLES_SYNONYMS} \
                                      --negative_samples_synonyms ${NEGATIVE_SAMPLES_SYNONYMS} \
                                      --negative_samples_rand ${NEGATIVE_SAMPLES_RAND} \
                                      --p_replace_pos_label ${P_REPLACE_POS_LABEL} \
                                      --p_replace_neg_label ${P_REPLACE_NEG_LABEL} \
                                      --p_mask ${P_MASK} \
                                      --p_random ${P_RANDOM} \
                                      --p_delete 0.0 \
                                      --p_split 0.0 \
                                      --p_swap 0.0 \
                                      --p_label_cond 0.0 \
                                      --overwrite_output_dir

References

Please cite as [1]. There is also an arXiv version.

[1] Hardalov, M., Arora, A., Nakov, P., & Augenstein, I. (2022). "Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training", Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22).

@article{hardalov-etal-2022-fewshot,
	title        = {Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training},
	author       = {Hardalov, Momchil and Arora, Arnav and Nakov, Preslav and Augenstein, Isabelle},
	year         = 2022,
	month        = {Feb},
	journal      = {Proceedings of the AAAI Conference on Artificial Intelligence},
	volume       = 36
}

License

The code in this repository is licenced under the CC-BY-NC-SA 4.0. The datasets are licensed under CC-BY-SA 4.0.