Counterfactual Generative Smoothing for Imbalanced Natural Language Classification

This repository contains code for the paper "Counterfactual Generative Smoothing forImbalanced Natural Language Classification" by Hojae Han, Seungtaek Choi, Myeongho Jeong, Jin-woo Park, and Seung-won Hwang.

Setup

$ pip install -r requirements.txt

Pre-training Cond-BART (ours: varying mask ratio)

Pre-processing

$ ./run_data_generation_gmodel.sh
$ cd revised_libs/fairseq
$ ./data_processing.sh

Example 1) Pre-train on SNIPS-step

$ fairseq-train SNIPS-step-GEN-bin --checkpoint-suffix _SNIPS_step_our --dataset SNIPS --data_setting step --restore-file /workspace/Imbalanced/nlp/data/model/bart.large/model.pt --max-tokens 512 --task denoising --layernorm-embedding --share-all-embeddings --share-decoder-input-output-embed --reset-optimizer --reset-dataloader --reset-meters --required-batch-size-multiple 1 --arch bart_large --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 --clip-norm 0.1 --lr-scheduler polynomial_decay --lr 5e-05 --total-num-update 20000 --warmup-updates 500 --update-freq 4 --skip-invalid-size-inputs-valid-test --replace-length 1 --find-unused-parameters --rotate 0.0 --sample-break-mode 'eos' --min_mask 0.2 --max_mask 1.0 --mask-random 0.0 --mask-length 'word' --poisson-lambda 0.0 --valid-subset 'valid' --memory-efficient-fp16 --save-interval 20 --max-epoch 100;

Data generation

Example 1) Augment SNIPS-longtail with CGS_d:

$ python translation.py --gpu --device 0 --dataset SNIPS --data_setting step --cmodel our --gmodel our --imbalanced_ratio 100 --source_selection cluster --use_token_importance --random_seed 7777

Training Text Classification

$ ./train_text_classification.sh [dataset] [data_setting] [cmodel] [gmodel]

or

$ run train.py [with custom arguments]

Example 1) Train on TREC-longtail augmented by CGS_d:

$ ./train_text_classification.sh TREC longtail our our

Example 2) Train on TREC-step augmented by CSS_f:

$ python train.py --num_of_epoch 50 --gpu --device 0 --TMix True --dataset TREC --data_setting step --train_bert --imbalanced_ratio 100 --random_seed 7777

Example 3) Train on ATIS augmented by Cond-BART:

$ python train.py --num_of_epoch 100 --gpu --device 0 --dataset $dataset --data_setting ATIS --data_augment --train_bert  --gmodel bart --imbalanced_ratio 100 --random_seed 7777

Example 4) Train on SNIPS-step augmented by LAMBADA:

$ ./train_text_classification.sh SNIPS step standard lambada

Example 5) Train on TREC-step without augmentation:

$ python train.py --num_of_epoch 100 --gpu --device 0 --dataset TREC --data_setting step --train_bert --imbalanced_ratio 100 --random_seed 7777

Citation

@inproceedings{han2021counterfactual,
  title={Counterfactual Generative Smoothing for Imbalanced Natural Language Classification},
  author={Han, Hojae and Choi, Seungtaek and Jeong, Myeongho and Park, Jin-woo and Hwang, Seung-won},
  booktitle={Proceedings of the 30th ACM International Conference on Information \& Knowledge Management},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
revised_libs		revised_libs
M2M_data_generator.py		M2M_data_generator.py
README.md		README.md
README.txt		README.txt
config.py		config.py
config_m2m.py		config_m2m.py
config_our.py		config_our.py
data_analyzer_from_meta.py		data_analyzer_from_meta.py
data_generation_for_gmodel.py		data_generation_for_gmodel.py
data_generator_from_meta.py		data_generator_from_meta.py
data_loader.py		data_loader.py
environment.yml		environment.yml
get_result.py		get_result.py
head_train.py		head_train.py
model.py		model.py
model_tmix.py		model_tmix.py
model_utils.py		model_utils.py
requirements.txt		requirements.txt
sample_dataset.py		sample_dataset.py
test.py		test.py
token_importance.py		token_importance.py
train.py		train.py
train_gpt2.py		train_gpt2.py
train_m2m.py		train_m2m.py
train_smote.py		train_smote.py
train_text_classification.sh		train_text_classification.sh
translation.py		translation.py
translation_BART_baseline.py		translation_BART_baseline.py
utils_en.py		utils_en.py
utils_m2m.py		utils_m2m.py
utils_test.py		utils_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Counterfactual Generative Smoothing for Imbalanced Natural Language Classification

Setup

Pre-training Cond-BART (ours: varying mask ratio)

Pre-processing

Example 1) Pre-train on SNIPS-step

Data generation

Example 1) Augment SNIPS-longtail with CGS_d:

Training Text Classification

Example 1) Train on TREC-longtail augmented by CGS_d:

Example 2) Train on TREC-step augmented by CSS_f:

Example 3) Train on ATIS augmented by Cond-BART:

Example 4) Train on SNIPS-step augmented by LAMBADA:

Example 5) Train on TREC-step without augmentation:

Citation

About

Releases

Packages

Languages

stovecat/CGS

Folders and files

Latest commit

History

Repository files navigation

Counterfactual Generative Smoothing for Imbalanced Natural Language Classification

Setup

Pre-training Cond-BART (ours: varying mask ratio)

Pre-processing

Example 1) Pre-train on SNIPS-step

Data generation

Example 1) Augment SNIPS-longtail with CGS_d:

Training Text Classification

Example 1) Train on TREC-longtail augmented by CGS_d:

Example 2) Train on TREC-step augmented by CSS_f:

Example 3) Train on ATIS augmented by Cond-BART:

Example 4) Train on SNIPS-step augmented by LAMBADA:

Example 5) Train on TREC-step without augmentation:

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages