Code and scripts for the ACL2024 Findings paper "Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features".
The code is based on the open-source toolkit fairseq. Our model code transformer_disentangler_and_linguistic_encoder.py
is in "fairseq/fairseq/models", and our criterion code label_smoothed_cross_entropy_with_disentangling.py
is in "fairseq/fairseq/criterions".
-
Python version == 3.9.12
-
Pytorch version == 1.12.1
-
Install fairseq:
git clone https://github.com/ictnlp/SemLing-MNMT.git cd SemLing-MNMT pip install --editable ./
We use the Sentencepiece toolkit to pre-process the IWSLT2017, OPUS-7 and PC-6 datasets. For each dataset, we implement the Unigram Model algorithm for tokenization and learn a joint vocabulary with 32K tokens.
We provide training and inference scripts of IWSLT2017 in the folder "scripts" as examples. Add your pathes to scripts
and run them.
Here are some explanations:
-
In
train.sh
,--disentangler-lambda
,--disentangler-reconstruction-lambda
, and--disentangler-negative-lambda
are hyperparameters corresponding to$\lambda$ ,$\lambda_1$ ,$\lambda_2$ in our paper. And--linguistic-encoder-layers
controls the layer number of the linguistic encoder. -
In
generate.sh
andgenerate_zero_shot.sh
, we generate translation and compute BLEU scores with SacreBLEU (version == 1.5.1).