Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Fixes all current issues with FLORES V1 #40

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

mnoukhov
Copy link

@mnoukhov mnoukhov commented Mar 31, 2022

A working branch of FLORES v1

Tested reproduce.sh neen and got 12.5 BLEU on devtest after 2 iterations of BT (compared to README's 15.9). I used 1 RTX8000 and the full pipeline ran in ~100 hours (after adjusting max_tokens to 16000 to eliminate unnecessary update_freq)

Download Issues:

Other issues:

  • fixes old fairseq-train args e.g. min-lr
  • converts old args to omegaconf cfg
  • skips inputs that are greater than 1024 tokens when doing backtranslation

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 31, 2022
mnoukhov and others added 6 commits March 31, 2022 15:42
min-lr no longer part of inverse_sqrt_lr_scheduler
update loaded fairseq model to use cfg instead of args
convert corresponding args (data) into cfg (task.data)
some of the monolingual data is > 1024 tokens in length
ignore that data when generating BT
@mnoukhov mnoukhov changed the title Fixes all current issues with FLORES V1 download Fixes all current issues with FLORES V1 Apr 8, 2022
Copy link
Contributor

@guzmanhe guzmanhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to rebase to merge

@@ -204,7 +206,7 @@ REMOVE_FILE_PATHS+=( $NE_DICT dictionaries )


# Download test sets
download_data $DATA/wikipedia_en_ne_si_test_sets.tgz "https://github.com/facebookresearch/flores/raw/master/data/wikipedia_en_ne_si_test_sets.tgz"
download_data $DATA/wikipedia_en_ne_si_test_sets.tgz "https://github.com/facebookresearch/flores/raw/main/data/wikipedia_en_ne_si_test_sets.tgz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was fixed in a previous PR, it might cause merge conflicts.

@@ -11,7 +11,7 @@ ROOT=$(dirname "$0")
INDICNLP=$ROOT/indic_nlp_library
if [ ! -e $INDICNLP ]; then
echo "Cloning Indic NLP Library..."
git -C $ROOT clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git $INDICNLP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already fixed in a previous PR

@mnoukhov
Copy link
Author

@guzmanhe thanks! I rebased + merged the two previous PRs so if they are accepted I should have no merge conflicts but let me know if there are issues

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants