Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

WIP: Add BPE training with LF-MMI. #215

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Jun 19, 2021

A small vocab_size, e.g., 200, is used to avoid OOM if the bigram P is used. After removing P, it is possible to use a large vocab size, e.g., 5000.

@glynpu is doing BPE CTC training. We can use his implementation once it's ready.
This pull-request is for experimental purpose.


Will add decoding code later.

--

The training is still on-going. The tensorboard training log is available at
https://tensorboard.dev/experiment/CN5yTQNmTLODdyLZA6K8rQ/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D

A small vocab_size is used to avoid OOM.
fix_random_seed(42)
setup_dist(rank, world_size, args.master_port)

exp_dir = Path('exp-bpe-' + model_type + '-mmi-att-sa-vgg-normlayer')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is copied from mmi_att_transformer_train.py, with only this line being modified.

Example usage of this script:

python3 ./generate_bpe_tokens.py \
--model-file ./data/lang_bpe/bpe_unigram_500.model > data/lang_bpe/bpe_unigram_500.tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to write the model out to disk and get the tokens separately?
I'd imagine it would be very fast to train the BPE model (isn't it some very simple LM?) and get the tokens inside the same script.. this just feels a bit complicated. But it's OK for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to write the model out to disk and get the tokens separately?

No, it's not necessary. We can generate all the needed files tokens.txt, words.txt, lexicon.txt in one
file. I just feel splitting them is easier to understand.

Yes, it is quite fast to train a BPE model. It's less than 1 minute for the LibriSpeech dataset.
The model is only several hundred KB and does not occupy too much disk space.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to write the model out to disk

According to the documentation and examples given in
https://github.com/google/sentencepiece/blob/master/python/README.md ,
it looks like the model has to be written to disk after training. There is no option to retain a memory representation
of the model without writing it to disk.

stage=5

# settings for BPE training -- start
vocab_size=5000
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey

I've removed the bigram P and it is possible to do BPE LF-MMI training with vocab_size=5000.

@danpovey
Copy link
Contributor

BTW, the way I think we can solve the memory-blowup issue is:
(i) use the new, more compact CTC topo
(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py; load it into k2 as P (no disambig symbols!), and remove epsilons. k2 uses a rm-epsilon algorithm that should keep the epsilon-free LM compact, unlike OpenFst which would cause it to blow up.
BTW, I am asking some others to add a pruning option to make_kn_lm.py.

@csukuangfj
Copy link
Collaborator Author

(i) use the new, more compact CTC topo

Yes, I am using the new CTC topo.

(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py;

Will update the code to train a word piece bigram ARPA LM with make_kn_lm.py.

ans = []
with open(words_txt, 'r', encoding='latin-1') as f:
for line in f:
word, id = line.strip().split()
Copy link
Contributor

@glynpu glynpu Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a python build-in fuction, maybe change to idx; or just _ since it's not used

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants