-
Notifications
You must be signed in to change notification settings - Fork 42
WIP: Add BPE training with LF-MMI. #215
base: master
Are you sure you want to change the base?
Conversation
A small vocab_size is used to avoid OOM.
fix_random_seed(42) | ||
setup_dist(rank, world_size, args.master_port) | ||
|
||
exp_dir = Path('exp-bpe-' + model_type + '-mmi-att-sa-vgg-normlayer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is copied from mmi_att_transformer_train.py, with only this line being modified.
Example usage of this script: | ||
|
||
python3 ./generate_bpe_tokens.py \ | ||
--model-file ./data/lang_bpe/bpe_unigram_500.model > data/lang_bpe/bpe_unigram_500.tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to write the model out to disk and get the tokens separately?
I'd imagine it would be very fast to train the BPE model (isn't it some very simple LM?) and get the tokens inside the same script.. this just feels a bit complicated. But it's OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to write the model out to disk and get the tokens separately?
No, it's not necessary. We can generate all the needed files tokens.txt
, words.txt
, lexicon.txt
in one
file. I just feel splitting them is easier to understand.
Yes, it is quite fast to train a BPE model. It's less than 1 minute for the LibriSpeech dataset.
The model is only several hundred KB and does not occupy too much disk space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to write the model out to disk
According to the documentation and examples given in
https://github.com/google/sentencepiece/blob/master/python/README.md ,
it looks like the model has to be written to disk after training. There is no option to retain a memory representation
of the model without writing it to disk.
stage=5 | ||
|
||
# settings for BPE training -- start | ||
vocab_size=5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the bigram P and it is possible to do BPE LF-MMI training with vocab_size=5000.
BTW, the way I think we can solve the memory-blowup issue is: |
Yes, I am using the new CTC topo.
Will update the code to train a word piece bigram ARPA LM with |
ans = [] | ||
with open(words_txt, 'r', encoding='latin-1') as f: | ||
for line in f: | ||
word, id = line.strip().split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id is a python build-in fuction, maybe change to idx; or just _ since it's not used
A small vocab_size, e.g., 200, is used to avoid OOM if the bigram P is used. After removing P, it is possible to use a large vocab size, e.g., 5000.
@glynpu is doing BPE CTC training. We can use his implementation once it's ready.
This pull-request is for experimental purpose.
Will add decoding code later.
--
The training is still on-going. The tensorboard training log is available at
https://tensorboard.dev/experiment/CN5yTQNmTLODdyLZA6K8rQ/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D