WIP: Add BPE training with LF-MMI. #215

csukuangfj · 2021-06-19T12:11:26Z

A small vocab_size, e.g., 200, is used to avoid OOM if the bigram P is used. After removing P, it is possible to use a large vocab size, e.g., 5000.

@glynpu is doing BPE CTC training. We can use his implementation once it's ready.
This pull-request is for experimental purpose.

Will add decoding code later.

--

The training is still on-going. The tensorboard training log is available at
https://tensorboard.dev/experiment/CN5yTQNmTLODdyLZA6K8rQ/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D

A small vocab_size is used to avoid OOM.

csukuangfj · 2021-06-19T12:17:18Z

egs/librispeech/asr/simple_v1/bpe_mmi_att_transformer_train.py

+    fix_random_seed(42)
+    setup_dist(rank, world_size, args.master_port)
+
+    exp_dir = Path('exp-bpe-' + model_type + '-mmi-att-sa-vgg-normlayer')


This file is copied from mmi_att_transformer_train.py, with only this line being modified.

danpovey · 2021-06-19T12:24:24Z

egs/librispeech/asr/simple_v1/generate_bpe_tokens.py

+Example usage of this script:
+
+python3 ./generate_bpe_tokens.py \
+  --model-file ./data/lang_bpe/bpe_unigram_500.model > data/lang_bpe/bpe_unigram_500.tokens


Is it necessary to write the model out to disk and get the tokens separately?
I'd imagine it would be very fast to train the BPE model (isn't it some very simple LM?) and get the tokens inside the same script.. this just feels a bit complicated. But it's OK for now.

Is it necessary to write the model out to disk and get the tokens separately?

No, it's not necessary. We can generate all the needed files tokens.txt, words.txt, lexicon.txt in one
file. I just feel splitting them is easier to understand.

Yes, it is quite fast to train a BPE model. It's less than 1 minute for the LibriSpeech dataset.
The model is only several hundred KB and does not occupy too much disk space.

Is it necessary to write the model out to disk

According to the documentation and examples given in
https://github.com/google/sentencepiece/blob/master/python/README.md ,
it looks like the model has to be written to disk after training. There is no option to retain a memory representation
of the model without writing it to disk.

csukuangfj · 2021-06-20T01:57:20Z

egs/librispeech/asr/simple_v1/run.sh

+stage=5
+
+# settings for BPE training -- start
+vocab_size=5000


@danpovey

I've removed the bigram P and it is possible to do BPE LF-MMI training with vocab_size=5000.

danpovey · 2021-06-20T02:03:54Z

BTW, the way I think we can solve the memory-blowup issue is:
(i) use the new, more compact CTC topo
(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py; load it into k2 as P (no disambig symbols!), and remove epsilons. k2 uses a rm-epsilon algorithm that should keep the epsilon-free LM compact, unlike OpenFst which would cause it to blow up.
BTW, I am asking some others to add a pruning option to make_kn_lm.py.

csukuangfj · 2021-06-20T02:33:43Z

(i) use the new, more compact CTC topo

Yes, I am using the new CTC topo.

(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py;

Will update the code to train a word piece bigram ARPA LM with make_kn_lm.py.

glynpu · 2021-06-21T09:28:12Z

egs/librispeech/asr/simple_v1/generate_bpe_lexicon.py

+    ans = []
+    with open(words_txt, 'r', encoding='latin-1') as f:
+        for line in f:
+            word, id = line.strip().split()


id is a python build-in fuction, maybe change to idx; or just _ since it's not used

WIP: Add BPE training with LF-MMI.

b2eb9d7

A small vocab_size is used to avoid OOM.

csukuangfj commented Jun 19, 2021

View reviewed changes

danpovey reviewed Jun 19, 2021

View reviewed changes

Remove the bigram P.

0c8b7df

csukuangfj commented Jun 20, 2021

View reviewed changes

Add decoding code for BPE training.

400f3e0

glynpu reviewed Jun 21, 2021

View reviewed changes

glynpu mentioned this pull request Jul 5, 2021

[WIP] update bpe models and integrate 4-gram rescore #227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add BPE training with LF-MMI. #215

WIP: Add BPE training with LF-MMI. #215

csukuangfj commented Jun 19, 2021 •

edited

Loading

csukuangfj Jun 19, 2021

danpovey Jun 19, 2021

csukuangfj Jun 19, 2021

csukuangfj Jun 20, 2021

csukuangfj Jun 20, 2021

danpovey commented Jun 20, 2021

csukuangfj commented Jun 20, 2021

glynpu Jun 21, 2021 •

edited

Loading

WIP: Add BPE training with LF-MMI. #215

Are you sure you want to change the base?

WIP: Add BPE training with LF-MMI. #215

Conversation

csukuangfj commented Jun 19, 2021 • edited Loading

csukuangfj Jun 19, 2021

Choose a reason for hiding this comment

danpovey Jun 19, 2021

Choose a reason for hiding this comment

csukuangfj Jun 19, 2021

Choose a reason for hiding this comment

csukuangfj Jun 20, 2021

Choose a reason for hiding this comment

csukuangfj Jun 20, 2021

Choose a reason for hiding this comment

danpovey commented Jun 20, 2021

csukuangfj commented Jun 20, 2021

glynpu Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

csukuangfj commented Jun 19, 2021 •

edited

Loading

glynpu Jun 21, 2021 •

edited

Loading