-
Notifications
You must be signed in to change notification settings - Fork 42
Full-librispeech training #146
Comments
Nice! I think there are two more easy wins: using tglarge for decoding (I think we’re using tgmed currently) and saving checkpoints more frequently than per epoch so we can also benefit from averaging here. |
MM yes. Fangjun is already working on LM rescoring with the 4-gram model, for which I am currently working on GPU intersection code that works for that case. |
The data augmentation setup probably needs some tuning. I ran the full libri recipe as-is, and got:
(also the average of epochs 2 and 3 yields improvement: 5.37% and 11.13%) Then, I also ran it without any data augmentation (I dropped speed perturbed cuts, removed MUSAN and SpecAug, and increased training time to 9 epochs, as each epoch is now 3x smaller, so the network sees the same number of hours)
Such a small difference doesn't seem right, does it? |
It seems plausible to me. It could be that we'd only see improvements after more epochs of training. |
OK... as long as the way the cuts are grouped into minibatches is random so
the total length of the sequence is different
(for a given cut) from epoch to epoch, that should have the same effect as
randomization.
…On Tue, Apr 13, 2021 at 12:14 AM Piotr Żelasko ***@***.***> wrote:
I think we already have implemented "both" sides padding; which would
center the cuts. It'd look sth like this (except the cuts would be
concatenated first, with data augmentation applied):
[image: image]
<https://user-images.githubusercontent.com/15930688/114426826-73777c00-9b88-11eb-9fc1-046af443f817.png>
Does it make sense?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#146 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2NLTRPBKXOIV2OV4TTIML5RANCNFSM42J6CVFA>
.
|
There is also a transform called ExtraPadding that adds a fixed number N of padding frames to the cut (N/2 on each side); I can extend it so that it is randomized. |
FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:
|
Wow! |
Actually I just realized it will be interesting for you to see more info; here's a WER breakdown by epoch and numbers for different average settings, and with/without rescoring. You can notice that the WER at epoch 3 is worse than when trained on 1 GPU, which is likely explained by 4x lesser number of optimizer steps due to 4 GPU (partially, but not fully, counter-acted by linearly increased LR)
|
Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with |
Could you post the results with LM rescoring disabled. It's using the whole lattice for rescoring by default. Pass --use-lm-rescoring=0 to the commandline can disable LM rescoring. Just want to know what's the role LM rescoring plays here. |
I think I did it concurrently to your question -- only the last result in my previous message has the rescoring turned on. |
Thanks! GitHub didn't show the results when I commented. |
Are there tensorboard logs in your case? That contains time stamps. |
Good point. It's ~1h:45min per epoch. I think it can still get a bit better if we decay the LR faster, it was still quite high (~1.1e-3) at the end of training. |
@pzelasko Is it possible to share this trained model? I want to do n-best rescoring with transformer LM with it. |
@glynpu sure! You should be able to download it here: https://livejohnshopkins-my.sharepoint.com/:f:/g/personal/pzelask2_jh_edu/EjpFSUZ1WXlItIWlf-YemmIBTbNkbA3fovl_kZv0tQFupw?e=JZHh6x LMK if that doesn't work. |
@pzelasko Thanks for sharing! Why is the best_model.pt(128MB) so much smaller than others(384MB)? |
I think “best_model” doesn’t store the optimizer, scheduler, etc. state dicts for resuming training. Also it is not necessarily best, since it’s picked based on dev loss and not on WER (and it is not averaged). |
Thanks. So I should average model of 8,9,10 epoch to reproduce your best result. |
@pzelasko Could you please check the shared folder? Some models don't exist in shared folder. There are only epoch-[0,1,4,5,8}.pt. https://livejohnshopkins-my.sharepoint.com/:f:/g/personal/pzelask2_jh_edu/EjpFSUZ1WXlItIWlf-YemmIBTbNkbA3fovl_kZv0tQFupw?e=JZHh6x |
That's weird. Something went wrong when uploading. I'm pushing the missing files, you can expect them to be there in the next hour. |
Thanks!
|
Fantastic-- thanks!
Can you try much less weight decay in the transformer setup? I notice
0.001 which IMO is too high.
And use the noam optimizer if you weren't already.
…On Fri, Apr 23, 2021 at 3:53 PM LIyong.Guo ***@***.***> wrote:
That's weird. Something went wrong when uploading. I'm pushing the missing
files, you can expect them to be there in the next hour.
Thanks!
Got models and results of transformer LM n-best rescore are:
rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore (Piotr's am with full librispeech) * * * * 4.71 9.66
4-gram LM n-best rescore(Piotr's am with full librispeech) * 100 * * 4.38
9.18
4-gram LM lattice rescore * * * * 4.18 8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5 9 100 45.02 115.24
3.61 8.29
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#146 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3GJXJ7HBFX25TH6KLTKERP7ANCNFSM42J6CVFA>
.
|
Guys, I ran the current setup on the full librispeech data for 3 epochs and this issue is mostly just an FYI so you can see what I got. I am thinking perhaps we can start doing experiments on full-libri for just 2 or 3 epochs, since it won't take much longer than 10 epochs on the smaller data and the results are (a) quite a bit better, especially on test-other and (b) perhaps more indicative of what we'd get in larger setups.
I think our test-clean errors are probably dominated by language modeling issues, which may explain why the improvement is only 1.5% absolute, vs. 6% absolute on test-other.
Depends what you guys think... we should probably agree on one setup that we can mostly use, for consistency.
Decoding with
python3 mmi_att_transformer_decode.py --epoch=3 --avg=1
The text was updated successfully, but these errors were encountered: