Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Full-librispeech training #146

Open
danpovey opened this issue Apr 3, 2021 · 27 comments
Open

Full-librispeech training #146

danpovey opened this issue Apr 3, 2021 · 27 comments

Comments

@danpovey
Copy link
Contributor

danpovey commented Apr 3, 2021

Guys, I ran the current setup on the full librispeech data for 3 epochs and this issue is mostly just an FYI so you can see what I got. I am thinking perhaps we can start doing experiments on full-libri for just 2 or 3 epochs, since it won't take much longer than 10 epochs on the smaller data and the results are (a) quite a bit better, especially on test-other and (b) perhaps more indicative of what we'd get in larger setups.

I think our test-clean errors are probably dominated by language modeling issues, which may explain why the improvement is only 1.5% absolute, vs. 6% absolute on test-other.

Depends what you guys think... we should probably agree on one setup that we can mostly use, for consistency.

2021-04-02 09:06:11,165 INFO [common.py:158] Save checkpoint to exp-conformer-noam-mmi-att-musan-sa-full/epoch-0.pt: epoch=0, learning_rate=0, objf=0.35558393759552065, valid_objf=0.15738312575246463
2021-04-02 19:14:51,307 INFO [common.py:158] Save checkpoint to exp-conformer-noam-mmi-att-musan-sa-full/epoch-1.pt: epoch=1, learning_rate=0.00042936316818072904, objf=0.18261609444931104, valid_objf=0.14119687143085302
2021-04-03 05:22:19,287 INFO [common.py:158] Save checkpoint to exp-conformer-noam-mmi-att-musan-sa-full/epoch-2.pt: epoch=2, learning_rate=0.0003036056078123336, objf=0.16412144019593852, valid_objf=0.13489163773498417

Decoding with python3 mmi_att_transformer_decode.py --epoch=3 --avg=1

2021-04-03 14:07:21,184 INFO [common.py:356] [test-clean] %WER 5.53% [2910 / 52576, 301 ins, 287 del, 2322 sub ]
2021-04-03 14:09:29,506 INFO [common.py:356] [test-other] %WER 11.79% [6170 / 52343, 636 ins, 600 del, 4934 sub ]
@pzelasko
Copy link
Collaborator

pzelasko commented Apr 3, 2021

Nice! I think there are two more easy wins: using tglarge for decoding (I think we’re using tgmed currently) and saving checkpoints more frequently than per epoch so we can also benefit from averaging here.

@danpovey
Copy link
Contributor Author

danpovey commented Apr 3, 2021

MM yes. Fangjun is already working on LM rescoring with the 4-gram model, for which I am currently working on GPU intersection code that works for that case.
Perhaps there could be a checkpoints-per-epoch parameter?

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 9, 2021

The data augmentation setup probably needs some tuning. I ran the full libri recipe as-is, and got:

Epoch 3:
2021-04-07 11:42:46,500 INFO [common.py:357] [test-clean] %WER 5.53% [2909 / 52576, 301 ins, 262 del, 2346 sub ]
2021-04-07 11:44:30,129 INFO [common.py:357] [test-other] %WER 11.60% [6074 / 52343, 640 ins, 615 del, 4819 sub ]

(also the average of epochs 2 and 3 yields improvement: 5.37% and 11.13%)

Then, I also ran it without any data augmentation (I dropped speed perturbed cuts, removed MUSAN and SpecAug, and increased training time to 9 epochs, as each epoch is now 3x smaller, so the network sees the same number of hours)

Libri 960h no aug

Epoch 1:
2021-04-07 18:11:28,655 INFO [common.py:357] [test-clean] %WER 8.68% [4564 / 52576, 493 ins, 514 del, 3557 sub ]
2021-04-07 18:13:38,786 INFO [common.py:357] [test-other] %WER 18.84% [9863 / 52343, 852 ins, 1208 del, 7803 sub ]
Epoch 2:
2021-04-07 20:49:00,742 INFO [common.py:356] [test-clean] %WER 7.02% [3693 / 52576, 443 ins, 371 del, 2879 sub ]
2021-04-07 20:51:00,451 INFO [common.py:356] [test-other] %WER 15.77% [8253 / 52343, 905 ins, 747 del, 6601 sub ]
Epoch 3:
2021-04-07 22:07:01,294 INFO [common.py:356] [test-clean] %WER 6.33% [3326 / 52576, 357 ins, 325 del, 2644 sub ]
2021-04-07 22:08:15,296 INFO [common.py:356] [test-other] %WER 14.49% [7582 / 52343, 758 ins, 796 del, 6028 sub ]
Epoch 4:
2021-04-08 08:01:33,173 INFO [common.py:356] [test-clean] %WER 6.10% [3208 / 52576, 376 ins, 297 del, 2535 sub ]
2021-04-08 08:03:48,852 INFO [common.py:356] [test-other] %WER 13.78% [7213 / 52343, 886 ins, 618 del, 5709 sub ]
Epoch 5:
2021-04-08 08:05:21,075 INFO [common.py:356] [test-clean] %WER 5.92% [3114 / 52576, 356 ins, 265 del, 2493 sub ]
2021-04-08 08:06:29,545 INFO [common.py:356] [test-other] %WER 13.54% [7086 / 52343, 783 ins, 644 del, 5659 sub ]
Epoch 6:
2021-04-08 08:14:20,788 INFO [common.py:356] [test-clean] %WER 5.50% [2891 / 52576, 328 ins, 254 del, 2309 sub ]
2021-04-08 08:15:28,709 INFO [common.py:356] [test-other] %WER 12.85% [6726 / 52343, 784 ins, 600 del, 5342 sub ]
Epoch 7:
2021-04-08 11:37:02,413 INFO [common.py:356] [test-clean] %WER 5.62% [2956 / 52576, 312 ins, 270 del, 2374 sub ]
2021-04-08 11:38:46,792 INFO [common.py:356] [test-other] %WER 12.93% [6770 / 52343, 710 ins, 632 del, 5428 sub ]
Epoch 8:
2021-04-08 15:30:49,968 INFO [common.py:356] [test-clean] %WER 5.61% [2948 / 52576, 330 ins, 283 del, 2335 sub ]
2021-04-08 15:31:59,328 INFO [common.py:356] [test-other] %WER 12.78% [6692 / 52343, 766 ins, 584 del, 5342 sub ]
Epoch 9:
2021-04-09 08:50:07,223 INFO [common.py:356] [test-clean] %WER 5.60% [2946 / 52576, 321 ins, 309 del, 2316 sub ]
2021-04-09 08:51:54,651 INFO [common.py:356] [test-other] %WER 12.59% [6592 / 52343, 705 ins, 616 del, 5271 sub ]


Average (last 4 epochs):
2021-04-09 08:49:39,470 INFO [common.py:356] [test-clean] %WER 5.25% [2762 / 52576, 318 ins, 260 del, 2184 sub ]
2021-04-09 08:50:27,926 INFO [common.py:356] [test-other] %WER 11.72% [6136 / 52343, 708 ins, 535 del, 4893 sub ]

Such a small difference doesn't seem right, does it?

@danpovey
Copy link
Contributor Author

It seems plausible to me. It could be that we'd only see improvements after more epochs of training.
BTW, if it isn't already supported, can you add an option to randomize the position of cuts in minibatches?
I mean, so the silence isn't justified to the right but is allocated randomly?
The motivation is that when we use subsampling on the output of the model, it isn't invariant to shifts modulo the
subsampling factor (e.g. modulo 4), so the random shift acts a bit like data augmentation.

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 12, 2021

I think we already have implemented "both" sides padding; which would center the cuts. It'd look sth like this (except the cuts would be concatenated first, with data augmentation applied):

image

Does it make sense? I can make it the default behaviour.

@danpovey
Copy link
Contributor Author

danpovey commented Apr 12, 2021 via email

@pzelasko
Copy link
Collaborator

There is also a transform called ExtraPadding that adds a fixed number N of padding frames to the cut (N/2 on each side); I can extend it so that it is randomized.

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 20, 2021

FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:

2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]
2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

@csukuangfj
Copy link
Collaborator

FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:


2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]

2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

Wow!

@pzelasko
Copy link
Collaborator

Actually I just realized it will be interesting for you to see more info; here's a WER breakdown by epoch and numbers for different average settings, and with/without rescoring. You can notice that the WER at epoch 3 is worse than when trained on 1 GPU, which is likely explained by 4x lesser number of optimizer steps due to 4 GPU (partially, but not fully, counter-acted by linearly increased LR)

Epoch 1:
2021-04-18 12:01:52,735 INFO [common.py:364] [test-clean] %WER 8.08% [4249 / 52576, 557 ins, 312 del, 3380 sub ]
2021-04-18 12:02:48,774 INFO [common.py:364] [test-other] %WER 17.64% [9231 / 52343, 1148 ins, 846 del, 7237 sub ]
Epoch 2:
2021-04-18 15:17:08,612 INFO [common.py:364] [test-clean] %WER 6.33% [3329 / 52576, 395 ins, 300 del, 2634 sub ]
2021-04-18 15:18:06,073 INFO [common.py:364] [test-other] %WER 13.48% [7056 / 52343, 734 ins, 736 del, 5586 sub ]
Epoch 3:
2021-04-18 15:19:37,956 INFO [common.py:364] [test-clean] %WER 5.82% [3060 / 52576, 357 ins, 289 del, 2414 sub ]
2021-04-18 15:20:31,606 INFO [common.py:364] [test-other] %WER 12.35% [6462 / 52343, 757 ins, 586 del, 5119 sub ]
Epoch 4:
2021-04-19 10:12:39,576 INFO [common.py:364] [test-clean] %WER 5.46% [2872 / 52576, 327 ins, 247 del, 2298 sub ]
2021-04-19 10:14:33,516 INFO [common.py:364] [test-other] %WER 11.81% [6181 / 52343, 712 ins, 588 del, 4881 sub ]
Epoch 5:
2021-04-19 10:16:39,978 INFO [common.py:364] [test-clean] %WER 5.48% [2879 / 52576, 347 ins, 243 del, 2289 sub ]
2021-04-19 10:17:48,955 INFO [common.py:364] [test-other] %WER 11.35% [5943 / 52343, 727 ins, 523 del, 4693 sub ]
Epoch 6:
2021-04-19 10:19:19,325 INFO [common.py:364] [test-clean] %WER 5.14% [2703 / 52576, 290 ins, 253 del, 2160 sub ]
2021-04-19 10:20:28,177 INFO [common.py:364] [test-other] %WER 10.82% [5661 / 52343, 622 ins, 549 del, 4490 sub ]
Epoch 7:
2021-04-19 14:58:01,037 INFO [common.py:364] [test-clean] %WER 5.15% [2706 / 52576, 300 ins, 244 del, 2162 sub ]
2021-04-19 14:59:10,205 INFO [common.py:364] [test-other] %WER 10.85% [5678 / 52343, 662 ins, 528 del, 4488 sub ]
Epoch 8:
2021-04-19 16:21:46,197 INFO [common.py:364] [test-clean] %WER 4.99% [2626 / 52576, 318 ins, 211 del, 2097 sub ]
2021-04-19 16:22:52,261 INFO [common.py:364] [test-other] %WER 10.50% [5497 / 52343, 614 ins, 481 del, 4402 sub ]
Epoch 9:
2021-04-20 09:15:25,490 INFO [common.py:364] [test-clean] %WER 4.96% [2606 / 52576, 289 ins, 213 del, 2104 sub ]
2021-04-20 09:16:20,003 INFO [common.py:364] [test-other] %WER 10.49% [5492 / 52343, 646 ins, 485 del, 4361 sub ]
Epoch 10:
2021-04-20 09:17:33,124 INFO [common.py:364] [test-clean] %WER 5.14% [2702 / 52576, 287 ins, 257 del, 2158 sub ]
2021-04-20 09:18:25,427 INFO [common.py:364] [test-other] %WER 10.60% [5548 / 52343, 652 ins, 453 del, 4443 sub ]


Average (epochs 4, 5, 6):
2021-04-19 10:12:16,678 INFO [common.py:364] [test-clean] %WER 5.02% [2641 / 52576, 308 ins, 235 del, 2098 sub ]
2021-04-19 10:13:28,620 INFO [common.py:364] [test-other] %WER 10.16% [5319 / 52343, 623 ins, 463 del, 4233 sub ]

Average (epochs 8, 9, 10):
2021-04-20 09:17:24,622 INFO [common.py:364] [test-clean] %WER 4.71% [2477 / 52576, 291 ins, 201 del, 1985 sub ]
2021-04-20 09:19:01,079 INFO [common.py:364] [test-other] %WER 9.65% [5052 / 52343, 604 ins, 422 del, 4026 sub ]

Average (epochs 8, 9, 10) with rescoring:
2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]
2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

@pzelasko
Copy link
Collaborator

Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with print which doesn't show timestamps 😅)

@csukuangfj
Copy link
Collaborator

Could you post the results with LM rescoring disabled. It's using the whole lattice for rescoring by default.

Pass --use-lm-rescoring=0 to the commandline can disable LM rescoring. Just want to know what's the role LM rescoring plays here.

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 20, 2021

I think I did it concurrently to your question -- only the last result in my previous message has the rescoring turned on.

@csukuangfj
Copy link
Collaborator

I think I did it concurrently to your question -- only the last result in my previous message has the rescoring turned on.

Thanks! GitHub didn't show the results when I commented.

@csukuangfj
Copy link
Collaborator

Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with print which doesn't show timestamps 😅)

Are there tensorboard logs in your case? That contains time stamps.

@pzelasko
Copy link
Collaborator

Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with print which doesn't show timestamps 😅)

Are there tensorboard logs in your case? That contains time stamps.

Good point. It's ~1h:45min per epoch. I think it can still get a bit better if we decay the LR faster, it was still quite high (~1.1e-3) at the end of training.

@glynpu
Copy link
Contributor

glynpu commented Apr 21, 2021

FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:

2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]
2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

@pzelasko Is it possible to share this trained model? I want to do n-best rescoring with transformer LM with it.
Previous exps of n-best rescoring with transformer LM already got lower wer than 4-gram LM with AM model trained by fangjun. Results are in the first comment of this conversation.

@pzelasko
Copy link
Collaborator

@glynpu sure! You should be able to download it here: https://livejohnshopkins-my.sharepoint.com/:f:/g/personal/pzelask2_jh_edu/EjpFSUZ1WXlItIWlf-YemmIBTbNkbA3fovl_kZv0tQFupw?e=JZHh6x

LMK if that doesn't work.

@glynpu
Copy link
Contributor

glynpu commented Apr 22, 2021

@pzelasko Thanks for sharing! Why is the best_model.pt(128MB) so much smaller than others(384MB)?

image

@pzelasko
Copy link
Collaborator

I think “best_model” doesn’t store the optimizer, scheduler, etc. state dicts for resuming training. Also it is not necessarily best, since it’s picked based on dev loss and not on WER (and it is not averaged).

@glynpu
Copy link
Contributor

glynpu commented Apr 22, 2021

Thanks. So I should average model of 8,9,10 epoch to reproduce your best result.
But epoch-{9.pt 10.pt}.pt seems doesn't exist in this shared folder.

@csukuangfj
Copy link
Collaborator

@glynpu

@pzelasko counts from 1, not from 0. So you should use epoch-{7,8,9}.pt

@glynpu
Copy link
Contributor

glynpu commented Apr 22, 2021

@pzelasko Could you please check the shared folder? Some models don't exist in shared folder. There are only epoch-[0,1,4,5,8}.pt. https://livejohnshopkins-my.sharepoint.com/:f:/g/personal/pzelask2_jh_edu/EjpFSUZ1WXlItIWlf-YemmIBTbNkbA3fovl_kZv0tQFupw?e=JZHh6x

@pzelasko
Copy link
Collaborator

That's weird. Something went wrong when uploading. I'm pushing the missing files, you can expect them to be there in the next hour.

@pzelasko
Copy link
Collaborator

@glynpu

@pzelasko counts from 1, not from 0. So you should use epoch-{7,8,9}.pt

We'll probably need to make the indexing consistent, different parts of code base count from 0, others from 1...

@glynpu
Copy link
Contributor

glynpu commented Apr 23, 2021

That's weird. Something went wrong when uploading. I'm pushing the missing files, you can expect them to be there in the next hour.

Thanks!
Got models and results of transformer LM n-best rescore are:

rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore * * * * 4.71 9.66
4-gram LM n-best rescore * 100 * * 4.38 9.18
4-gram LM lattice rescore * * * * 4.18 8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5 9 100 45.02 115.24 3.61 8.29

@danpovey
Copy link
Contributor Author

danpovey commented Apr 23, 2021 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants