Loss becoming "nan" during codebook training? #20

jhyau · 2022-06-11T22:25:58Z

Hello! I was running codebook training on VAS, but for some reason I see the loss turning into nan after the first epoch. I was wondering if I may be doing something incorrectly? I used this command:

python train.py --base configs/vas_codebook.yaml -t True --gpus 0,

Here are the nans I see:

Epoch 0: 51%|██████████████████████████████▉ | 78/154 [01:55<01:52, 1.48s/it, loss=nan, v_num=0, val/rec_loss_epoch=1.100, val/aeloss_epoch=1.140, train/aeloss_step=nan.0]
Previous Epoch counts: [530, 0, 1, 0, 0, 0, 11, 45, 212, 1, 0, 49, 5, 0, 1, 0, 0, 0, 0, 4, 1, 48, 0, 17, 5, 201, 13, 5, 38, 0, 1, 287, 1370, 6, 3, 0, 0, 1, 0, 1, 58, 1, 3, 4, 228, 123, 0, 0, 15, 0, 0, 6
, 0, 0, 36, 39, 36, 1, 7, 0, 0, 4, 38, 3, 0, 1, 62, 147, 5, 0, 3, 9, 8, 0, 13, 80, 33, 40, 0, 20, 0, 104, 26, 0, 4, 14, 1, 0, 0, 129, 0, 0, 2, 4, 7, 0, 1, 1, 0, 0, 28, 33, 2, 83, 0, 0, 43, 4, 4, 0, 59,
11, 22, 17, 6, 0, 30, 219, 0, 6, 15, 4, 2, 0, 0, 2, 0, 8]
Epoch 1: 51%|▌| 78/154 [01:08<01:06, 1.15it/s, loss=nan, v_num=0, val/rec_loss_epoch=nan.0, val/aeloss_epoch=nan.0, train/aeloss_step=nan.0, train/aeloss_epoch=nan.0, val/rec_loss_step=nan.0, val/aelo
Previous Epoch counts: [41870, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0]

Thank you very much!

The loss is going to 'nan' when i load the correct ckpt, do you have this problem? I trained on VAS dataset.

Originally posted by @jwliu-cc in #13 (comment)

The text was updated successfully, but these errors were encountered:

v-iashin · 2022-06-12T07:32:10Z

Hi, @jhyau. Thanks a lot for letting me know about it!

I think I could replicate the same problem that you and jwliu-cc have. I added a post to that issue and reset the changes at the cost of having this nasty bug in the code base.

I will close this one because it is merely the consequence of that issue.

jhyau · 2022-06-12T08:50:24Z

Sounds good, thank you so much for looking into this and the quick response!

v-iashin closed this as completed Jun 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss becoming "nan" during codebook training? #20

Loss becoming "nan" during codebook training? #20

jhyau commented Jun 11, 2022

v-iashin commented Jun 12, 2022

jhyau commented Jun 12, 2022

Loss becoming "nan" during codebook training? #20

Loss becoming "nan" during codebook training? #20

Comments

jhyau commented Jun 11, 2022

v-iashin commented Jun 12, 2022

jhyau commented Jun 12, 2022