Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss becoming "nan" during codebook training? #20

Closed
jhyau opened this issue Jun 11, 2022 · 2 comments
Closed

Loss becoming "nan" during codebook training? #20

jhyau opened this issue Jun 11, 2022 · 2 comments

Comments

@jhyau
Copy link

jhyau commented Jun 11, 2022

Hello! I was running codebook training on VAS, but for some reason I see the loss turning into nan after the first epoch. I was wondering if I may be doing something incorrectly? I used this command:

python train.py --base configs/vas_codebook.yaml -t True --gpus 0,

Here are the nans I see:

Epoch 0: 51%|██████████████████████████████▉ | 78/154 [01:55<01:52, 1.48s/it, loss=nan, v_num=0, val/rec_loss_epoch=1.100, val/aeloss_epoch=1.140, train/aeloss_step=nan.0]
Previous Epoch counts: [530, 0, 1, 0, 0, 0, 11, 45, 212, 1, 0, 49, 5, 0, 1, 0, 0, 0, 0, 4, 1, 48, 0, 17, 5, 201, 13, 5, 38, 0, 1, 287, 1370, 6, 3, 0, 0, 1, 0, 1, 58, 1, 3, 4, 228, 123, 0, 0, 15, 0, 0, 6
, 0, 0, 36, 39, 36, 1, 7, 0, 0, 4, 38, 3, 0, 1, 62, 147, 5, 0, 3, 9, 8, 0, 13, 80, 33, 40, 0, 20, 0, 104, 26, 0, 4, 14, 1, 0, 0, 129, 0, 0, 2, 4, 7, 0, 1, 1, 0, 0, 28, 33, 2, 83, 0, 0, 43, 4, 4, 0, 59,
11, 22, 17, 6, 0, 30, 219, 0, 6, 15, 4, 2, 0, 0, 2, 0, 8]
Epoch 1: 51%|▌| 78/154 [01:08<01:06, 1.15it/s, loss=nan, v_num=0, val/rec_loss_epoch=nan.0, val/aeloss_epoch=nan.0, train/aeloss_step=nan.0, train/aeloss_epoch=nan.0, val/rec_loss_step=nan.0, val/aelo
Previous Epoch counts: [41870, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0]

Thank you very much!

The loss is going to 'nan' when i load the correct ckpt, do you have this problem? I trained on VAS dataset.

Originally posted by @jwliu-cc in #13 (comment)

@v-iashin
Copy link
Owner

Hi, @jhyau. Thanks a lot for letting me know about it!

I think I could replicate the same problem that you and jwliu-cc have. I added a post to that issue and reset the changes at the cost of having this nasty bug in the code base.

I will close this one because it is merely the consequence of that issue.

@jhyau
Copy link
Author

jhyau commented Jun 12, 2022

Sounds good, thank you so much for looking into this and the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants