Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too long data_time in training of denoise and deblur models. And too low GPU utilization #2130

Open
3 tasks done
GeLeinjust opened this issue Mar 20, 2024 · 0 comments
Open
3 tasks done
Assignees
Labels
kind/bug something isn't working

Comments

@GeLeinjust
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmagic

Environment

pytorch 11.7
others as requirements.txt

Reproduces the problem - code sample

You can reproduce the problem by training nafnet for denoise on SIDD and deblur on GoPro.

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/nafnet/nafnet_c64eb2248mb12db2222_8xb8-lr1e-3-400k_sidd.py
--work-dir ./work_dirs/naf_sidd
--auto-scale-lr
--amp \

Reproduces the problem - error message

This is part of the log of nafnet trained on SIDD.

03/20 11:49:02 - mmengine - INFO - Exp name: nafnet_c64eb2248mb12db2222_lr1e-3_400k_sidd_20240320_114530
03/20 11:49:02 - mmengine - INFO - Iter(train) [ 40/20000] lr: 2.0000e-03 memory: 6946 data_time: 4.7464 loss: -18.6554 time: 5.1101
03/20 11:53:11 - mmengine - INFO - Iter(train) [ 100/20000] lr: 1.9999e-03 eta: 1 day, 1:03:44 time: 4.5339 data_time: 4.2340 memory: 6946 loss: -25.1152
03/20 11:59:38 - mmengine - INFO - Iter(train) [ 200/20000] lr: 1.9995e-03 eta: 23:06:41 time: 3.8703 data_time: 3.6130 memory: 6946 loss: -33.5698
03/20 12:05:44 - mmengine - INFO - Iter(train) [ 300/20000] lr: 1.9989e-03 eta: 22:00:53 time: 3.6648 data_time: 3.4076 memory: 6946 loss: -34.8641
03/20 12:12:13 - mmengine - INFO - Iter(train) [ 400/20000] lr: 1.9980e-03 eta: 21:42:56 time: 3.8854 data_time: 3.6273 memory: 6946 loss: -35.0379
03/20 12:19:02 - mmengine - INFO - Iter(train) [ 500/20000] lr: 1.9969e-03 eta: 21:42:59 time: 4.0917 data_time: 3.8339 memory: 6946 loss: -35.8696
03/20 12:25:20 - mmengine - INFO - Iter(train) [ 600/20000] lr: 1.9956e-03 eta: 21:23:50 time: 3.7777 data_time: 3.5199 memory: 6946 loss: -37.0751
03/20 12:31:35 - mmengine - INFO - Iter(train) [ 700/20000] lr: 1.9940e-03 eta: 21:06:59 time: 3.7479 data_time: 3.4884 memory: 6946 loss: -37.3975
03/20 12:37:41 - mmengine - INFO - Iter(train) [ 800/20000] lr: 1.9921e-03 eta: 20:49:35 time: 3.6678 data_time: 3.4098 memory: 6946 loss: -37.6647
03/20 12:44:41 - mmengine - INFO - Iter(train) [ 900/20000] lr: 1.9900e-03 eta: 20:53:17 time: 4.1938 data_time: 3.9359 memory: 6946 loss: -37.7504
03/20 12:51:04 - mmengine - INFO - Exp name: nafnet_c64eb2248mb12db2222_lr1e-3_400k_sidd_20240320_114530
03/20 12:51:04 - mmengine - INFO - Iter(train) [ 1000/20000] lr: 1.9877e-03 eta: 20:43:17 time: 3.8284 data_time: 3.5698 memory: 6946 loss: -38.3180
03/20 12:51:04 - mmengine - INFO - Saving checkpoint at 1000 iterations
03/20 12:57:26 - mmengine - INFO - Iter(train) [ 1100/20000] lr: 1.9851e-03 eta: 20:33:44 time: 3.8210 data_time: 3.5634 memory: 6946 loss: -38.2264
03/20 13:03:55 - mmengine - INFO - Iter(train) [ 1200/20000] lr: 1.9823e-03 eta: 20:26:41 time: 3.8970 data_time: 3.6391 memory: 6946 loss: -38.6776
03/20 13:10:27 - mmengine - INFO - Iter(train) [ 1300/20000] lr: 1.9793e-03 eta: 20:20:17 time: 3.9197 data_time: 3.6504 memory: 6946 loss: -38.9627
03/20 13:17:17 - mmengine - INFO - Iter(train) [ 1400/20000] lr: 1.9760e-03 eta: 20:17:44 time: 4.0952 data_time: 3.8379 memory: 6946 loss: -39.0869
03/20 13:23:24 - mmengine - INFO - Iter(train) [ 1500/20000] lr: 1.9724e-03 eta: 20:05:50 time: 3.6675 data_time: 3.4088 memory: 6946 loss: -39.0055
03/20 13:29:31 - mmengine - INFO - Iter(train) [ 1600/20000] lr: 1.9686e-03 eta: 19:54:41 time: 3.6693 data_time: 3.4121 memory: 6946 loss: -39.3286
03/20 13:36:10 - mmengine - INFO - Iter(train) [ 1700/20000] lr: 1.9646e-03 eta: 19:49:54 time: 3.9914 data_time: 3.7340 memory: 6946 loss: -39.4060
03/20 13:42:56 - mmengine - INFO - Iter(train) [ 1800/20000] lr: 1.9603e-03 eta: 19:46:11 time: 4.0665 data_time: 3.8090 memory: 6946 loss: -39.6298
03/20 13:48:56 - mmengine - INFO - Iter(train) [ 1900/20000] lr: 1.9558e-03 eta: 19:34:42 time: 3.5981 data_time: 3.3417 memory: 6946 loss: -39.6971
03/20 13:55:22 - mmengine - INFO - Exp name: nafnet_c64eb2248mb12db2222_lr1e-3_400k_sidd_20240320_114530
03/20 13:55:22 - mmengine - INFO - Iter(train) [ 2000/20000] lr: 1.9511e-03 eta: 19:27:38 time: 3.8557 data_time: 3.5993 memory: 6946 loss: -39.6293
03/20 13:55:22 - mmengine - INFO - Saving checkpoint at 2000 iterations
03/20 13:55:36 - mmengine - INFO - Iter(val) [ 100/1280] eta: 0:02:26 time: 0.1238 data_time: 0.0358 memory: 825
03/20 13:55:46 - mmengine - INFO - Iter(val) [ 200/1280] eta: 0:01:48 time: 0.1008 data_time: 0.0173 memory: 825
03/20 13:55:57 - mmengine - INFO - Iter(val) [ 300/1280] eta: 0:01:42 time: 0.1049 data_time: 0.0184 memory: 825
03/20 13:56:07 - mmengine - INFO - Iter(val) [ 400/1280] eta: 0:01:27 time: 0.0992 data_time: 0.0172 memory: 825
03/20 13:56:16 - mmengine - INFO - Iter(val) [ 500/1280] eta: 0:01:12 time: 0.0930 data_time: 0.0155 memory: 825
03/20 13:56:25 - mmengine - INFO - Iter(val) [ 600/1280] eta: 0:01:03 time: 0.0934 data_time: 0.0152 memory: 825
03/20 13:56:35 - mmengine - INFO - Iter(val) [ 700/1280] eta: 0:00:54 time: 0.0941 data_time: 0.0153 memory: 825
03/20 13:56:44 - mmengine - INFO - Iter(val) [ 800/1280] eta: 0:00:46 time: 0.0959 data_time: 0.0154 memory: 825
03/20 13:56:54 - mmengine - INFO - Iter(val) [ 900/1280] eta: 0:00:36 time: 0.0973 data_time: 0.0158 memory: 825
03/20 13:57:04 - mmengine - INFO - Iter(val) [1000/1280] eta: 0:00:27 time: 0.0990 data_time: 0.0160 memory: 825
03/20 13:57:14 - mmengine - INFO - Iter(val) [1100/1280] eta: 0:00:17 time: 0.0994 data_time: 0.0155 memory: 825
03/20 13:57:24 - mmengine - INFO - Iter(val) [1200/1280] eta: 0:00:07 time: 0.0991 data_time: 0.0155 memory: 825
03/20 13:57:33 - mmengine - INFO - Iter(val) [1280/1280] MAE: 0.0126 PSNR: 36.7611 SSIM: 0.8853 data_time: 0.0178 time: 0.1006
03/20 13:57:34 - mmengine - INFO - The best checkpoint with 36.7611 PSNR at 2000 iter is saved to best_PSNR_iter_2000.pth.

Additional information

I want to know if set 'pin_memory=True' would be useful.

@GeLeinjust GeLeinjust added the kind/bug something isn't working label Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants