Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

Open
hanyunfan opened this issue Oct 23, 2021 · 1 comment
Open

Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

hanyunfan opened this issue Oct 23, 2021 · 1 comment

Comments

@hanyunfan
Copy link

hanyunfan commented Oct 23, 2021

in the training rule, here

The opt_base_learning_rate has been defined to be K*0.02.

maskrcnn sgd opt_base_learning_rate 0.02 * K for any integer K base learning rate, this should be the learning rate after warm up and before decay

It is good for systems has 4,8,16 GPUs. However, it doesn't converge well with systems has other numbers of GPUs, ex. 10 GPUs.

Another example is the RCP itself.
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_1.1.0/rcps_maskrcnn.json
"maskrcnn_ref_96":
{
"Benchmark": "maskrcnn",
"Creator": "NVIDIA",
"When": "Prior to 1.0 submission",
"Platform": "TBD",
"BS": 96,
"Hyperparams": {
"opt_learning_decay_steps": [12000, 16000],
"opt_base_learning_rate": 0.12,
"num_image_candidates": 6000,
"opt_learning_rate_warmup_factor": 0.000192,
"opt_learning_rate_warmup_steps": 625
},
"Epochs to converge": [
14, 15, 14, 14, 14, 14, 14, 14, 14, 13,
14, 14, 15, 14, 14, 14, 14, 14, 14, 14]

},

"maskrcnn_ref_128":
{
"Benchmark": "maskrcnn",
"Creator": "NVIDIA",
"When": "Prior to 1.0 submission",
"Platform": "TBD",
"BS": 128,
"Hyperparams": {
"opt_learning_decay_steps": [9000, 12000],
"opt_base_learning_rate": 0.16,
"num_image_candidates": 6000,
"opt_learning_rate_warmup_factor": 0.000256,
"opt_learning_rate_warmup_steps": 625
},
"Epochs to converge": [
14, 14, 14, 14, 14, 14, 14, 14, 14, 14]
},

LR needs to be scaled with Global_batchsize, which isn't friendly for 10-GPU systems.

We had run with bs=120 and LR=0.15 on a 10-GPU system, and it converged at the same epochs(14) just like bs=96 and LR=0.12 did. Both have the same local_BS=12. Plus, in the BS=128 case defined in the same RCP, converged epoch also defined as 14.

So, we propose to have the rule on opt_base_learning_rate of Mask R-CNN adjusted from 0.02K to 0.01K. This will be more fair to systems has different numbers of GPU.

@hanyunfan
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant