Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

hanyunfan · 2021-10-23T18:19:50Z

in the training rule, here

The opt_base_learning_rate has been defined to be K*0.02.

maskrcnn	sgd	opt_base_learning_rate	0.02 * K for any integer K	base learning rate, this should be the learning rate after warm up and before decay

It is good for systems has 4,8,16 GPUs. However, it doesn't converge well with systems has other numbers of GPUs, ex. 10 GPUs.

Another example is the RCP itself.
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_1.1.0/rcps_maskrcnn.json
"maskrcnn_ref_96":
{
"Benchmark": "maskrcnn",
"Creator": "NVIDIA",
"When": "Prior to 1.0 submission",
"Platform": "TBD",
"BS": 96,
"Hyperparams": {
"opt_learning_decay_steps": [12000, 16000],
"opt_base_learning_rate": 0.12,
"num_image_candidates": 6000,
"opt_learning_rate_warmup_factor": 0.000192,
"opt_learning_rate_warmup_steps": 625
},
"Epochs to converge": [
14, 15, 14, 14, 14, 14, 14, 14, 14, 13,
14, 14, 15, 14, 14, 14, 14, 14, 14, 14]
},

"maskrcnn_ref_128":
{
"Benchmark": "maskrcnn",
"Creator": "NVIDIA",
"When": "Prior to 1.0 submission",
"Platform": "TBD",
"BS": 128,
"Hyperparams": {
"opt_learning_decay_steps": [9000, 12000],
"opt_base_learning_rate": 0.16,
"num_image_candidates": 6000,
"opt_learning_rate_warmup_factor": 0.000256,
"opt_learning_rate_warmup_steps": 625
},
"Epochs to converge": [
14, 14, 14, 14, 14, 14, 14, 14, 14, 14]
},

LR needs to be scaled with Global_batchsize, which isn't friendly for 10-GPU systems.

We had run with bs=120 and LR=0.15 on a 10-GPU system, and it converged at the same epochs(14) just like bs=96 and LR=0.12 did. Both have the same local_BS=12. Plus, in the BS=128 case defined in the same RCP, converged epoch also defined as 14.

So, we propose to have the rule on opt_base_learning_rate of Mask R-CNN adjusted from 0.02K to 0.01K. This will be more fair to systems has different numbers of GPU.

hanyunfan · 2021-10-23T18:35:39Z

Link back to https://github.com/mlcommons/submission_training_1.1/issues/24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

hanyunfan commented Oct 23, 2021 •

edited

Loading

hanyunfan commented Oct 23, 2021

Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01 #473

Comments

hanyunfan commented Oct 23, 2021 • edited Loading

hanyunfan commented Oct 23, 2021

hanyunfan commented Oct 23, 2021 •

edited

Loading