You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LR needs to be scaled with Global_batchsize, which isn't friendly for 10-GPU systems.
We had run with bs=120 and LR=0.15 on a 10-GPU system, and it converged at the same epochs(14) just like bs=96 and LR=0.12 did. Both have the same local_BS=12. Plus, in the BS=128 case defined in the same RCP, converged epoch also defined as 14.
So, we propose to have the rule on opt_base_learning_rate of Mask R-CNN adjusted from 0.02K to 0.01K. This will be more fair to systems has different numbers of GPU.
The text was updated successfully, but these errors were encountered:
in the training rule, here
The opt_base_learning_rate has been defined to be K*0.02.
It is good for systems has 4,8,16 GPUs. However, it doesn't converge well with systems has other numbers of GPUs, ex. 10 GPUs.
Another example is the RCP itself.
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_1.1.0/rcps_maskrcnn.json
"maskrcnn_ref_96":
{
"Benchmark": "maskrcnn",
"Creator": "NVIDIA",
"When": "Prior to 1.0 submission",
"Platform": "TBD",
"BS": 96,
"Hyperparams": {
"opt_learning_decay_steps": [12000, 16000],
"opt_base_learning_rate": 0.12,
"num_image_candidates": 6000,
"opt_learning_rate_warmup_factor": 0.000192,
"opt_learning_rate_warmup_steps": 625
},
"Epochs to converge": [
14, 15, 14, 14, 14, 14, 14, 14, 14, 13,
14, 14, 15, 14, 14, 14, 14, 14, 14, 14]
},
"maskrcnn_ref_128":
{
"Benchmark": "maskrcnn",
"Creator": "NVIDIA",
"When": "Prior to 1.0 submission",
"Platform": "TBD",
"BS": 128,
"Hyperparams": {
"opt_learning_decay_steps": [9000, 12000],
"opt_base_learning_rate": 0.16,
"num_image_candidates": 6000,
"opt_learning_rate_warmup_factor": 0.000256,
"opt_learning_rate_warmup_steps": 625
},
"Epochs to converge": [
14, 14, 14, 14, 14, 14, 14, 14, 14, 14]
},
LR needs to be scaled with Global_batchsize, which isn't friendly for 10-GPU systems.
We had run with bs=120 and LR=0.15 on a 10-GPU system, and it converged at the same epochs(14) just like bs=96 and LR=0.12 did. Both have the same local_BS=12. Plus, in the BS=128 case defined in the same RCP, converged epoch also defined as 14.
So, we propose to have the rule on opt_base_learning_rate of Mask R-CNN adjusted from 0.02K to 0.01K. This will be more fair to systems has different numbers of GPU.
The text was updated successfully, but these errors were encountered: