[Bug] Strange behavior of the lr_mult #1612

AlphaPlusTT · 2024-11-29T06:22:55Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

mmcv==2.1.0
mmdet==3.3.0
mmdet3d==1.4.0
mmengine==0.10.5

Reproduces the problem - code sample

During training, I want the learning rate of the image_backbone to remain at 0.1 times the base learning rate. Therefore, I set the following in the configuration file:

optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=lr, weight_decay=0.01),
    clip_grad=dict(max_norm=35, norm_type=2),
    paramwise_cfg=dict(
        custom_keys={
            'img_backbone': dict(lr_mult=0.1),
        }
    )
)

And set the param_scheduler:

param_scheduler = [
    # learning rate scheduler
    # During the first 8 epochs, learning rate increases from lr to lr * 100
    # during the next 12 epochs, learning rate decreases from lr * 100 to lr
    dict(
        type='CosineAnnealingLR',
        T_max=8,
        eta_min=lr * 100,
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingLR',
        T_max=12,
        eta_min=lr,
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True),
    # momentum scheduler
    # During the first 8 epochs, momentum increases from 0 to 0.85 / 0.95
    # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
    dict(
        type='CosineAnnealingMomentum',
        T_max=8,
        eta_min=0.85 / 0.95,
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingMomentum',
        T_max=12,
        eta_min=1,
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True)
]

Reproduces the problem - command or script

Consistent with the above.

Reproduces the problem - error message

At the beginning, the learning rate of img_backbone is indeed 0.1 times the base learning rate:

2024/11/21 21:40:30 - mmengine - INFO - Epoch(train)  [1][ 100/3517]  base_lr: 5.0005e-05 lr: 5.0060e-06  eta: 20:29:23  time: 0.9889  data_time: 0.0563  memory: 32041  grad_norm: 52825.8107  loss: 5568.1427  task0.loss_heatmap: 49.9860  task0.loss_bbox: 0.9099  task1.loss_heatmap: 596.6443  task1.loss_bbox: 1.1417  task2.loss_heatmap: 2504.7168  task2.loss_bbox: 1.5418  task3.loss_heatmap: 620.9393  task3.loss_bbox: 0.8771  task4.loss_heatmap: 1612.4171  task4.loss_bbox: 0.9113  task5.loss_heatmap: 177.1250  task5.loss_bbox: 0.9324

However, img_backbone's learning rate slowly caught up during the training process:

2024/11/21 23:45:30 - mmengine - INFO - Epoch(train)  [3][ 100/3517]  base_lr: 7.2556e-05 lr: 3.4323e-05  eta: 18:42:58  time: 1.0738  data_time: 0.0613  memory: 32031  grad_norm: 64.6705  loss: 14.8487  task0.loss_heatmap: 1.4181  task0.loss_bbox: 0.6505  task1.loss_heatmap: 2.0847  task1.loss_bbox: 0.7157  task2.loss_heatmap: 2.0074  task2.loss_bbox: 0.7194  task3.loss_heatmap: 1.4966  task3.loss_bbox: 0.5754  task4.loss_heatmap: 1.9814  task4.loss_bbox: 0.6894  task5.loss_heatmap: 1.8084  task5.loss_bbox: 0.7016
...
...
2024/11/22 01:50:03 - mmengine - INFO - Epoch(train)  [5][ 100/3517]  base_lr: 1.2583e-04 lr: 1.0358e-04  eta: 16:36:18  time: 1.0527  data_time: 0.0568  memory: 32069  grad_norm: 52.7803  loss: 15.1927  task0.loss_heatmap: 1.4274  task0.loss_bbox: 0.6234  task1.loss_heatmap: 2.1254  task1.loss_bbox: 0.6715  task2.loss_heatmap: 2.0836  task2.loss_bbox: 0.7248  task3.loss_heatmap: 1.9199  task3.loss_bbox: 0.6361  task4.loss_heatmap: 1.8900  task4.loss_bbox: 0.6479  task5.loss_heatmap: 1.7534  task5.loss_bbox: 0.6891

It looks like lr_mult only works at the beginning to set the learning rate. How can I make lr_mult work throughout the training process?

Additional information

I think that after adding lr_mult, the learning rate of the image backbone during the entire training process should be 0.1 times the basic learning rate.

The text was updated successfully, but these errors were encountered:

AlphaPlusTT added the bug Something isn't working label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Strange behavior of the lr_mult #1612

[Bug] Strange behavior of the lr_mult #1612

AlphaPlusTT commented Nov 29, 2024

[Bug] Strange behavior of the lr_mult #1612

[Bug] Strange behavior of the lr_mult #1612

Comments

AlphaPlusTT commented Nov 29, 2024

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information