Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Strange behavior of the lr_mult #1612

Open
2 tasks done
AlphaPlusTT opened this issue Nov 29, 2024 · 0 comments
Open
2 tasks done

[Bug] Strange behavior of the lr_mult #1612

AlphaPlusTT opened this issue Nov 29, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@AlphaPlusTT
Copy link

Prerequisite

Environment

mmcv==2.1.0
mmdet==3.3.0
mmdet3d==1.4.0
mmengine==0.10.5

Reproduces the problem - code sample

During training, I want the learning rate of the image_backbone to remain at 0.1 times the base learning rate. Therefore, I set the following in the configuration file:

optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=lr, weight_decay=0.01),
    clip_grad=dict(max_norm=35, norm_type=2),
    paramwise_cfg=dict(
        custom_keys={
            'img_backbone': dict(lr_mult=0.1),
        }
    )
)

And set the param_scheduler:

param_scheduler = [
    # learning rate scheduler
    # During the first 8 epochs, learning rate increases from lr to lr * 100
    # during the next 12 epochs, learning rate decreases from lr * 100 to lr
    dict(
        type='CosineAnnealingLR',
        T_max=8,
        eta_min=lr * 100,
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingLR',
        T_max=12,
        eta_min=lr,
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True),
    # momentum scheduler
    # During the first 8 epochs, momentum increases from 0 to 0.85 / 0.95
    # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
    dict(
        type='CosineAnnealingMomentum',
        T_max=8,
        eta_min=0.85 / 0.95,
        begin=0,
        end=8,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        type='CosineAnnealingMomentum',
        T_max=12,
        eta_min=1,
        begin=8,
        end=20,
        by_epoch=True,
        convert_to_iter_based=True)
]

Reproduces the problem - command or script

Consistent with the above.

Reproduces the problem - error message

At the beginning, the learning rate of img_backbone is indeed 0.1 times the base learning rate:

2024/11/21 21:40:30 - mmengine - INFO - Epoch(train)  [1][ 100/3517]  base_lr: 5.0005e-05 lr: 5.0060e-06  eta: 20:29:23  time: 0.9889  data_time: 0.0563  memory: 32041  grad_norm: 52825.8107  loss: 5568.1427  task0.loss_heatmap: 49.9860  task0.loss_bbox: 0.9099  task1.loss_heatmap: 596.6443  task1.loss_bbox: 1.1417  task2.loss_heatmap: 2504.7168  task2.loss_bbox: 1.5418  task3.loss_heatmap: 620.9393  task3.loss_bbox: 0.8771  task4.loss_heatmap: 1612.4171  task4.loss_bbox: 0.9113  task5.loss_heatmap: 177.1250  task5.loss_bbox: 0.9324

However, img_backbone's learning rate slowly caught up during the training process:

2024/11/21 23:45:30 - mmengine - INFO - Epoch(train)  [3][ 100/3517]  base_lr: 7.2556e-05 lr: 3.4323e-05  eta: 18:42:58  time: 1.0738  data_time: 0.0613  memory: 32031  grad_norm: 64.6705  loss: 14.8487  task0.loss_heatmap: 1.4181  task0.loss_bbox: 0.6505  task1.loss_heatmap: 2.0847  task1.loss_bbox: 0.7157  task2.loss_heatmap: 2.0074  task2.loss_bbox: 0.7194  task3.loss_heatmap: 1.4966  task3.loss_bbox: 0.5754  task4.loss_heatmap: 1.9814  task4.loss_bbox: 0.6894  task5.loss_heatmap: 1.8084  task5.loss_bbox: 0.7016
...
...
2024/11/22 01:50:03 - mmengine - INFO - Epoch(train)  [5][ 100/3517]  base_lr: 1.2583e-04 lr: 1.0358e-04  eta: 16:36:18  time: 1.0527  data_time: 0.0568  memory: 32069  grad_norm: 52.7803  loss: 15.1927  task0.loss_heatmap: 1.4274  task0.loss_bbox: 0.6234  task1.loss_heatmap: 2.1254  task1.loss_bbox: 0.6715  task2.loss_heatmap: 2.0836  task2.loss_bbox: 0.7248  task3.loss_heatmap: 1.9199  task3.loss_bbox: 0.6361  task4.loss_heatmap: 1.8900  task4.loss_bbox: 0.6479  task5.loss_heatmap: 1.7534  task5.loss_bbox: 0.6891

It looks like lr_mult only works at the beginning to set the learning rate. How can I make lr_mult work throughout the training process?

Additional information

I think that after adding lr_mult, the learning rate of the image backbone during the entire training process should be 0.1 times the basic learning rate.

@AlphaPlusTT AlphaPlusTT added the bug Something isn't working label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant