Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UNET3D] Update RCPs due to !523 and !625 #1

Merged
merged 17 commits into from
Jan 25, 2024

Conversation

mmarcinkiewicz
Copy link
Owner

@mmarcinkiewicz mmarcinkiewicz commented Apr 24, 2023

In mlcommons/training#523 the distributed sampler was enabled, which required to use drop_last_batch setting, vs pad_last_batch as was being used previously. This changed the samples_per_epoch for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly.

Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence.

I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant:

GBS=32
old samples per epoch: 192
old RCP mean: 1974
new samples per epoch: 160
new RCP mean: 2409
ratio: 1.22, expected ratio (192/160): 1.2

GBS=56
old samples per epoch: 168
old RCP mean: 2213
new samples per epoch: 168
new RCP mean: 2300
ratio: 1.04, expected ratio: 1.00

GBS=64
old samples per epoch: 192
old RCP mean: ---
new samples per epoch: 128
new RCP mean: 3270.77
ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51

GBS=80
old samples per epoch: 240
old RCP mean: 1618.75
new samples per epoch: 160
new RCP mean: 2462.51
ratio 1.52, expected ratio: 1.5

GBS=84
old RCP mean: 2233.25
new RCP mean: 2308.695652173913
ratio: 1.03, expected ratio: 1.0

The slight increase (3-4%) of the ratio might or might not be attributed to the fix to the distributed sampler.

@drcanchi
Copy link

drcanchi commented May 4, 2023

Just to clarify, did the new RCPs include the bug fix for distributed sampler seed as well ? mlcommons/training#625

@mmarcinkiewicz
Copy link
Owner Author

@drcanchi yes

itayhubara and others added 4 commits May 9, 2023 10:45
remove trained_samples key as we have train_samples key in the common.yaml
In mlcommons/training#523 the distributed sampler was enabled, which required to use `drop_last_batch` setting, vs `pad_last_batch` as was being used previously. This changed the `samples_per_epoch` for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly.

Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence.

I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes.
The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant:

GBS=32
old samples per epoch: 192
old RCP mean: 1974
new samples per epoch: 160
new RCP mean: 2409
ratio: 1.22, expected ratio (192/160): 1.2

GBS=56
old samples per epoch: 168
old RCP mean: 2213
new samples per epoch: 168
new RCP mean: 2300
ratio: 1.04, expected ratio: 1.00

GBS=64
old samples per epoch: 192
old RCP mean: ---
new samples per epoch: 128
new RCP mean: 3270.77
ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51

GBS=80
old samples per epoch: 240
old RCP mean: 1618.75
new samples per epoch: 160
new RCP mean: 2462.51
ratio 1.52, expected ratio: 1.5

GBS=84
old RCP mean: 2233.25
new RCP mean: 2308.695652173913
ratio: 1.03, expected ratio: 1.0
@mmarcinkiewicz mmarcinkiewicz merged commit 18ee0bd into master Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants