forked from mlcommons/logging
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UNET3D] Update RCPs due to !523 and !625 #1
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
_ALL_RESULT_FILE_COUNTS and _ALL_ALLOWED_BENCHMARKS updated to reflect new benchmarks (DLRM_DCNv2 and GPT3). File counts for DLRM_DCNv2 set to 10, GPT3 set to 3.
[gpt3] add constants, compliance rules and initial RCP data
Only read xlsx config when flag is passed. This was making the UI to crash because the config file was not there
Just to clarify, did the new RCPs include the bug fix for distributed sampler seed as well ? mlcommons/training#625 |
@drcanchi yes |
remove trained_samples key as we have train_samples key in the common.yaml
In mlcommons/training#523 the distributed sampler was enabled, which required to use `drop_last_batch` setting, vs `pad_last_batch` as was being used previously. This changed the `samples_per_epoch` for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly. Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence. I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant: GBS=32 old samples per epoch: 192 old RCP mean: 1974 new samples per epoch: 160 new RCP mean: 2409 ratio: 1.22, expected ratio (192/160): 1.2 GBS=56 old samples per epoch: 168 old RCP mean: 2213 new samples per epoch: 168 new RCP mean: 2300 ratio: 1.04, expected ratio: 1.00 GBS=64 old samples per epoch: 192 old RCP mean: --- new samples per epoch: 128 new RCP mean: 3270.77 ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51 GBS=80 old samples per epoch: 240 old RCP mean: 1618.75 new samples per epoch: 160 new RCP mean: 2462.51 ratio 1.52, expected ratio: 1.5 GBS=84 old RCP mean: 2233.25 new RCP mean: 2308.695652173913 ratio: 1.03, expected ratio: 1.0
pgmpablo157321
force-pushed
the
unet3d-rcp-fix-v3.0
branch
from
May 9, 2023 17:46
6ab1078
to
12d92a6
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In mlcommons/training#523 the distributed sampler was enabled, which required to use
drop_last_batch
setting, vspad_last_batch
as was being used previously. This changed thesamples_per_epoch
for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly.Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence.
I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant:
GBS=32
old samples per epoch: 192
old RCP mean: 1974
new samples per epoch: 160
new RCP mean: 2409
ratio: 1.22, expected ratio (192/160): 1.2
GBS=56
old samples per epoch: 168
old RCP mean: 2213
new samples per epoch: 168
new RCP mean: 2300
ratio: 1.04, expected ratio: 1.00
GBS=64
old samples per epoch: 192
old RCP mean: ---
new samples per epoch: 128
new RCP mean: 3270.77
ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51
GBS=80
old samples per epoch: 240
old RCP mean: 1618.75
new samples per epoch: 160
new RCP mean: 2462.51
ratio 1.52, expected ratio: 1.5
GBS=84
old RCP mean: 2233.25
new RCP mean: 2308.695652173913
ratio: 1.03, expected ratio: 1.0
The slight increase (3-4%) of the ratio might or might not be attributed to the fix to the distributed sampler.