[UNET3D] Update RCPs due to !523 and !625 #1

mmarcinkiewicz · 2023-04-24T07:59:58Z

In mlcommons/training#523 the distributed sampler was enabled, which required to use drop_last_batch setting, vs pad_last_batch as was being used previously. This changed the samples_per_epoch for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly.

Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence.

I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant:

GBS=32
old samples per epoch: 192
old RCP mean: 1974
new samples per epoch: 160
new RCP mean: 2409
ratio: 1.22, expected ratio (192/160): 1.2

GBS=56
old samples per epoch: 168
old RCP mean: 2213
new samples per epoch: 168
new RCP mean: 2300
ratio: 1.04, expected ratio: 1.00

GBS=64
old samples per epoch: 192
old RCP mean: ---
new samples per epoch: 128
new RCP mean: 3270.77
ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51

GBS=80
old samples per epoch: 240
old RCP mean: 1618.75
new samples per epoch: 160
new RCP mean: 2462.51
ratio 1.52, expected ratio: 1.5

GBS=84
old RCP mean: 2233.25
new RCP mean: 2308.695652173913
ratio: 1.03, expected ratio: 1.0

The slight increase (3-4%) of the ratio might or might not be attributed to the fix to the distributed sampler.

_ALL_RESULT_FILE_COUNTS and _ALL_ALLOWED_BENCHMARKS updated to reflect new benchmarks (DLRM_DCNv2 and GPT3). File counts for DLRM_DCNv2 set to 10, GPT3 set to 3.

[gpt3] add constants, compliance rules and initial RCP data

Only read xlsx config when flag is passed. This was making the UI to crash because the config file was not there

drcanchi · 2023-05-04T16:21:20Z

Just to clarify, did the new RCPs include the bug fix for distributed sampler seed as well ? mlcommons/training#625

mmarcinkiewicz · 2023-05-04T17:40:44Z

@drcanchi yes

remove trained_samples key as we have train_samples key in the common.yaml

In mlcommons/training#523 the distributed sampler was enabled, which required to use `drop_last_batch` setting, vs `pad_last_batch` as was being used previously. This changed the `samples_per_epoch` for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly. Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence. I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant: GBS=32 old samples per epoch: 192 old RCP mean: 1974 new samples per epoch: 160 new RCP mean: 2409 ratio: 1.22, expected ratio (192/160): 1.2 GBS=56 old samples per epoch: 168 old RCP mean: 2213 new samples per epoch: 168 new RCP mean: 2300 ratio: 1.04, expected ratio: 1.00 GBS=64 old samples per epoch: 192 old RCP mean: --- new samples per epoch: 128 new RCP mean: 3270.77 ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51 GBS=80 old samples per epoch: 240 old RCP mean: 1618.75 new samples per epoch: 160 new RCP mean: 2462.51 ratio 1.52, expected ratio: 1.5 GBS=84 old RCP mean: 2233.25 new RCP mean: 2308.695652173913 ratio: 1.03, expected ratio: 1.0

sgpyc and others added 12 commits March 15, 2023 23:28

[gpt3] new constants & compliance checker

6b84e3d

[gpt3] Initial RCP data

c1a0506

Merge https://github.com/mlcommons/logging into gpt3

7625383

[GPT3] update PaxML RCPs using tokenized eval dataset

1b97707

[GPT3] Changes according to comments in PR303

8060d14

[GPT3] update RCP info according to comments in PR303

8ea587e

Add DLRM_DCNV2 and GPT3 names to logging repo (mlcommons#306)

9d5a652

_ALL_RESULT_FILE_COUNTS and _ALL_ALLOWED_BENCHMARKS updated to reflect new benchmarks (DLRM_DCNv2 and GPT3). File counts for DLRM_DCNv2 set to 10, GPT3 set to 3.

Create CODEOWNERS

9ce6aad

Merge https://github.com/mlcommons/logging into gpt3

b497e4d

[gpt3] update according to review comments.

ec8ca0b

Merge pull request mlcommons#303 from sgpyc/gpt3

3811db4

[gpt3] add constants, compliance rules and initial RCP data

Fix result summarizer xlsx-UI interation (mlcommons#311)

84f6f0c

Only read xlsx config when flag is passed. This was making the UI to crash because the config file was not there

itayhubara and others added 4 commits May 9, 2023 10:45

Update closed_gpt3.yaml (mlcommons#312)

dfbee54

remove trained_samples key as we have train_samples key in the common.yaml

Update "when" fields

868390a

Update rcps_unet3d.json

12d92a6

pgmpablo157321 force-pushed the unet3d-rcp-fix-v3.0 branch from 6ab1078 to 12d92a6 Compare May 9, 2023 17:46

Fix RCPs for UNET3D GBS=80

18ee0bd

mmarcinkiewicz merged commit 18ee0bd into master Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UNET3D] Update RCPs due to !523 and !625 #1

[UNET3D] Update RCPs due to !523 and !625 #1

mmarcinkiewicz commented Apr 24, 2023 •

edited

Loading

drcanchi commented May 4, 2023

mmarcinkiewicz commented May 4, 2023

[UNET3D] Update RCPs due to !523 and !625 #1

[UNET3D] Update RCPs due to !523 and !625 #1

Conversation

mmarcinkiewicz commented Apr 24, 2023 • edited Loading

drcanchi commented May 4, 2023

mmarcinkiewicz commented May 4, 2023

mmarcinkiewicz commented Apr 24, 2023 •

edited

Loading