Relax RCP checking rules #206

emizan76 · 2022-02-17T20:14:55Z

This work has been initiated after discussions related to: mlcommons/training_policies#451

We have found that some RCPs do not follow the expected trend of the RCP curves.

Epochs (or training samples) to converge should increase monotonically with batch size, and the rate of increase should also go up. There are quite a few violations to that rule in the current RCPs, and we have decided it is more efficient to have the checker be aware, handle and possibly accept these violations instead of spending all our time running the references for better RCPs.

I can see 3 approaches to this problem -- as I am thinking about it right now the best is (3), but I am not sure what I am going to do yet because the devil is in the details.

(1) Prune RCPs that violate the above rules.
Step 1: Find the RCP with faster convergence (min mean epochs to converge) and prune all RCPs with lower batch size that have slower convergence.
Step 2: Get the RCP with the largest batch size and prune all RCPs with smaller batch size that have slower convergence.
Step 3: Go through the remaining RCPs and remove the ones that have slower convergence than if they were interpolated. Tentative algorithm for my own sanity

for i = 1..N-2
  if RCP[i+1] has slower convergence than interpolation(RCP[i], RCP[i+2]):
    remove it
    decrement i,N

(2) Do not change RCPs at all, but when there is an RCP failure, then run all possible interpolations between existing RCPs.
Tentative algorithm for my own sanity.

if RCP[i] failure
  Best_convergence = RCP[i] or interpolated(RCP[i-1], RCP[i+1])
  for prev = i-1 .. 1
    for next = i+1 .. N
      if interpolation(RCP[prev], RCP[next]) < Best_Convergence:
        Best_Convergence = interpolation(RCP[prev], RCP[next])
  If submission_convergence >= Best_Convergence
     Passed // keep track also of Best _Convergence so we can improve RCPs in the future.
  else
     Failed

(3) Do the pruning, and then run convergence test in non-pruned RCPs, and if failed run it on pruned RCPs. Keep track of where the submission passes the RCPs, so we can improve RCPs in the future.

The text was updated successfully, but these errors were encountered:

emizan76 · 2022-03-09T18:52:41Z

I have a first version of the code locally.

Here is what is supported:

RCP PRUNING ALGORITHM
Step 1: Find the RCP with faster convergence (min mean epochs to converge) and prune all RCPs with lower batch size that have slower convergence.
Step 2: Go through the remaining RCPs and prune the ones that have slower convergence than if they were interpolated. Pseudocode: [step 2 from above was not needed]

for i = 1..N-2
  if RCP[i+1] has slower convergence than interpolation(RCP[i], RCP[i+2]):
    remove it
    decrement i,N

Original RCPs and Pruned RCPs are stored in separate dictionaries.

We run the checker on the original RCPs, and if it fails we run it in the pruned RCPs.
We can easily eliminate running on the original RCPs easily.

Once this is discussed, I will open a PR.

RCP curves with updated point that are pruned are here:
https://docs.google.com/spreadsheets/d/1ZYoTo5C8RIqbwfK7Fvs1h8mMJl_XD7da494QOMjnXUY/edit#gid=0

emizan76 · 2022-03-16T18:57:12Z

See PR #215

Support for pruning RCPs. (Issue #206)

emizan76 · 2022-04-13T04:47:31Z

This has been implemented, see PR215, resolving.

emizan76 self-assigned this Feb 17, 2022

emizan76 mentioned this issue Mar 16, 2022

Use original RCP for RCP test if batch size and ALL hparams of submission match the RCP, otherwise used pruned RCPs. #214

Open

emizan76 added a commit that referenced this issue Apr 8, 2022

Merge pull request #215 from mlcommons/pruned_rcps

55dc8e4

Support for pruning RCPs. (Issue #206)

emizan76 closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax RCP checking rules #206

Relax RCP checking rules #206

emizan76 commented Feb 17, 2022

emizan76 commented Mar 9, 2022

emizan76 commented Mar 16, 2022

emizan76 commented Apr 13, 2022

Relax RCP checking rules #206

Relax RCP checking rules #206

Comments

emizan76 commented Feb 17, 2022

emizan76 commented Mar 9, 2022

emizan76 commented Mar 16, 2022

emizan76 commented Apr 13, 2022