Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax RCP checking rules #206

Closed
emizan76 opened this issue Feb 17, 2022 · 3 comments
Closed

Relax RCP checking rules #206

emizan76 opened this issue Feb 17, 2022 · 3 comments
Assignees

Comments

@emizan76
Copy link
Contributor

This work has been initiated after discussions related to: mlcommons/training_policies#451

We have found that some RCPs do not follow the expected trend of the RCP curves.

Epochs (or training samples) to converge should increase monotonically with batch size, and the rate of increase should also go up. There are quite a few violations to that rule in the current RCPs, and we have decided it is more efficient to have the checker be aware, handle and possibly accept these violations instead of spending all our time running the references for better RCPs.

I can see 3 approaches to this problem -- as I am thinking about it right now the best is (3), but I am not sure what I am going to do yet because the devil is in the details.

(1) Prune RCPs that violate the above rules.
Step 1: Find the RCP with faster convergence (min mean epochs to converge) and prune all RCPs with lower batch size that have slower convergence.
Step 2: Get the RCP with the largest batch size and prune all RCPs with smaller batch size that have slower convergence.
Step 3: Go through the remaining RCPs and remove the ones that have slower convergence than if they were interpolated. Tentative algorithm for my own sanity

for i = 1..N-2
  if RCP[i+1] has slower convergence than interpolation(RCP[i], RCP[i+2]):
    remove it
    decrement i,N

(2) Do not change RCPs at all, but when there is an RCP failure, then run all possible interpolations between existing RCPs.
Tentative algorithm for my own sanity.

if RCP[i] failure
  Best_convergence = RCP[i] or interpolated(RCP[i-1], RCP[i+1])
  for prev = i-1 .. 1
    for next = i+1 .. N
      if interpolation(RCP[prev], RCP[next]) < Best_Convergence:
        Best_Convergence = interpolation(RCP[prev], RCP[next])
  If submission_convergence >= Best_Convergence
     Passed // keep track also of Best _Convergence so we can improve RCPs in the future.
  else
     Failed

(3) Do the pruning, and then run convergence test in non-pruned RCPs, and if failed run it on pruned RCPs. Keep track of where the submission passes the RCPs, so we can improve RCPs in the future.

@emizan76 emizan76 self-assigned this Feb 17, 2022
@emizan76
Copy link
Contributor Author

emizan76 commented Mar 9, 2022

I have a first version of the code locally.

Here is what is supported:

RCP PRUNING ALGORITHM
Step 1: Find the RCP with faster convergence (min mean epochs to converge) and prune all RCPs with lower batch size that have slower convergence.
Step 2: Go through the remaining RCPs and prune the ones that have slower convergence than if they were interpolated. Pseudocode: [step 2 from above was not needed]

for i = 1..N-2
  if RCP[i+1] has slower convergence than interpolation(RCP[i], RCP[i+2]):
    remove it
    decrement i,N

Original RCPs and Pruned RCPs are stored in separate dictionaries.

We run the checker on the original RCPs, and if it fails we run it in the pruned RCPs.
We can easily eliminate running on the original RCPs easily.

Once this is discussed, I will open a PR.

RCP curves with updated point that are pruned are here:
https://docs.google.com/spreadsheets/d/1ZYoTo5C8RIqbwfK7Fvs1h8mMJl_XD7da494QOMjnXUY/edit#gid=0

@emizan76
Copy link
Contributor Author

See PR #215

emizan76 added a commit that referenced this issue Apr 8, 2022
Support for pruning RCPs. (Issue #206)
@emizan76
Copy link
Contributor Author

This has been implemented, see PR215, resolving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant