-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax RCP checking rules #206
Comments
I have a first version of the code locally. Here is what is supported: RCP PRUNING ALGORITHM
Original RCPs and Pruned RCPs are stored in separate dictionaries. We run the checker on the original RCPs, and if it fails we run it in the pruned RCPs. Once this is discussed, I will open a PR. RCP curves with updated point that are pruned are here: |
See PR #215 |
Support for pruning RCPs. (Issue #206)
This has been implemented, see PR215, resolving. |
This work has been initiated after discussions related to: mlcommons/training_policies#451
We have found that some RCPs do not follow the expected trend of the RCP curves.
Epochs (or training samples) to converge should increase monotonically with batch size, and the rate of increase should also go up. There are quite a few violations to that rule in the current RCPs, and we have decided it is more efficient to have the checker be aware, handle and possibly accept these violations instead of spending all our time running the references for better RCPs.
I can see 3 approaches to this problem -- as I am thinking about it right now the best is (3), but I am not sure what I am going to do yet because the devil is in the details.
(1) Prune RCPs that violate the above rules.
Step 1: Find the RCP with faster convergence (min mean epochs to converge) and prune all RCPs with lower batch size that have slower convergence.
Step 2: Get the RCP with the largest batch size and prune all RCPs with smaller batch size that have slower convergence.
Step 3: Go through the remaining RCPs and remove the ones that have slower convergence than if they were interpolated. Tentative algorithm for my own sanity
(2) Do not change RCPs at all, but when there is an RCP failure, then run all possible interpolations between existing RCPs.
Tentative algorithm for my own sanity.
(3) Do the pruning, and then run convergence test in non-pruned RCPs, and if failed run it on pruned RCPs. Keep track of where the submission passes the RCPs, so we can improve RCPs in the future.
The text was updated successfully, but these errors were encountered: