-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support partially allocated jobs across scheduler reload #6445
base: master
Are you sure you want to change the base?
Conversation
I think this would work, but as I look over it I find myself wondering if we might be better off either with a list of things that have been released, or actually leaving this out of hello entirely and using a partial release after the hello response from the scheduler. The main reason is that this approach sends what is allocated as R, then we have to work out what to release by taking the difference between what is allocated and what needs to be removed before we can act on it. If instead the released portion is sent, either in hello or as a partial release, we can reuse the same code we already have for partial release without introducing a new code path into what's already proven somewhat complex or fragile. That said, working through this as I write, it looks like the job-manager has the allocated resources in this format as pending. That makes me wonder if we should send that |
Duh! Why didn't I think of that? Let me explore that one and see how it goes. In general, I'm in favor of making the job manager offload work from scheduler(s). It seems like your idea would make it "just work" wtihout any changes. |
No, in the current proposal we are just sending the allocated idset (ranks) and some basic job metadata like the id in the RPC responses, then libschedutil looks up the job's original R (including JGF) from the KVS and passes it to the scheduler in its callback. |
OK, actually, it's quite the pain to send a free after the hello is finished, because the scheduler has to send the Would sending a |
Could libschedutil do the work of creating the partial R from the original R and Edit: (sorry if this is a naive suggestion, I haven't gone back to look at the actual implementation before suggesting it) |
Oh good idea, that makes sense to me. That gets around the fact that the R fragment in the job manager is missing the JGF. |
Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way. |
Great! I'll push a rework of this PR with those changes shortly. |
ok, pushed those changes. This is built on top of #6450 currently - will rebase when that gets merged. The RFC PR will need a small rework for |
Problem: RFC 27 allows the scheduler to send a partial-ok flag in the hello request, and then receive partially allocated jobs in hello responses. If the hello request includes this flag, pass it on to housekeeping. For each partially released housekeeping job, include the 'free' idset in the response per RFC 27.
Problem: libschedutil provides no way for the scheduler to indicate that the partial-ok flag should be set in the hello request. Add the SCHEDUTIL_HELLO_PARTIAL_OK flag which is passed to schedutil_create().
Problem: when processing hello responses, all schedulers now need to process R - free for partial releases. As a convenience, change the libschedutil hello callback to subtract the free idset from the R it fetched from the KVS. Note that the scheduling key, if present, remains the full object which is opaque to flux-core.
Problem: sched-simple does not support partial hello responses. Set the SCHEDUTIL_HELLO_PARTIAL_OK flag. Add a 'test-hello-nopartial' module option to get the old behavior. Set test-hello-nopartial in the current test of partial housekeeping release.
Problem: there is no coverage of reloading the scheduler with partially released jobs in housekeeping. Add a test.
Problem: when the hello protocol cannot process a job, it logs the name of the wrong rlist function. Make the log message a little more high level.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6445 +/- ##
==========================================
+ Coverage 83.60% 83.63% +0.03%
==========================================
Files 523 523
Lines 87505 87559 +54
==========================================
+ Hits 73156 73233 +77
+ Misses 14349 14326 -23
|
flux-framework/rfc#433 was updated to specify the (optional) Then schedutil was modified so that the The current behavior is preserved unless the scheduler sets the SCHEDUTIL_HELLO_PARTIAL_OK flag. |
This is a proof of concept implementation of the changes to the scheduler hello protocol proposed in flux-framework/rfc#433, in which support is added for reloading the scheduler with housekeeping running and some nodes of job(s) already released.
The scheduler indicates it supports this by calling
schedutil_create()
with the SCHEDUTIL_HELLO_PARTIAL_OK flag. When it sets that, it agrees to parse an optionalallocated
key in each hello response. Theallocated
key is set to an idset representing the subset of ranks of R that are actually allocated. If missing, all ranks are assumed to be allocated.We need to get some feedback from @milroy, @trws, et al to make sure this approach works for fluxion. I thought working through this with sched-simple would be helpful to illustrate the idea.