-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content/content-sqlite: what to do on ENOSPC #6236
Comments
So after offline discussion went into code to review how some stuff works b/c everyone forgot. This is sort of a scattering of notes, as I admit I'm having a little trouble organizing my thoughts and best way to proceed. I think writing it out first and starting discussion might get things moving. Current notes on what currently is going on:
So thoughts on issues/ideas:
Sooooo ... some thought needed here in general. Haven't really organized / digested all the complexities here yet. Some notes for me on maybe bugs I found while looking into this: [1] it is possibly a bug that
Maybe more things later |
This is by design - it is a "write back cache" and the only guarantee after a KVS commit completes is that the data is available from any other rank (subject to eventually consistent semantics, or explicit synchronization). Agreed, we can't fail the commit because it's not connected to the backing store operation. We can return errors to a cache flush. A check point consisting of 1) flush the cache, 2) write the root ref to the backing store should stop at 1) if it gets ENOSPC, otherwise - dangling blobrefs and sadness. |
Idea: what if on ENOSPC on a store, the cache stopped trying to flush dirty pages and told the KVS to go read-only? That would potentially leave the instance in a state where the KVS could be manually dumped later on but prevent more data from piling up in the cache. IOW KVS commits would be refused with EROFS or similar error. Need to chew on that idea a bit to figure out what is wrong with it. |
Yeah, agreed. I think the lack of a |
I think that's a good idea. Similar to my "backing module failed" idea above. The issue is how to inform the KVS of this fact. We could simply do it on all subsequent failed stores, but that means the KVS won't know the last store that succeeded (thus we don't know the last rootref that was valid). Maybe we don't care and we just accept this. It could be some "event" mechanism, and it could notify the KVS of the failed blobref it tried to store? If the KVS kept track of some recent store attempts it could determine the last rootref that was valid and checkpoint that one? I suppose the latter point about failed blobrefs could also be handled via a response from |
As a bit of an aside, I could have swore that we once checkpointed the KVS after every job that completed. However, I cannot find any evidence of this in the current flux-core. Is this going on or perhaps we did at one point in time? Or perhaps we only talked about it and never implemented it. I was just thinking about mechanisms we want to employ to try and save/checkpoint the most recent valid rootref. |
Propagate the error up the stack?
Having a way to set/clear a read only flag at each level would be useful for testing. Possibly having a way to start with a read-only backing store would also be useful for recovery and such. (Still thinking about this in a brainstorming context - I'm not sure there's not a more direct, simpler way to handle ENOSPC) |
I remember arguing that this would be likely to harm job throughput, so maybe we didn't implement it.
It could be harmful to replace the last good checkpoint with one that references nonexistent blobs. For example, you could easily lose the entire jobs directory if the hash representing a new version of that was a blob that could not be written. If we did the read-only idea then the current KVS root would probably be advanced beyond what is stored in the backing store checkpoint. That root ref held by the KVS could be valid with respect to the cache, but would be invalid to write to the backing store checkpoint unless all the dirty blobs were first successfully flushed. |
Maybe we could put the instance into a safe mode like @trws was suggesting. If the admins were to simply stop flux from there, they get to restart from the last saved checkpoint. If they wanted to try a little harder then, once space is available again, they could run a command to try to save the checkpoint and then restart. |
Safe mode being read only mode? Minimally no new jobs would be able to be submitted due to ENOSPC. Would current running jobs fail as well? I guess this is something to mess around with, as we don't quite know the fallout of such a decision.
Trying to checkpoint meaning try to |
My intuition would be that anything that tried to persist to the KVS at that point would fail. Running jobs might be able to continue, but if they try to change state they would almost certainly error out. Even going to a safe mode would likely end up being fatal, but might let some things limp along a bit longer. |
Safe mode being a higher level state for the instance that would stop/disable queues at least. Having a read-only KVS would obviously cause fallout across the board. No posting to job eventlogs for one. Yeah, maybe this is a bad idea and we need something more draconian to happen like rank 0 broker shutdown. Note: we had a safe mode earlier that was entered if flux wasn't shut down cleanly. That caused confusion noted in #4861 and was removed by #4898. |
Idea: in lieu of safe mode, stop the instance but have rc3 on rank 0 run a command to pause just before unloading the KVS until some amount of disk is free in |
I was assuming we would leave dirty blobs in the cache until they are successfully written. Apparently the prerequisite of a successful flush is not built-in to |
I'm assuming you mean Internally, a |
Oh duh yes. We could just do an empty commit with |
Thinking we effectively send out a |
No I mean shutdown at rank 0 like normal, but once all the other ranks have disconnected and rank 0 starts running it's rc3, hold it there with only the kvs + content stuff loaded. |
This continues conversation from PR #6217. From @garlick
and later
The text was updated successfully, but these errors were encountered: