Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content/content-sqlite: what to do on ENOSPC #6236

Open
chu11 opened this issue Aug 26, 2024 · 18 comments
Open

content/content-sqlite: what to do on ENOSPC #6236

chu11 opened this issue Aug 26, 2024 · 18 comments

Comments

@chu11
Copy link
Member

chu11 commented Aug 26, 2024

This continues conversation from PR #6217. From @garlick

After consulting with @kkier a bit on this, we think it may be better to present admins with a choice of

  • ensure flux never runs out of space on rank 0 (whether by partition or whatever), or
  • flux goes down hard on ENOSPC and, when manually restarted, recovers what it can but doesn't stop for manual intervention

That is a trade-off they are most qualified to make based on their judgement of the relative impacts. From their perspective I think it's more of a question of whether they want to reserve some amount of disk for flux or share it among multiple consumers, rather than whether or not they want to make a partition. If this PR successfully provided a mechanism to reserve space without a partition, it wouldn't change much in that calculus.

IOW: I think we should invest some development effort in the second option for now rather than this space reservation scheme.

and later

Well it does seem like content-sqlite should just close the database after we get an ENOSPC. I think we probably need to talk through what can happen next. Ideally it minimizes the need to manually intervene.

@chu11
Copy link
Member Author

chu11 commented Aug 26, 2024

So after offline discussion went into code to review how some stuff works b/c everyone forgot. This is sort of a scattering of notes, as I admit I'm having a little trouble organizing my thoughts and best way to proceed. I think writing it out first and starting discussion might get things moving.

Current notes on what currently is going on:

  • currently the KVS checkpoints its rootref either A) based on a configured period in the KVS config (e.g. checkpoint every 30 minutes) and B) when the KVS module is unloaded. The periodic checkpoints are done via a FLUX_KVS_SYNC (underneath the covers calls content.flush followed but a content.checkpoint put. When unloading the module, only a content.checkpoint-put is done. [1]

  • If content-sqlite gets ENOSPC, content-sqlite just "stays alive". Every subsequent store request will also get an ENOSPC. This goes on forever. Which is probably not what we want.

  • ENOSPC is returned to the content module, which logs an error. BUT AFAICT the KVS only cares that store requests are "cached", so this error is never propagated to the KVS. (i.e. this may conflict with desire to "fail HARD").

So thoughts on issues/ideas:

  • The KVS's ability to run w/o a backing store, IMO conflicts with our desire to fail hard on ENOSPC. We probably need the backing module to return an error to the content cache, which then returns an error to the KVS.

    • A possible solution is that the content cache can go into uhh "backing module failed" mode under ENOSPC. So subsequent store attempts return an error. This could possibly be a module config option (i.e. enabled just for system instance).
  • There is a racy/asynchronous issue in this region. Content cache returns "success" on a store request when data is cached, not when a store to the backing module succeeds. So if ENOSPC is hit, "success" for that store has already been sent to the KVS module and that commit is done. No idea on how to handle this issue at the moment.

  • The points above probably illustrate one concern, which is that the final rootref from the KVS may in fact not be valid. I can see a scenario where the content is cached, the KVS rootref is updated, but the behind the scenes content-store fails with ENOSPC. That rootref could conceptually be checkpointed (ENOSPC hits for larger stores, but a checkpoint can get through b/c its smaller on average).

  • Checkpoints go to the same database that store requests goto. So checkpoints can suffer from the same problems (ENOSPC).

    • A possible solution is to create a completely separate database that stores checkpoints. This database could be preallocated a small amount of space and have journaling disabled (i.e. preallocation cost negligible, performance impact neglible b/c its just for checkpoints). This would allow the checkpoint when the KVS is unloaded to work (this assumes KVS commits that failed going to the backing store are resolved before this checkpoint commit is done).

Sooooo ... some thought needed here in general. Haven't really organized / digested all the complexities here yet.

Some notes for me on maybe bugs I found while looking into this:

[1] it is possibly a bug that content.flush is not called before the content.checkpoint-put when the KVS module is unloaded. If ENOSPC was hit, that would inform us that we cannot just the rootref to be checkpointed. Should investigate.

  • I don't think there are tests to ensure the KVS behaves as intended when a backing store is not loaded. Most notably writing data with FLUX_KVS_SYNC should always fail w/ ENOSYS.

  • Tests to ensure FLUX_KVS_SYNC failures on ENOSPC would be good too. Not sure if it's possible, but a test in which content.flush succeeds and content-checkpoint.put fails would be good.

Maybe more things later

@garlick
Copy link
Member

garlick commented Aug 26, 2024

ENOSPC is returned to the content module, which logs an error. BUT AFAICT the KVS only cares that store requests are "cached", so this error is never propagated to the KVS.

This is by design - it is a "write back cache" and the only guarantee after a KVS commit completes is that the data is available from any other rank (subject to eventually consistent semantics, or explicit synchronization). Agreed, we can't fail the commit because it's not connected to the backing store operation.

We can return errors to a cache flush. A check point consisting of 1) flush the cache, 2) write the root ref to the backing store should stop at 1) if it gets ENOSPC, otherwise - dangling blobrefs and sadness.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

Idea: what if on ENOSPC on a store, the cache stopped trying to flush dirty pages and told the KVS to go read-only? That would potentially leave the instance in a state where the KVS could be manually dumped later on but prevent more data from piling up in the cache.

IOW KVS commits would be refused with EROFS or similar error.

Need to chew on that idea a bit to figure out what is wrong with it.

@chu11
Copy link
Member Author

chu11 commented Aug 27, 2024

We can return errors to a cache flush. A check point consisting of 1) flush the cache, 2) write the root ref to the backing store should stop at 1) if it gets ENOSPC, otherwise - dangling blobrefs and sadness.

Yeah, agreed. I think the lack of a content.flush before the checkpoint at the end of the KVS module is an error we need to correct.

@chu11
Copy link
Member Author

chu11 commented Aug 27, 2024

Idea: what if on ENOSPC on a store, the cache stopped trying to flush dirty pages and told the KVS to go read-only? That would potentially leave the instance in a state where the KVS could be manually dumped later on but prevent more data from piling up in the cache.

I think that's a good idea. Similar to my "backing module failed" idea above.

The issue is how to inform the KVS of this fact. We could simply do it on all subsequent failed stores, but that means the KVS won't know the last store that succeeded (thus we don't know the last rootref that was valid). Maybe we don't care and we just accept this.

It could be some "event" mechanism, and it could notify the KVS of the failed blobref it tried to store? If the KVS kept track of some recent store attempts it could determine the last rootref that was valid and checkpoint that one? I suppose the latter point about failed blobrefs could also be handled via a response from content.flush.

@chu11
Copy link
Member Author

chu11 commented Aug 27, 2024

As a bit of an aside, I could have swore that we once checkpointed the KVS after every job that completed. However, I cannot find any evidence of this in the current flux-core. Is this going on or perhaps we did at one point in time? Or perhaps we only talked about it and never implemented it.

I was just thinking about mechanisms we want to employ to try and save/checkpoint the most recent valid rootref.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

The issue is how to inform the KVS of this fact.

Propagate the error up the stack?

  • backing store becomes read only and propagates ENOSPC/EROFS errors to the cache
  • cache becomes read only and propagates ENOSPC/EROFS errors to the KVS
  • KVS becomes read only and propagates ENOSPC/EROFS errors to users
  • from there we might have to work out some problems to be sure things fail in a sane way

Having a way to set/clear a read only flag at each level would be useful for testing.

Possibly having a way to start with a read-only backing store would also be useful for recovery and such.

(Still thinking about this in a brainstorming context - I'm not sure there's not a more direct, simpler way to handle ENOSPC)

@garlick
Copy link
Member

garlick commented Aug 27, 2024

As a bit of an aside, I could have swore that we once checkpointed the KVS after every job that completed. However, I cannot find any evidence of this in the current flux-core. Is this going on or perhaps we did at one point in time? Or perhaps we only talked about it and never implemented it.

I remember arguing that this would be likely to harm job throughput, so maybe we didn't implement it.

I was just thinking about mechanisms we want to employ to try and save/checkpoint the most recent valid rootref.

It could be harmful to replace the last good checkpoint with one that references nonexistent blobs. For example, you could easily lose the entire jobs directory if the hash representing a new version of that was a blob that could not be written.

If we did the read-only idea then the current KVS root would probably be advanced beyond what is stored in the backing store checkpoint. That root ref held by the KVS could be valid with respect to the cache, but would be invalid to write to the backing store checkpoint unless all the dirty blobs were first successfully flushed.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

Maybe we could put the instance into a safe mode like @trws was suggesting. If the admins were to simply stop flux from there, they get to restart from the last saved checkpoint. If they wanted to try a little harder then, once space is available again, they could run a command to try to save the checkpoint and then restart.

@chu11
Copy link
Member Author

chu11 commented Aug 27, 2024

Maybe we could put the instance into a safe mode like @trws was suggesting. If the admins were to simply stop flux from there, they get to restart from the last saved checkpoint.

Safe mode being read only mode? Minimally no new jobs would be able to be submitted due to ENOSPC. Would current running jobs fail as well? I guess this is something to mess around with, as we don't quite know the fallout of such a decision.

If they wanted to try a little harder then, once space is available again, they could run a command to try to save the checkpoint and then restart.

Trying to checkpoint meaning try to content.flush anything that ENOSPC-ed earlier and then checkpoint? It is certainly doable, but once we go into "read-only" mode, there's no way to know what KVS rootref is good for checkpointing.

@trws
Copy link
Member

trws commented Aug 27, 2024

Safe mode being read only mode? Minimally no new jobs would be able to be submitted due to ENOSPC. Would current running jobs fail as well? I guess this is something to mess around with, as we don't quite know the fallout of such a decision.

My intuition would be that anything that tried to persist to the KVS at that point would fail. Running jobs might be able to continue, but if they try to change state they would almost certainly error out. Even going to a safe mode would likely end up being fatal, but might let some things limp along a bit longer.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

Safe mode being a higher level state for the instance that would stop/disable queues at least.

Having a read-only KVS would obviously cause fallout across the board. No posting to job eventlogs for one.

Yeah, maybe this is a bad idea and we need something more draconian to happen like rank 0 broker shutdown.
Edit: hmm, maybe somehow a partial shutdown that leaves rank 0 running so that an admin could potentially flush the content cache once there is disk space.

Note: we had a safe mode earlier that was entered if flux wasn't shut down cleanly. That caused confusion noted in #4861 and was removed by #4898.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

Idea: in lieu of safe mode, stop the instance but have rc3 on rank 0 run a command to pause just before unloading the KVS until some amount of disk is free in /var/lib/flux? The command could restore read-write mode and let shutdown continue.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

Trying to checkpoint meaning try to content.flush anything that ENOSPC-ed earlier and then checkpoint? It is certainly doable, but once we go into "read-only" mode, there's no way to know what KVS rootref is good for checkpointing.

I was assuming we would leave dirty blobs in the cache until they are successfully written.

Apparently the prerequisite of a successful flush is not built-in to flux_kvs_checkpoint(). It probably should be. Or maybe I'm forgetting the reason it isn't.

@chu11
Copy link
Member Author

chu11 commented Aug 27, 2024

Apparently the prerequisite of a successful flush is not built-in to flux_kvs_checkpoint(). It probably should be. Or maybe I'm forgetting the reason it isn't.

I'm assuming you mean kvs_checkpoint_commit(). I think this API (I guess being semi-internal to flux-core) was intended solely for the writing of the rootref.

Internally, a FLUX_KVS_SYNC is equivalent to content.flush + kvs_checkpoint_commit(), so I think that was used versus having an independent function for it. I guess having an independent function could be useful.

@garlick
Copy link
Member

garlick commented Aug 27, 2024

Oh duh yes. We could just do an empty commit with FLUX_KVS_SYNC to accomplish that.

@chu11
Copy link
Member Author

chu11 commented Aug 27, 2024

Idea: in lieu of safe mode, stop the instance but have rc3 on rank 0 run a command to pause just before unloading the KVS until some amount of disk is free in /var/lib/flux? The command could restore read-write mode and let shutdown continue.

Thinking we effectively send out a flux shutdown to all ranks > 0?

@garlick
Copy link
Member

garlick commented Aug 27, 2024

No I mean shutdown at rank 0 like normal, but once all the other ranks have disconnected and rank 0 starts running it's rc3, hold it there with only the kvs + content stuff loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@garlick @chu11 @trws and others