Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: kvs: call content.flush before checkpoint #6240

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Aug 27, 2024

Problem: When the KVS module is unloaded, a checkpoint of the root
reference is attempted. However, a content.flush is not done
beforehand. This could result in an invalid checkpoint reference
as data is not guaranteed to be flushed to the backing store.

Solution: Call content.flush before checkpointing.

Fixes #6237

I threw in a few new tests for some extra coverage of FLUX_KVS_SYNC too.

@chu11 chu11 force-pushed the issue6237_content_flush_before_checkpoint branch 3 times, most recently from e1f7562 to b019755 Compare August 27, 2024 21:44
Comment on lines 2920 to 2928
if (!(f = kvs_checkpoint_commit (h, NULL, rootref, rootseq, 0, 0))
|| flux_rpc_get (f, NULL) < 0)
/* first must ensure all content is flushed */
if (!(f1 = flux_rpc (h, "content.flush", NULL, 0, 0))
|| flux_rpc_get (f1, NULL) < 0) {
/* fallthrough to kvs_checkpoint_commit(), ENOSYS may be due
* to no backing store, but checkpoint can still be done to
* content cache.
*/
if (errno != ENOSYS)
goto error;
}

if (!(f2 = kvs_checkpoint_commit (h, NULL, rootref, rootseq, 0, 0))
|| flux_rpc_get (f2, NULL) < 0)
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fall through to the checkpoint commit? If there is no backing store then wouldn't it also fail?

Copy link
Member Author

@chu11 chu11 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this admittedly surprised me, but you can checkpoint to the content cache without it going through to the backing store.

i.e. if there is no backing store, the checkpoint can succeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya know, thinking about this a bit .... do we want that behavior? Do we want checkpoints to work with the content cache and w/o a backing store. It seems like a nuanced corner case. I'm not entirely sure why it was initially supported that way. This would be for a different issue I think, as this PR solves a specific content.flush before "checkpoint" corner case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a hook in the content cache for the checkpoint, why don't we do the cache flush there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've forgotten so many details here!

Copy link
Member Author

@chu11 chu11 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. supporting checkpoint w/o a backing module simply may have been a fallout from supporting the none backing module. #4492

perhaps checkpoint specifically should not be supported by the content cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it allows the kvs module to be reloaded and not lose all the data when there is no backing store?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it allows the kvs module to be reloaded and not lose all the data when there is no backing store?

I do think that's the point. But that is sort of a nuanced use case, possibly only useful in testing?

Copy link
Member Author

@chu11 chu11 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a hook in the content cache for the checkpoint, why don't we do the cache flush there?

That's not a bad idea. Lemme look into that, although it'd be for a different PR b/c FLUX_KVS_SYNC does two calls and that should probably be collapsed down into one as well.

@chu11
Copy link
Member Author

chu11 commented Sep 3, 2024

making WIP, feel #6251 should come first now

@chu11 chu11 changed the title kvs: call content.flush before checkpoint WIP: kvs: call content.flush before checkpoint Sep 3, 2024
Problem: An accidental 'd' was added to remove, making it "removed".

Fix spelling.
Problem: A test in t0028-content-backing-none.t incorrectly
calls checkpoint_put when it should call checkpoint_get.

Fix invalid test.
Problem:  The typical message unpack style is to place key names
and storage pointers on the same line, but that is not done in
several locations in the content and content backing modules.

Correct code style to be more consistent to the rest of flux-core.
Problem: A backing store is required for content.flush but it
is not required for content.checkpoint-put.  This is inconsistent
and can lead to checkpointing problems done the line.

Require content.checkpoint-put to only work if there is a backing
store available.  As a consequence, remove code that handled
"cached" checkpoints when a backing store is not available.

Fixes flux-framework#6251
Problem: Now that the content backing store is required for checkpoints,
many tests fail.

Remove tests that previously assumed that checkpointing worked without
a content backing store.  Adjust some tests that now have an new
error message.
Problem: There is no coverage to ensure that the "none" backing
store works identically to when no backing store is never loaded.

Add coverage in t0028-content-backing-none.t.
Problem: There is no coverage to ensure FLUX_KVS_SYNC fails when
there is no longer space on disk.

Add coverage to t0090-content-enospc.t.
Problem: There is no coverage to ensure FLUX_KVS_SYNC does not
work if there is no backing store.

Add coverage in t1010-kvs-commit-sync.t.
Problem: When the KVS module is unloaded, a checkpoint of the root
reference is attempted.  However, a content.flush is not done
beforehand.  This could result in an invalid checkpoint reference
as data is not guaranteed to be flushed to the backing store.

Solution: Call content.flush before checkpointing.

Fixes flux-framework#6237
@chu11 chu11 force-pushed the issue6237_content_flush_before_checkpoint branch from 76e9d9f to cb88b17 Compare September 4, 2024 21:08
Copy link

codecov bot commented Sep 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.33%. Comparing base (a75a4d9) to head (cb88b17).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6240      +/-   ##
==========================================
- Coverage   83.36%   83.33%   -0.03%     
==========================================
  Files         521      521              
  Lines       85693    85617      -76     
==========================================
- Hits        71438    71351      -87     
- Misses      14255    14266      +11     
Files with missing lines Coverage Δ
src/modules/content-files/content-files.c 73.77% <ø> (ø)
src/modules/content-sqlite/content-sqlite.c 72.04% <ø> (-1.64%) ⬇️
src/modules/content/cache.c 85.32% <ø> (-0.03%) ⬇️
src/modules/content/checkpoint.c 74.38% <100.00%> (-4.23%) ⬇️
src/modules/kvs/kvs.c 72.06% <100.00%> (+0.02%) ⬆️

... and 17 files with indirect coverage changes

@chu11
Copy link
Member Author

chu11 commented Sep 5, 2024

re-pushed, PR was built on top of #6255 and re-worked logically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvs: call content.flush before checkpoint
2 participants