Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround assumption that layers are unique in zstdchunk converter #1847

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

apostasie
Copy link
Contributor

@apostasie apostasie commented Nov 1, 2024

Hey,

(Hopefully) fixes #1842 and fixes containerd/nerdctl#3623

zstdchunk convert function assumes that layers are unique and cannot be repeated.

defer func() {
if err := cs.Delete(ctx, uncompressedDesc.Digest); err != nil {
log.G(ctx).WithError(err).WithField("uncompressedDesc", uncompressedDesc).Warn("failed to remove tmp uncompressed layer")
}
}()

When this is not the case, the defer delete of the uncompressed version will remove it while other competing routines may still need it, leading to the conversion failing.

This PR offers one possible solution that is minimally intrusive and does not change the overall design - just use a mutex when uncompressing a specific layer to ensure there cannot be concurrent processing for the same desc.

Note that if the layers are already uncompressed, no locking is enforced, hence the same class of issue may manifest itself elsewhere in the code (this PR does not try to address that - and it is unclear if such a situation is a problem or not).

Also PR-ed on nerdctl - containerd/nerdctl#3628 - should be closed over there if this here can be merged and released.

Finally, there may be more issues involved with large images like the ones in the initial reports - and this PR might not be a silver bullet, though it did successfully convert the nvidia image locally.

Thanks.

cc @AkihiroSuda @ktock

@GrigoryEvko
Copy link

Hi @apostasie did you check this PR with ctr-remote command for example? Unfortunately this change works for me in nerdctl and doesn't in ctr-remote image optimize :(

@apostasie
Copy link
Contributor Author

apostasie commented Nov 2, 2024

Hey @GrigoryEvko

Yep.
You have to pass --no-optimize to ctr-remote.

If you do not specify it, because of the extra opts from Analyze, the converter function can no longer acquire a lock and the calls are again parallelized.

Clearly there are deep-seated issues in the codebase making assumptions about images that are not true:
https://github.com/containerd/stargz-snapshotter/blob/main/nativeconverter/zstdchunked/zstdchunked.go#L82-L83
(developer's famous last words: "this should not happen" :-))

Bottom-line:

With this patch, use nerdctl, or use ctr-remote with --no-optimize.

If the maintainers here find this direction acceptable, we can also patch LayerConvertWithLayerOptsFuncWithCompressionLevel, but I feel more and more we have to rethink the entire thing.

None of this appears to be safe to use concurrently (IMO the uncompress/defer delete step on the content store is problematic).

@GrigoryEvko
Copy link

You have to pass --no-optimize to ctr-remote.

Oh no, it was the whole point of using optimize command 😄

For today, if I'd like to have optimized zstdchunked image, would ctr-remote image optimize to estargz and then nerdctl image convert to zstdchunked work? I'm not sure about all the metadata, would it be saved and relative after the conversion?

It's interesting that converting to estargz with gzip doesn't have all these issues, I thought it's mostly a compression algorithm change, looks like it's implemented differently.

Thanks again!

@apostasie
Copy link
Contributor Author

Oh, stop making these puppy eyes sir! 😂

Here.

Latest version works for both (note that I hate the updated version even more than the previous one, because of the global vars).

Enjoy ;-).

@apostasie
Copy link
Contributor Author

Afk for most of the day - will be slow to answer.

@apostasie
Copy link
Contributor Author

@AkihiroSuda @ktock gentle nudge

Comment on lines +114 to +121
mu.Lock()
if _, ok := uncompressMap[desc.Digest]; !ok {
uncompressMap[desc.Digest] = &unMap{}
}
uncompressMap[desc.Digest].mu.Lock()
mu.Unlock()
defer uncompressMap[desc.Digest].mu.Unlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for slow review. I think we can just remove cs.Delete(ctx, uncompressedDesc.Digest) in this function. containerd garbage collects that blob when nobody refers to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zstdchunked error deleting temp layers zstdchunked error trying to delete uncompressed big layers
3 participants