-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shared library/tool for managing backing store files #125
Comments
I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this: Suppose you have
So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1. Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files. |
But yeah, it would be cool with a global, namespaced version of this, because then we can easily get sharing between the ostree rootfs and container images. |
I think being able to share between the host and container images is more than a nice-to-have; if we're doing a major architectural rework, I'd call it a requirement because it makes containerization a more zero-cost thing. (e.g. assuming that your glibc is shared between host and base images) |
The scheme you describe makes sense to me offhand. The simplicity is very appealing; there's no explicit locking (e.g.
s/EEXIST/ENOENT/ right? i.e. on a failure to Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned. |
And basically once we have this shared scheme, I think we can seamlessly convert an ostree repository into this format (for composefs-only cases). And that then significantly reduces the logic in ostree core and I think simplifies the composefs integration. |
That is what i mean with EEXISTS. You do what you said, but it could race with someone else, then when you link() it you get EEXISTS so you start over trying to link from shared to per-image. |
I think we can optimize this by scanning the composefs image that we're removing instead, and then only finally unlinking anything in the final shared dir n_link==1 |
Parallel to the above, flatpak stores a The general approach for remove in flatpak is:
This way we can atomically remove things, yet still keep running instances. We can maybe do the same, but just lock the image file. |
I was thinking about this more today, because we just hit a really bad ostree bug that of course only affected ostree, not rpm and not containers/storage. In a future where we share tooling between host updates and containers there's much less chances for bugs that affects just one of the two, and we get all the page cache sharing etc. But...what I really came here to say is that while this all sounds good, so far in many scenarios in FCOS (and generally ostree systems) we've been encouraging people to provision a separate /var mount. There's multiple advantages to this - it more strongly decouples OS updates from "system state". But today that "system state" includes And if we eventually try to do something like upgrading users who are currently using separate ostree and container storage into a more unified model, we now have uncomfortable tradeoffs around disk sizing. I guess ultimately we'd need to detect this situation when EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from |
This one also relates a bit to containers/bootc#128 |
Thanks for the simple idea how to manage that. I am working on a small OS using composefs and will implement your idea in Go. If your interested using that code, I could release some parts. Basically it's a complete stack managing the digest store with GC and synchronization over a secure channel with an A/B update process. |
I've been thinking about doing something similar in RAUC in the future, where we'd have two A/B RO rootfs images (i.e. erofs on dm-verity) + By creating an object store tree in the rootfs image (as hardlinks to the normal files), we could gain page cache sharing simply by using the the rootfs object store as a overlayfs data layer above the RW object store in If we want to avoid duplicate storage space use (between rootfs and |
Yeah, this is an interesting aspect. We could just treat the two object dirs (containers / os) as completely separate at "install time" (possibly even on different disks), but then force-merge them at runtime, thus automatically achieving page-cache shareing, but not disk space sharing. This will possibly cause problems when deleting objects though. Existing overlayfs mounts may have resolved dentries to the middle layer in the dcache, and if they go away there it could cause ENOENT issues even if the right file still exists in the lowermost object store. |
One thing I was thinking about related to this topic is how composefs wants to by default identify objects by their fsverity digest, versus how many other tools (e.g. ostree and estargz and containers zstd:chunked) identify by sha256. Ultimately a big point of composefs is the on-demand kernel verification of file content with fsverity. So the question then is: can we do anything with that externally provided sha256? I think the answer is basically just "no" - at least for untrusted images. I could imagine for example that we help maintain a database/index that maps from "full sha256" ➡️ cfs-digest and skip fetching/processing objects that we already have, as ostree and zstd:chunked do. But that would mean that an untrusted image could include a sha256 of a file whose contents it doesn't know (along with some dummy content) and use it to "leak" the real contents into the running image. Is this a real problem? I'm not totally sure, but I don't want to dismiss it. |
@giuseppe thoughts re ⬆️ |
I've been thinking more about the ostree/composefs integration and longer term, I think composefs should have its own opinionated management tooling for backing store files and checkouts.
Basically we move the "GC problem" from higher level tools into a shared composefs layer - and that will greatly "thin out" what ostree needs to do, and the same for container/storage type things. And more generally, it would help drive unifying these two things which I think we do want long term. Related to this, a mounted composefs shouldn't have backing store files deleted underneath it.
Maybe we could get away with having this just be a directory, e.g.
/composefs
(like/ostree/repo
) or perhaps/usr/.composefs
. Call this a$composefsdir
.Vaguely thinking perhaps we could have then
$composefsdir/roots.d
with namespaced subdirectories, like$composefsdir/roots.d/ostree
and$composefsdir/roots.d/containers
. Finally there'd be$composefsdir/files
which would hold the regular files.Then we'd have a CLI tool like
/usr/libexec/composefsctl --root /composefs gc
that would iterate over all composefs filesystems and GC any unreferenced regular files. In order to ensure GC doesn't race with addition we'd also need "API" operations like/usr/libexec/composefsctl add container/foo.composefs
that did locking. And a correspondingcomposefsctl delete
.The text was updated successfully, but these errors were encountered: