-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make multiscale image parsing more deterministic #200
Comments
The way that I've been thinking about this access pattern is that you don't do |
I agree that treating the group with the But I think that's the only access pattern that can deterministically resolve the spatial metadata for a single scale level, in the current version of OME-NGFF. So we should make sure that we fix future versions of the spec so that single scale levels can be accessed with deterministic metadata retrieval. |
I think accessing an array in isolation is semantically different to accessing a single scale level of a multiscale dataset, in a way would make it sensible to have a different API. In terms of implementation, it would be useful to have an object representing the multiscale group and then access scale levels from there; if you need to read the multiscale dataset metadata anyway in order to fully understand the scale array, I think that indirection mirrors this kind of access pattern. Denormalising scale metadata by duplicating it feels brittle, to me. A way to further simplify it would be to specify that all scale arrays must be children of the multiscale group (i.e. the path from multiscale group to scale array cannot include a "/"); and that multiscale groups can only contain one multiscale dataset. If you need to store several related multiscale datasets, why not have a super-group a level above? This way, all scale arrays can only be part of one multiscale datasets, and know exactly where to find its metadata. |
Here we disagree. If I want to open a zarr array as a numpy array (because I don't care about chunking), can just call
I agree. A more radical idea I floated in the past was to remove all coordinate metadata from
I think something along these lines would be good, although there are tools right now (ngff-zarr) that would go out of spec if we did the exact thing you suggest. |
Can't make a specification omelette without breaking a few implementation eggs... |
Agree. My preference is for (c).
Enough that part of that idea is in this PR, located here.
but is not clear enough or go far enough. At a minimum:
Also useful:
|
After playing around with various implementations of the 0.4 specification, I noticed some issues with multiscale images that I think we should address in the later versions of the spec.
Resolving metadata for an array requires a search
I noticed the following issue when using @thewtex's excellent
ngff-zarr
tool. This tool saves multiscale images using thefoo.zarr/mulitscale_group/intermediate_group/scale0
layout, i.e. it adds an additional zarr group between the multiscale group and each array with image data.In the current
multiscales
spec, there is no hard constraint on where the individual arrays are stored relative to the zarr group that contains the metadata for those arrays. The spec states:Each dictionary in "datasets" MUST contain the field "path", whose value contains the path to the array for this resolution relative to the current zarr group.
The example hierarchy in the spec has the arrays stored directly in the group with the
mulitscales
metadata, but as demonstrated byngff-zarr
, the arrays could be anywhere in the hierarchy below thatmultiscales
group -- a hierarchy likefoo.zarr/multiscale_group/a/b/c/d/e/s0
, wheres0
is the first resolution level array, is compliant with the spec. Now suppose a program opens the arrays0
, and we want the program to access the metadata for it. The location of the metadata relative to the array is not defined, so the program must search the hierarchy in reverse, checking the metadata of each zarr group for themultiscales
metadata, and then parsing it. This is onerous, and we should amend the spec to remove this burden.Consolidated metadata would only solve part of the problem, because we would still need to search the consolidated metadata hierarchy. Instead of requiring a search, it would be a much better if the metadata for an image had a fixed location relative to the array data.
Coordinate metadata for an array may be indeterminate
The spec allows the coordinate metadata of a single array to be indeterminate via two mechanisms:
first, the same array may be referenced by two elements of the same
multiscales
JSON array, e.g.example
second, because the spec allows nested multiscale groups (a side effect of the spec being insufficiently prescriptive about the allowed zarr hierarchy), the same array may be referenced by
multiscales
metadata in two different groups.How to address these issues
a) Specify that a single array must be consistently described by all instances of
multiscales
. This would address the indeterminate coordinates problem vis a vis the spec, but would require an extra check from parsers. A better solution would be to structure the coordinate metadata so that indeterminate coordinates cannot be represented by the structure of the spec. I think c) is one way to do this.b) mandate the exact structure of the zarr hierarchy for a multiscale image, e.g. specify exactly where the zarr arrays must be relative to the group with multiscale metadata. We should also specify that a zarr array
from The layout adopted by
ngff-zarr
seems fine to me, so I would propose we just codify that in the spec. If we do this, then applications consuming a single scale array will know exactly where to look to find the coordinate metadata for that array.c) put coordinate metadata inside .zattrs for each array, and duplicate this in the multiscales group .zattrs (or duplicate it at the top-level of the hierarchy using consolidated metadata functionality). This is the simplest way to ensure that the coordinates for an array can always be resolved.
I'd be happy either writing new PRs for these, or trying to add stuff to #138 , since this arguably falls under the scope of that effort.
The text was updated successfully, but these errors were encountered: