Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make multiscale image parsing more deterministic #200

Open
d-v-b opened this issue Jul 2, 2023 · 6 comments
Open

How to make multiscale image parsing more deterministic #200

d-v-b opened this issue Jul 2, 2023 · 6 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Jul 2, 2023

After playing around with various implementations of the 0.4 specification, I noticed some issues with multiscale images that I think we should address in the later versions of the spec.

Resolving metadata for an array requires a search

I noticed the following issue when using @thewtex's excellent ngff-zarr tool. This tool saves multiscale images using the foo.zarr/mulitscale_group/intermediate_group/scale0 layout, i.e. it adds an additional zarr group between the multiscale group and each array with image data.

In the current multiscales spec, there is no hard constraint on where the individual arrays are stored relative to the zarr group that contains the metadata for those arrays. The spec states:

Each dictionary in "datasets" MUST contain the field "path", whose value contains the path to the array for this resolution relative to the current zarr group.

The example hierarchy in the spec has the arrays stored directly in the group with the mulitscales metadata, but as demonstrated by ngff-zarr, the arrays could be anywhere in the hierarchy below that multiscales group -- a hierarchy like foo.zarr/multiscale_group/a/b/c/d/e/s0, where s0 is the first resolution level array, is compliant with the spec. Now suppose a program opens the array s0, and we want the program to access the metadata for it. The location of the metadata relative to the array is not defined, so the program must search the hierarchy in reverse, checking the metadata of each zarr group for the multiscales metadata, and then parsing it. This is onerous, and we should amend the spec to remove this burden.

Consolidated metadata would only solve part of the problem, because we would still need to search the consolidated metadata hierarchy. Instead of requiring a search, it would be a much better if the metadata for an image had a fixed location relative to the array data.

Coordinate metadata for an array may be indeterminate

The spec allows the coordinate metadata of a single array to be indeterminate via two mechanisms:

first, the same array may be referenced by two elements of the same multiscales JSON array, e.g.

example

{
    "multiscales": [
        {
            "version": "0.4",
            "name": "example_1",
            "axes": [
                {"name": "y", "type": "space", "unit": "micrometer"},
                {"name": "x", "type": "space", "unit": "micrometer"}
            ],
            "datasets": [
                {
                    "path": "0",
                    "coordinateTransformations": [{
                        "type": "scale",
                        "scale": [1.0, 1.0]
                    }]
                },
            ],
            "type": "gaussian"
            },
        {
            "version": "0.4",
            "name": "inconsistent_example",
            "axes": [
                {"name": "t", "type": "time", "unit": "second"},
                {"name": "x", "type": "space", "unit": "micrometer"}
            ],
            "datasets": [
                {
                    "path": "0",
                    "coordinateTransformations": [{
                        "type": "scale",
                        "scale": [3.0, 3.0]
                    }]
                },
            ],
            "type": "oops"
            }
    ]
}

second, because the spec allows nested multiscale groups (a side effect of the spec being insufficiently prescriptive about the allowed zarr hierarchy), the same array may be referenced by multiscales metadata in two different groups.

How to address these issues

a) Specify that a single array must be consistently described by all instances of multiscales. This would address the indeterminate coordinates problem vis a vis the spec, but would require an extra check from parsers. A better solution would be to structure the coordinate metadata so that indeterminate coordinates cannot be represented by the structure of the spec. I think c) is one way to do this.

b) mandate the exact structure of the zarr hierarchy for a multiscale image, e.g. specify exactly where the zarr arrays must be relative to the group with multiscale metadata. We should also specify that a zarr array
from The layout adopted by ngff-zarr seems fine to me, so I would propose we just codify that in the spec. If we do this, then applications consuming a single scale array will know exactly where to look to find the coordinate metadata for that array.

c) put coordinate metadata inside .zattrs for each array, and duplicate this in the multiscales group .zattrs (or duplicate it at the top-level of the hierarchy using consolidated metadata functionality). This is the simplest way to ensure that the coordinates for an array can always be resolved.

I'd be happy either writing new PRs for these, or trying to add stuff to #138 , since this arguably falls under the scope of that effort.

@clbarnes
Copy link

clbarnes commented Jul 5, 2023

The way that I've been thinking about this access pattern is that you don't do open(scale_array), you do open(group, dataset_idx, scale_idx), so the scale array is always accessed via the containing group. More complicated for point-and-click file browser type of access, certainly, but sidesteps these issues.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jul 8, 2023

I agree that treating the group with the multiscales metadata as the entrypoint is a solution, but it's not a very good outcome. The Zarr API, at least in zarr-python, looks like open(store, path_to_array), i.e. you get resource (an array) by supplying a path to that resource. Now imagine that you could only access zarr arrays via the groups that contained them, e.g. open(store, path_to_group, name_of_array). This API is needlessly indirect, and it's good that zarr-python doesn't work this way. The exact same criticism holds for the access pattern you suggest: open(group, dataset_idx, scale_idx).

But I think that's the only access pattern that can deterministically resolve the spatial metadata for a single scale level, in the current version of OME-NGFF. So we should make sure that we fix future versions of the spec so that single scale levels can be accessed with deterministic metadata retrieval.

@clbarnes
Copy link

clbarnes commented Jul 8, 2023

imagine that you could only access zarr arrays via the groups that contained them, e.g. open(store, path_to_group, name_of_array). This API is needlessly indirect, and it's good that zarr-python doesn't work this way.

I think accessing an array in isolation is semantically different to accessing a single scale level of a multiscale dataset, in a way would make it sensible to have a different API. In terms of implementation, it would be useful to have an object representing the multiscale group and then access scale levels from there; if you need to read the multiscale dataset metadata anyway in order to fully understand the scale array, I think that indirection mirrors this kind of access pattern.

Denormalising scale metadata by duplicating it feels brittle, to me.

A way to further simplify it would be to specify that all scale arrays must be children of the multiscale group (i.e. the path from multiscale group to scale array cannot include a "/"); and that multiscale groups can only contain one multiscale dataset. If you need to store several related multiscale datasets, why not have a super-group a level above? This way, all scale arrays can only be part of one multiscale datasets, and know exactly where to find its metadata.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jul 8, 2023

I think accessing an array in isolation is semantically different to accessing a single scale level of a multiscale dataset, in a way would make it sensible to have a different API.

Here we disagree. If I want to open a zarr array as a numpy array (because I don't care about chunking), can just call np.from_array(zarr_array). If I want to open a zarr array as a dask array (because I care about chunking), it's trivial to write a function with the exact same signature -- zarr_to_dask(zarr_array). Now suppose I want to open a zarr array as a chunked xarray.DataArray (because I care about chunking, coordinates, and metadata). ome_ngff_zarr_to_xarray(zarr_array) doesn't work deterministically. At least for myself, this access pattern is extremely common -- I very often want to open a single zarr array with awareness of its coordinates.

Denormalising scale metadata by duplicating it feels brittle, to me.

I agree. A more radical idea I floated in the past was to remove all coordinate metadata from multiscales, and stick it instead in array attrs (which the current ngff spec leaves completely empty). Under this scheme, multiscales.datasets would just be a list of array names, nothing more. This was opposed because loading the array metadata was a performance bottleneck for some applications.

A way to further simplify it would be to specify that all scale arrays must be children of the multiscale group (i.e. the path from multiscale group to scale array cannot include a "/")

I think something along these lines would be good, although there are tools right now (ngff-zarr) that would go out of spec if we did the exact thing you suggest.

@clbarnes
Copy link

clbarnes commented Jul 8, 2023

Can't make a specification omelette without breaking a few implementation eggs...

@bogovicj
Copy link
Contributor

bogovicj commented Jul 10, 2023

@d-v-b

Agree.

My preference is for (c).

put coordinate metadata inside .zattrs for each array, and duplicate this in the multiscales group .zattrs (or duplicate it at the top-level of the hierarchy using consolidated metadata functionality).

Enough that part of that idea is in this PR, located here.

Coordinate transformations from array to physical coordinates MUST be stored in multiscales ([[#multiscale-md]]),
and MUST be duplicated in the attributes of the zarr array.

but is not clear enough or go far enough. At a minimum:

  1. All arrays to be descendants of the multiscales group to which they belong
  2. All arrays of a multiscales are siblings ( so s0, a/s1 is disallowed).

Also useful:

  1. Them to be named in a sane way:
    a) All names for arrays end in a unique non-negative integer.
    b) The set of suffixes for paths in a multiscales is a contiguous (so the set [s0, s3] would be disallowed), with the smallest equal to zero.
    c) 0 is the highest resolution, (i+1) is lower resolution than (i)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants