Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Add mode to read consolidated ZARR datasets #2992

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mannreis
Copy link
Contributor

This changes add a mode option mode=consolidated (perhaps best to do it by default when reading and fallback if fails) that will fetch a possibly existing .zmetadata file from the root of the dataset. That could serve as unified representation to be used whenever needing group or variable metadata further down the code path.

This is a WIP motivated by #2987 and lacks (at least):

  • Unit testing
  • Functional testing
  • Support Zarr V3
  • Robustness when open consolitaded not available

@DennisHeimbigner
Copy link
Collaborator

The way I planned to do the consolidated metada (aside: would like a shorter term than "consolidated")
for netcdfd is to create another dispatch layer for accessing various metadata pieces.
So for v2, this would wrap read/write of .zroup .zarray and .zattrs.
For v3, this would wrap access to zarr.json.

@florianziemen
Copy link

How about using csd as a shorthand for consolidated (maybe even make both variants legal options)?

Personally, I would prefer to make consolidated the default, and fall back to unconsolidated, if no .zmetadata file is available (or the user explicitly asks for unconsolidated), but I would also understand if you prefer not to change existing behavior of libnetcdf...

@DennisHeimbigner
Copy link
Collaborator

It occurs to me to ask. Why is the consolidated metadata in a separate .zmetadata rather than in the root groups' zarr.json?

@florianziemen
Copy link

No idea why, but it is handled that way in zarr python for zarr2 ...

See your question here: zarr-developers/zarr-python#720

@mannreis
Copy link
Contributor Author

The way I planned to do the consolidated metada (aside: would like a shorter term than "consolidated") for netcdfd is to create another dispatch layer for accessing various metadata pieces. So for v2, this would wrap read/write of .zroup .zarray and .zattrs. For v3, this would wrap access to zarr.json.

You mean adding a block of function pointers to NC_Dispatch](https://github.com/Unidata/netcdf-c/blob/main/include/netcdf_dispatch.h.in#L34) that would handle the metadata(-file) operations for zarr? I was picturing something internal to to the NCZ_* layer but I don't have a really good overview of the code design.

@DennisHeimbigner
Copy link
Collaborator

You mean adding a block of function pointers to NC_Dispatch

No, I was thinking of an internal dispatch table. When I added support
for Zarr version 3, I created a dispatch table discriminated on the version.
I then constructed some code to look at the URL and the Zarr dataset
to infer which version to use. I would do the same for the metadata
dispatcher but discriminating on consolidated or not.

@mannreis
Copy link
Contributor Author

I finally understood you're referring to the implementation in the branches of your fork! I'm taking a look into zarrv3b.tmp, is this the branch you envision to merge?
Just to clarify, what we'd like to have is that, when opening a consolidated dataset (without authentication), one could point to a "vanilla HTTP" server. This means that, HTTP-S3-specific requests like method=list-bucket-v2 would be avoided (when data is consolidated) or delayed (when not). Is this a sensible requirement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants