S3 byte address for chunks; was API access to NetCDF/HDF chunk index #2754

bnlawrence · 2023-09-26T07:54:20Z

bnlawrence
Sep 26, 2023

Hi Folks

We are in the process of building support for computational storage support for NetCDF data (repo). To do this, we need access to the NetCDF chunk index [1]. Currently we do this via kerchunk and the Zarr indexer (but not by using Zarr itself, as we do not want either Zarr or NetCDF to load the chunk to memory, that would defeat the point of computational storage). The json format of the kerchunk index is a huge overhead, but it must be the case that the netcdf machinery effectively does the same thing.

Is there some underlying library interface we could get to that does this, or can someone point us to the relevant code? (I wouldn't know where to start in the c library ... sadly).

Cheers
Bryan

[1] We are not doing classic mode, so I believe this is a b-tree index which must exist somewhere, and when we've found that, we would want to be able to go from slice notation into which elements we need, then use this index to tell the storage where the relevant chunk lies.

edwardhartnett · 2023-09-26T08:49:57Z

edwardhartnett
Sep 26, 2023

HDF5 no doubt provides this, but I don't know how. If we could identify how, we could consider adding it to the netcdf API. An alternative might be to add a netCDF API call that can return the hid_t ID of the open file (i.e. the HDF5 file ID), Then the HDF5 API could be used directly, without trouble...

0 replies

bnlawrence · 2023-09-26T08:57:01Z

bnlawrence
Sep 26, 2023
Author

Thanks Ed. Sounds like we need to explore the HDF options for doing this, assuming we have the hid_t ID, and if we can do what we need to do there, come back here and make a request. Will keep this discussion updated with progress.

2 replies

edwardhartnett Sep 26, 2023

I wish I could be more help but I'm only a part-time, occasional netCDF contributor these days, my real work is the NOAA GRIB libraries now. ;-)

bnlawrence Sep 26, 2023
Author

Ah, no drama, this particular rabbit hole was always going to be labyrinthacious :-).

DennisHeimbigner · 2023-09-26T18:38:05Z

DennisHeimbigner
Sep 26, 2023
Collaborator

There is an HDF5 API for accessing chunks directly called H5Dread_chunk.
It is documented here: https://docs.hdfgroup.org/hdf5/v1_12/group___h5_d.html#title26
We (netcdf-c) use it for testing our Zarr chunking code.
This is via a program netcdf-c/nczarr_test/ncdumpchunks.c.

0 replies

bnlawrence · 2023-09-27T08:54:15Z

bnlawrence
Sep 27, 2023
Author

Ah, that's helpful, because we can go chasing its use in HDF code. The issue for us is not so much the reading of the chunk, we can do that, the issue is finding the chunk address from the chunk index. If you're using Zarr, you do their multi indexer (which would likely exploit a kerchunk index for non Zarr native data). @valeriupredoi can you please chase this up?

0 replies

bnlawrence · 2023-10-05T11:29:38Z

bnlawrence
Oct 5, 2023
Author

After reading a bit of the netcdf c code (*) I realise I asked the wrong question! Probably worth putting it differently. If we assume we know how to got to your NC_get_var with all the right arguments, what we know (think?) happens next is that

(for a local HDF5 file) you dispatch that to an HDF call to load the disk. I can't find the exact method you call (I'm sorry I got lost in the discussion of dispatch tables). Can someone please point me to that?
(for a remote S3 instance of an HDF5 file) you must (?) somehow work out where that chunk is and do a range get .... so you have a filename and byte address and a size of bytes you need to unpack (yourselves, presumably by getting those bytes and then passing them through the HDF file). For our purposes, it's that byte-address and size of bytes (size will depend on compression) that we really want. Is this really how you do it for S3?

(*) for "reading c-code" imagine me in some foreign country reading a menu and ordering something and hoping I get something vaguely edible and vaguely related to what I thought I was ordering.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 byte address for chunks; was API access to NetCDF/HDF chunk index #2754

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

S3 byte address for chunks; was API access to NetCDF/HDF chunk index #2754

bnlawrence Sep 26, 2023

Replies: 5 comments · 2 replies

edwardhartnett Sep 26, 2023

bnlawrence Sep 26, 2023 Author

edwardhartnett Sep 26, 2023

bnlawrence Sep 26, 2023 Author

DennisHeimbigner Sep 26, 2023 Collaborator

bnlawrence Sep 27, 2023 Author

bnlawrence Oct 5, 2023 Author

bnlawrence
Sep 26, 2023

Replies: 5 comments 2 replies

edwardhartnett
Sep 26, 2023

bnlawrence
Sep 26, 2023
Author

bnlawrence Sep 26, 2023
Author

DennisHeimbigner
Sep 26, 2023
Collaborator

bnlawrence
Sep 27, 2023
Author

bnlawrence
Oct 5, 2023
Author