Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual Dataset Workflow Tracking Issue #197

Open
4 of 5 tasks
mpiannucci opened this issue Oct 12, 2024 · 8 comments
Open
4 of 5 tasks

Virtual Dataset Workflow Tracking Issue #197

mpiannucci opened this issue Oct 12, 2024 · 8 comments
Labels
virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references

Comments

@mpiannucci
Copy link
Contributor

mpiannucci commented Oct 12, 2024

In order to create and use virtual datasets with python, users will want to use kerchunk and virtualizarr. These are just starting down the path to zarr 3 and icechunk compatability. This issue will be used to track progress and relevant PRs:

All of this can be installed with pip. However we need to install with three steps for now to avoid version conflicts:

pip install icechunk xarray VirtualiZarr

pip install git+https://github.com/mpiannucci/kerchunk@v3

This assumes also having fsspec and s3fs installed:

pip install fsspec s3fs

With all of this installed, HDF5 virtual datasets currently work like this:

import xarray as xr
from virtualizarr import open_virtual_dataset

url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20221001T0000Z/20221001T0000Z-PT0000H00M-CAPE_mixed_layer_lowest_500m.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")

# create xarray dataset
ds = open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})

# create an icechunk store
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('ukmet'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
    virtual_ref_config=VirtualRefConfig.s3_anonymous(region='eu-west-2'),
))

# use virtualizarr to write the dataset to icechunk
ds.virtualize.to_icechunk(store)

# commit to save progress
store.commit(message="Initial commit")

# open it back up
ds = xr.open_zarr(store, zarr_version=3, consolidated=False)

# plot!
ds.atmosphere_convective_available_potential_energy.plot()

output

Updated 11/13/2024

@maxrjones
Copy link

maxrjones commented Oct 17, 2024

This is so awesome, thank you for open sourcing your work and the impressive documentation/issue tracking!

Just wanted to share the snippet below that works for me, since there has been some changes on those branches since this code was posted. In particular, only dataset_to_icechunk is available and storage_options is required for successful execution. Also if it's helpful for anyone working on JupyterHubs quay.io/developmentseed/warp-resample-profiling:eac145edd638 has all the dependencies installed in the order you specified.

import xarray as xr
from virtualizarr import open_virtual_dataset
from virtualizarr.writers.icechunk import dataset_to_icechunk

url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20221001T0000Z/20221001T0000Z-PT0000H00M-CAPE_mixed_layer_lowest_500m.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")

# create xarray dataset
ds = open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})

# create an icechunk store
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('ukmet'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
    virtual_ref_config=VirtualRefConfig.s3_anonymous(region='eu-west-2'),
))

# use virtualizarr to write the dataset to icechunk
dataset_to_icechunk(ds, store)

# commit to save progress
store.commit(message="Initial commit")

# open it back up
ds = xr.open_zarr(store, zarr_version=3, consolidated=False)

# plot!
ds.atmosphere_convective_available_potential_energy.plot()

@mpiannucci
Copy link
Contributor Author

Thanks @maxrjones !! I updated the code sample up top to match just to make sure its all on the same page

@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 22, 2024

Icechunk support was merged to VirtualiZarr main! zarr-developers/VirtualiZarr#256

I updated the top post with the latest instructions

Edit: And released!! https://virtualizarr.readthedocs.io/en/latest/generated/virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk.html#virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk

@mpiannucci
Copy link
Contributor Author

mpiannucci commented Oct 23, 2024

I listed out a current breakdown of the work to be done in kerchunk here if anyone is interested in helping to drive this effort foward!

@martindurant
Copy link

I wonder, do we have examples of supermassive iced datasets yet, with millions of references? I wanted to see how the msgpack format stacks up against kerchunk's parquet format, particularly the ability to only load partitions of the reference data.

@TomNicholas TomNicholas added the virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references label Nov 7, 2024
@mpiannucci
Copy link
Contributor Author

mpiannucci commented Nov 13, 2024

numcodecs 0.14.0 is out with included support for zarr 3 codecs using the numcodecs. prefix. I have updated the installation instructions in the op.

The last piece to this puzzle is getting kerchunk fully working with zarr 3 stores which is a work in progress

@TomNicholas
Copy link
Contributor

numcodecs 0.14.0 is out with included support for zarr 3 codecs using the numcodecs. prefix. I have updated the installation instructions in the op.

Great! Would you mind submitting a PR to VirtualiZarr to change this dependency?

@TomNicholas
Copy link
Contributor

I wonder, do we have examples of supermassive iced datasets yet, with millions of references?

I tried 100 million virtual references in #401, which kind of already works. (Which is surprising given how no effort has gone into optimizing anything yet!)

Great! Would you mind submitting a PR to VirtualiZarr to change this dependency?

(This was done in zarr-developers/VirtualiZarr#301)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references
Projects
None yet
Development

No branches or pull requests

4 participants