-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294
Comments
For the record, I made some statistics on our current file numbers using Zarr v2 and the chunking at the size of the original field of views, see below. AnalysisFor regular cases (using our 23 well example), we are very close to the target of averaging around 5 MB per file for 3D. The average file size is slightly lower in the MIP case (probably because there are more of the metadata files like .zattrs / .zarray vs. fewer individual chunks). For the extreme case of large plates with small wells (only few field of views), we have proportionally more metadata files and we need more pyramid layers (9 layers instead of 5). Those pyramid layers start becoming smaller earlier. Thus, the average file size decreases to around 3.5 MB per file. Even lower for MIPs, but MIPs are only 10% of the files. The overall average file size for the extreme case is 3.3 MB. ConclusionSharding will be helpful, but it's not super urgent that we have it now. We should aim for at least 5 MB average file size, we're pretty close in regular cases and not too far off in some of the extremes. The dataFor 23 well plate:MIP: 3D Overall average: 4.6 MB HCA 384 well plate:MIP: 3D Overall average: 3.3 MB |
Sharding should be part of zarr-python as of June: zarr-developers/zarr-python#1777. This will be already enough to start tests and exploration. On top of exploration, a broader discussion will concern support for OME-NGFF v0.5 (which will likely adopt zarr v3 https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617). |
Zarr sharding has been merged into the Python Zarr v3 code: zarr-developers/zarr-python#1111 (comment)
To improve on the concerns here: #54
We should investigate how easy it will be to implement sharding for us and whether it interferes with any of our processing.
A thing to keep in mind: We can read individual chunks in parallel from a shard, but we can't write in parallel to a shard.
The text was updated successfully, but these errors were encountered: