Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294

Open
jluethi opened this issue Feb 15, 2023 · 2 comments
Open

Comments

@jluethi
Copy link
Collaborator

jluethi commented Feb 15, 2023

Zarr sharding has been merged into the Python Zarr v3 code: zarr-developers/zarr-python#1111 (comment)

To improve on the concerns here: #54
We should investigate how easy it will be to implement sharding for us and whether it interferes with any of our processing.

A thing to keep in mind: We can read individual chunks in parallel from a shard, but we can't write in parallel to a shard.

@jluethi
Copy link
Collaborator Author

jluethi commented Oct 18, 2023

For the record, I made some statistics on our current file numbers using Zarr v2 and the chunking at the size of the original field of views, see below.

Analysis

For regular cases (using our 23 well example), we are very close to the target of averaging around 5 MB per file for 3D. The average file size is slightly lower in the MIP case (probably because there are more of the metadata files like .zattrs / .zarray vs. fewer individual chunks).
The MIP is less relevant over all, as it only as a fraction of the file numbers (& storage size). The overall average file size is 4.6 MB

For the extreme case of large plates with small wells (only few field of views), we have proportionally more metadata files and we need more pyramid layers (9 layers instead of 5). Those pyramid layers start becoming smaller earlier. Thus, the average file size decreases to around 3.5 MB per file. Even lower for MIPs, but MIPs are only 10% of the files. The overall average file size for the extreme case is 3.3 MB.

Conclusion

Sharding will be helpful, but it's not super urgent that we have it now. We should aim for at least 5 MB average file size, we're pretty close in regular cases and not too far off in some of the extremes.


The data

For 23 well plate:

MIP:
Files: 9637
Size: 37 GB
⇒ Average: 3.8 MB

3D
Files: 230996
Size: 1.1 TB
⇒ Average 4.76 MB

Overall average: 4.6 MB

HCA 384 well plate:

MIP:
Files: 89815
Size: 172G
⇒ Average: 1.92 MB

3D
Files: 721926
Size: 2.5T
⇒ 3.46 MB

Overall average: 3.3 MB

@tcompa tcompa changed the title Explore Zarr sharding to reduce file numbers of large OME-Zarr files Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files May 2, 2024
@tcompa
Copy link
Collaborator

tcompa commented May 2, 2024

Sharding should be part of zarr-python as of June: zarr-developers/zarr-python#1777. This will be already enough to start tests and exploration.

On top of exploration, a broader discussion will concern support for OME-NGFF v0.5 (which will likely adopt zarr v3 https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants