Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294

jluethi · 2023-02-15T13:57:02Z

Zarr sharding has been merged into the Python Zarr v3 code: zarr-developers/zarr-python#1111 (comment)

To improve on the concerns here: #54
We should investigate how easy it will be to implement sharding for us and whether it interferes with any of our processing.

A thing to keep in mind: We can read individual chunks in parallel from a shard, but we can't write in parallel to a shard.

jluethi · 2023-10-18T16:22:17Z

For the record, I made some statistics on our current file numbers using Zarr v2 and the chunking at the size of the original field of views, see below.

Analysis

For regular cases (using our 23 well example), we are very close to the target of averaging around 5 MB per file for 3D. The average file size is slightly lower in the MIP case (probably because there are more of the metadata files like .zattrs / .zarray vs. fewer individual chunks).
The MIP is less relevant over all, as it only as a fraction of the file numbers (& storage size). The overall average file size is 4.6 MB

For the extreme case of large plates with small wells (only few field of views), we have proportionally more metadata files and we need more pyramid layers (9 layers instead of 5). Those pyramid layers start becoming smaller earlier. Thus, the average file size decreases to around 3.5 MB per file. Even lower for MIPs, but MIPs are only 10% of the files. The overall average file size for the extreme case is 3.3 MB.

Conclusion

Sharding will be helpful, but it's not super urgent that we have it now. We should aim for at least 5 MB average file size, we're pretty close in regular cases and not too far off in some of the extremes.

The data

For 23 well plate:

MIP:
Files: 9637
Size: 37 GB
⇒ Average: 3.8 MB

3D
Files: 230996
Size: 1.1 TB
⇒ Average 4.76 MB

Overall average: 4.6 MB

HCA 384 well plate:

MIP:
Files: 89815
Size: 172G
⇒ Average: 1.92 MB

3D
Files: 721926
Size: 2.5T
⇒ 3.46 MB

Overall average: 3.3 MB

tcompa · 2024-05-02T08:34:27Z

Sharding should be part of zarr-python as of June: zarr-developers/zarr-python#1777. This will be already enough to start tests and exploration.

On top of exploration, a broader discussion will concern support for OME-NGFF v0.5 (which will likely adopt zarr v3 https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617).

jluethi added this to Fractal Project Management Feb 15, 2023

github-project-automation bot moved this to TODO in Fractal Project Management Feb 15, 2023

jluethi mentioned this issue Feb 15, 2023

Optimal I/O and on-disk storage for zarr arrays (chunking scheme + Zarr store) #54

Closed

tcompa mentioned this issue May 2, 2024

Support upcoming v0.5 OME-NGFF standard #168

Closed

2 tasks

tcompa changed the title ~~Explore Zarr sharding to reduce file numbers of large OME-Zarr files~~ Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294

Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294

jluethi commented Feb 15, 2023

jluethi commented Oct 18, 2023

tcompa commented May 2, 2024 •

edited

Loading

Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294

Explore Zarr V3 sharding to reduce file numbers of large OME-Zarr files #294

Comments

jluethi commented Feb 15, 2023

jluethi commented Oct 18, 2023

Analysis

Conclusion

The data

For 23 well plate:

HCA 384 well plate:

tcompa commented May 2, 2024 • edited Loading

tcompa commented May 2, 2024 •

edited

Loading