blosc performance investigation #9

melissalinkert · 2024-09-13T13:39:24Z

As discussed in the most recent NGFF challenge call, I tried to reproduce the behavior noted in zarr-developers/zarr-python#2171 here.

Short version is I have not so far been able to reproduce this exact problem. blosc with cname=zstd produces the effectively the same size total output, with some small discrepancies in individual chunks. cc @joshmoore

I tested with these 3 public datasets:

https://downloads.openmicroscopy.org/images/PerkinElmer-Operetta/omer/ (plate)
https://downloads.openmicroscopy.org/images/Hamamatsu-NDPI/openslide/CMU-1/ (slide)
stack&sizeX=8192&sizeY=8192&sizeZ=4000&pixelType=uint16.fake (simulating dimensions of an EM volume)

With bioformats2raw 0.9.4:

$ bin/bioformats2raw -p $SLIDE/CMU-1.ndpi ~/data/blosc-test/v2/slide.zarr --compression blosc --compression-properties "cname=zstd"
$ bin/bioformats2raw -p $PLATE/006P_M3/006P__2017-08-19T12_42_59-Measurement\ 3/Images/Index.idx.xml ~/data/blosc-test/v2/plate.zarr --compression blosc --compression-properties "cname=zstd"
$ bin/bioformats2raw -p "test&sizeX=8192&sizeY=8192&sizeZ=4000.fake" ~/data/blosc-test/v2/zstack.zarr --compression blosc --compression-properties "cname=zstd"

Note that omitting --compression-properties "cname=zstd" will use the lz4 cname instead, see https://github.com/zarr-developers/jzarr/blob/533c8bb4197f57bd664edc6c62b3f5cd0de262ba/src/main/java/com/bc/zarr/CompressorFactory.java#L220. For the slide test, the default lz4 cname results in an output dataset that is approximately twice the size as with the alternate zstd cname.

Converting the generated v2 datasets to v3, using the current state of #8:

$ bin/zarr2zarr ~/data/blosc-test/v2/slide.zarr/ ~/data/blosc-test/v3/slide.zarr --compression blosc
$ bin/zarr2zarr ~/data/blosc-test/v2/plate.zarr/ ~/data/blosc-test/v3/plate.zarr --compression blosc
$ bin/zarr2zarr ~/data/blosc-test/v2/zstack.zarr/ ~/data/blosc-test/v3/zstack.zarr --compression blosc

Double-checking that the compression options match:

$ cat v2/slide.zarr/0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "zstd",
    "id" : "blosc"
  },
  "dtype" : "|u1",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 3, 1, 38144, 51200 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}
$ cat v3/slide.zarr/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,3,1,38144,51200],"data_type":"uint8","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,1024,1024]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"blosc","configuration":{"cname":"zstd","shuffle":"noshuffle","clevel":5,"typesize":1,"blocksize":0},"name":"blosc"}],"dimension_names":null,"attributes":{}}
$ cat v2/plate.zarr/A/2/0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "zstd",
    "id" : "blosc"
  },
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 6, 18, 1080, 1080 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}
$ cat v3/plate.zarr/A/2/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,6,18,1080,1080],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,1024,1024]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"blosc","configuration":{"cname":"zstd","shuffle":"noshuffle","clevel":5,"typesize":2,"blocksize":0},"name":"blosc"}],"dimension_names":null,"attributes":{}}
$ cat v2/zstack.zarr/0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "zstd",
    "id" : "blosc"
  },
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 1, 4000, 8192, 8192 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}
$ cat v3/zstack.zarr/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,1,4000,8192,8192],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,1024,1024]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"blosc","configuration":{"cname":"zstd","shuffle":"noshuffle","clevel":5,"typesize":2,"blocksize":0},"name":"blosc"}],"dimension_names":null,"attributes":{}}

and then comparing total sizes in bytes:

$ du -sb v2/slide.zarr/
2059804567	v2/slide.zarr/
$ du -sb v3/slide.zarr/
2059798307	v3/slide.zarr/
$ du -sb v2/plate.zarr/
5256606095	v2/plate.zarr/
$ du -sb v3/plate.zarr/
5254587700	v3/plate.zarr/
$ du -sb v2/zstack.zarr/
4504698672	v2/zstack.zarr/
$ du -sb v3/zstack.zarr/
4504697401	v3/zstack.zarr/

and a few selected chunk files:

$ ls -lgG v2/slide.zarr/0/0/0/0/0/0/10
-rwxrwxrwx 1 145067 Sep 12 12:45 v2/slide.zarr/0/0/0/0/0/0/10
$ ls -lgG v3/slide.zarr/0/0/c/0/0/0/0/10
-rwxrwxrwx 1 145067 Sep 12 13:03 v3/slide.zarr/0/0/c/0/0/0/0/10
$ ls -lgG v2/plate.zarr/A/2/0/0/0/0/0/0/
total 1196
-rwxrwxrwx 1 1147095 Sep 12 12:58 0
-rwxrwxrwx 1   70851 Sep 12 12:59 1
$ ls -lgG v3/plate.zarr/A/2/0/0/c/0/0/0/0/
total 1192
-rwxrwxrwx 1 1146668 Sep 12 13:45 0
-rwxrwxrwx 1   71242 Sep 12 13:45 1
$ ls -lgG v2/zstack.zarr/0/0/0/0/3000/0/
total 128
-rwxrwxrwx 1 12743 Sep 12 14:01 0
-rwxrwxrwx 1 12728 Sep 12 14:07 1
-rwxrwxrwx 1 12728 Sep 12 14:12 2
-rwxrwxrwx 1 12728 Sep 12 14:18 3
-rwxrwxrwx 1 12728 Sep 12 14:24 4
-rwxrwxrwx 1 12728 Sep 12 14:29 5
-rwxrwxrwx 1 12728 Sep 12 14:35 6
-rwxrwxrwx 1 12728 Sep 12 14:39 7
$ ls -lgG v3/zstack.zarr/0/0/c/0/0/3000/0/
total 128
-rwxrwxrwx 1 12744 Sep 12 22:50 0
-rwxrwxrwx 1 12728 Sep 12 22:50 1
-rwxrwxrwx 1 12728 Sep 12 22:50 2
-rwxrwxrwx 1 12728 Sep 12 22:50 3
-rwxrwxrwx 1 12728 Sep 12 22:50 4
-rwxrwxrwx 1 12728 Sep 12 22:50 5
-rwxrwxrwx 1 12728 Sep 12 22:50 6
-rwxrwxrwx 1 12728 Sep 12 22:50 7

Some things we may want to improve, either here or as part of future work in bioformats2raw (in which case this issue can be moved):

make the blosc cname and other extended options configurable when writing v3
make it more obvious in the documentation that changing the cname is something worth trying if size on disk matters
change the default cname in bioformats2raw to zstd?

The text was updated successfully, but these errors were encountered:

mkitti · 2024-09-30T11:47:36Z

Do you specify the typesize and/or dtype when initializing the Blosc compressor?

melissalinkert · 2024-09-30T15:42:15Z

If I understand correctly, using CodecBuilder.withBlosc() as we do here:

https://github.com/glencoesoftware/zarr2zarr/blob/master/src/main/java/com/glencoesoftware/zarr/Convert.java#L369

means that lines 30-66 of https://github.com/zarr-developers/zarr-java/blob/main/src/main/java/dev/zarr/zarrjava/v3/codec/CodecBuilder.java (in particular line 48) will automatically set typesize correctly based on the data type.

melissalinkert mentioned this issue Oct 17, 2024

Update readme to include section on compression options glencoesoftware/bioformats2raw#262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blosc performance investigation #9

blosc performance investigation #9

melissalinkert commented Sep 13, 2024

mkitti commented Sep 30, 2024

melissalinkert commented Sep 30, 2024

blosc performance investigation #9

blosc performance investigation #9

Comments

melissalinkert commented Sep 13, 2024

mkitti commented Sep 30, 2024

melissalinkert commented Sep 30, 2024