Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blosc performance investigation #9

Open
melissalinkert opened this issue Sep 13, 2024 · 2 comments
Open

blosc performance investigation #9

melissalinkert opened this issue Sep 13, 2024 · 2 comments

Comments

@melissalinkert
Copy link
Member

As discussed in the most recent NGFF challenge call, I tried to reproduce the behavior noted in zarr-developers/zarr-python#2171 here.

Short version is I have not so far been able to reproduce this exact problem. blosc with cname=zstd produces the effectively the same size total output, with some small discrepancies in individual chunks. cc @joshmoore

I tested with these 3 public datasets:

With bioformats2raw 0.9.4:

$ bin/bioformats2raw -p $SLIDE/CMU-1.ndpi ~/data/blosc-test/v2/slide.zarr --compression blosc --compression-properties "cname=zstd"
$ bin/bioformats2raw -p $PLATE/006P_M3/006P__2017-08-19T12_42_59-Measurement\ 3/Images/Index.idx.xml ~/data/blosc-test/v2/plate.zarr --compression blosc --compression-properties "cname=zstd"
$ bin/bioformats2raw -p "test&sizeX=8192&sizeY=8192&sizeZ=4000.fake" ~/data/blosc-test/v2/zstack.zarr --compression blosc --compression-properties "cname=zstd"

Note that omitting --compression-properties "cname=zstd" will use the lz4 cname instead, see https://github.com/zarr-developers/jzarr/blob/533c8bb4197f57bd664edc6c62b3f5cd0de262ba/src/main/java/com/bc/zarr/CompressorFactory.java#L220. For the slide test, the default lz4 cname results in an output dataset that is approximately twice the size as with the alternate zstd cname.

Converting the generated v2 datasets to v3, using the current state of #8:

$ bin/zarr2zarr ~/data/blosc-test/v2/slide.zarr/ ~/data/blosc-test/v3/slide.zarr --compression blosc
$ bin/zarr2zarr ~/data/blosc-test/v2/plate.zarr/ ~/data/blosc-test/v3/plate.zarr --compression blosc
$ bin/zarr2zarr ~/data/blosc-test/v2/zstack.zarr/ ~/data/blosc-test/v3/zstack.zarr --compression blosc

Double-checking that the compression options match:

$ cat v2/slide.zarr/0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "zstd",
    "id" : "blosc"
  },
  "dtype" : "|u1",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 3, 1, 38144, 51200 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}
$ cat v3/slide.zarr/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,3,1,38144,51200],"data_type":"uint8","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,1024,1024]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"blosc","configuration":{"cname":"zstd","shuffle":"noshuffle","clevel":5,"typesize":1,"blocksize":0},"name":"blosc"}],"dimension_names":null,"attributes":{}}
$ cat v2/plate.zarr/A/2/0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "zstd",
    "id" : "blosc"
  },
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 6, 18, 1080, 1080 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}
$ cat v3/plate.zarr/A/2/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,6,18,1080,1080],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,1024,1024]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"blosc","configuration":{"cname":"zstd","shuffle":"noshuffle","clevel":5,"typesize":2,"blocksize":0},"name":"blosc"}],"dimension_names":null,"attributes":{}}
$ cat v2/zstack.zarr/0/0/.zarray 
{
  "chunks" : [ 1, 1, 1, 1024, 1024 ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "zstd",
    "id" : "blosc"
  },
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [ 1, 1, 4000, 8192, 8192 ],
  "dimension_separator" : "/",
  "zarr_format" : 2
}
$ cat v3/zstack.zarr/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,1,4000,8192,8192],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,1024,1024]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"blosc","configuration":{"cname":"zstd","shuffle":"noshuffle","clevel":5,"typesize":2,"blocksize":0},"name":"blosc"}],"dimension_names":null,"attributes":{}}

and then comparing total sizes in bytes:

$ du -sb v2/slide.zarr/
2059804567	v2/slide.zarr/
$ du -sb v3/slide.zarr/
2059798307	v3/slide.zarr/
$ du -sb v2/plate.zarr/
5256606095	v2/plate.zarr/
$ du -sb v3/plate.zarr/
5254587700	v3/plate.zarr/
$ du -sb v2/zstack.zarr/
4504698672	v2/zstack.zarr/
$ du -sb v3/zstack.zarr/
4504697401	v3/zstack.zarr/

and a few selected chunk files:

$ ls -lgG v2/slide.zarr/0/0/0/0/0/0/10
-rwxrwxrwx 1 145067 Sep 12 12:45 v2/slide.zarr/0/0/0/0/0/0/10
$ ls -lgG v3/slide.zarr/0/0/c/0/0/0/0/10
-rwxrwxrwx 1 145067 Sep 12 13:03 v3/slide.zarr/0/0/c/0/0/0/0/10
$ ls -lgG v2/plate.zarr/A/2/0/0/0/0/0/0/
total 1196
-rwxrwxrwx 1 1147095 Sep 12 12:58 0
-rwxrwxrwx 1   70851 Sep 12 12:59 1
$ ls -lgG v3/plate.zarr/A/2/0/0/c/0/0/0/0/
total 1192
-rwxrwxrwx 1 1146668 Sep 12 13:45 0
-rwxrwxrwx 1   71242 Sep 12 13:45 1
$ ls -lgG v2/zstack.zarr/0/0/0/0/3000/0/
total 128
-rwxrwxrwx 1 12743 Sep 12 14:01 0
-rwxrwxrwx 1 12728 Sep 12 14:07 1
-rwxrwxrwx 1 12728 Sep 12 14:12 2
-rwxrwxrwx 1 12728 Sep 12 14:18 3
-rwxrwxrwx 1 12728 Sep 12 14:24 4
-rwxrwxrwx 1 12728 Sep 12 14:29 5
-rwxrwxrwx 1 12728 Sep 12 14:35 6
-rwxrwxrwx 1 12728 Sep 12 14:39 7
$ ls -lgG v3/zstack.zarr/0/0/c/0/0/3000/0/
total 128
-rwxrwxrwx 1 12744 Sep 12 22:50 0
-rwxrwxrwx 1 12728 Sep 12 22:50 1
-rwxrwxrwx 1 12728 Sep 12 22:50 2
-rwxrwxrwx 1 12728 Sep 12 22:50 3
-rwxrwxrwx 1 12728 Sep 12 22:50 4
-rwxrwxrwx 1 12728 Sep 12 22:50 5
-rwxrwxrwx 1 12728 Sep 12 22:50 6
-rwxrwxrwx 1 12728 Sep 12 22:50 7

Some things we may want to improve, either here or as part of future work in bioformats2raw (in which case this issue can be moved):

  • make the blosc cname and other extended options configurable when writing v3
  • make it more obvious in the documentation that changing the cname is something worth trying if size on disk matters
  • change the default cname in bioformats2raw to zstd?
@mkitti
Copy link

mkitti commented Sep 30, 2024

Do you specify the typesize and/or dtype when initializing the Blosc compressor?

@melissalinkert
Copy link
Member Author

If I understand correctly, using CodecBuilder.withBlosc() as we do here:

https://github.com/glencoesoftware/zarr2zarr/blob/master/src/main/java/com/glencoesoftware/zarr/Convert.java#L369

means that lines 30-66 of https://github.com/zarr-developers/zarr-java/blob/main/src/main/java/dev/zarr/zarrjava/v3/codec/CodecBuilder.java (in particular line 48) will automatically set typesize correctly based on the data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants