-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetCDF4: Size difference between single write and iterative writes #2862
Comments
Hello all, |
You are using HDF5 1.10, which is old. Can you try with the most recent release? Also, can you turn compression off and see what happens? You are using a deflate level of 9 which is never a good idea. Instead, use 1. Also turning on the shuffle filter will make compression work better. If I read correctly, you find that writing a 1000000 x 30 field has different sizes for different write patterns. However, if you adjust the chunksizes, it works better, is that correct? That makes sense. Consider that this is a very long and thin array, and the defaults are set up to handle more square-shaped arrays, like lat/lon arrays, where the sizes of the arrays are similar magnitude. Whenever you have long, thin arrays, I would not be surprised to see the default chunksizes behave badly. |
Hi @edwardhartnett , Unfortunately, it is not straightforward for us to update HDF5 to the more recent versions at this time, because of some internal bugs. If you have it setup already, can you try executing the code above above to see if you can reproduce the same? Without compression, both the files are of the same size : 117,194 KB. I changed the compression parameters to turn shuffle on and deflate level of 1. I also modified the code above such that for the second case, the nc_open, nc_inq_varid and nc_close commands are outside the for-loop. In that case, both the files are of the same size.
|
Hi @edwardhartnett, Thanks, |
What I'm seeing is that you get different file sizes with different behavior. This is expected in HDF5. If you think this is a bug in HDF5, then you should construct a HDF5 test case and submit it to the HDF5 team. I don't think this is a netCDF bug. |
NetCDF version : 4.9.2
HDF5 version : 1.10.11
OS: Linux
I create two netCDF4 files (file1.nc and file2.nc) with the same properties:
Dimensions: 1000000 x 30
Datatype: NC_FLOAT
Compression: Deflate, Level 9
Chunking: Default
The data I write is also the same (all values = 1)
However, I create them by two separate ways (you can find my repro code below):
file1.nc :
Written using nc_put_var_float (single call) to write the whole dataset (1000000 x 30) at once.
Size of file = 123 KB.
file2.nc :
Written by calling nc_put_vara_float in a for-loop.
10000 x 30 elements written in each for-loop iteration (file is opened and closed during every for loop iteration)
Size of the file = 935 KB
** If chunksize is {10000, 30}, file1.nc is 195 KB whereas file2.nc is 201 KB.
** If we open and close file2.nc only once (before and after the for loop), file2.nc size is 123 KB (same as file1.nc).
I compared that the contents and the dimensions of the variables are the same.
file1.nc (default chunking) -> 123 KB
file1.nc ({10000, 30} chunking) -> 195 KB
file2.nc (default chunking -> 935 KB)
file2.nc ({10000, 30} chunking) -> 201 KB
I know that (2) is not the ideal way to write to the variable, but can someone help me understand why there is such a big difference in filesizes, even though the content is the same? If I change the data to rand instead of 1s, file1.nc is around 100MB whereas file2.nc size bloats to almost 1.5 GB.
Why is the file size same when, for case (2), the file is opened and closed only once?
Why does the size of file2.nc reduce from 935 KB to 201KB just by changing the default chunking parameters to {10000, 30}?
I am attaching my code below. Let me know if you have any questions regarding the same.
Thanks.
The text was updated successfully, but these errors were encountered: