HDF5 pre-built filter shared libraries.. #2495
Replies: 13 comments 4 replies
-
A few thoughts: |
Beta Was this translation helpful? Give feedback.
-
Users are already exploring some of the advanced capabilities offered by filters. I was surprised (indeed: shocked) by the blithe way that users in our EGU session described their workflows, involving HDF5 filters that we have not yet touched upon in our discussions. (Frequently LZF.) No concern for portability or interoperability were expressed by anyone! ;-) I suggest that, as with zstandard, we move decisively to add support for filters that are well-maintained and provide an important feature for netCDF users, and can be supported in netcdf-Java. The HDF5 team have already done a lot of the work by selecting and supporting these filters: BZIP2, JPEG, LZF, BLOSC, MAFISC, LZ4, Bitshuffle, and ZFP. By supporting a curated set of filters (and ideally we would support the full list of HDF5 pre-built filters), we can give users the full functionality available in the world of filters, and avoid a "Wild West" scenario, where many different filters, or versions of filters, become embedded in important Earth science workflows, causing downstream trouble for users (and us). Or where filters make data impossible to read with native Java code. By well-supported, I mean the same level of support we give to zlib and other currently-supported filters, like the recently-added zstandard support. This generally consists of two new functions, for example nc_def_var_zstandard()/nc_inq_var_zstandard()) which call the filter code appropriately. These functions then can appear in the documentation and users can use them without really caring that they are filters instead of built-in features. This support can be provided in the netcdf-c/netcdf-fortran libraries, or in the CCR C/Fortran libraries. This is a little extra work, but we can rejoice in the fact that we are leveraging the work of Dennis, the HDF5 team, the filter code writer, and (usually) the library used by the filter, to get advanced features into user hands with remarkably little effort. ;-) LZF and BLOSC are next on the list for CCR, as far as I am concerned. What would be useful would be understanding how they will impact netcdf-java. LZF is a compression library, BLOSC is something else, and itself uses different compression libraries. But BLOSC is about HPC I/O performance - I'm not sure whether or how the resulting data files need to care about it. @czender knows more... Our recent efforts have really improved the compression situation, and started to catch us up to current state of the art, but there is much more to be done... |
Beta Was this translation helpful? Give feedback.
-
At this point I know of at least three independent sources of filter implementations:
So I have no idea where this is all headed, but Ed's wild west scenario is looking likely for the foreseeable future. |
Beta Was this translation helpful? Give feedback.
-
Yes, some of the wrappers differ, but AFAIK the underlying filter library (e.g. libzstd)
That seems reasonable. I must confess that I am surprised that zstd is not in the HDF5 prebuilt |
Beta Was this translation helpful? Give feedback.
-
It occurs to me that if we rely on the HDF5 prebuilts, then I will need some |
Beta Was this translation helpful? Give feedback.
-
I am constructing a PR to add support for programmatic plugin path manipulation. ================================================== re: #2753 This is a draft PR that needs community feedback before it gets committed. As suggested by Ed Hartnett, This PR extends the netcdf.h API to support programmatic control over the search path used to locate plugins. Below, I provide a discussion about the decisions I made in designing the API. I am open to suggestions for alternate designs. The API proposed in this PR looks like this (from netcdf-c/include/netcdf_filter.h).
As a rule, the initialize and finalize functions do not need to be called because they are called as part of nc_initialize()/nc_finalize(). The sync function pushes the current global plugin list state down to one or more of the dispatchers that use plugin paths: libhdf5 and libnczarr currently. ================================================== There were several constraints that I had to consider in designing the API.
We also have to consider this constraint as desirable but not completely necessary:
Ideally, it would be desirable to stick the API into the existing struct NC_Dispatch structure so every dispatcher could handle it as desired. Unfortunately, this approach requires that the user already has a valid netcdf file descriptor -- an ncid -- so as to use the proper dispatch table. This ostensibly violates constraint 3 above. But for the sake of argument, suppose we violate that constraint and require the user to open a file in order to manipulate the plugin path sequence. This would generally require the following set of netcdf API calls:
It is not clear to me that any plugin path changes in step 2 would apply to the step 3 operations. It may be that they will only apply to actions after step 4. Note also that this approach supports constraint 4 above. I personally dislike this solution. The alternative I have chosen keeps constraints 1-3 and partially violates constraint 4. This solution is embodied in the API described at the beginning of this document. The basic idea is that the netcdf library keeps a global plugin path sequence that is distinct from that of any of the dispatchers. Before any files are opened, it is possible to set up an arbitary plugin path sequence using the proposed API. The critical thing to note is the nc_plugin_path_sync() API function. This function pushes the current plugin path sequence down to either all dispatchers that use plugin paths, or to one specific dispatcher. So the typical usage pattern is as follows:
If it is desired to set the plugin path sequence for a specific dispatcher, then change step 2 to call nc_plugin_path_sync(n) where n specifies the dispatcher to change. The legal values for n are: NC_FORMATX_NC_HDF5 or NC_FORMATX_NCZARR or 0 (zero). The zero case pushes the current plugin path sequence to all dispatchers that use plugin paths. |
Beta Was this translation helpful? Give feedback.
-
The proposal sounds very useful. I am wondering if you have considered to make the int nc_plugin_path_append(int formatx,const char* path);
I guess that But this is just a minor suggestion. I am quite certain that the proposed API will work for the julia community. Can you maybe clarify if nc_plugin_path_append("foo");
nc_plugin_path_sync(NC_FORMATX_NC_HDF5);
nc_plugin_path_append("bar");
nc_plugin_path_sync(NC_FORMATX_NCZARR); // path of NcZarr is just "bar" or "foo:bar" ? |
Beta Was this translation helpful? Give feedback.
-
Adding the formatx argument to all of the manipulation functions is a reasonable alternative. |
Beta Was this translation helpful? Give feedback.
-
Also WRT:
It does not clear all changes, so one would have to add a call to nc_plugin_path_load(NULL) to clear the plugin path. |
Beta Was this translation helpful? Give feedback.
-
I am happy to see this being addressed, thanks @DennisHeimbigner . |
Beta Was this translation helpful? Give feedback.
-
It has occurred to me that the simplest possible interface is one that has two operations:
This puts the burden on the client to construct the path list they want, but also gives them The only reason I can see for using a more complex interface is if the dispatcher need to figure |
Beta Was this translation helpful? Give feedback.
-
I agree! Simpler is better. |
Beta Was this translation helpful? Give feedback.
-
Further reflection: the simple api might be good in a multithreaded environment. |
Beta Was this translation helpful? Give feedback.
-
I see that HDF5 is now providing pre-built filter plugin shared libraries. See here:
How if at all should we consider making use of these?
Beta Was this translation helpful? Give feedback.
All reactions