Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an hts_crc32 function to use zlib or libdeflate. #1850

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

jkbonfield
Copy link
Contributor

This follows on from the hts_md5* functions which wrap up either OpenSSL or our own implementation. Libdeflate's crc32 function is considerably faster than the native zlib, so we want to use it in (for example) the new "samtools checksum" code, but we do not wish to add baggage of looking for libdeflate in the configure script.

This follows on from the hts_md5* functions which wrap up either
OpenSSL or our own implementation.  Libdeflate's crc32 function is
considerably faster than the native zlib, so we want to use it in (for
example) the new "samtools checksum" code, but we do not wish to add
baggage of looking for libdeflate in the configure script.
@jkbonfield
Copy link
Contributor Author

Updated bgzf_encode_level0_func to use hts_crc32 to remove an ifdef. (Thanks to @vasudeva8 for idea.)

We should probably consider at some stage exposing bgzf_compress and bgzf_uncompress as these too could very well be useful elsewhere for people who want to link against htslib but don't want the faff of doing their own libdeflate checking and if/else type clauses. Plus it's a MUCH simpler interface than zlib's due to the non-streaming nature.

However that's a different PR IMO and more nuanced with BGZF vs generic Deflate.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Oct 31, 2024

Continuing our slack conversation here, for anyone who is curious over the differences between the hts_crc32 and the old Zlib way of doing things with regards to initialising CRCs, you may notice a subtle difference in the old crc ifdefs:

#ifdef HAVE_LIBDEFLATE
    crc = libdeflate_crc32(0, j->comp_data + BLOCK_HEADER_LENGTH + 5,
                           j->uncomp_len);
#else
    crc = crc32(crc32(0L, NULL, 0L),
                (Bytef*)j->comp_data + BLOCK_HEADER_LENGTH + 5, j->uncomp_len);
#endif

The zlib version is crc32(crc32(0L, NULL, 0L), <data>, <length>) while libdeflates is not. This is because zlib.h states:

   If buf is Z_NULL, this function returns the required initial value for the crc.

   Usage example:

     uLong crc = crc32(0L, Z_NULL, 0);

     while (read_buffer(buffer, length) != EOF) {
       crc = crc32(crc, buffer, length);
     }
     if (crc != original_crc) error();

However, crc32(0L, NULL, 0L) returns zero. This can be seen in the zlib source code, but more importantly is enshrined in the Gzip RFC 1952 which states "Update a running crc with the bytes buf[0..len-1] and return the updated crc. The crc should be initialized to zero." It even has example usage just using plain zero:

      /* Return the CRC of the bytes buf[0..len-1]. */
      unsigned long crc(unsigned char *buf, int len)
      {
        return update_crc(0L, buf, len);
      }

I have no idea why Zlib documentation encourages people to do a NOP, but I'm assuming at some point in the zlib implementation they were using another checksum and initialisation mattered, but it's now just become dead code. It can never change due to the nature of it being a long-standing RFC so I went with the less code approach.

@vasudeva8 vasudeva8 merged commit 1d2e493 into samtools:develop Oct 31, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants