Make use of posix_memalign for hfile buffer.

On AMD EPYC 7713 aligning to cache size boundaries makes a very significant difference to fp->backend->read performance in the kernel. A modern Intel CPU did not demonstrate this difference. x86 often have cache line size of 64 bytes, and apple Arm chips 128 bytes. I haven't tested if Arm benefits from alignment during read calls, but we can check size with sysconf(_SC_LEVEL1_DCACHE_LINESIZE). However to avoid additional autoconfery I just picked 256 as it gives us headroom and is simple. Speed ups on the AMD EPYC: time bash -c 'for i in `seq 1 30`;do cat < ~/lustre/enwik9| ./bgzip -l5 -@32 > /dev/null;done' Unaligned real 0m45.012s user 10m7.661s sys 0m58.770s Aligned real 0m30.717s user 11m14.004s sys 0m32.921s It is likely this could improve other bits of code too.
jkbonfield · Nov 18, 2024 · d1be3c2 · d1be3c2
1 parent 186d21b
commit d1be3c2
Show file tree

Hide file tree

Showing 2 changed files with 7 additions and 1 deletion.
diff --git a/configure.ac b/configure.ac
@@ -326,7 +326,7 @@ HTS_HIDE_DYNAMIC_SYMBOLS
 
 dnl FIXME This pulls in dozens of standard header checks
 AC_FUNC_MMAP
-AC_CHECK_FUNCS([gmtime_r fsync drand48 srand48_deterministic getauxval elf_aux_info])
+AC_CHECK_FUNCS([gmtime_r fsync drand48 srand48_deterministic getauxval elf_aux_info posix_memalign])
 
 # Darwin has a dubious fdatasync() symbol, but no declaration in <unistd.h>
 AC_CHECK_DECL([fdatasync(int)], [AC_CHECK_FUNCS(fdatasync)])

diff --git a/hfile.c b/hfile.c
@@ -113,8 +113,14 @@ hFILE *hfile_init(size_t struct_size, const char *mode, size_t capacity)
     // FIXME For now, clamp input buffer sizes so mpileup doesn't eat memory
     if (strchr(mode, 'r') && capacity > maxcap) capacity = maxcap;
 
+#ifdef HAVE_POSIX_MEMALIGN
+    fp->buffer = NULL;
+    if (posix_memalign((void **)&fp->buffer, 256, capacity) < 0)
+        goto error;
+#else
     fp->buffer = (char *) malloc(capacity);
     if (fp->buffer == NULL) goto error;
+#endif
 
     fp->begin = fp->end = fp->buffer;
     fp->limit = &fp->buffer[capacity];