Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use _mm512_popcnt_epi64 to speedup hamming distance evaluation. #4020

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mulugetam
Copy link
Contributor

@mulugetam mulugetam commented Nov 6, 2024

The _mm512_popcnt_epi64 intrinsic is used to accelerate Hamming distance calculations in HammingComputerDefault and HammingComputer64.

Benchmarking with bench_hamming_computer on AWS r7i instance shows a performance improvement of up to 30% compared to AVX-2.

This PR depends on PR#4025

@alexanderguzhva
Copy link
Contributor

@mulugetam Sure, _mm512_popcnt_epi64() requires VPOPCNTDQ (at least Intel IceLake CPUs), but Faiss AVX512 for Linux builds defaults to F, CD, VL, DQ, BW extensions (https://github.com/facebookresearch/faiss/blob/main/faiss/CMakeLists.txt). So, the performance gain from this PR may be seen only if someone builds Faiss by himself using an MSVC compiler.

@mengdilin
Copy link
Contributor

@mulugetam FYI we are also working with other Intel collaborators (@guangzegu and @xtangxtang) to incorporate AMX to FAISS from #3266 @asadoughi suggested perhaps we should have a meeting and sync up on the collaboration with Intel

@mengdilin
Copy link
Contributor

Thanks for your contributions :) Question: What is the performance for Hamming distance calculations when VPOPCNTDQ is not supported compared to AVX2?

@mulugetam
Copy link
Contributor Author

@alexanderguzhva

VPOPCNTDQ

Any downside in using the scheme below? I check if the compiler and the underlying machine support VPOPCNTDQ and then add the -mavx512vpopcntdq flag to the target_compile_options based on that.

function(check_avx512vpopcntdq_support result_var)
    set(TEST_PROGRAM "${CMAKE_BINARY_DIR}/check_avx512vpopcntdq.cpp")

    file(WRITE ${TEST_PROGRAM} "
    int main() {
        #ifdef __has_builtin
            #if __has_builtin(__builtin_cpu_supports)
                if (__builtin_cpu_supports(\"avx512vpopcntdq\")) {
                    return 0;
                }
            #endif
        #endif
        return 1;
    }
    ")

    try_run(RUN_RESULT COMPILE_RESULT
            ${CMAKE_BINARY_DIR}
            ${TEST_PROGRAM})

    if(RUN_RESULT EQUAL 0)
        set(${result_var} TRUE PARENT_SCOPE)
    else()
        set(${result_var} FALSE PARENT_SCOPE)
    endif()
endfunction()

include(CheckCXXCompilerFlag)
check_cxx_compiler_flag("-mavx512vpopcntdq" COMPILER_SUPPORTS_AVX512VPOPCNTDQ)

if(COMPILER_SUPPORTS_AVX512VPOPCNTDQ)
        check_avx512vpopcntdq_support(AVX512VPOPCNTDQ_SUPPORT)
        add_definitions(-DHAS_AVX512VPOPCNTDQ_SUPPORT)
endif()

// Line 259 gets modified
if(HAS_AVX512VPOPCNTDQ_SUPPORT)
    target_compile_options(faiss_avx512 PRIVATE
        $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -mfma -mf16c -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mpopcnt -mavx512vpopcntdq>)
else()
    target_compile_options(faiss_avx512 PRIVATE
        $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -mfma -mf16c -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mpopcnt>)
endif()

@mulugetam
Copy link
Contributor Author

Thanks for your contributions :) Question: What is the performance for Hamming distance calculations when VPOPCNTDQ is not supported compared to AVX2?

@mengdilin If I benchmark the AVX-512 build without VPOPCNTDQ and compare it to the AVX2 build, I’d see about a 1% - 5% speedup (depending on the code_size) for AVX-512.

@mulugetam
Copy link
Contributor Author

@mulugetam FYI we are also working with other Intel collaborators (@guangzegu and @xtangxtang) to incorporate AMX to FAISS from #3266 @asadoughi suggested perhaps we should have a meeting and sync up on the collaboration with Intel

@mengdilin
That's nice. I'm approaching this from an OpenSearch perspective, where FAISS is used as one of the engines. We've observed significant performance improvements in vector indexing and search with our previous PR, which added AVX-512 support for ScalarQuantizer.

@alexanderguzhva
Copy link
Contributor

alexanderguzhva commented Nov 7, 2024

@mulugetam Technically, cmake-based check for vpopcntdq may lead to the following situation. Say, a build is performed on a most recent CPU get, and then the built python package is put into conda / pip repository. This package gets spread across the world and leads to problems for all ppl who has Intel Cascade Lake CPUs.
More preferrable solution at this stage would be to add a new architecture mode (currently, there are 'generic', 'avx2' and 'avx512'), it is up to Faiss team to decide whether to accept another arch.
Also, it is possible to have dynamic kernel selection via function pointers, but the kernel is tiny (function pointer overhead will be huge), and Faiss does not currently support such an approach.

@mulugetam
Copy link
Contributor Author

@mulugetam Technically, cmake-based check for vpopcntdq may lead to the following situation. Say, a build is performed on a most recent CPU get, and then the built python package is put into conda / pip repository. This package gets spread across the world and leads to problems for all ppl who has Intel Cascade Lake CPUs. More preferrable solution at this stage would be to add a new architecture mode (currently, there are 'generic', 'avx2' and 'avx512'), it is up to Faiss team to decide whether to accept another arch. Also, it is possible to have dynamic kernel selection via function pointers, but the kernel is tiny (function pointer overhead will be huge), and Faiss does not currently support such an approach.

@alexanderguzhva Thanks! In addition to VPOPCNTDQ, we plan to make additional PRs to speed up the scalar quantizer with FP16 instructions. I will modify this PR and introduce an 'avx512_advanced' mode. Hopefully, the team will accept the changes, given that it offers a performance boost.

@alexanderguzhva
Copy link
Contributor

@mulugetam FP16 means avx512fp16 or f16c?

@alexanderguzhva
Copy link
Contributor

@mulugetam Well, there are two more problems then.

  1. fp16 is not supported on AMD, so this means that the proposed arch should be called something like avx512-sr. Which is beneficial in case if FAISS finally gets AMX support, bcz AMX also requires SR as a minimum.
  2. fp16 might lack the needed precision range. As of now, FP16 version of Faiss ScalarQuantizer assumes that the data is stored in fp16, but computations are performed in fp32. Performing pure computations in fp16 (say, via _mm512_fmadd_ph()) might produce out-of-range problems for the final result. So, I'd recommend adding some specific flag to a ScalarQuantizer class that lets it use pure fp16-based computations.

Nevertheless, please write and benchmark the code, and then we'll decide how to integrate it into Faiss properly. :) Thanks

@mulugetam
Copy link
Contributor Author

@alexanderguzhva @mengdilin I have created new PR#4025 that adds 'avx512-sr' architecture mode and marked this PR to be dependent on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants