New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Faiss GPU: bfloat16 brute-force kNN support (#4014) #4018

Closed

wickedfoo wants to merge 1 commit into facebookresearch:main from wickedfoo:export-D65459723

Contributor

wickedfoo commented Nov 6, 2024

Summary:

This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (bfKnn).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION is set in StandardGpuResources.cpp. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Nov 6, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

facebook-github-bot added the fb-exported label

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

841d313

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 3d228c8 to 841d313 Compare

November 6, 2024 04:35

Contributor

facebook-github-bot commented Nov 6, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

gtwang01 added GPU Implementation and removed Implementation labels

Contributor

gtwang01 commented Nov 7, 2024

I noticed this PR is very similar to #4014. Is this a duplicate of that?

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

ddf0c58

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 841d313 to ddf0c58 Compare

November 12, 2024 00:12

Contributor

facebook-github-bot commented Nov 12, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

c7c8620

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from ddf0c58 to c7c8620 Compare

November 12, 2024 00:46

Contributor

facebook-github-bot commented Nov 12, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

56ba282

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from c7c8620 to 56ba282 Compare

November 12, 2024 02:02

Contributor

facebook-github-bot commented Nov 12, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

c2b4a61

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 56ba282 to c2b4a61 Compare

November 12, 2024 16:42

Contributor

facebook-github-bot commented Nov 12, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

bb1a416

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from c2b4a61 to bb1a416 Compare

November 12, 2024 16:44

Contributor

facebook-github-bot commented Nov 12, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

78a62b7

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from bb1a416 to 78a62b7 Compare

November 12, 2024 20:04

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 78a62b7 to 4440177 Compare

November 12, 2024 20:04

Contributor

facebook-github-bot commented Nov 12, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from e54d570 to 2df0d1f Compare

November 14, 2024 00:08

Contributor

facebook-github-bot commented Nov 14, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

c05f80a

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 2df0d1f to c05f80a Compare

November 18, 2024 20:17

Contributor

facebook-github-bot commented Nov 18, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from c05f80a to 3150441 Compare

November 18, 2024 23:55

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

3220cee

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 3150441 to 3220cee Compare

November 18, 2024 23:55

Contributor

facebook-github-bot commented Nov 18, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

1 similar comment

Contributor

facebook-github-bot commented Nov 18, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

206bc9d

Summary:
Pull Request resolved: facebookresearch#4018

Pull Request resolved: facebookresearch#4014

This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 3220cee to 206bc9d Compare

November 19, 2024 04:48

Contributor

facebook-github-bot commented Nov 19, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

97467dc

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 206bc9d to 97467dc Compare

November 19, 2024 20:22

Contributor

facebook-github-bot commented Nov 19, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

15651f2

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 97467dc to 15651f2 Compare

November 19, 2024 20:24

Contributor

facebook-github-bot commented Nov 19, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

3e61cfd

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 15651f2 to 3e61cfd Compare

November 19, 2024 23:35

Contributor

facebook-github-bot commented Nov 19, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723


          Faiss GPU: bfloat16 brute-force kNN support (facebookresearch#4018)

95728a3

Summary:


This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

wickedfoo force-pushed the export-D65459723 branch from 3e61cfd to 95728a3 Compare

November 20, 2024 00:00

Contributor

facebook-github-bot commented Nov 20, 2024

This pull request was exported from Phabricator. Differential Revision: D65459723

facebook-github-bot closed this in

eaab46c

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Nov 20, 2024

This pull request has been merged in eaab46c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported GPU Merged