fbgemm_gpu test fail #2977

ywangwxd · 2024-08-13T04:57:53Z

I am using CUDA 11.8. I installed 0.8 binary. When I run the test program batched_unary_embeddings_test.py

ERROR: test_gpu (main.TableBatchedEmbeddingsTest)

Traceback (most recent call last):
File "/repo/fbgemm/fbgemm_gpu/test/batched_unary_embeddings_test.py", line 240, in test_gpu
self._test_main(gpu_infer=True)
File "/y/repo/fbgemm/fbgemm_gpu/test/batched_unary_embeddings_test.py", line 152, in _test_main
offsets_tensor[1:] = torch.ops.fbgemm.asynchronous_inclusive_cumsum(
File "//.conda/envs/torchrec/lib/python3.10/site-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

If I compiled the package, it told me that nvcc does not support c++ 20. It is strange that if the release binary supports cuda 11.8, why my nvcc (cuda 11.8) cannot even recognize c++ 20.

The text was updated successfully, but these errors were encountered:

q10 · 2024-08-13T05:18:23Z

Hi @ywangwxd

batched_unary_embeddings_test.py is known to be broken at the moment (see #1559) and we haven't had the bandwidth to fix it yet. It is currently ignored in the CI runs for this reason.

As for C++20 support, we are able to support C++20 with CUDA 11.8 is because we add -allow-unsupported-compiler flag to NVCC_PREPEND_FLAGS when invoking nvcc (see here).

ywangwxd · 2024-08-13T06:12:46Z

Hi @ywangwxd

batched_unary_embeddings_test.py is known to be broken at the moment (see #1559) and we haven't had the bandwidth to fix it yet. It is currently ignored in the CI runs for this reason.

As for C++20 support, we are able to support C++20 with CUDA 11.8 is because we add -allow-unsupported-compiler flag to NVCC_PREPEND_FLAGS when invoking nvcc (see here).

1, Then how should I validate the success of installation of FBGEMM?
2, If you can do that, why I got complain about C++20 in my env? Can I fix it by changing any option? I used
export NVCC_APPEND_FLAGS='-allow-unsupported-compiler'
before building, it does not work.

q10 · 2024-08-13T07:01:31Z

Hi @ywangwxd

Generally, the installation of FBGEMM can be validated by running python -c "import fbgemm_gpu" after install. Our CI already runs the test suites for each PR for the supported CUDA versions, Python versions, etc.

It should be NVCC_PREPEND_FLAGS, not NVCC_APPEND_FLAGS, which is another valid env variable.

ywangwxd · 2024-08-13T11:12:05Z

Hi @ywangwxd

Generally, the installation of FBGEMM can be validated by running python -c "import fbgemm_gpu" after install. Our CI already runs the test suites for each PR for the supported CUDA versions, Python versions, etc.

It should be NVCC_PREPEND_FLAGS, not NVCC_APPEND_FLAGS, which is another valid env variable.

Is there anyway to validate the runtime, not just import. I encountered a problem when use fbgemm with torchrec.
I suspect it is the problem of fbgemm, but I want to have a test on fbgemm alone. I used pip to install it this time.

q10 · 2024-08-13T18:28:30Z

Hi @ywangwxd

This error usually indicates that you're running on a CUDA hardware model for which we did not compile the FBGEMM code for. We generally compile FBGEMM for SM 7.0, 8.0, 9.0, and 9.0a. What is the hardware model you are running the code on?

spcyppt · 2024-08-13T22:34:04Z

Hi @ywangwxd, can you show the result from running nvidia-smi?

ywangwxd · 2024-08-14T09:17:07Z

Hi @ywangwxd

This error usually indicates that you're running on a CUDA hardware model for which we did not compile the FBGEMM code for. We generally compile FBGEMM for SM 7.0, 8.0, 9.0, and 9.0a. What is the hardware model you are running the code on?

OK, this is the problem. I used a P100 card, which should be SM 6.0.
I suggest specify this in the requirements. It only mentions cuda versions, which mine is matching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fbgemm_gpu test fail #2977

fbgemm_gpu test fail #2977

ywangwxd commented Aug 13, 2024

q10 commented Aug 13, 2024 •

edited

Loading

ywangwxd commented Aug 13, 2024 •

edited

Loading

q10 commented Aug 13, 2024

ywangwxd commented Aug 13, 2024

q10 commented Aug 13, 2024

spcyppt commented Aug 13, 2024

ywangwxd commented Aug 14, 2024

fbgemm_gpu test fail #2977

fbgemm_gpu test fail #2977

Comments

ywangwxd commented Aug 13, 2024

ERROR: test_gpu (main.TableBatchedEmbeddingsTest)

q10 commented Aug 13, 2024 • edited Loading

ywangwxd commented Aug 13, 2024 • edited Loading

q10 commented Aug 13, 2024

ywangwxd commented Aug 13, 2024

q10 commented Aug 13, 2024

spcyppt commented Aug 13, 2024

ywangwxd commented Aug 14, 2024

q10 commented Aug 13, 2024 •

edited

Loading

ywangwxd commented Aug 13, 2024 •

edited

Loading