Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error and hang on v100 #1041

Open
samnordmann opened this issue Oct 28, 2024 · 1 comment
Open

Error and hang on v100 #1041

samnordmann opened this issue Oct 28, 2024 · 1 comment

Comments

@samnordmann
Copy link
Collaborator

samnordmann commented Oct 28, 2024

  • Setup: DGX 8*V100 32GB, CUDA 12.4, node "dgx1v-loki-23" in dlcluster
  • 1 node, two processes
  • reproducer:
docker run \
  --rm --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 \
  --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \
  --cap-add=SYS_PTRACE --privileged \
  --device=/dev/infiniband \
  --gpus all \
  gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \
 /bin/bash -c 'mpirun -np 2 build/test_multidevice --gtest_filter=*Gather/UCC*'
  • Error: [1730117228.178968] [dgx1v-loki-23:3000 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7f65a8000000 len:16777216 err:201
  • or sometimes it just segfaults
  • UCX version: # API headers version: 1.18.0, Git branch 'master', revision 9da106a
  • UCC version=1.4.0 revision 2bb2b73
@samnordmann samnordmann changed the title Error and hang in TL/CUDA on v100 Error and hang on v100 Oct 28, 2024
@samnordmann samnordmann reopened this Oct 28, 2024
@samnordmann
Copy link
Collaborator Author

I just figured out that adding -x UCC_CL_BASIC_TLS=^mlx5 solves that bug. In the debug log we see that the non-master rank 1 prints [1730120424.285612] [dgx1v-loki-23:26352:0] ucc_context.c:817 UCC DEBUG ctx create epilog for mlx5 failed: Unhandled error, then enters into context cleanup, which contains a barrier, while rank 0 init mlx5 successfully, hence the hang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant