You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just figured out that adding -x UCC_CL_BASIC_TLS=^mlx5 solves that bug. In the debug log we see that the non-master rank 1 prints [1730120424.285612] [dgx1v-loki-23:26352:0] ucc_context.c:817 UCC DEBUG ctx create epilog for mlx5 failed: Unhandled error, then enters into context cleanup, which contains a barrier, while rank 0 init mlx5 successfully, hence the hang
docker run \ --rm --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --cap-add=SYS_PTRACE --privileged \ --device=/dev/infiniband \ --gpus all \ gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \ /bin/bash -c 'mpirun -np 2 build/test_multidevice --gtest_filter=*Gather/UCC*'
[1730117228.178968] [dgx1v-loki-23:3000 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7f65a8000000 len:16777216 err:201
# API headers version: 1.18.0, Git branch 'master', revision 9da106a
The text was updated successfully, but these errors were encountered: