This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

cuda_ipc: Couldn't find GPU with UUID xxx #412

Open

tscmoo opened this issue Sep 28, 2021 · 4 comments

tscmoo commented Sep 28, 2021 •

edited

Loading

In globalIdxForDevice at tensorpipe/channel/cuda_ipc/context_impl.cc:102 "iter == globalUuids.end()Couldn't find GPU with UUID 9b967b0b-fa75-3a89-0770-e950eea546c5"

2 SLURM jobs scheduled on the same node, allocated two different gpus. Each process can not see the GPU of the other.
Failure occurs in cuda_ipc canCommunicateWithRemote.

I don't know if cuda ipc can be made to work when the remote device is not visible from the current process.
This may be expected to fail, but canCommunicateWithRemote should probably just return false instead of throwing an exception in that case (that would also make cuda_ipc unusable for my use-case).

The text was updated successfully, but these errors were encountered:

Contributor

lw commented Sep 29, 2021

Interesting. I must admit I had seen this before but then it went away and I didn't think much about it since.

TensorPipe should be able to circumvent limits to GPU visibility if they are imposed through CUDA_VISIBLE_DEVICES, because it uses NVML to detect GPUs which doesn't honor that environment variable. This is the same as nvidia-smi (e.g., try running CUDA_VISIBLE_DEVICES=0 nvidia-smi on a multi-GPU machine).

So it seems that your two jobs are isolated from each other to a point where even NVML isn't able to see the devices. I'm not sure how exactly SLURM is doing that. If you have any insight into that I'd love to hear it, otherwise I'll have to look into it...

Author

tscmoo commented Sep 29, 2021

I don't know how it does it, but indeed it's not through CUDA_VISIBLE_DEVICES.
nvidia-smi, nvtop etc also only show the one allocated gpu (and always as GPU 0).

Contributor

lw commented Oct 29, 2021

I looked more into this and indeed it seems there's an NVIDIA Docker integration of sorts that limits the container to only see some of the GPUs by, probably, limiting what device nodes are visible within the container (https://nvidia.github.io/nvidia-container-runtime/). It also seems that SLURM uses this extension since I saw similar mentions in their docs.

So this means that, at a minimum, we need to relax the assumptions in our code where we suppose that if two processes are running on the "same kernel" they must be able to see the same set of devices.

However by doing that we would simply end up refusing to use NVLink between two GPUs on the same machine if the processes are in different containers, which is not what we want. I don't see any method in the CUDA or NVML APIs allowing us to check if peer-to-peer support is enabled between a GPU in the container and one outside of it, and we need to check that in order to determine in advance whether we can use IPC. (This is because TensorPipe's handshake is "eager", and thus we can't just try to use IPC and "see").

What I did see however is a method in NVML to list the NVLinks of a given device (in the container) and get the PCI address of the other endpoint of an NVLink. I haven't yet checked whether this works if the remote is in another container but, if it does, it could allow us to basically "reimplement" the peer-to-peer check on our own. I'm not super excited about that, since I don't know if we can really perfectly recreate CUDA's answer on that, but it's the only option I see...

Contributor

lw commented Oct 29, 2021

BTW I was wondering what you were trying to achieve when you hit this problem. Were you separating processes on the same machine using containers on purpose? What's the reason to do this instead of using a single container? Or were you scheduling smaller jobs which just happened to end up on the same machine?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.