-
Notifications
You must be signed in to change notification settings - Fork 75
Error: "transport retry counter exceeded" when torch.distributed.rpc.init_rpc between different pods in k8s #455
Comments
It is speculated that the version of MLNX_OFED may be too low, and I am working on this. |
After troubleshooting, I finally determined the cause of this problem. On the k8s training cluster in our lab, each compute node is equipped with two IB network cards, but only one IB card can be used, and the other card is not plugged in. The ibstatus command shows the following status:
It can be seen that the status of the network card "mlx5_1" is down. But tensorpipe selects the first device returned by So I made a slight modification at here, so that it will traverse the deviceList, and only select the devices whose port status is active, and I am very happy to find that tensorpipe can work perfectly. However, this strategy may lead to inconsistent device selection on both sides of the communication. For example, A has two IB network cards that work well, while B has a good IB card and a bad IB card, which may still cause errors (if the two IB cards cards have different network configurations). In addition to this, the order of devices returned by tensorpipe is the best rdma-enabled rpc library I've seen, and our projects use it heavily. I very much hope that I can do my best to contribute to it. I open a PR and feel free to make further changes based on review comments. thanks : ) |
Love to hear that you're finding TensorPipe useful (and would be interested in knowing more what you're doing with it). And you're right that the IB device selection of the I'll review your PR and gladly land it. Unfortunately we've stopped actively investing in improving TensorPipe, hence we won't be able to do any of the above changes ourselves. But if you're willing to contribute more fixes I'll happily accept them.
One assumption we made, which I believe will be very hard to relax, is that all the IB cards on all the nodes can talk to each other. This is basically always true for the TCP/IP network, and thus we thought we could impose it also for the IB network. If we wanted to support cases where it isn't so, then we'd need a way to detect if two HCAs are in the same network, and AFAIK this isn't easily doable (except if one has UMAD access, but it tends to require root privileges). Do you have any ideas? |
Hi Luca @lw great to hear your reply.
The main scenario where we use tensorpipe is reinforcement learning (RL) training, which mainly consists of 2 different processes: the collector process (which may contain multiple subprocesses) is responsible for interacting with the RL environment (using CPU) and performing model inference (using GPU); learner is responsible for model training (using GPU) The data flow of collector and learner is:
In most cases, there is one learner and multiple collectors. The number of collectors may dynamically increase or decrease based on demand. In our k8s environment, these workers are encapsulated in Pods and scheduled to run on physical nodes. Pods use containers to isolate processes, so even if two Pods are scheduled to the same physical node, communication methods such as shm cannot be used without additional configuration (such as mounting /dev/shm to the container). So we hope that the ideal communication library has the following capabilities:
I tried to use nccl's send/recv API to achieve the above goals, but I found to my dismay that nccl is not very suitable for our scenario.:
Finally I discovered tensorpipe, and I was very excited to find it worked well for most of our needs. Hence the above story.
I would like to ask you for advice. I read the content of issue#405 and learned about
I regret that pytorch no longer continues to support tensorpipe. Heterogeneous data (CPU data/GPU data) and the dynamics of communicating entities in RL training make it difficult to use existing APIs to implement, and tensorpipe perfectly solves this problem. I'd love to continue contributing to tensorpipe and hope this awesome project will continue to be active.
Sorry I don't have any more ideas right now, I think I need to look up more doc. But the good news is that now tensorpipe can run perfectly on our cluster. I'll update if I have any ideas. |
|
Hello, my code is running in the k8s environment. I started pytorch in two pods and tried to use torchrpc , but I encountered an error in the torch.distributed.rpc.init_rpc function. Hope to get some advice or inspiration.
Code
Error
I add TP_VERBOSE_LOGGING so it print tensorpipe debug info.
It can be seen that the fatal error is IBV_WC_RETRY_EXC_ERR: Transport Retry Counter Exceeded. This means that the remote side didn't send any Ack or Nack. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. But in fact the other process does not report an error, but waits until the init_rpc times out.
Here is the rank 0 log, name is B:
Here is the rank 1 log, name is A:
I found issue#452 very similar to my problem , but my network environment is infiniband, not ROCE. And checked the LID and GID of the running environment, as follows:
//ibstat
//show_gids
Because LID (Local ID) is a layer 2 attribute in the infiniband protocol stack, this attribute does not need to be set in the RoCE network. When this port is queried, this field is 0. And I read the src code of tensorpipe, if LID is not 0, GID index has no effect on qp_attr . So I think my error may be not caused by kGlobalIdentifierIndex.
In addition, I used pytorch's NCCL init_process_group API in the same environment and did not reproduce the above error.
And if I start two processes in the same Pod and run the same torch rpc code, everything works fine.
Container Env Versions
pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 0.12.1 py38_cu113 pytorch
torchelastic 0.2.1 pypi_0 pypi
torchvision 0.13.1 py38_cu113 pytorch
MLNX_OFED_LINUX-4.6-1.0.1.1
ubuntu 18.04
The text was updated successfully, but these errors were encountered: