Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Error: "transport retry counter exceeded" when torch.distributed.rpc.init_rpc between different pods in k8s #455

Open
SolenoidWGT opened this issue Sep 21, 2022 · 5 comments · May be fixed by #456

Comments

@SolenoidWGT
Copy link

SolenoidWGT commented Sep 21, 2022

Hello, my code is running in the k8s environment. I started pytorch in two pods and tried to use torchrpc , but I encountered an error in the torch.distributed.rpc.init_rpc function. Hope to get some advice or inspiration.

Code

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Test torch rpc')
    parser.add_argument('--rank', type=int)
    parser.add_argument('--init_method', type=str)
    rank = args.rank
    init_method = args.init_method
    rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=8, init_method=init_method)
    if rank == 1:
        rpc_backend_options.set_device_map("B", {0: 0})
        name = "A"
    elif rank == 0:
        rpc_backend_options.set_device_map("A", {0: 0})
        name = "B"

    logging.debug(torch.__version__)
    logging.debug("name:{}, rank:{}, init_method:{}".format(name, rank, init_method))
    rpc.init_rpc(name, rank=rank, world_size=world_size, rpc_backend_options=rpc_backend_options)

Error

I add TP_VERBOSE_LOGGING so it print tensorpipe debug info.
It can be seen that the fatal error is IBV_WC_RETRY_EXC_ERR: Transport Retry Counter Exceeded. This means that the remote side didn't send any Ack or Nack. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. But in fact the other process does not report an error, but waits until the init_rpc times out.

Here is the rank 0 log, name is B:

[09-21 07:24:27] DEBUG    1.12.1                                                                                                                                func.py:67
[09-21 07:24:27] DEBUG    name:B, rank:0, init_method:tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516                                      func.py:67
V0921 07:24:29.916767    89 tensorpipe/core/context_impl.cc:53] Context 89:c0 created
V0921 07:24:29.916805    89 tensorpipe/core/context_impl.cc:55] Context 89:c0 aliased as B
V0921 07:24:29.930452    89 tensorpipe/core/context_impl.cc:81] Context B is registering transport ibv
V0921 07:24:29.930539    89 tensorpipe/core/context_impl.cc:81] Context B is registering transport uv
V0921 07:24:29.932656    89 tensorpipe/core/context_impl.cc:81] Context B is registering transport shm
V0921 07:24:29.932819    89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_basic
V0921 07:24:29.932829    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_basic
V0921 07:24:29.932836    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_basic.cpu
V0921 07:24:29.935908    89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_xth
V0921 07:24:29.935921    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_xth
V0921 07:24:29.935981    89 tensorpipe/channel/cma/context_impl.cc:225] Couldn't detect the active Linux Security Modules
V0921 07:24:29.936031    89 tensorpipe/channel/cma/context_impl.cc:253] YAMA ptrace scope set to classic ptrace permissions
V0921 07:24:29.936044    89 tensorpipe/channel/cma/context_impl.cc:296] The domain descriptor for CMA is cma:6783c3f0-b9c5-408b-80b7-9fd2287768d8_1000d9_f0001d4c_1000d9_effffffd_0_0_a80425fb
V0921 07:24:29.936110    89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cma
V0921 07:24:29.936117    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cma
V0921 07:24:29.964395    89 tensorpipe/channel/cuda_ipc/context_impl.cc:254] The UUIDs of all the GPUs found by the CUDA IPC channel are 53dd857c-914a-4f99-7bcc-0af7e3edc566
V0921 07:24:29.964409    89 tensorpipe/channel/cuda_ipc/context_impl.cc:256] The peer-to-peer support found by the CUDA IPC channel is {{1}}
V0921 07:24:29.964435    89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_ipc
V0921 07:24:29.964445    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_ipc
V0921 07:24:29.964457    89 tensorpipe/core/context_impl.cc:104] Context B is registering channel basic
V0921 07:24:29.964463    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_basic
V0921 07:24:29.995572    89 tensorpipe/core/context_impl.cc:104] Context B is registering channel mpt_uv
V0921 07:24:29.995585    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_mpt_uv
V0921 07:24:29.995705    89 tensorpipe/channel/cuda_gdr/context_impl.cc:479] CUDA GDR channel is not viable because the nv_peer_mem kernel module isn't active
V0921 07:24:29.995721    89 tensorpipe/core/context_impl.cc:116] Context B is opening listener B[l0]
V0921 07:24:29.996051    89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport shm
V0921 07:24:29.996061    89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport ibv
V0921 07:24:29.996065    89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport uv
V0921 07:24:30.330784    89 tensorpipe/core/listener_impl.cc:102] Listener B[l0] received an accept request (#0)
V0921 07:24:30.331219    89 tensorpipe/core/context_impl.cc:128] Context B is opening pipe B.p0
V0921 07:24:30.331241    89 tensorpipe/core/context_impl.cc:132] Pipe B.p0 aliased as B_to_A
V0921 07:24:30.331300    89 tensorpipe/core/pipe_impl.cc:286] Pipe B_to_A is writing nop object (spontaneous connection)
V0921 07:24:30.331342    89 tensorpipe/core/pipe_impl.cc:311] Pipe B_to_A is writing nop object (brochure)
V0921 07:24:30.331349    89 tensorpipe/core/pipe_impl.cc:319] Pipe B_to_A is reading nop object (brochure answer)
V0921 07:24:30.331715    89 tensorpipe/core/pipe_impl.cc:290] Pipe B_to_A done writing nop object (spontaneous connection)
V0921 07:24:30.331771    89 tensorpipe/core/pipe_impl.cc:314] Pipe B_to_A done writing nop object (brochure)
V0921 07:24:30.332378    89 tensorpipe/core/pipe_impl.cc:322] Pipe B_to_A done reading nop object (brochure answer)
V0921 07:24:30.332393    89 tensorpipe/core/pipe_impl.cc:1113] Pipe B_to_A is opening connection (descriptor, as replacement)
V0921 07:24:30.332434    89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332448    89 tensorpipe/core/pipe_impl.cc:1131] Pipe B_to_A is opening connection (descriptor_reply)
V0921 07:24:30.332456    89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332527    89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 0/2 (for channel cuda_basic)
V0921 07:24:30.332538    89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332542    89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 1/2 (for channel cuda_basic)
V0921 07:24:30.332550    89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332565    89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_cuda_basic.cpu is opening channel B.ch_cuda_basic.cpu.c0
V0921 07:24:30.332585    89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_cuda_basic is opening channel B.ch_cuda_basic.c0
V0921 07:24:30.332608    89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_cuda_basic.c0 was renamed to B_to_A.ch_cuda_basic
V0921 07:24:30.332622    89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_cuda_basic.cpu.c0 was renamed to B_to_A.ch_cuda_basic.cpu
V0921 07:24:30.332631    89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 0/1 (for channel mpt_uv)
V0921 07:24:30.332638    89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332649    89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_mpt_uv is opening channel B.ch_mpt_uv.c0
V0921 07:24:30.332680    89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_mpt_uv.c0 was renamed to B_to_A.ch_mpt_uv
V0921 07:24:30.343321    89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343622    89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343782    89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343939    89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.345877    89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:31.525355    89 tensorpipe/core/pipe_impl.cc:550] Pipe B_to_A received a write request (#0, contaning 4 payloads and 0 tensors)
V0921 07:24:31.525398    89 tensorpipe/core/pipe_impl.cc:925] Pipe B_to_A is writing nop object (message descriptor #0)
V0921 07:24:31.525412    89 tensorpipe/core/pipe_impl.cc:942] Pipe B_to_A is writing payloads of message #0
V0921 07:24:31.525415    89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.0
V0921 07:24:31.525421    89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.1
V0921 07:24:31.525425    89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.2
V0921 07:24:31.525428    89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.3
V0921 07:24:31.525434    89 tensorpipe/core/pipe_impl.cc:880] Pipe B_to_A is sending tensors of message #0
V0921 07:24:31.525495    89 tensorpipe/core/pipe_impl.cc:931] Pipe B_to_A done writing nop object (message descriptor #0)
V0921 07:24:31.525575    89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.0
V0921 07:24:31.525598    89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.1
V0921 07:24:31.525622    89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.2
V0921 07:24:31.525634    89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.3
V0921 07:24:31.525640    89 tensorpipe/core/pipe_impl.cc:557] Pipe B_to_A is calling a write callback (#0)
V0921 07:24:31.525671    89 tensorpipe/core/pipe_impl.cc:560] Pipe B_to_A done calling a write callback (#0)
V0921 07:24:31.525692    89 tensorpipe/core/pipe_impl.cc:376] Pipe B_to_A received a readDescriptor request (#0)
V0921 07:24:31.525707    89 tensorpipe/core/pipe_impl.cc:845] Pipe B_to_A is reading nop object (message descriptor #0)
V0921 07:24:34.575529    89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_mpt_uv is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576221    89 tensorpipe/core/pipe_impl.cc:634] Pipe B_to_A is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576272    89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_mpt_uv is closing
V0921 07:24:34.576319    89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_cuda_basic is closing
V0921 07:24:34.576328    89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_cuda_basic is handling error channel closed (this error originated at tensorpipe/channel/channel_impl_boilerplate.h:265)
V0921 07:24:34.576354    89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_cuda_basic.cpu is closing
V0921 07:24:34.576362    89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_cuda_basic.cpu is handling error channel closed (this error originated at tensorpipe/channel/channel_impl_boilerplate.h:265)
V0921 07:24:34.576381    89 tensorpipe/core/pipe_impl.cc:849] Pipe B_to_A done reading nop object (message descriptor #0)
V0921 07:24:34.576391    89 tensorpipe/core/pipe_impl.cc:382] Pipe B_to_A is calling a readDescriptor callback (#0)
[W tensorpipe_agent.cpp:940] RPC agent for B encountered error when reading incoming response from A: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576442    89 tensorpipe/core/pipe_impl.cc:385] Pipe B_to_A done calling a readDescriptor callback (#0)
Traceback (most recent call last):
  File "test_dist.py", line 470, in <module>
    setup(int(args.rank), 2, False, str(args.init_method))
  File "test_dist.py", line 425, in setup
    rpc.init_rpc(name, rank=rank, world_size=world_size, rpc_backend_options=rpc_backend_options)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_backend_options=rpc_backend_options,
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 359, in _tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 226, in _all_gather
    timeout=rpc_timeout,
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 800, in rpc_sync
    return fut.wait()
RuntimeError: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.744256    89 tensorpipe/core/context_impl.cc:220] Context B is closing
V0921 07:24:34.744301    89 tensorpipe/core/context_impl.cc:238] Context B is handling error context closed (this error originated at tensorpipe/core/context_impl.cc:221)

Here is the rank 1 log, name is A:

tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516
[09-21 09:01:44] DEBUG    not set NCCL_IB_GID_INDEX                                                                                                             func.py:67
[09-21 09:01:44] DEBUG    1.12.1                                                                                                                                func.py:67
[09-21 09:01:44] DEBUG    name:A, rank:1, init_method:tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516                                      func.py:67
V0921 09:01:44.481227    89 tensorpipe/core/context_impl.cc:53] Context 89:c0 created
V0921 09:01:44.481273    89 tensorpipe/core/context_impl.cc:55] Context 89:c0 aliased as A
V0921 09:01:44.494768    89 tensorpipe/core/context_impl.cc:81] Context A is registering transport ibv
V0921 09:01:44.494844    89 tensorpipe/core/context_impl.cc:81] Context A is registering transport uv
V0921 09:01:44.497126    89 tensorpipe/core/context_impl.cc:81] Context A is registering transport shm
V0921 09:01:44.497258    89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_basic
V0921 09:01:44.497267    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_basic
V0921 09:01:44.497273    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_basic.cpu
V0921 09:01:44.502026    89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_xth
V0921 09:01:44.502036    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_xth
V0921 09:01:44.502157    89 tensorpipe/channel/cma/context_impl.cc:225] Couldn't detect the active Linux Security Modules
V0921 09:01:44.502223    89 tensorpipe/channel/cma/context_impl.cc:253] YAMA ptrace scope set to classic ptrace permissions
V0921 09:01:44.502234    89 tensorpipe/channel/cma/context_impl.cc:296] The domain descriptor for CMA is cma:3d1e04ad-0931-4975-b822-467e2b3b1cee_100085_f000197b_100085_effffffd_0_0_a80425fb
V0921 09:01:44.502283    89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cma
V0921 09:01:44.502290    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cma
V0921 09:01:44.538451    89 tensorpipe/channel/cuda_ipc/context_impl.cc:254] The UUIDs of all the GPUs found by the CUDA IPC channel are 19071872-e206-4660-2142-9651e3f148a8
V0921 09:01:44.538464    89 tensorpipe/channel/cuda_ipc/context_impl.cc:256] The peer-to-peer support found by the CUDA IPC channel is {{1}}
V0921 09:01:44.538490    89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_ipc
V0921 09:01:44.538497    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_ipc
V0921 09:01:44.538509    89 tensorpipe/core/context_impl.cc:104] Context A is registering channel basic
V0921 09:01:44.538515    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_basic
V0921 09:01:44.580634    89 tensorpipe/core/context_impl.cc:104] Context A is registering channel mpt_uv
V0921 09:01:44.580645    89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_mpt_uv
V0921 09:01:44.580795    89 tensorpipe/channel/cuda_gdr/context_impl.cc:479] CUDA GDR channel is not viable because the nv_peer_mem kernel module isn't active
V0921 09:01:44.580809    89 tensorpipe/core/context_impl.cc:116] Context A is opening listener A[l0]
V0921 09:01:44.581122    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport shm
V0921 09:01:44.581132    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:44.581136    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport uv
V0921 09:01:44.754984    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport uv
V0921 09:01:44.755039    89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:44.755054    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport uv
V0921 09:01:44.755066    89 tensorpipe/core/listener_impl.cc:102] Listener A[l0] received an accept request (#0)
V0921 09:01:44.755142    89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:44.755171    89 tensorpipe/core/listener_impl.cc:293] Listener A[l0] got spontaneous connection
V0921 09:01:44.755259    89 tensorpipe/core/listener_impl.cc:295] Listener A[l0] is opening pipe A[l0].p0
V0921 09:01:44.755266    89 tensorpipe/core/listener_impl.cc:299] Pipe A[l0].p0 aliased as A[l0]_from_B
V0921 09:01:44.755295    89 tensorpipe/core/pipe_impl.cc:332] Pipe A[l0]_from_B is reading nop object (brochure)
V0921 09:01:44.755305    89 tensorpipe/core/listener_impl.cc:108] Listener A[l0] is calling an accept callback (#0)
V0921 09:01:44.755323    89 tensorpipe/core/listener_impl.cc:111] Listener A[l0] done calling an accept callback (#0)
V0921 09:01:44.755331    89 tensorpipe/core/listener_impl.cc:102] Listener A[l0] received an accept request (#1)
V0921 09:01:44.755341    89 tensorpipe/core/pipe_impl.cc:376] Pipe A[l0]_from_B received a readDescriptor request (#0)
V0921 09:01:44.755433    89 tensorpipe/core/pipe_impl.cc:335] Pipe A[l0]_from_B done reading nop object (brochure)
V0921 09:01:44.755483    89 tensorpipe/core/pipe_impl.cc:1046] Pipe A[l0]_from_B is requesting connection (as replacement)
V0921 09:01:44.755557    89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#0)
V0921 09:01:44.755566    89 tensorpipe/core/pipe_impl.cc:1046] Pipe A[l0]_from_B is requesting connection (as replacement)
V0921 09:01:44.755571    89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#1)
V0921 09:01:44.755706    89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 0/2 (for channel cuda_basic)
V0921 09:01:44.755761    89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#2)
V0921 09:01:44.755765    89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 1/2 (for channel cuda_basic)
V0921 09:01:44.755768    89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#3)
V0921 09:01:44.755774    89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 0/1 (for channel mpt_uv)
V0921 09:01:44.755777    89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#4)
V0921 09:01:44.755781    89 tensorpipe/core/pipe_impl.cc:1028] Pipe A[l0]_from_B is writing nop object (brochure answer)
V0921 09:01:44.755965    89 tensorpipe/core/pipe_impl.cc:1031] Pipe A[l0]_from_B done writing nop object (brochure answer)
V0921 09:01:45.056188    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.056226    89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.056237    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.256277    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.256319    89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.256326    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.259347    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.259359    89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.259363    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.261365    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.261375    89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.261379    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.263591    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.263600    89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.263604    89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:49.328515    89 tensorpipe/core/listener_impl.cc:207] Listener A[l0] is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328582    89 tensorpipe/core/listener_impl.cc:108] Listener A[l0] is calling an accept callback (#1)
[W tensorpipe_agent.cpp:530] RPC agent for A encountered error when accepting incoming pipe: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328604    89 tensorpipe/core/listener_impl.cc:111] Listener A[l0] done calling an accept callback (#1)
V0921 09:01:49.328610    89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#4)
V0921 09:01:49.328617    89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#4)
V0921 09:01:49.328621    89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#3)
V0921 09:01:49.328624    89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#3)
V0921 09:01:49.328628    89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#2)
V0921 09:01:49.328631    89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#2)
V0921 09:01:49.328634    89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#0)
V0921 09:01:49.328639    89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#0)
V0921 09:01:49.328642    89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#1)
V0921 09:01:49.328658    89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#1)
V0921 09:01:49.328689    89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.328698    89 tensorpipe/core/pipe_impl.cc:634] Pipe A[l0]_from_B is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328707    89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#1)
V0921 09:01:49.328712    89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#0)
V0921 09:01:49.328716    89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#4)
V0921 09:01:49.328720    89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#2)
V0921 09:01:49.328726    89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#3)
V0921 09:01:49.328733    89 tensorpipe/core/pipe_impl.cc:382] Pipe A[l0]_from_B is calling a readDescriptor callback (#0)
[W tensorpipe_agent.cpp:726] RPC agent for A encountered error when reading incoming request from B: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328756    89 tensorpipe/core/pipe_impl.cc:385] Pipe A[l0]_from_B done calling a readDescriptor callback (#0)
V0921 09:01:49.328765    89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 0/1 (for channel mpt_uv)
V0921 09:01:49.328770    89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 1/2 (for channel cuda_basic)
V0921 09:01:49.328776    89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 0/2 (for channel cuda_basic)
V0921 09:01:49.328780    89 tensorpipe/core/pipe_impl.cc:1052] Pipe A[l0]_from_B done requesting connection (as replacement)
V0921 09:01:49.328787    89 tensorpipe/core/pipe_impl.cc:1052] Pipe A[l0]_from_B done requesting connection (as replacement)
V0921 09:01:49.328790    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport uv
V0921 09:01:49.328797    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport shm
V0921 09:01:49.328802    89 tensorpipe/core/pipe_impl.cc:355] Pipe A[l0]_from_B is closing
V0921 09:01:49.329226    89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:49.329238    89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.329491    89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.329795    89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.330076    89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)

I found issue#452 very similar to my problem , but my network environment is infiniband, not ROCE. And checked the LID and GID of the running environment, as follows:
//ibstat

CA 'mlx5_0'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.28.2006
	Hardware version: 0
	Node GUID: 0x043f720300fc827c
	System image GUID: 0x043f720300fc827c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 141
		LMC: 0
		SM lid: 1
		Capability mask: 0x2651e848
		Port GUID: 0x043f720300fc827c
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.28.2006
	Hardware version: 0
	Node GUID: 0x043f720300fc8288
	System image GUID: 0x043f720300fc8288
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0x2651e848
		Port GUID: 0x043f720300fc8288
		Link layer: InfiniBand

//show_gids

DEV	PORT	INDEX	GID					IPv4  		VER	DEV
---	----	-----	---					------------  	---	---
mlx5_0	1	0	fe80:0000:0000:0000:043f:7203:00fc:827c			v1
mlx5_1	1	0	fe80:0000:0000:0000:043f:7203:00fc:8288			v1

Because LID (Local ID) is a layer 2 attribute in the infiniband protocol stack, this attribute does not need to be set in the RoCE network. When this port is queried, this field is 0. And I read the src code of tensorpipe, if LID is not 0, GID index has no effect on qp_attr . So I think my error may be not caused by kGlobalIdentifierIndex.

tensorpipe/common/ibv.cc:117
  // Global routing is only set up as far as needed to support RoCE.
  attrMask |= IbvLib::QP_AV;
  if (destinationInfo.localIdentifier != 0) {
    attr.ah_attr.is_global = 0;
    attr.ah_attr.dlid = destinationInfo.localIdentifier;
  } else {
    attr.ah_attr.is_global = 1;
    attr.ah_attr.grh.dgid = destinationInfo.globalIdentifier;
    attr.ah_attr.grh.sgid_index = selfAddr.globalIdentifierIndex;
    attr.ah_attr.grh.hop_limit = 1;
  }

In addition, I used pytorch's NCCL init_process_group API in the same environment and did not reproduce the above error.
And if I start two processes in the same Pod and run the same torch rpc code, everything works fine.

Container Env Versions

pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 0.12.1 py38_cu113 pytorch
torchelastic 0.2.1 pypi_0 pypi
torchvision 0.13.1 py38_cu113 pytorch

MLNX_OFED_LINUX-4.6-1.0.1.1
ubuntu 18.04

@SolenoidWGT SolenoidWGT changed the title Error transport retry counter exceeded Error: "transport retry counter exceeded" when torch.distributed.rpc.init_rpc between different pods in k8s Sep 21, 2022
@SolenoidWGT
Copy link
Author

It is speculated that the version of MLNX_OFED may be too low, and I am working on this.

@SolenoidWGT
Copy link
Author

After troubleshooting, I finally determined the cause of this problem. On the k8s training cluster in our lab, each compute node is equipped with two IB network cards, but only one IB card can be used, and the other card is not plugged in. The ibstatus command shows the following status:

CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.28.2006
        Hardware version: 0
        Node GUID: 0x043f720300e71090
        System image GUID: 0x043f720300e71090
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 147
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x043f720300e71090
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.28.2006
        Hardware version: 0
        Node GUID: 0x043f720300e71098
        System image GUID: 0x043f720300e71098
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Capability mask: 0x2651e848
                Port GUID: 0x043f720300e71098
                Link layer: InfiniBand

It can be seen that the status of the network card "mlx5_1" is down. But tensorpipe selects the first device returned by ibv_get_device_list by default when selecting the ib device, and the code location is here. But unfortunately this device is "mlx5_1" . I debugged tensorpipe and traced it using gdb.

So I made a slight modification at here, so that it will traverse the deviceList, and only select the devices whose port status is active, and I am very happy to find that tensorpipe can work perfectly.

However, this strategy may lead to inconsistent device selection on both sides of the communication. For example, A has two IB network cards that work well, while B has a good IB card and a bad IB card, which may still cause errors (if the two IB cards cards have different network configurations). In addition to this, the order of devices returned by ibv_get_device_list may be indeterminate. Therefore, if the two communicating parties can negotiate each other's IB devices and ports using sockets or other methods in advance, it should be a more robust implementation.

tensorpipe is the best rdma-enabled rpc library I've seen, and our projects use it heavily. I very much hope that I can do my best to contribute to it. I open a PR and feel free to make further changes based on review comments. thanks : )

@SolenoidWGT SolenoidWGT linked a pull request Sep 28, 2022 that will close this issue
@lw
Copy link
Contributor

lw commented Oct 7, 2022

Love to hear that you're finding TensorPipe useful (and would be interested in knowing more what you're doing with it).

And you're right that the IB device selection of the ibv transport is very primitive, contrary to the one of the gdr channel (which tries to pair GPUs to their nearest NIC). We didn't deem it important because the ibv transport is only used for small control messages thus its performance isn't super critical, but we didn't expect correctness issues like the ones you faced. Some features that could be implemented to improve this are status detection, a device selection flag, potentially opening multiple devices and multiplexing over them, incorporating NUMA awareness, etc.

I'll review your PR and gladly land it. Unfortunately we've stopped actively investing in improving TensorPipe, hence we won't be able to do any of the above changes ourselves. But if you're willing to contribute more fixes I'll happily accept them.

if the two IB cards cards have different network configurations

One assumption we made, which I believe will be very hard to relax, is that all the IB cards on all the nodes can talk to each other. This is basically always true for the TCP/IP network, and thus we thought we could impose it also for the IB network. If we wanted to support cases where it isn't so, then we'd need a way to detect if two HCAs are in the same network, and AFAIK this isn't easily doable (except if one has UMAD access, but it tends to require root privileges). Do you have any ideas?

@SolenoidWGT
Copy link
Author

SolenoidWGT commented Oct 10, 2022

Hi Luca @lw great to hear your reply.

and would be interested in knowing more what you're doing with it).

The main scenario where we use tensorpipe is reinforcement learning (RL) training, which mainly consists of 2 different processes: the collector process (which may contain multiple subprocesses) is responsible for interacting with the RL environment (using CPU) and performing model inference (using GPU); learner is responsible for model training (using GPU)

The data flow of collector and learner is:

  1. The learner needs to transfer the model located in the GPU memory to the collector's GPU memory.
  2. The collector needs to transfer the observation data of its own environment (located in the host memory) to the GPU memory of the learner.

In most cases, there is one learner and multiple collectors. The number of collectors may dynamically increase or decrease based on demand.

In our k8s environment, these workers are encapsulated in Pods and scheduled to run on physical nodes. Pods use containers to isolate processes, so even if two Pods are scheduled to the same physical node, communication methods such as shm cannot be used without additional configuration (such as mounting /dev/shm to the container).

So we hope that the ideal communication library has the following capabilities:

  1. Support a variety of different P2P communication methods (GDR, IB, TCP, shm, cuda-ipc), and choose the best method according to different scenarios.
  2. Support rpc agent to dynamically connect and quit.
  3. Container awareness.

I tried to use nccl's send/recv API to achieve the above goals, but I found to my dismay that nccl is not very suitable for our scenario.:

  1. Pytorch's nccl API currently does not support specifying different cuda streams for nccl's send/recv. It is easy to deadlock when I don't strive to match the calling order of send/recv.
  2. The send/recv of nccl will involve CUDA kernel,data located at host memory is not supported, and an additional copy to the CUDA memory is required.
  3. When nccl establishes a communication group, it needs to establish a topology graph according to all processes, so if a new process joins or quits, it will become more troublesome: we need to stop the work of all processes, and then create a new communication group.

Finally I discovered tensorpipe, and I was very excited to find it worked well for most of our needs. Hence the above story.

ibv transport is only used for small control messages thus its performance isn't super critical

I would like to ask you for advice. I read the content of issue#405 and learned about transport transmission control information, and channel transmits heavy payloads. Channel uses one of the types of transport (uv, ibv, shm) for the actual communication. But in pytorch's rpc backend code, I only found channel implemented with uv, see here: . I'm curious, IB has better performance than TCP in both bandwidth and latency, why there is no channel implemented by ibv?

I'll review your PR and gladly land it. Unfortunately we've stopped actively investing in improving TensorPipe, hence we won't be able to do any of the above changes ourselves. But if you're willing to contribute more fixes I'll happily accept them.

I regret that pytorch no longer continues to support tensorpipe. Heterogeneous data (CPU data/GPU data) and the dynamics of communicating entities in RL training make it difficult to use existing APIs to implement, and tensorpipe perfectly solves this problem. I'd love to continue contributing to tensorpipe and hope this awesome project will continue to be active.

then we'd need a way to detect if two HCAs are in the same network, and AFAIK this isn't easily doable (except if one has UMAD access, but it tends to require root privileges). Do you have any ideas?

Sorry I don't have any more ideas right now, I think I need to look up more doc. But the good news is that now tensorpipe can run perfectly on our cluster. I'll update if I have any ideas.

@lw
Copy link
Contributor

lw commented Oct 10, 2022

why there is no channel implemented by ibv?

  • There is one channel, called cuda_gdr, which also uses IB, but is independent from the ibv transport. Currently cuda_gdr only supports CUDA memory, but we designed it so that it can easily start handling CPU memory too (we just need to implement it).
  • There is the basic channel, which is simply an adapter that allows any transport to be used as a channel. It automatically selects the best transport, therefore you won't find an explicit instantiation of basic with ibv anywhere in the codebase. We consider this our "fallback" channel because it's a channel that will always work, although it isn't particularly specialized and might perform worse than ad-hoc channels.
  • For TCP based connections, the basic channel isn't able to achieve a very high bandwidth, because the data movement (from RAM to NIC) is operated in software by the kernel on the CPU, and a single thread isn't fast enough to drive the NICs at their full speed. That's why we added a "multiplexing" channel that bundles multiple uv transports, and multiple threads, in order to load-balance the bandwidth across them. In IB the copy is handled by the NIC directly, thus we don't have this limitation, and this is why we don't need a multiplexing version of ibv.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants