You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.
I am training my RL model in SLURM with multiple nodes. torch.distributed.rpc.init_rpc failed when the number of RPC agents are over some threshold. Then I switched to run this test script with different ways of achieving MASTER_ADDR and MASTER_PORT, but still failed.
The error seems to originate from tensorpipe, so I post it here. Apologize if I should go to the Pytorch repo.
On Slurm with 3 nodes, each has 5 sockets, 1 core per socket and 1 thread per core (so totally 15 cores), this script fails with the following error when world size is larger than 37 or 38 (not deterministic) -- I group similar messages for readability:
[W tensorpipe_agent.cpp:863] RPC agent for 5 encountered error when sending outgoing request #0 to 0: connect: Resource temporarily unavailable (this error originated at tensorpipe/common/socket.cc:114)
... (similar messages to the above, all sending to 0)
[W tensorpipe_agent.cpp:492] RPC agent for 0 encountered error when accepting incoming pipe: sendmsg: Broken pipe (this error originated at tensorpipe/common/socket.h:105)
[W tensorpipe_agent.cpp:682] RPC agent for 0 encountered error when reading incoming request from 10: sendmsg: Broken pipe (this error originated at tensorpipe/common/socket.h:105)
... (similar messages to the above, all 0 reading incoming request)
[W tensorpipe_agent.cpp:863] RPC agent for 7 encountered error when sending outgoing request #0 to 0: async error on socket: Connection reset by peer (this error originated at tensorpipe/transport/shm/connection_impl.cc:187)
... (similar messages to the above, all sending to 0)
Traceback (most recent call last):
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/hpc/home/zg93/test/test_rpc.py", line 21, in run_worker
rpc.init_rpc(
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 190, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 224, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 97, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 305, in _tensorpipe_init_backend_handler
api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 204, in _all_gather
rpc_sync(
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 767, in rpc_sync
return fut.wait()
RuntimeError: connect: Resource temporarily unavailable (this error originated at tensorpipe/common/socket.cc:114)
Thank you for reading and helping me with the issue!
The text was updated successfully, but these errors were encountered:
I am training my RL model in SLURM with multiple nodes.
torch.distributed.rpc.init_rpc
failed when the number of RPC agents are over some threshold. Then I switched to run this test script with different ways of achieving MASTER_ADDR and MASTER_PORT, but still failed.The error seems to originate from tensorpipe, so I post it here. Apologize if I should go to the Pytorch repo.
Test Script:
On Slurm with 3 nodes, each has 5 sockets, 1 core per socket and 1 thread per core (so totally 15 cores), this script fails with the following error when
world size
is larger than 37 or 38 (not deterministic) -- I group similar messages for readability:Thank you for reading and helping me with the issue!
The text was updated successfully, but these errors were encountered: