Resource temporarily unavailable when initializing RPC in multi-node training #441

gongziyida · 2022-04-26T21:28:45Z

I am training my RL model in SLURM with multiple nodes. torch.distributed.rpc.init_rpc failed when the number of RPC agents are over some threshold. Then I switched to run this test script with different ways of achieving MASTER_ADDR and MASTER_PORT, but still failed.

The error seems to originate from tensorpipe, so I post it here. Apologize if I should go to the Pytorch repo.

Test Script:

import torch
import torch.distributed.rpc as rpc
import socket
import os
import torch.multiprocessing as mp

def run_worker(rank, world_size):
    ## Different ways of getting MASTER_ADDR
    # os.environ['MASTER_ADDR'] = socket.gethostname()
    # os.environ['MASTER_ADDR'] = socket.gethostbyname(socket.gethostname())
    # os.environ['MASTER_ADDR'] = os.environ['SLURM_SUBMIT_HOST']
    os.environ['MASTER_ADDR'] = os.environ['SLURMD_NODENAME']
    os.environ['MASTER_PORT'] = '49153'
    print('master addr %d/%d' % (rank, world_size), os.environ['MASTER_ADDR'])
    # options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=16)

    rpc.init_rpc(
        str(rank),
        rank=rank,
        world_size=world_size,
        # rpc_backend_options=options
    )

    print('rank:', rank)

    rpc.shutdown()

if __name__ == '__main__':
    world_size = 40 
    mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)

On Slurm with 3 nodes, each has 5 sockets, 1 core per socket and 1 thread per core (so totally 15 cores), this script fails with the following error when world size is larger than 37 or 38 (not deterministic) -- I group similar messages for readability:

[W tensorpipe_agent.cpp:863] RPC agent for 5 encountered error when sending outgoing request #0 to 0: connect: Resource temporarily unavailable (this error originated at tensorpipe/common/socket.cc:114)
... (similar messages to the above, all sending to 0)

[W tensorpipe_agent.cpp:492] RPC agent for 0 encountered error when accepting incoming pipe: sendmsg: Broken pipe (this error originated at tensorpipe/common/socket.h:105)

[W tensorpipe_agent.cpp:682] RPC agent for 0 encountered error when reading incoming request from 10: sendmsg: Broken pipe (this error originated at tensorpipe/common/socket.h:105)
... (similar messages to the above, all 0 reading incoming request)

[W tensorpipe_agent.cpp:863] RPC agent for 7 encountered error when sending outgoing request #0 to 0: async error on socket: Connection reset by peer (this error originated at tensorpipe/transport/shm/connection_impl.cc:187)
... (similar messages to the above, all sending to 0)

Traceback (most recent call last):
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/hpc/home/zg93/test/test_rpc.py", line 21, in run_worker
    rpc.init_rpc(
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 190, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 224, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 97, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 305, in _tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 204, in _all_gather
    rpc_sync(
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 767, in rpc_sync
    return fut.wait()
RuntimeError: connect: Resource temporarily unavailable (this error originated at tensorpipe/common/socket.cc:114)

Thank you for reading and helping me with the issue!

The text was updated successfully, but these errors were encountered:

RigCor7 · 2022-05-13T06:11:24Z

I also have the same issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource temporarily unavailable when initializing RPC in multi-node training #441

Resource temporarily unavailable when initializing RPC in multi-node training #441

gongziyida commented Apr 26, 2022

RigCor7 commented May 13, 2022

Resource temporarily unavailable when initializing RPC in multi-node training #441

Resource temporarily unavailable when initializing RPC in multi-node training #441

Comments

gongziyida commented Apr 26, 2022

RigCor7 commented May 13, 2022