Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Resource temporarily unavailable when initializing RPC in multi-node training #441

Open
gongziyida opened this issue Apr 26, 2022 · 1 comment

Comments

@gongziyida
Copy link

I am training my RL model in SLURM with multiple nodes. torch.distributed.rpc.init_rpc failed when the number of RPC agents are over some threshold. Then I switched to run this test script with different ways of achieving MASTER_ADDR and MASTER_PORT, but still failed.

The error seems to originate from tensorpipe, so I post it here. Apologize if I should go to the Pytorch repo.

Test Script:

import torch
import torch.distributed.rpc as rpc
import socket
import os
import torch.multiprocessing as mp

def run_worker(rank, world_size):
    ## Different ways of getting MASTER_ADDR
    # os.environ['MASTER_ADDR'] = socket.gethostname()
    # os.environ['MASTER_ADDR'] = socket.gethostbyname(socket.gethostname())
    # os.environ['MASTER_ADDR'] = os.environ['SLURM_SUBMIT_HOST']
    os.environ['MASTER_ADDR'] = os.environ['SLURMD_NODENAME']
    os.environ['MASTER_PORT'] = '49153'
    print('master addr %d/%d' % (rank, world_size), os.environ['MASTER_ADDR'])
    # options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=16)

    rpc.init_rpc(
        str(rank),
        rank=rank,
        world_size=world_size,
        # rpc_backend_options=options
    )

    print('rank:', rank)

    rpc.shutdown()

if __name__ == '__main__':
    world_size = 40 
    mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)

On Slurm with 3 nodes, each has 5 sockets, 1 core per socket and 1 thread per core (so totally 15 cores), this script fails with the following error when world size is larger than 37 or 38 (not deterministic) -- I group similar messages for readability:

[W tensorpipe_agent.cpp:863] RPC agent for 5 encountered error when sending outgoing request #0 to 0: connect: Resource temporarily unavailable (this error originated at tensorpipe/common/socket.cc:114)
... (similar messages to the above, all sending to 0)
[W tensorpipe_agent.cpp:492] RPC agent for 0 encountered error when accepting incoming pipe: sendmsg: Broken pipe (this error originated at tensorpipe/common/socket.h:105)
[W tensorpipe_agent.cpp:682] RPC agent for 0 encountered error when reading incoming request from 10: sendmsg: Broken pipe (this error originated at tensorpipe/common/socket.h:105)
... (similar messages to the above, all 0 reading incoming request)
[W tensorpipe_agent.cpp:863] RPC agent for 7 encountered error when sending outgoing request #0 to 0: async error on socket: Connection reset by peer (this error originated at tensorpipe/transport/shm/connection_impl.cc:187)
... (similar messages to the above, all sending to 0)
Traceback (most recent call last):
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/hpc/home/zg93/test/test_rpc.py", line 21, in run_worker
    rpc.init_rpc(
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 190, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 224, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 97, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 305, in _tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 204, in _all_gather
    rpc_sync(
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/hpc/home/zg93/virtual-rodent/dm_control/lib/python3.8/site-packages/torch/distributed/rpc/api.py", line 767, in rpc_sync
    return fut.wait()
RuntimeError: connect: Resource temporarily unavailable (this error originated at tensorpipe/common/socket.cc:114)

Thank you for reading and helping me with the issue!

@RigCor7
Copy link

RigCor7 commented May 13, 2022

I also have the same issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants