You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.
I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:
terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"
followed by:
[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.
Using cat /proc/sys/fs/file-max gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.
Thank you!
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:
terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"
followed by:
[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
I call the init_rpc with these arguments:
rpc.init_rpc('worker:{}'.format(rank-num_ps), rank=rank, world_size=world_size, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method='env://', _transports=["uv"],))
I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.
Using
cat /proc/sys/fs/file-max
gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.Thank you!
The text was updated successfully, but these errors were encountered: