Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Does benchmark_pipe support ibv transport and cuda channel? #452

Open
baoleai opened this issue Jul 21, 2022 · 9 comments
Open

Does benchmark_pipe support ibv transport and cuda channel? #452

baoleai opened this issue Jul 21, 2022 · 9 comments

Comments

@baoleai
Copy link

baoleai commented Jul 21, 2022

It seems like benchmark_pipe only support [shm|uv] transport and [basic] channel. Does benchmark_pipe also support ibv transport and cuda channel? Is there a complete example of different transport and channel combinations, including cpu-cpu communication and GPU-GPU communication?

@lw
Copy link
Contributor

lw commented Jul 21, 2022

It should support all combinations of everything, don't pay attention to the help string. :)

@baoleai
Copy link
Author

baoleai commented Jul 21, 2022

When I use transport=ibv, it shows error:

  what():  In getTransport at tensorpipe/core/context_impl.cc:147 "unsupported transport ibv"
Aborted

@lw
Copy link
Contributor

lw commented Jul 21, 2022

Perhaps you didn't build TensorPipe with InfiniBand support, or the InfiniBand support didn't detect the right hardware/software requirements on your machine and decided to turn itself off. You can check the latter by launching with TP_VERBOSE_LOGGING=9.

@baoleai
Copy link
Author

baoleai commented Jul 21, 2022

After add TP_VERBOSE_LOGGING=9 and RDMA support, now the error is:

mode = connect
transport = ibv
channel = basic
address = ibv://xx.xx.xx.xx
num_round_trips = 1
num_payloads = 0
payload_size = 0
num_tensors = 0
tensor_size = 0
tensor_type = cpu
metadata_size = 0
V0722 00:51:41.536990   136 tensorpipe/core/context_impl.cc:53] Context 136:c0 created
V0722 00:51:41.537856   136 tensorpipe/common/ibv_lib.h:650] Found shared library libibverbs.so.1 at /usr/lib/x86_64-linux-gnu/libibverbs.so.1.8.28.0
terminate called after throwing an instance of 'std::system_error'
  what():  In operator() at tensorpipe/common/ibv.h:109 "": No such file or directory
Aborted (core dumped)

I think this may be due to me running in a docker container(host support rdma) and not configuring the rdma NIC driver correctly.

@baoleai
Copy link
Author

baoleai commented Jul 22, 2022

Hi, @lw
After add RDMA support, I get the following error on server side:

TP_VERBOSE_LOGGING=9 ./benchmark_pipe --mode=listen --transport=ibv --channel=basic --address=ibv://xx.xx.xx.xx:xx --num-round-trips=1 --tensor-size=100 --num-tensors=100 --num-payloads=100 --payload-size=100 --metadata-size=100 --tensor-type=cpu


V0722 15:40:06.427386 262423 tensorpipe/transport/listener_impl_boilerplate.h:164] Listener 262423:c0[l0].tr_ibv received an accept request (#1)
V0722 15:40:06.427431 262423 tensorpipe/transport/ibv/connection_impl.cc:218] Connection 262423:c0[l0].tr_ibv.c0 is handling an event on its socket (OUT)
V0722 15:40:06.427651 262423 tensorpipe/transport/ibv/connection_impl.cc:218] Connection 262423:c0[l0].tr_ibv.c0 is handling an event on its socket (IN)
V0722 15:40:21.811300 262423 tensorpipe/transport/ibv/connection_impl.cc:218] Connection 262423:c0[l0].tr_ibv.c0 is handling an event on its socket (IN)
V0722 15:40:21.811322 262423 tensorpipe/transport/connection_impl_boilerplate.h:453] Connection 262423:c0[l0].tr_ibv.c0 is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0722 15:40:21.811348 262423 tensorpipe/transport/connection_impl_boilerplate.h:223] Connection 262423:c0[l0].tr_ibv.c0 is calling a nop object read callback (#0)
V0722 15:40:21.811361 262423 tensorpipe/core/listener_impl.cc:207] Listener 262423:c0[l0] is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0722 15:40:21.811374 262423 tensorpipe/core/listener_impl.cc:108] Listener 262423:c0[l0] is calling an accept callback (#0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  In operator() at tensorpipe/benchmark/benchmark_pipe.cc:396 "erroreof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)"
Aborted

On client side:

TP_VERBOSE_LOGGING=9 ./benchmark_pipe --mode=connect --transport=ibv --channel=basic --address=ibv://xx.xx.xx.xx:xx --num-round-trips=1 --tensor-size=100 --num-tensors=100 --num-payloads=100 --payload-size=100 --metadata-size=100 --tensor-type=cpu

V0722 15:40:06.428367 148941 tensorpipe/transport/ibv/reactor.cc:189] Transport context N/A posting RDMA write for QP 4627
V0722 15:40:21.806042 148941 tensorpipe/transport/ibv/reactor.cc:95] Transport context N/A got work completion for request 1 for QP 4627 with status transport retry counter exceeded and opcode RDMA_WRITE (byte length: 0, immediate data: 0)
V0722 15:40:21.806073 148941 tensorpipe/transport/connection_impl_boilerplate.h:453] Connection 148941:c0.p0.d.tr_ibv is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:479)
V0722 15:40:21.806087 148941 tensorpipe/transport/connection_impl_boilerplate.h:223] Connection 148941:c0.p0.d.tr_ibv is calling a nop object read callback (#0)
V0722 15:40:21.806097 148941 tensorpipe/core/pipe_impl.cc:634] Pipe 148941:c0.p0 is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:479)
V0722 15:40:21.806114 148941 tensorpipe/core/pipe_impl.cc:557] Pipe 148941:c0.p0 is calling a write callback (#0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  In operator() at tensorpipe/benchmark/benchmark_pipe.cc:507 "errortransport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:479)"
Aborted

@lw
Copy link
Contributor

lw commented Jul 22, 2022

To me this looks like your network isn't set up correctly. I can't help you with that. You should check this yourself or with your administrator, and there's standard diagnostic tools that can help you.

@baoleai
Copy link
Author

baoleai commented Jul 22, 2022

Thanks, actually my rdma(RoCE) network is connected, I tested the connectivity with ib_send_bw.

@baoleai
Copy link
Author

baoleai commented Jul 24, 2022

Hi, @lw After add RDMA support, I get the following error on server side:

TP_VERBOSE_LOGGING=9 ./benchmark_pipe --mode=listen --transport=ibv --channel=basic --address=ibv://xx.xx.xx.xx:xx --num-round-trips=1 --tensor-size=100 --num-tensors=100 --num-payloads=100 --payload-size=100 --metadata-size=100 --tensor-type=cpu


V0722 15:40:06.427386 262423 tensorpipe/transport/listener_impl_boilerplate.h:164] Listener 262423:c0[l0].tr_ibv received an accept request (#1)
V0722 15:40:06.427431 262423 tensorpipe/transport/ibv/connection_impl.cc:218] Connection 262423:c0[l0].tr_ibv.c0 is handling an event on its socket (OUT)
V0722 15:40:06.427651 262423 tensorpipe/transport/ibv/connection_impl.cc:218] Connection 262423:c0[l0].tr_ibv.c0 is handling an event on its socket (IN)
V0722 15:40:21.811300 262423 tensorpipe/transport/ibv/connection_impl.cc:218] Connection 262423:c0[l0].tr_ibv.c0 is handling an event on its socket (IN)
V0722 15:40:21.811322 262423 tensorpipe/transport/connection_impl_boilerplate.h:453] Connection 262423:c0[l0].tr_ibv.c0 is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0722 15:40:21.811348 262423 tensorpipe/transport/connection_impl_boilerplate.h:223] Connection 262423:c0[l0].tr_ibv.c0 is calling a nop object read callback (#0)
V0722 15:40:21.811361 262423 tensorpipe/core/listener_impl.cc:207] Listener 262423:c0[l0] is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0722 15:40:21.811374 262423 tensorpipe/core/listener_impl.cc:108] Listener 262423:c0[l0] is calling an accept callback (#0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  In operator() at tensorpipe/benchmark/benchmark_pipe.cc:396 "erroreof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)"
Aborted

On client side:

TP_VERBOSE_LOGGING=9 ./benchmark_pipe --mode=connect --transport=ibv --channel=basic --address=ibv://xx.xx.xx.xx:xx --num-round-trips=1 --tensor-size=100 --num-tensors=100 --num-payloads=100 --payload-size=100 --metadata-size=100 --tensor-type=cpu

V0722 15:40:06.428367 148941 tensorpipe/transport/ibv/reactor.cc:189] Transport context N/A posting RDMA write for QP 4627
V0722 15:40:21.806042 148941 tensorpipe/transport/ibv/reactor.cc:95] Transport context N/A got work completion for request 1 for QP 4627 with status transport retry counter exceeded and opcode RDMA_WRITE (byte length: 0, immediate data: 0)
V0722 15:40:21.806073 148941 tensorpipe/transport/connection_impl_boilerplate.h:453] Connection 148941:c0.p0.d.tr_ibv is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:479)
V0722 15:40:21.806087 148941 tensorpipe/transport/connection_impl_boilerplate.h:223] Connection 148941:c0.p0.d.tr_ibv is calling a nop object read callback (#0)
V0722 15:40:21.806097 148941 tensorpipe/core/pipe_impl.cc:634] Pipe 148941:c0.p0 is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:479)
V0722 15:40:21.806114 148941 tensorpipe/core/pipe_impl.cc:557] Pipe 148941:c0.p0 is calling a write callback (#0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  In operator() at tensorpipe/benchmark/benchmark_pipe.cc:507 "errortransport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:479)"
Aborted

Hi @lw , I solved this ibv transport error by changing the default kGlobalIdentifierIndex to 3 since gid 3 is available in my environment. It would be great if tensorpipe could automatically detect the available gid or allow users to set it.

@lw
Copy link
Contributor

lw commented Jul 25, 2022

Glad to see you figured out the issue! I'm not that familiar with ibverbs to know how to auto-determine the gid, if you know how and would like to submit a patch that'd be great!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants