Add a Herring-like benchmark #410

lw · 2021-09-22T11:58:05Z

Summary:
This diff adds two benchmarks that perform an allreduce operation using internally a technique inspired from Amazon's Herring paper. The two versions, one for TCP and one for InfiniBand interconnect, operate similarly: they spawn a set of clients (one per device, with multiple devices per machine, and multiple machines), then each allreduce call first performs a NCCL reduce or reduce_scatter step within each machine, then a shared send to a set of servers (one per machine, or one per device), then these servers perform the aggregation, and send back the result to the clients, which then do a final NCCL broadcast or all_gather step. Multiple allreduce calls, one per "bucket", are launched in parallel during each "epoch". There's also a baseline script that does these allreduces simply using NCCL.

Since the benchmarks depend on PyTorch, they are a bit "awkward" to build and install from within the TensorPipe repo. So, instead of integrating them inside our CMake system, I opted to build them as a separate PyTorch extension, hence giving them a setup.py file. Thus the installation steps are to first install TensorPipe on its own, then NCCL, then PyTorch, and then the benchmarks.

Differential Revision: D30220559

Summary: This diff adds two benchmarks that perform an allreduce operation using internally a technique inspired from Amazon's Herring paper. The two versions, one for TCP and one for InfiniBand interconnect, operate similarly: they spawn a set of clients (one per device, with multiple devices per machine, and multiple machines), then each allreduce call first performs a NCCL reduce or reduce_scatter step within each machine, then a shared send to a set of servers (one per machine, or one per device), then these servers perform the aggregation, and send back the result to the clients, which then do a final NCCL broadcast or all_gather step. Multiple allreduce calls, one per "bucket", are launched in parallel during each "epoch". There's also a baseline script that does these allreduces simply using NCCL. Since the benchmarks depend on PyTorch, they are a bit "awkward" to build and install from within the TensorPipe repo. So, instead of integrating them inside our CMake system, I opted to build them as a separate PyTorch extension, hence giving them a setup.py file. Thus the installation steps are to first install TensorPipe on its own, then NCCL, then PyTorch, and then the benchmarks. Differential Revision: D30220559 fbshipit-source-id: 43ab5bc440271942698313e6b06b39104a0e44de

facebook-github-bot · 2021-09-22T11:58:25Z

This pull request was exported from Phabricator. Differential Revision: D30220559

facebook-github-bot added cla signed fb-exported labels Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Herring-like benchmark #410

Add a Herring-like benchmark #410

lw commented Sep 22, 2021

facebook-github-bot commented Sep 22, 2021

Add a Herring-like benchmark #410

Are you sure you want to change the base?

Add a Herring-like benchmark #410

Conversation

lw commented Sep 22, 2021

facebook-github-bot commented Sep 22, 2021