Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Add a Herring-like benchmark #410

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add a Herring-like benchmark #410

wants to merge 1 commit into from

Conversation

lw
Copy link
Contributor

@lw lw commented Sep 22, 2021

Summary:
This diff adds two benchmarks that perform an allreduce operation using internally a technique inspired from Amazon's Herring paper. The two versions, one for TCP and one for InfiniBand interconnect, operate similarly: they spawn a set of clients (one per device, with multiple devices per machine, and multiple machines), then each allreduce call first performs a NCCL reduce or reduce_scatter step within each machine, then a shared send to a set of servers (one per machine, or one per device), then these servers perform the aggregation, and send back the result to the clients, which then do a final NCCL broadcast or all_gather step. Multiple allreduce calls, one per "bucket", are launched in parallel during each "epoch". There's also a baseline script that does these allreduces simply using NCCL.

Since the benchmarks depend on PyTorch, they are a bit "awkward" to build and install from within the TensorPipe repo. So, instead of integrating them inside our CMake system, I opted to build them as a separate PyTorch extension, hence giving them a setup.py file. Thus the installation steps are to first install TensorPipe on its own, then NCCL, then PyTorch, and then the benchmarks.

Differential Revision: D30220559

Summary:
This diff adds two benchmarks that perform an allreduce operation using internally a technique inspired from Amazon's Herring paper. The two versions, one for TCP and one for InfiniBand interconnect, operate similarly: they spawn a set of clients (one per device, with multiple devices per machine, and multiple machines), then each allreduce call first performs a NCCL reduce or reduce_scatter step within each machine, then a shared send to a set of servers (one per machine, or one per device), then these servers perform the aggregation, and send back the result to the clients, which then do a final NCCL broadcast or all_gather step. Multiple allreduce calls, one per "bucket", are launched in parallel during each "epoch". There's also a baseline script that does these allreduces simply using NCCL.

Since the benchmarks depend on PyTorch, they are a bit "awkward" to build and install from within the TensorPipe repo. So, instead of integrating them inside our CMake system, I opted to build them as a separate PyTorch extension, hence giving them a setup.py file. Thus the installation steps are to first install TensorPipe on its own, then NCCL, then PyTorch, and then the benchmarks.

Differential Revision: D30220559

fbshipit-source-id: 43ab5bc440271942698313e6b06b39104a0e44de
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D30220559

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants