Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the XCCL DPU team, and DPU daemon #106

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

janjust
Copy link

@janjust janjust commented Jan 6, 2021

This PR adds the new DPU team as well as a contrib directory with the accompanying DPU daemon app.

This is a first but comprehensive attempt which successfully runs pytorch param-comms benchmark.
Tested over 32 bluefield enabled nodes.

There are several configury options to keep in mind when running.

new config options:
--with-dpu=yes

client/host side:
two new flags and additional dpu parameter for TLS:
-x TORCH_UCC_TLS=dpu
-x XCCL_TEAM_DPU_ENABLE=1
-x XCCL_TEAM_DPU_HOST_DPU_LIST=

the host_dpu_list file is a 1 to 1 mapping host file that dpu team will use to identify the IP address of his DPU.
eg:
host1 dpu1
host2 dpu2
etc.

dpu side:
-x DPU_DATA_BUFFER_SIZE=$((16 * 1024 * 1024))
En environment variable that sets the buffer size available on the DPU.
If not provided, default is 16MB.
./dpu_server <threads (int)> by default it will use a single thread.

eg.
mpirun -np 4 --map-by ppr:1:node -x UCX_NET_DEVICES=mlx5_0:1 -x XCCL_TEST_TLS=ucx --bind-to none --report-bindings --tag-output -hostfile file.dpus -x LD_LIBRARY_PATH  ./dpu_server 4

Signed-off-by: Tomislavj Janjusic [email protected]

Co-authored-by: Artem Polyakov [email protected]
Sergey Lebedev [email protected]

@janjust
Copy link
Author

janjust commented Jan 6, 2021

@manjugv @Sergei-Lebedev @vspetrov
Hey guys - this is PR which adds the DPU team, developed during the hackathon by @artpol84 @Sergei-Lebedev and me.

It's the first attempt that successfully runs, but obviously needs strong vetting.
We did preliminary data-checks with the xccl allreduce tests, seems to pass - and it successfully runs the pytorch param/comms bench.

@artpol84
Copy link

artpol84 commented Jan 7, 2021

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <[email protected]>
Co-authored-by: Sergey Lebedev <[email protected]>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

@artpol84
Copy link

artpol84 commented Jan 7, 2021

I tried it out of curiosity and it works as expected:
artpol84@91a6466

Signed-off-by: Tomislavj Janjusic <[email protected]>

Co-authored-by: Artem Polyakov <[email protected]>
Co-authored-by: Sergey Lebedev <[email protected]>
@janjust
Copy link
Author

janjust commented Jan 7, 2021

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <[email protected]>
Co-authored-by: Sergey Lebedev <[email protected]>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

done

Signed-off-by: Tomislavj Janjusic <[email protected]>

Co-authored-by: Artem Polyakov <[email protected]>
Co-authored-by: Sergey Lebedev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants