[QUESTION] all gather for MoE permutation seems redundant? #1071

ZSL98 · 2024-09-04T12:23:19Z

ZSL98
Sep 4, 2024

I have a node with 8 GPUs. The model has 8 experts and I use TP=1, EP=8, with sequence parallel on. Then I expect each GPU has one expert. I use the MoEAllGatherTokenDispatcher. The size of the hidden_states is [S/TP, B, H] for token_permutation, which is actually [S, B, H] because TP=1. Then why we still need tensor_parallel.gather_from_sequence_parallel_region_to_moe to gather and form a global_hidden_states, whose size will become [S * EP, B, H]? In my view, each rank has a copy of the [S, B, H] hidden_states, why there is still a need for all gather?
There are only B * S tokens to compute, but now each rank has B * S * EP tokens, because we use get_tensor_and_expert_parallel_group() and its size is TP * EP=8.

blankde · 2024-09-26T11:05:55Z

blankde
Sep 26, 2024

I think the allgather dispatcher first gathers all tokens, which is [s*e] tokens, to the local rank, and then uses a mask to identify the tokens needed by the local experts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] all gather for MoE permutation seems redundant? #1071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[QUESTION] all gather for MoE permutation seems redundant? #1071

ZSL98 Sep 4, 2024

Replies: 1 comment

blankde Sep 26, 2024

ZSL98
Sep 4, 2024

blankde
Sep 26, 2024