Replies: 1 comment
-
I think the allgather dispatcher first gathers all tokens, which is [s*e] tokens, to the local rank, and then uses a mask to identify the tokens needed by the local experts. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a node with 8 GPUs. The model has 8 experts and I use TP=1, EP=8, with sequence parallel on. Then I expect each GPU has one expert. I use the
MoEAllGatherTokenDispatcher
. The size of the hidden_states is [S/TP, B, H] fortoken_permutation
, which is actually [S, B, H] because TP=1. Then why we still needtensor_parallel.gather_from_sequence_parallel_region_to_moe
to gather and form aglobal_hidden_states
, whose size will become [S * EP, B, H]? In my view, each rank has a copy of the [S, B, H] hidden_states, why there is still a need for all gather?There are only B * S tokens to compute, but now each rank has B * S * EP tokens, because we use get_tensor_and_expert_parallel_group() and its size is TP * EP=8.
Beta Was this translation helpful? Give feedback.
All reactions