You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The uniform method uniformly divides the transformer layers into groups of layers (each group of size --recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained. For example, when --recompute-num-layers is set to 4, only the input activation of each group of 4 transformer layers is stored.
I agree that with a bigger group size, it stores less input activations which take less gpu memory. But with a bigger group size, during the recompute process, which means it will also take more gpu memory for the temporary activations.
So the activations peak memory during the training process should be stored_input_activations_memory + M / num_groups (M is the activations memory with no recomputing for the model).
But the official doc says " When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained", Is there any other technical applied to support it?
But the way, I have tested to set the --recompute-num-layers equal to the number of layers for the model with a uniform recompute method and found it actually saved some activations memory. But I am not clearly about how this is implemented since with whole layers in one group, which means it needs to store all the activations during the recomputed process, which should be the same as the activations in a no-recompute process.
This discussion was converted from issue #1226 on October 23, 2024 21:04.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Megatron-LM gtihub document says:
The uniform method uniformly divides the transformer layers into groups of layers (each group of size --recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained. For example, when --recompute-num-layers is set to 4, only the input activation of each group of 4 transformer layers is stored.
I agree that with a bigger group size, it stores less input activations which take less gpu memory. But with a bigger group size, during the recompute process, which means it will also take more gpu memory for the temporary activations.
So the activations peak memory during the training process should be stored_input_activations_memory + M / num_groups (M is the activations memory with no recomputing for the model).
But the official doc says " When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained", Is there any other technical applied to support it?
But the way, I have tested to set the --recompute-num-layers equal to the number of layers for the model with a uniform recompute method and found it actually saved some activations memory. But I am not clearly about how this is implemented since with whole layers in one group, which means it needs to store all the activations during the recomputed process, which should be the same as the activations in a no-recompute process.
Beta Was this translation helpful? Give feedback.
All reactions