[QUESTION] How will recompute-num-layers influence the gpu memory usage with uniform recompute-method. #1247

bugm · 2024-10-18T07:05:33Z

bugm
Oct 18, 2024

Megatron-LM gtihub document says:

The uniform method uniformly divides the transformer layers into groups of layers (each group of size --recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained. For example, when --recompute-num-layers is set to 4, only the input activation of each group of 4 transformer layers is stored.

I agree that with a bigger group size, it stores less input activations which take less gpu memory. But with a bigger group size, during the recompute process, which means it will also take more gpu memory for the temporary activations.

So the activations peak memory during the training process should be stored_input_activations_memory + M / num_groups (M is the activations memory with no recomputing for the model).
But the official doc says " When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained", Is there any other technical applied to support it?

But the way, I have tested to set the --recompute-num-layers equal to the number of layers for the model with a uniform recompute method and found it actually saved some activations memory. But I am not clearly about how this is implemented since with whole layers in one group, which means it needs to store all the activations during the recomputed process, which should be the same as the activations in a no-recompute process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How will recompute-num-layers influence the gpu memory usage with uniform recompute-method. #1247

{{title}}

Replies: 0 comments

Select a reply

[QUESTION] How will recompute-num-layers influence the gpu memory usage with uniform recompute-method. #1247

bugm Oct 18, 2024

Replies: 0 comments

bugm
Oct 18, 2024