[QUESTION] how to control GPU memory layout for 70B LLM model? #1074

wangdaw2023 · 2024-09-02T11:12:22Z

wangdaw2023
Sep 2, 2024

I am training 70B Megatron LLM on A800 with 32 nodes cluster. The cluster is composed of 32 nodes with 8 * A800 and 4 * RoCE 200Gb/s. I find 70B MFU 20% is quite lower than 32B model MFU 47%. Besides, I find some node GPU memory usage is 70GB, while other node memory usage is 50GB. I would like to tune memory usage to the same level to use bigger micro batch size to improve MFU. It involves to place/layout which LLM layer to which rank. Any document for this topic?

32B LLM, TP=8, PP=1, MFU=47%

70B LLM, TP=8, PP=2, MFU=20%

a635402687 · 2024-09-03T10:02:34Z

a635402687
Sep 3, 2024

Maybe try to set smaller TP larger PP (e.g. TP=4, PP=4 or TP=4, PP=8) for 70B case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] how to control GPU memory layout for 70B LLM model? #1074

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[QUESTION] how to control GPU memory layout for 70B LLM model? #1074

wangdaw2023 Sep 2, 2024

Replies: 1 comment

a635402687 Sep 3, 2024

wangdaw2023
Sep 2, 2024

a635402687
Sep 3, 2024