Replies: 2 comments
-
A possible reason is that the local mcore model does not support flash-attn. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Your question
I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model.
I notice clearly worse wall clock time and memory usage:
Environment:
For the data I use c4_en data from huggingface and tokenize it using gpt2 tokenizer. I use the first 3.6e7(first 10%) document to conduct the experiments.
To Reproduce
megatron-lm commit hash: 9de386d
I customize a script from pretrain_gpt_distributed.sh and rename it as
pretrain_gpt_cli.sh
To reproduce the experiment, please run following bash command:
Is there any reason behind this?
Beta Was this translation helpful? Give feedback.
All reactions