Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix NAN loss of rope long context training #399

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

inkcherry
Copy link

@inkcherry inkcherry commented Jun 5, 2024

based on ( #392) , we got NAN loss during long centext training using ds-sp(ulysses) for a llama style model.

We found that this issue is caused by precision problems. Half precision of rope sequence representation leads to loss in long context. Similar modifications have also been applied to the transformers.
https://github.com/huggingface/transformers/blob/63fb253df0d976b95d9b4b9a7b0012e5f8a37896/src/transformers/models/llama/modeling_llama.py#L111

@shrutiramesh1988
Copy link

Even with this fix, I'm still facing loss=nan issues when trying to run the llama2 pre-training on single/multiple nodes with BF16, ZeRO stage 1, --use-rotary-position-embeddings and a sequence length of 4096. Could you kindly help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants