forked from pytorch/torchtitan
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync torch titan #7
Draft
philippguevorguian
wants to merge
12
commits into
main
Choose a base branch
from
sync_torch_titan
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ghstack-source-id: 28a28926bec3c1a6671a18b403deab0fc096d218 Pull Request resolved: pytorch#538
ghstack-source-id: ceb4fa54121be241633daf06a0ca2eb407667274 Pull Request resolved: pytorch#535
closes: pytorch#548 > Nvidia Ada Lovelace GPUs (e.g., RTX 4090, L20, L40) with SM89 version are also support FP8 MMA, and hence, it is recommended to relax the CUDA architecture limitations to enable FP8 training on a broader range of devices. > > and the [CUDA 12.0 announcement](https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/) says that it supports Lovelace architecture: > '*CUDA 12.0 exposes programmable functionality for many features of the NVIDIA Hopper and NVIDIA Ada Lovelace architectures: ...32x Ultra xMMA (including FP8 and FP16)*' > > - https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/ > - https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html > - https://github.com/NVIDIA/cutlass/blob/c4e3e122e266644c61b4af33d0cc09f4c391a64b/include/cutlass/arch/mma_sm89.h#L57 > > ![image](https://github.com/user-attachments/assets/3c11736c-2e84-4bd6-a49c-5af8b0e3e6ac) After relaxing the CUDA architecture limitations for FP8, my environment with **4 x L40 GPUs (SM89)** can still successfully train llama under float8 precision. ![image](https://github.com/user-attachments/assets/1337e041-0d0d-49b5-8c11-00e67f4df41f) --------- Co-authored-by: Andrew Gu <[email protected]>
ghstack-source-id: ab6a7cec6ba4f4690f5834d22bc16d8d9f2bdba8 Pull Request resolved: pytorch#555
In this PR, we mostly measured the performance and loss curves for 405B model with some optimizations techniques we recently developed. We also want to log the actual peak TFLOPs used for MFU calculation for cross-validation. Also we should get device information from system rather from device name because it does not contain "NVL" or "SXM". <img width="496" alt="image" src="https://github.com/user-attachments/assets/ba822de5-cf23-4ecd-b29c-70f9aac38290">
As title. We have updated the peak FLOPs for H100 so we need to use the correct number here
The lspci command is part of the `pciutils` package, which provides tools for listing and querying PCI devices. But somehow `pciutils` is not installed in CI machines. This PR is to first unblock CI failure and then we can see if we want to make `pciutils` a requirement for Titan.
Somehow, when rebasing, the legacy float8 enabling flag stays in the 405B toml. Let's remove it. And this does not affect the perf number we obtained because the old flag is just a no-op after rebase.
ghstack-source-id: 3ece57ae6d8dbf7ff66e3c41f1804ddb08078ba4 Pull Request resolved: pytorch#525
Latest torch titan changes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.