Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync torch titan #7

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft

Sync torch titan #7

wants to merge 12 commits into from

Conversation

philippguevorguian
Copy link
Collaborator

No description provided.

awgu and others added 12 commits August 20, 2024 11:06
ghstack-source-id: 28a28926bec3c1a6671a18b403deab0fc096d218
Pull Request resolved: pytorch#538
ghstack-source-id: ceb4fa54121be241633daf06a0ca2eb407667274
Pull Request resolved: pytorch#535
closes: pytorch#548

> Nvidia Ada Lovelace GPUs (e.g., RTX 4090, L20, L40) with SM89 version
are also support FP8 MMA, and hence, it is recommended to relax the CUDA
architecture limitations to enable FP8 training on a broader range of
devices.
>
> and the [CUDA 12.0
announcement](https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/)
says that it supports Lovelace architecture:
> '*CUDA 12.0 exposes programmable functionality for many features of
the NVIDIA Hopper and NVIDIA Ada Lovelace architectures: ...32x Ultra
xMMA (including FP8 and FP16)*'
> 
> -
https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/
> - https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html
> -
https://github.com/NVIDIA/cutlass/blob/c4e3e122e266644c61b4af33d0cc09f4c391a64b/include/cutlass/arch/mma_sm89.h#L57
> 
>
![image](https://github.com/user-attachments/assets/3c11736c-2e84-4bd6-a49c-5af8b0e3e6ac)

After relaxing the CUDA architecture limitations for FP8, my environment
with **4 x L40 GPUs (SM89)** can still successfully train llama under
float8 precision.


![image](https://github.com/user-attachments/assets/1337e041-0d0d-49b5-8c11-00e67f4df41f)

---------

Co-authored-by: Andrew Gu <[email protected]>
ghstack-source-id: ab6a7cec6ba4f4690f5834d22bc16d8d9f2bdba8
Pull Request resolved: pytorch#555
In this PR, we mostly measured the performance and loss curves for 405B
model with some optimizations techniques we recently developed. We also
want to log the actual peak TFLOPs used for MFU calculation for
cross-validation. Also we should get device information from system
rather from device name because it does not contain "NVL" or "SXM".

<img width="496" alt="image"
src="https://github.com/user-attachments/assets/ba822de5-cf23-4ecd-b29c-70f9aac38290">
As title. We have updated the peak FLOPs for H100 so we need to use the
correct number here
The lspci command is part of the `pciutils` package, which provides
tools for listing and querying PCI devices. But somehow `pciutils` is
not installed in CI machines. This PR is to first unblock CI failure and
then we can see if we want to make `pciutils` a requirement for Titan.
Somehow, when rebasing, the legacy float8 enabling flag stays in the
405B toml. Let's remove it. And this does not affect the perf number we
obtained because the old flag is just a no-op after rebase.
ghstack-source-id: 3ece57ae6d8dbf7ff66e3c41f1804ddb08078ba4
Pull Request resolved: pytorch#525
Latest torch titan changes
@philippguevorguian philippguevorguian marked this pull request as draft September 2, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants