[WIP] Adding support for FP8 training #218

shahromil16 · 2024-02-21T20:05:16Z

Changes made to train.py, model.py, and norms.py to check if Transformer Engine can be imported for P5 instances or H100s and use FP8 for Linear and LayerNorm layers.
Minor modifications to main.py for FP8 support

sagadre · 2024-02-21T20:08:27Z

open_lm/model.py

@@ -117,41 +128,72 @@ def __init__(self, layer_id, args: Params):
        super().__init__()
        self.n_heads = args.n_heads
        self.head_dim = args.dim // args.n_heads
-        self.in_proj = nn.Linear(args.dim, 3 * args.n_heads * self.head_dim, bias=False)
-        self.out_proj = nn.Linear(args.n_heads * self.head_dim, args.dim, bias=False)
+        if using_te:


instead of if/else's can we have a single helper function that recursively searches the model and replaces the linears?

another option is having a linear/layernorm module m that is set to either te or nn

Ideally, yes. But that would replace the Linear layer found here which breaks the training. So until a fix is not found, need to isolate that particular layer.

can't we just not recurse for special cases?

missed this earlier, but yeah this might also address my swiglu comment below (since it would recurse and replace the linear within the swiglu)

Completed in the latest commit.

sagadre · 2024-02-21T20:10:04Z

open_lm/model.py

+            torch.nn.init.trunc_normal_(self.in_proj.weight_tensor.float(), std=std, a=-3 * std, b=3 * std)
+            # scale init by depth as in https://arxiv.org/abs/1908.11365 -- worked slightly better.
+            std = std / math.sqrt(2 * (self.layer_id + 1))
+            torch.nn.init.trunc_normal_(self.out_proj.weight_tensor.float(), std=std, a=-3 * std, b=3 * std)


why do we need to cast to float? does a float cast happen in place?

We dont need float cast. Removing that in next commit.

Removed completely as we are recursively changing NN.Linear to TE.Linear

sagadre · 2024-02-21T20:13:50Z

open_lm/model.py

+                eps=args.norm_eps,
+            )
+        else:
+            self.attention_norm = args.norm_type(


can we just add te.LayerNorm as one of the args.norm_type?

Added this in the latest commit based on presence of TE, TE.LayerNorm or NN.LayerNorm will be considered.

sagadre · 2024-02-21T20:14:56Z

open_lm/norms.py

@@ -55,7 +67,16 @@ def reset_parameters(self) -> None:
                self.bias.zero_()

    def forward(self, input: Tensor) -> Tensor:
-        return F.layer_norm(input, self.normalized_shape, self.weight, self.bias, self.eps)
+        if using_te:


@achalddave do u think we should have a seperate class for TeLayerNorm? or do u prefer combining it with existing layer norm

We can have a separate class for TeLayerNorm.

sagadre · 2024-02-21T20:16:20Z

open_lm/params.py

+        help="Using SMP Flash Attention.",
+    )
+    parser.add_argument(
+        "--sharding-strategy",


is this used? if so can we have a more specific name?

also not seeing where --use-smp-flash-attention is used

This is not used for FP8. Just placeholder flags defaulted to None for Sagemaker Model Parallel.

Removed this to avoid confusion

achalddave · 2024-02-28T00:49:49Z

open_lm/model.py

@@ -202,9 +245,14 @@ def __init__(self, layer_id, args: Params):
        elif args.ffn_type == "gelu":


Could we also support fp8 for swiglu above? We can make a copy of the Swiglu class in this file. Here's the source for Swiglu https://github.com/facebookresearch/xformers/blob/7f8c290183344343771f4e1d945a8ce10a9500ff/xformers/ops/swiglu_op.py#L430

@rams16592 seems like the recursive replace linear patten should take care of this automatically. a function like this seems like it would be great and we can exclude certain linears that need to be higher precision for stability. this function has an include field instead of exclude, but hopefully that's easy to flip:
https://github.com/mlfoundations/open_clip/blob/73fa7f03a33da53653f61841eb6d69aef161e521/src/open_clip/utils.py#L65

Applied this change in the latest commit. Excluding the last output Linear layer from the conversion to TE Linear as its running into errors.

…til natively supports FSDP

Adding support for FP8 training

7e4dc10

shahromil16 requested review from achalddave and sagadre February 21, 2024 20:05

shahromil16 self-assigned this Feb 21, 2024

shahromil16 changed the title ~~Adding support for FP8 training~~ [WIP] Adding support for FP8 training Feb 21, 2024

sagadre reviewed Feb 21, 2024

View reviewed changes

Linter changes

e8cad2a

sagadre reviewed Feb 21, 2024

View reviewed changes

achalddave reviewed Feb 28, 2024

View reviewed changes

shahromil16 added 18 commits February 29, 2024 15:59

Converting all Linears to TE Linears except output Linear

0514087

Fix linter errors

ff8e8c8

Merge remote-tracking branch 'origin/main' into feature/fp8

298471f

Rebase from main and update FP8 changes

9224b0e

Linter changes

4563be2

Adding asserts for FP8

937927a

Asserts for FP8

1594b9f

Predefine all_gpus for TE

740f2b1

Merge remote-tracking branch 'origin/main' into feature/fp8

ccc7eef

Remove if/else for fp8 checks

3713f61

Remove extra asserts

14e8278

Removing unused deps

e572510

Update routine for converting NN layers to TE equivalents

8350cb9

Merge remote-tracking branch 'origin/main' into feature/fp8

a907b3c

Update FP8 flags and checks for layers

4e582a0

Linter checks

cdb0cf7

Add checks for autocast function

afb46cb

Minor edit to model

00c9e5b

shahromil16 added 7 commits April 24, 2024 11:36

Adding default args as Params to SwiGLUTorch

40c7a6d

Linter fixes

8dbd1d8

Adding Torch Attention TE

ec91746

Merge remote-tracking branch 'origin/main' into feature/fp8

afb7a66

Fixing FP8+FSDP memory issues by removing FP8 from all activations un…

29000f3

…til natively supports FSDP

Merge remote-tracking branch 'origin/main' into feature/fp8

936cd9a

Updating deps and config

aca1b75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adding support for FP8 training #218

[WIP] Adding support for FP8 training #218

shahromil16 commented Feb 21, 2024

sagadre Feb 21, 2024

sagadre Feb 21, 2024

shahromil16 Feb 21, 2024

sagadre Feb 28, 2024

achalddave Feb 28, 2024

shahromil16 Mar 1, 2024

sagadre Feb 21, 2024

shahromil16 Feb 21, 2024

shahromil16 Mar 1, 2024

sagadre Feb 21, 2024

shahromil16 Mar 1, 2024

sagadre Feb 21, 2024

shahromil16 Mar 1, 2024

sagadre Feb 21, 2024

sagadre Feb 21, 2024

shahromil16 Feb 21, 2024

shahromil16 Mar 1, 2024

achalddave Feb 28, 2024

sagadre Feb 28, 2024

shahromil16 Mar 1, 2024

		@@ -202,9 +245,14 @@ def __init__(self, layer_id, args: Params):
		elif args.ffn_type == "gelu":

[WIP] Adding support for FP8 training #218

Are you sure you want to change the base?

[WIP] Adding support for FP8 training #218

Conversation

shahromil16 commented Feb 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment