Modernize MosaicBERT #440

Skylion007 · 2024-01-02T17:54:22Z

This PR modernizes the MosaicBERT codebase with Flash Attention 2, PyTorch 2 (torch==2.1.1), and an updated version of composer (mosaicml>=0.17).

In particular, this updates MosaicBERT to be compatible with Flash Attention 2 (flash-attn==4.2.4), which now supports ALiBi slopes (PR#540).

Context:

The code for MosaicBERT was mostly written October - December 2022. At this time, Flash Attention was quite new and did not support linear biases (i.e. ALiBi). We therefore wrote a custom Flash Attention that supported ALiBi using triton in https://github.com/mosaicml/examples/blob/v0.0.4/examples/bert/src/flash_attn_triton.py. This version of triton also required PyTorch 1.13. This is also the kernel used for the MosaicBERT NeurIPS submission.
Now that Flash Attention natively supports ALiBi, we no longer need to support a custom triton implementation

See w&b runs here

Note that changes to files outside of examples/benchmarks/bert are simply formatting changes due to linting.

examples/benchmarks/bert/src/bert_layers.py

jacobfulano · 2024-01-02T18:26:40Z

examples/benchmarks/bert/src/bert_layers.py

+                if convert_dtype:
+                    # Triton implementation only supports fp16 and bf16
+                    orig_dtype = qkv.dtype
+                    qkv = qkv.to(torch.float16)


Do we need this to be in torch.float16?

we do not, this code was here before though.

How should we select between bfloat16 and float16 though?

examples/benchmarks/bert/src/mosaic_bert.py

jacobfulano · 2024-01-02T18:35:57Z

examples/benchmarks/bert/src/text_data.py

@@ -266,8 +261,6 @@ def build_text_dataloader(
                    cfg.dataset.get('validate_hash', None),
                    keep_zip=stream.get('keep_zip', None) or
                    cfg.dataset.get('keep_zip', False),
-                    keep_raw=stream.get('keep_raw', None) or


Just noting that this is correct and that keep_raw is no longer a flag in mosaicml-streaming (see Streaming docs)

can you check that the defaults here match the defaults currently set in llm foundry?

The defaults in llm foundry are a bit different. Should we update this function whole-hog?

From llmfoundry text_data.py

def __init__(self, tokenizer: PreTrainedTokenizerBase, max_seq_len: int, streams: Optional[Sequence[Stream]] = None, remote: Optional[str] = None, local: Optional[str] = None, split: Optional[str] = None, download_retry: int = 2, download_timeout: float = 60, validate_hash: Optional[str] = None, keep_zip: bool = False, epoch_size: Optional[Union[int, str]] = None, predownload: Optional[int] = None, cache_limit: Optional[Union[int, str]] = None, partition_algo: str = 'relaxed', num_canonical_nodes: Optional[int] = None, batch_size: Optional[int] = None, shuffle: bool = False, shuffle_algo: str = 'py1e', shuffle_seed: int = 9176, shuffle_block_size: Optional[int] = None, sampling_method: str = 'balanced', sampling_granularity: int = 1, batching_method: str = 'random', **kwargs: Any):

Since it's still text data, this should be good!

jacobfulano · 2024-01-05T04:36:19Z

examples/end-to-end-examples/support_chatbot/app_demo.py

This is just linting

jacobfulano · 2024-01-05T04:36:53Z

examples/end-to-end-examples/support_chatbot/eval_data/composer_docstrings.jsonl

This is just linting

jacobfulano · 2024-01-05T04:37:17Z

examples/end-to-end-examples/support_chatbot/repo_downloader.py

This is just linting

jacobfulano · 2024-01-05T04:37:43Z

examples/end-to-end-examples/support_chatbot/scripts/conversion/convert_txt_to_stream.py

This is just linting

jacobfulano · 2024-01-05T04:52:28Z

Should be close to done @dakinggg, the two failed pytests were

FAILED tests/test_classification.py::test_classification_script - RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
FAILED tests/test_glue.py::test_glue_script[mosaic_bert] - RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
============= 2 failed, 3 passed, 3 warnings in 147.01s (0:02:27) ==============

stefan-it · 2024-01-05T17:51:32Z

examples/benchmarks/bert/src/bert_layers.py

@@ -425,6 +499,7 @@ def __init__(self, config):
            (1, self.num_attention_heads, self._current_alibi_size,
             self._current_alibi_size))
        self.rebuild_alibi_tensor(size=config.alibi_starting_size)
+        self.slopes = None


Hey @Skylion007 many thanks for this PR! I am currently testing it (with own dataset) and training is working (8x H100).

I had to remove this line, because:

this.slopes is set in the rebuild_alibi_tensor function before

it is later needed in line 583

Setting to None will then cause an error in line 583.

This was on me trying to appease the linting gods. Thanks for catching! Should be removed now

Taytay · 2024-01-06T20:12:22Z

UPDATE on 1/8/24: This was not an issue for me on a clean machine, so this is unlikely to be a real issue, and VERY unlikely to be an issue with this PR.

==============
ORIGINAL:
I don't think this error needs to hold up this PR, but FA2 was giving me some headaches as part of a clean requirements.txt installation. I fixed it by ensuring that packaging and torch were both installed BEFORE running the pip install for FA2.

Details:

Env: (This is in WSL for Windows, but most of the time that's equivalent to a Ubuntu environment, and I don't think it's the source of this error.)

I just checked out the branch and created a clean conda env. Then, I did the pip install -r requirements.txt and got an error:

❯ pip install -r requirements.txt
Collecting packaging (from -r requirements.txt (line 1))
  Using cached packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting einops==0.5.0 (from -r requirements.txt (line 2))
  Using cached einops-0.5.0-py3-none-any.whl (36 kB)
Collecting torch==2.1.1 (from -r requirements.txt (line 3))
  Using cached torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting composer<0.18,>=0.17.0 (from composer[nlp,wandb]<0.18,>=0.17.0->-r requirements.txt (line 4))
  Using cached composer-0.17.2-py3-none-any.whl.metadata (27 kB)
Collecting mosaicml-streaming<=0.7 (from -r requirements.txt (line 5))
  Using cached mosaicml_streaming-0.7.0-py3-none-any.whl.metadata (20 kB)
Collecting omegaconf==2.3.0 (from -r requirements.txt (line 6))
  Using cached omegaconf-2.3.0-py3-none-any.whl (79 kB)
Collecting transformers==4.35.2 (from -r requirements.txt (line 7))
  Using cached transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
Collecting flash_attn>=2.4.2 (from -r requirements.txt (line 9))
  Using cached flash_attn-2.4.2.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-snje5q4q/flash-attn_a0ad7b7eaf5e4b1bb1d9c8af1808da4b/setup.py", line 9, in <module>
          from packaging.version import parse, Version
      ModuleNotFoundError: No module named 'packaging'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I tried adding packaging to the top of the requirements.txt, but got the same error. This is happening I believe because FA2 is trying to run some setup stuff as part of its install?

so I pip install packaging on the command line:

Collecting packaging
  Using cached packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Using cached packaging-23.2-py3-none-any.whl (53 kB)
Installing collected packages: packaging
Successfully installed packaging-23.2

re-ran pip install -r requirements.txt:

❯ pip install -r requirements.txt
Requirement already satisfied: packaging in /home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (23.2)
Collecting einops==0.5.0 (from -r requirements.txt (line 2))
  Using cached einops-0.5.0-py3-none-any.whl (36 kB)
Collecting torch==2.1.1 (from -r requirements.txt (line 3))
  Using cached torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting composer<0.18,>=0.17.0 (from composer[nlp,wandb]<0.18,>=0.17.0->-r requirements.txt (line 4))
  Using cached composer-0.17.2-py3-none-any.whl.metadata (27 kB)
Collecting mosaicml-streaming<=0.7 (from -r requirements.txt (line 5))
  Using cached mosaicml_streaming-0.7.0-py3-none-any.whl.metadata (20 kB)
Collecting omegaconf==2.3.0 (from -r requirements.txt (line 6))
  Using cached omegaconf-2.3.0-py3-none-any.whl (79 kB)
Collecting transformers==4.35.2 (from -r requirements.txt (line 7))
  Using cached transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
Collecting flash_attn>=2.4.2 (from -r requirements.txt (line 9))
  Using cached flash_attn-2.4.2.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-gv890oec/flash-attn_bd567b3ed4774a49a637dedaf268441f/setup.py", line 19, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

FA2 is assuming that torch is already installed, but it's being installed as a sibling, so it's not a module yet!
I moved the FA2 requirement to its own requirements_fa2.txt file and got the requirements.txt to succeed.

Then I installed FA2 by running that: pip install -r requirements_fa2.txt
and it worked like a champ.

This no module named torch is not unheard of with FA2: Dao-AILab/flash-attention#246

Taytay · 2024-01-06T20:33:36Z

One more bug that I'll report here just in case it is not just a "my machine" thing. I didn't see NVidia Apex mentioned on the requirements, but when I get to the point where I am running this:

# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml

It looks like I need to have NVidia Apex installed:

/home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 6, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Building eval loader...
Traceback (most recent call last):
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 271, in <module>
    main(cfg)
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 210, in main
    algorithms = [
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 211, in <listcomp>
    build_algorithm(name, algorithm_cfg)
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 72, in build_algorithm
    return algorithms.FusedLayerNorm(**kwargs)
  File "/home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages/composer/algorithms/fused_layernorm/fused_layernorm.py", line 110, in __init__
    check_if_apex_installed()
  File "/home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages/composer/algorithms/fused_layernorm/fused_layernorm.py", line 30, in check_if_apex_installed
    raise ImportError(
ImportError: https://github.com/NVIDIA/apex is not installed. The Fused LayerNorm algorithm cannot be applied. The MosaicML Docker Images (https://hub.docker.com/r/mosaicml/pytorch) contain a copy of APEX for easy use.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 13697) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 13697) exited with code 1

Taytay · 2024-01-08T16:12:58Z

An update on the above: Once I installed Apex from source, the command worked.

You have already recommended the MosaicML Pytorch base image, which presumably comes with Apex pre-installed. I decided to ignore that handy tip and run from my existing WSL environment.

Something that would have helped me would be to clarify that if the user does not use the recommended Pytorch base image, they will need to install Apex after pip installing the requirements.txt. If I'm not the target audience, or this is opening you up to way too much config specification, I get it.

Taytay · 2024-01-08T16:49:55Z

With regards to my comment :

I don't think this error needs to hold up this PR, but FA2 was giving me some headaches as part of a clean requirements.txt installation. I fixed it by ensuring that packaging and torch were both installed BEFORE running the pip install for FA2

This was not an issue for me on a clean machine, so this is unlikely to be a real issue, and VERY unlikely to be an issue with this PR.

Taytay · 2024-01-08T18:24:41Z

I believe that one of the test yamls is missing:

algorithms:
  fused_layernorm: {}

I say that because in the README, it explains you can do a test run of training a Mosaic model by running:

# Run the pre-training script with the test config and MosaicBERT
composer main.py yamls/test/main.yaml model.name=mosaic_bert

However, yamls/test/main.yaml doesn't have these lines:

algorithms:
  fused_layernorm: {}

But yamls/main/mosaic-bert-base-uncased.yaml DOES specify fused_layernorm.

That means that the first time it tries to load Apex's fused_layernorm is when you get to this section:

# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml

I noticed this because I got an error when it tried to load Apex and my environment didn't have it installed. I was surprised because all of my "tests" from the README worked.

…_layernorm

jacobfulano · 2024-01-08T22:41:25Z

I believe that one of the test yamls is missing:
algorithms:
  fused_layernorm: {}
I say that because in the README, it explains you can do a test run of training a Mosaic model by running:
# Run the pre-training script with the test config and MosaicBERT
composer main.py yamls/test/main.yaml model.name=mosaic_bert
However, yamls/test/main.yaml doesn't have these lines:
algorithms:
  fused_layernorm: {}
But yamls/main/mosaic-bert-base-uncased.yaml DOES specify fused_layernorm.

That means that the first time it tries to load Apex's fused_layernorm is when you get to this section:
# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml
I noticed this because I got an error when it tried to load Apex and my environment didn't have it installed. I was surprised because all of my "tests" from the README worked.

Hi @Taytay,

Thanks for pointing this out. The MosaicML Composer library for a while used Fused Layernorm as a Composer "algorithm" to speed up pretraining. It relies on NVIDIA Apex and enables a faster kernel for LayerNorm.

More recently, we've been using Low Precision LayerNorm which does not rely on APEX and works just as well as Fused LayerNorm. From the Composer docs:

Low Precision LayerNorm is meant to replace our Fused LayerNorm algorithm. The two algorithms achieve very similar throughput. Fused LayerNorm also runs in low precision, but it is a more complex algorithm, since it uses a custom kernel. Since the custom kernel provides no additional speedup, we have replaced it with this simpler algorithm.

In the yaml, you can replace fused_layernorm with

algorithms:
  low_precision_layernorm: {}

I've updated the mosaicbert pretraining and finetuning yamls to use low_precision_layernorm.

Taytay · 2024-01-10T18:34:29Z

Thanks @jacobfulano. That's good news. It's worth mentioning that I ran into a bug in this branch that is fixed by #443

Skylion007 requested review from vchiley, dakinggg and jacobfulano January 2, 2024 17:54

Skylion007 force-pushed the skylion007/add-fa2-to-bert branch 3 times, most recently from 617db70 to c9ee668 Compare January 2, 2024 18:00

Modernize MosaicBERT

b809a7b

Skylion007 force-pushed the skylion007/add-fa2-to-bert branch from c9ee668 to b809a7b Compare January 2, 2024 18:09

jacobfulano reviewed Jan 2, 2024

View reviewed changes

examples/benchmarks/bert/src/bert_layers.py Show resolved Hide resolved

jacobfulano reviewed Jan 2, 2024

View reviewed changes

examples/benchmarks/bert/src/bert_layers.py Outdated Show resolved Hide resolved

jacobfulano reviewed Jan 2, 2024

View reviewed changes

examples/benchmarks/bert/src/mosaic_bert.py Show resolved Hide resolved

jacobfulano reviewed Jan 2, 2024

View reviewed changes

Reviewer suggestions & import refactor

dbe8d64

Skylion007 mentioned this pull request Jan 2, 2024

Adding support for alibi when using flash attention mosaicml/llm-foundry#820

Merged

jacobfulano mentioned this pull request Jan 3, 2024

FlashAttention Triton error on the MosaicBERT models other than base #441

Closed

stefan-it mentioned this pull request Jan 3, 2024

MosaicBERT: pretraining configuration for models > 128 seq. length #442

Open

Skylion007 mentioned this pull request Jan 4, 2024

Higher train loss and worse evaluation metrics when using torch.compile() pytorch/pytorch#113180

Open

Skylion007 and others added 10 commits January 4, 2024 12:22

Fix broken tests

6a8d85b

Cast to bfloat16 if possible

4cbdec1

Update streaming kwarg

6ecb798

Improve HALF_DTYPE selection

eb93f3b

revert back to old HALF selection

58d9ffe

update documentation

c6d58e4

clean up documentation

d5ec672

update yamls to mosaicml/pytorch:2.1.1_cu121-python3.10-ubuntu20.04

bcd10d2

clean up comments and documentation

e9e1213

linting

ea82b1b

jacobfulano reviewed Jan 5, 2024

View reviewed changes

examples/end-to-end-examples/support_chatbot/app_demo.py Outdated

Copy link

Contributor

jacobfulano Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

jacobfulano reviewed Jan 5, 2024

View reviewed changes

examples/end-to-end-examples/support_chatbot/eval_data/composer_docstrings.jsonl Outdated

Copy link

Contributor

jacobfulano Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

jacobfulano reviewed Jan 5, 2024

View reviewed changes

examples/end-to-end-examples/support_chatbot/repo_downloader.py Outdated

Copy link

Contributor

jacobfulano Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

jacobfulano reviewed Jan 5, 2024

View reviewed changes

examples/end-to-end-examples/support_chatbot/scripts/conversion/convert_txt_to_stream.py Outdated

Copy link

Contributor

jacobfulano Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

jacobfulano added 5 commits January 5, 2024 04:52

small formatting for pre-commit hook

26313a5

linting and type: ignore

329ed51

linting

95a4753

type ignore

28ccb4f

yapf precommit hook

2cdb6af

stefan-it reviewed Jan 5, 2024

View reviewed changes

remove bug for self.slopes

cb78d90

Taytay mentioned this pull request Jan 8, 2024

Fixes #322 (Change bf16 to amp_bf16) #443

Open

remove fused_layernorm with apex dependency and include low_precision…

c44effa

…_layernorm

jacobfulano mentioned this pull request Jan 16, 2024

Flash Attention 2 MAGICS-LAB/DNABERT_2#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize MosaicBERT #440

Modernize MosaicBERT #440

Skylion007 commented Jan 2, 2024 •

edited by jacobfulano

Loading

jacobfulano Jan 2, 2024

Skylion007 Jan 2, 2024

Skylion007 Jan 2, 2024

jacobfulano Jan 2, 2024

dakinggg Jan 4, 2024

jacobfulano Jan 5, 2024

snarayan21 Jan 5, 2024

jacobfulano Jan 5, 2024

jacobfulano Jan 5, 2024

jacobfulano Jan 5, 2024

jacobfulano Jan 5, 2024

jacobfulano commented Jan 5, 2024

stefan-it Jan 5, 2024 •

edited

Loading

jacobfulano Jan 5, 2024 •

edited

Loading

Taytay commented Jan 6, 2024 •

edited

Loading

Taytay commented Jan 6, 2024

Taytay commented Jan 8, 2024

Taytay commented Jan 8, 2024

Taytay commented Jan 8, 2024

jacobfulano commented Jan 8, 2024

Taytay commented Jan 10, 2024

Modernize MosaicBERT #440

Are you sure you want to change the base?

Modernize MosaicBERT #440

Conversation

Skylion007 commented Jan 2, 2024 • edited by jacobfulano Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobfulano commented Jan 5, 2024

stefan-it Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

jacobfulano Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

Taytay commented Jan 6, 2024 • edited Loading

Taytay commented Jan 6, 2024

Taytay commented Jan 8, 2024

Taytay commented Jan 8, 2024

Taytay commented Jan 8, 2024

jacobfulano commented Jan 8, 2024

Taytay commented Jan 10, 2024

Skylion007 commented Jan 2, 2024 •

edited by jacobfulano

Loading

stefan-it Jan 5, 2024 •

edited

Loading

jacobfulano Jan 5, 2024 •

edited

Loading

Taytay commented Jan 6, 2024 •

edited

Loading