MosaicBERT: pretraining configuration for models > 128 seq. length #442

stefan-it · 2024-01-03T11:57:07Z

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming Modernize MosaicBERT #440 PR.

Many thanks in advance!

Stefan

Taytay · 2024-01-04T17:31:22Z

@stefan-it - I tried the commit in main, and ran into a number of errors, and was pointed to #440, so I am planning on basing my work on that unless I hear otherwise.

jacobfulano · 2024-01-05T20:19:38Z

Hi @stefan-it we did not experiment with training on 128 then switching to 512 (as in the original BERT paper by Devlin et al. 2018). In our experiments, training MosaicBERT-Base on sequence length 512 with batch size 4096 and 70,000 steps took roughly 30 hours on 8 A100 80 GB GPUs (see below).

It might take us a few more days to merge the FA2 PR #440, but do let us know if you run into any issues!

mmarius · 2024-01-06T10:29:26Z

Hi @jacobfulano, do you also have an estimate for how long it will take to pre-train MosaicBERT-Large on a sequence length of 512 with batch size 4096 for 70,000 steps?

jacobfulano · 2024-01-08T22:52:53Z

Hi @mmarius, we did not specifically train MosaicBERT-Large with sequence length 512 with batch size 4096 for 70,000 steps. However my estimate would be roughly 4x the time it takes to train MosaicBERT-Large with sequence length 128 with batch size 4096 for 70,000 (~27.2 hours). So roughly 108 hours on 8 A100 80GB GPUs

jacobfulano · 2024-01-08T22:54:01Z

If you are going any larger than that I would recommend looking at the mosaicml/llm-foundry which should have support for training encoders/embedding models soon.

stefan-it changed the title ~~MosaicBERT: pretraining configurations for models > 128 seq. length~~ MosaicBERT: pretraining configuration for models > 128 seq. length Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MosaicBERT: pretraining configuration for models > 128 seq. length #442

MosaicBERT: pretraining configuration for models > 128 seq. length #442

stefan-it commented Jan 3, 2024

Taytay commented Jan 4, 2024

jacobfulano commented Jan 5, 2024 •

edited

Loading

mmarius commented Jan 6, 2024

jacobfulano commented Jan 8, 2024

jacobfulano commented Jan 8, 2024

MosaicBERT: pretraining configuration for models > 128 seq. length #442

MosaicBERT: pretraining configuration for models > 128 seq. length #442

Comments

stefan-it commented Jan 3, 2024

Taytay commented Jan 4, 2024

jacobfulano commented Jan 5, 2024 • edited Loading

mmarius commented Jan 6, 2024

jacobfulano commented Jan 8, 2024

jacobfulano commented Jan 8, 2024

jacobfulano commented Jan 5, 2024 •

edited

Loading