MosaicBERT: Convert composer weights to HF #445

stefan-it · 2024-01-25T08:30:43Z

Hi,

we could sucessfully pretrain various MosaicBERT models and evaluations with composer-based fine-tuning look really good :)

However, when using a/the conversion script llm-foundry/scripts/inference/convert_composer_to_hf.py the converted HF model seems to be initialized randomly and the MLM predictions are looking super random.

I used the conversion script from the llm-foundry repository like this:

$ python3 /mnt/llm-foundry/scripts/inference/convert_composer_to_hf.py --composer_path ep111-ba125000-rank0.pt --hf_output_path ./converted-3 --output_precision fp32

It then shows, that various weights are not correctly initalized:

HF checkpoint folder successfully created at ./converted-3.                                                              
Loading model from ./converted-3                                                                                         
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`                                             
Some weights of BertLMHeadModel were not initialized from the model checkpoint at ./converted-3 and are newly initialized
: ['bert.encoder.layer.7.attention.self.key.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.7
.attention.self.query.weight', 'bert.encoder.layer.10.output.LayerNorm.bias', 'bert.encoder.layer.4.output.dense.bias', '
bert.encoder.layer.8.attention.self.key.bias', 'bert.encoder.layer.5.output.LayerNorm.bias', 'bert.encoder.layer.1.output
.dense.weight', 'bert.encoder.layer.2.output.dense.bias', 'bert.encoder.layer.8.attention.self.value.bias', 'bert.encoder
.layer.5.intermediate.dense.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.1.intermediate
.dense.bias', 'bert.encoder.layer.1.attention.self.query.weight', 'bert.encoder.layer.8.attention.self.query.weight', 'be
rt.encoder.layer.2.attention.self.key.weight', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.3.atte
ntion.self.query.bias', 'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.2.attention.self.value.b
ias', 'bert.encoder.layer.4.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.l
ayer.2.attention.self.key.bias', 'bert.encoder.layer.6.attention.self.key.weight', 'bert.encoder.layer.5.attention.self.k
ey.bias', 'bert.encoder.layer.9.attention.self.query.weight', 'bert.encoder.layer.7.attention.self.value.weight', 'bert.e
ncoder.layer.8.output.dense.weight', 'bert.encoder.layer.4.attention.self.key.bias', 'bert.encoder.layer.11.attention.sel
f.value.bias', 'bert.encoder.layer.4.attention.self.key.weight', 'bert.encoder.layer.7.intermediate.dense.bias', 'bert.en
coder.layer.5.output.dense.bias', 'bert.encoder.layer.8.attention.self.value.weight', 'bert.encoder.layer.5.attention.sel
f.query.weight', 'bert.encoder.layer.4.attention.self.value.weight', 'bert.encoder.layer.9.intermediate.dense.weight', 'b
ert.encoder.layer.3.output.LayerNorm.bias', 'bert.encoder.layer.6.intermediate.dense.bias', 'bert.encoder.layer.3.interme
diate.dense.weight', 'bert.encoder.layer.9.attention.self.value.bias', 'bert.encoder.layer.4.output.LayerNorm.weight', 'b
ert.encoder.layer.3.output.LayerNorm.weight', 'bert.encoder.layer.5.attention.self.value.weight', 'bert.encoder.layer.10.
attention.self.key.weight', 'bert.encoder.layer.3.intermediate.dense.bias', 'bert.encoder.layer.9.output.LayerNorm.bias',
 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.11.intermediate.dense.bias', 'bert.encoder.layer.
0.attention.self.key.bias', 'bert.encoder.layer.7.output.LayerNorm.bias', 'bert.encoder.layer.0.output.dense.weight', 'be
rt.encoder.layer.6.attention.self.query.weight', 'bert.encoder.layer.11.output.LayerNorm.bias', 'bert.encoder.layer.5.out
put.LayerNorm.weight', 'bert.encoder.layer.9.output.dense.bias', 'bert.encoder.layer.6.attention.self.key.bias', 'bert.en
coder.layer.1.intermediate.dense.weight', 'bert.encoder.layer.10.attention.self.query.weight', 'bert.encoder.layer.3.atte
ntion.self.query.weight', 'bert.encoder.layer.9.output.dense.weight', 'bert.encoder.layer.1.attention.self.key.weight', '
bert.encoder.layer.10.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.2
.attention.self.query.bias', 'bert.encoder.layer.8.output.dense.bias', 'bert.encoder.layer.0.output.LayerNorm.weight'
[...]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Is there any special conversion script/hints for converting a MosaicBERT composer checkpoint 🤔

Any help is highly appreciated!

The text was updated successfully, but these errors were encountered:

dakinggg · 2024-01-25T22:10:43Z

Hi, the conversion script in LLM Foundry is not intended for MosaicBERT, which still lives here in examples repo. To export it properly with the code files, you'll need to do some manual movement of the code files. See my other answer as well: #401 (comment)

Patchwork53 mentioned this issue Apr 12, 2024

new python file to upload mosaic-bert to hf mosaicml/llm-foundry#1109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MosaicBERT: Convert composer weights to HF #445

MosaicBERT: Convert composer weights to HF #445

stefan-it commented Jan 25, 2024 •

edited

Loading

dakinggg commented Jan 25, 2024

MosaicBERT: Convert composer weights to HF #445

MosaicBERT: Convert composer weights to HF #445

Comments

stefan-it commented Jan 25, 2024 • edited Loading

dakinggg commented Jan 25, 2024

stefan-it commented Jan 25, 2024 •

edited

Loading