[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format #1181

uehara-mech · 2024-04-10T01:37:19Z

uehara-mech
Apr 10, 2024

I am trying to convert the weight for vicuna-7b-v1.5 in huggingface transformers ( https://huggingface.co/lmsys/vicuna-7b-v1.5 ) to be used with megatron-lm.
I am using tools/checkpoint/convert.py to do the conversion.
The command I used is as follows:

python tools/checkpoint/convert.py \
  --model-type GPT \
  --loader llama2_hf \
  --saver megatron \
  --target-tensor-parallel-size 2 \
  --target-pipeline-parallel-size 2 \
  --load-dir ${HF_CHECKPOINT_DIR} \
  --save-dir ${MEGATRON_CHECKPOINT_DIR} \
  --tokenizer-model ${TOKENIZER_MODEL}

When I run it, I get an error like this:

Traceback (most recent call last):
  File "[...]/Megatron-LM/tools/checkpoint/convert.py", line 158, in <module>
    main()
  File "[...]/Megatron-LM/tools/checkpoint/convert.py", line 151, in main
    loader.load_checkpoint(queue, args)
  File "[...]/Megatron-LM/tools/checkpoint/loader_llama2_hf.py", line 370, in load_checkpoint
    _load_checkpoint(queue, args)
  File "[...]/Megatron-LM/tools/checkpoint/loader_llama2_hf.py", line 280, in _load_checkpoint
    model = load_checkpoint_to_model(margs)
  File "[...]/Megatron-LM/tools/checkpoint/loader_llama2_hf.py", line 140, in load_checkpoint_to_model
    model = model_provider(True, True).to(args.params_dtype)
  File "[...]/Megatron-LM/pretrain_gpt.py", line 84, in model_provider
    model = megatron.legacy.model.GPTModel(
  File "[...]/Megatron-LM/megatron/legacy/model/gpt_model.py", line 61, in __init__
    self.language_model, self._language_model_key = get_language_model(
  File "[...]/Megatron-LM/megatron/legacy/model/language_model.py", line 67, in get_language_model
    language_model = TransformerLanguageModel(
  File "[...]/Megatron-LM/megatron/legacy/model/language_model.py", line 387, in __init__
    self.encoder = ParallelTransformer(
  File "[...]/Megatron-LM/megatron/legacy/model/transformer.py", line 1579, in __init__
    [build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "[...]/Megatron-LM/megatron/legacy/model/transformer.py", line 1579, in <listcomp>
    [build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "[...]/Megatron-LM/megatron/legacy/model/transformer.py", line 1519, in build_layer
    tp_group=mpu.get_tensor_model_parallel_group(),
  File "[...]/Megatron-LM/megatron/core/parallel_state.py", line 567, in get_tensor_model_parallel_group
    assert (
AssertionError: tensor model parallel group is not initialized

I looked into it, and it seems this error happens here:

Megatron-LM/megatron/core/parallel_state.py

Lines 563 to 569 in 7fe863f

    
           def get_tensor_model_parallel_group(check_initialized=True): 
        
               """Get the tensor model parallel group the caller rank belongs to.""" 
        
               if check_initialized: 
        
                   assert ( 
        
                       _TENSOR_MODEL_PARALLEL_GROUP is not None 
        
                   ), 'tensor model parallel group is not initialized' 
        
               return _TENSOR_MODEL_PARALLEL_GROUP

because _TENSOR_MODEL_PARALLEL_GROUP does not have a value set.

However, I found that _TENSOR_MODEL_PARALLEL_GROUP is only set here in the whole code:

Megatron-LM/megatron/core/parallel_state.py

Line 379 in 7fe863f

_TENSOR_MODEL_PARALLEL_GROUP = group

and this function initialize_model_parallel does not seem to be called during the weight conversion.

How can I correctly do the weight conversion?

BramVanroy · 2024-04-11T18:14:39Z

BramVanroy
Apr 11, 2024

I'm also interested in this, and more generally how Megatron can be used to convert from HF, continue pretraining, and convert back to HF.

0 replies

CaesarWWK · 2024-04-12T09:34:37Z

CaesarWWK
Apr 12, 2024

same issue on different model

0 replies

arktoswb · 2024-05-29T20:30:19Z

arktoswb
May 29, 2024

My understanding is that megatron model_type (that uses transformer-impl=local) is deprecated. Consider using mcore model_type (uses transformer-impl=transformer_engine):

--saver mcore

0 replies

arktoswb · 2024-05-30T12:31:30Z

arktoswb
May 30, 2024

Also, if you do need megatron model_type, try saving first to mcore, then to megatron. Last time I checked, that worked.

0 replies

nakroy · 2024-07-15T04:04:10Z

nakroy
Jul 15, 2024

My understanding is that megatron model_type (that uses transformer-impl=local) is deprecated. Consider using mcore model_type (uses transformer-impl=transformer_engine):
--saver mcore

Thanks, man. I use --saver megatron to convert llama2-13b-hf model, and it would cause a Runtime Error when loading checkpoints with Missing key(s) in state_dict: "embedding.word_embeddings.weights", "decoder.layers.0.self_attention.linear_proj.weight", etc.

And I change the convert command into --saver mcore, it successfully loaded checkpoints and started finetune trainning. I read the source code of Megatron-LM/tools/checkpoints/, and it seems that llama2_hf loader is already deprecated and llama_mistral is the choice.

Anyway, your answer really helps me a lot, before that I checked for a long time to figure out the problems

0 replies

2024-09-13T18:21:37Z

github-actions[bot]
bot Sep 13, 2024

Marking as stale. No activity in 60 days.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format #1181

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format #1181

uehara-mech Apr 10, 2024

Replies: 6 comments

BramVanroy Apr 11, 2024

CaesarWWK Apr 12, 2024

arktoswb May 29, 2024

arktoswb May 30, 2024

nakroy Jul 15, 2024

github-actions[bot] bot Sep 13, 2024

uehara-mech
Apr 10, 2024

BramVanroy
Apr 11, 2024

CaesarWWK
Apr 12, 2024

arktoswb
May 29, 2024

arktoswb
May 30, 2024

nakroy
Jul 15, 2024

github-actions[bot]
bot Sep 13, 2024