Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error during LoRA-merge in HF upload for Llama 3.1 70B model #782

Open
tmostak opened this issue Jul 24, 2024 · 22 comments
Open

[BUG] Error during LoRA-merge in HF upload for Llama 3.1 70B model #782

tmostak opened this issue Jul 24, 2024 · 22 comments
Assignees
Labels
type/bug Bug in code

Comments

@tmostak
Copy link

tmostak commented Jul 24, 2024

🐛 Bug

Today when attempting to upload a LoRA-trained Llama 3.1 70B model (first time I've trained Llama 3.1), I hit the following during the eLoRA merge. Note I used the cpu_shard method to upload. I've tried it twice now with the same error.

2024-07-24 17:22:58,705 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])]
2024-07-24 17:22:59,686 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])]
2024-07-24 17:22:59,701 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id 128001.
2024-07-24 17:22:59,701 - INFO: Setting pretraining_tp of model config to 1.
2024-07-24 17:22:59,723 - INFO: Using bfloat16 for backbone
2024/07/24 17:23:07 # {"client":"3f76ec33-3e3f-4837-9673-cda3f39f377f","state":"DISCONNECT","t":"ws_disconnect"}
2024/07/24 17:23:07 # {"addr":"99.68.143.103:49420","client_id":"3f76ec33-3e3f-4837-9673-cda3f39f377f","t":"client_reconnect"}
2024-07-24 18:04:05,704 - INFO: Attention implementation: sdpa
2024-07-24 18:04:05,713 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
2024-07-24 18:06:03,026 - INFO: Trainable parameters count: 6627000320
2024-07-24 18:06:03,027 - INFO: Total parameters count: 77180706816
2024-07-24 18:06:03,027 - INFO: Trainable %: 8.5863%
2024-07-24 18:08:56,811 - INFO: Weights loaded from: /home/ubuntu/h2o-llmstudio/output/user/heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1/checkpoint.pth
2024-07-24 18:10:15,356 - INFO: Merging LORA layers with base model.
2024-07-24 18:10:15,561 - ERROR: Unknown exception
Traceback (most recent call last):
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/handlers.py", line 358, in handle
await experiment_push_to_huggingface_dialog(q)
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/experiment.py", line 2015, in experiment_push_to_huggingface_dialog
publish_model_to_hugging_face(
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/hugging_face_utils.py", line 216, in publish_model_to_hugging_face
cfg, model, tokenizer = load_cfg_model_tokenizer(
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/chat.py", line 241, in load_cfg_model_tokenizer
model.backbone = model.backbone.merge_and_unload()
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 838, in merge_and_unload
return self._unload_and_optionally_merge(
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 457, in _unload_and_optionally_merge
target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 470, in merge
delta_weight = self.get_delta_weight(active_adapter)
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 533, in get_delta_weight
output_tensor = transpose(weight_B @ weight_A, self.fan_in_fan_out) * self.scaling[adapter]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

To Reproduce

cfg.yaml

Last login: Wed Jul 24 09:21:21 on ttys003

The default interactive shell is now zsh.
To update your account to use zsh, please run chsh -s /bin/zsh.
For more details, please visit https://support.apple.com/kb/HT208050.
(base) Todds-MBP:heavydb_benchmarks todd$ ssh lambda_train
Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.2.0-37-generic x86_64)
.============.
|| __ || _ _ _
|| _\ || | | __ _ _ __ ___ | |__ | | __ _
|| _\ || | | / | '_ _ | ' \ / _ |/ _ |
|| /λ\ || | |
| (| | | | | | | |) | (| | (| |
|| // _\ || |______,|| || ||.__/ _,|_,_|
.============. GPU CLOUD

System information as of Wed Jul 24 18:59:10 UTC 2024

System load: 0.04296875 Processes: 2188
Usage of /: 16.2% of 18.93TB Users logged in: 1
Memory usage: 3% IPv4 address for docker0: 172.17.0.1
Swap usage: 0% IPv4 address for eno1: 10.19.143.128
architecture:
backbone_dtype: bfloat16
gradient_checkpointing: true
intermediate_dropout: 0.0
pretrained: true
pretrained_weights: ''
augmentation:
neftune_noise_alpha: 0.0
random_parent_probability: 0.0
skip_parent_probability: 0.0
token_mask_probability: 0.0
dataset:
add_bos_token_to_answer: false
add_bos_token_to_prompt: false
add_bos_token_to_system: false
add_eos_token_to_answer: true
add_eos_token_to_prompt: false
add_eos_token_to_system: true
answer_column: answer
chatbot_author: H2O.ai
chatbot_name: h2oGPT
data_sample: 1.0
data_sample_choice:
- Train
- Validation
limit_chained_samples: false
mask_prompt_labels: true
parent_id_column: None
personalize: false
prompt_column:
- prompt
prompt_column_separator: ''
system_column: None
text_answer_separator: ''
text_prompt_start: ''
text_system_start: <|system|>
train_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_train.csv
validation_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_eval.csv
validation_size: 0.01
validation_strategy: custom
environment:
compile_model: false
deepspeed_allgather_bucket_size: 1000000
deepspeed_method: ZeRO3
deepspeed_reduce_bucket_size: 1000000
deepspeed_stage3_param_persistence_threshold: 1000000
deepspeed_stage3_prefetch_bucket_size: 1000000
find_unused_parameters: false
gpus:
- '0'
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
huggingface_branch: main
mixed_precision: false
mixed_precision_dtype: bfloat16
number_of_workers: 8
seed: 2
trust_remote_code: true
use_deepspeed: true
experiment_name: heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
logger: Neptune
neptune_project: tmostak/heavyiq
"cfg.yaml" 119L, 3565B 1,1 Top
architecture:
backbone_dtype: bfloat16
gradient_checkpointing: true
intermediate_dropout: 0.0
pretrained: true
pretrained_weights: ''
augmentation:
neftune_noise_alpha: 0.0
random_parent_probability: 0.0
skip_parent_probability: 0.0
token_mask_probability: 0.0
dataset:
add_bos_token_to_answer: false
add_bos_token_to_prompt: false
add_bos_token_to_system: false
add_eos_token_to_answer: true
add_eos_token_to_prompt: false
add_eos_token_to_system: true
answer_column: answer
chatbot_author: H2O.ai
chatbot_name: h2oGPT
data_sample: 1.0
data_sample_choice:
- Train
- Validation
limit_chained_samples: false
mask_prompt_labels: true
parent_id_column: None
personalize: false
prompt_column:
- prompt
prompt_column_separator: ''
system_column: None
text_answer_separator: ''
text_prompt_start: ''
text_system_start: <|system|>
train_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_train.csv
validation_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_eval.csv
validation_size: 0.01
validation_strategy: custom
environment:
compile_model: false
deepspeed_allgather_bucket_size: 1000000
deepspeed_method: ZeRO3
deepspeed_reduce_bucket_size: 1000000
deepspeed_stage3_param_persistence_threshold: 1000000
deepspeed_stage3_prefetch_bucket_size: 1000000
find_unused_parameters: false
gpus:
- '0'
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
huggingface_branch: main
mixed_precision: false
mixed_precision_dtype: bfloat16
number_of_workers: 8
seed: 2
trust_remote_code: true
use_deepspeed: true
experiment_name: heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
logger: Neptune
neptune_project: tmostak/heavyiq
"cfg.yaml" 119L, 3565B 1,1 Top

LLM Studio version

a1b2923 (tip of main)

@tmostak tmostak added the type/bug Bug in code label Jul 24, 2024
@tmostak
Copy link
Author

tmostak commented Jul 25, 2024

Ok so I did a bit more investigation, in particular logging max_memory and device_map in load_cfg_model_tokenizer in chat.py:

    logger.info("Before load checkpoint")
    with torch.device(cfg.environment._device):
        model = cfg.architecture.model_class(cfg)
        cfg.architecture.pretrained_weights = os.path.join(
            experiment_path, "checkpoint.pth"
        )
        load_checkpoint(cfg, model, strict=False)
    logger.info("After load checkpoint")

    if device == "cpu_shard":
        max_memory = get_balanced_memory(
            model,
        )
        logger.info("Max Memory: ")
        logger.info(max_memory)
        device_map = infer_auto_device_map(model, max_memory=max_memory)
        logger.info("Device Map: ")
        logger.info(device_map)
        model = dispatch_model(
            model,
            device_map=device_map,
2024-07-25 13:19:52,479 - INFO: Device: cpu
2024-07-25 13:19:53,310 - INFO: Stop token ids: [tensor([  27,   91, 9125,   91,   29])]
2024-07-25 13:19:53,621 - INFO: Before load checkpoint
2024-07-25 13:19:54,354 - INFO: Stop token ids: [tensor([  27,   91, 9125,   91,   29])]
2024-07-25 13:19:54,367 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id 128001.
2024-07-25 13:19:54,367 - INFO: Setting pretraining_tp of model config to 1.
2024-07-25 13:19:54,389 - INFO: Using bfloat16 for backbone
2024-07-25 14:01:00,580 - INFO: Attention implementation: sdpa
2024-07-25 14:01:00,589 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
2024-07-25 14:02:59,663 - INFO: Trainable parameters count: 6627000320
2024-07-25 14:02:59,663 - INFO: Total parameters count: 77180706816
2024-07-25 14:02:59,663 - INFO: Trainable %: 8.5863%
2024-07-25 14:05:28,328 - INFO: Weights loaded from: /home/ubuntu/h2o-llmstudio/output/user/heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1/checkpoint.pth
2024-07-25 14:05:28,328 - INFO: After load checkpoint
2024-07-25 14:05:30,090 - INFO: Max Memory:
2024-07-25 14:05:30,090 - INFO: {0: 19395466089, 1: 19395466089, 2: 19395466089, 3: 19395466089, 4: 19395466089, 5: 19395466089, 6: 19395466089, 7: 84537507840, 'cpu': 1717507887104}
2024-07-25 14:05:30,536 - INFO: Device Map:
2024-07-25 14:05:30,536 - INFO: OrderedDict([('backbone.base_model.model.model.embed_tokens', 0), ('backbone.base_model.model.model.layers.0', 0), ('backbone.base_model.model.model.layers.1', 0), ('backbone.base_model.model.model.layers.2', 0), ('backbone.base_model.model.model.layers.3', 0), ('backbone.base_model.model.model.layers.4', 0), ('backbone.base_model.model.model.layers.5', 0), ('backbone.base_model.model.model.layers.6', 0), ('backbone.base_model.model.model.layers.7', 0), ('backbone.base_model.model.model.layers.8.self_attn.q_proj', 0), ('backbone.base_model.model.model.layers.8.self_attn.k_proj.base_layer', 0), ('backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_dropout', 0), ('backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_A', 0), ('backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_B.default', 1), ('backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_embedding_A', 1), ('backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_embedding_B', 1), ('backbone.base_model.model.model.layers.8.self_attn.v_proj', 1), ('backbone.base_model.model.model.layers.8.self_attn.o_proj', 1), ('backbone.base_model.model.model.layers.8.self_attn.rotary_emb', 1), ('backbone.base_model.model.model.layers.8.mlp', 1), ('backbone.base_model.model.model.layers.8.input_layernorm', 1), ('backbone.base_model.model.model.layers.8.post_attention_layernorm', 1), ('backbone.base_model.model.model.layers.9', 1), ('backbone.base_model.model.model.layers.10', 1), ('backbone.base_model.model.model.layers.11', 1), ('backbone.base_model.model.model.layers.12', 1), ('backbone.base_model.model.model.layers.13', 1), ('backbone.base_model.model.model.layers.14', 1), ('backbone.base_model.model.model.layers.15', 1), ('backbone.base_model.model.model.layers.16', 1), ('backbone.base_model.model.model.layers.17', 1), ('backbone.base_model.model.model.layers.18.self_attn', 1), ('backbone.base_model.model.model.layers.18.input_layernorm', 2), ('backbone.base_model.model.model.layers.18.post_attention_layernorm', 2), ('backbone.base_model.model.model.layers.19', 2), ('backbone.base_model.model.model.layers.20', 2), ('backbone.base_model.model.model.layers.21', 2), ('backbone.base_model.model.model.layers.22', 2), ('backbone.base_model.model.model.layers.23', 2), ('backbone.base_model.model.model.layers.24', 2), ('backbone.base_model.model.model.layers.25', 2), ('backbone.base_model.model.model.layers.26', 2), ('backbone.base_model.model.model.layers.27', 2), ('backbone.base_model.model.model.layers.28.self_attn', 2), ('backbone.base_model.model.model.layers.28.mlp.gate_proj', 2), ('backbone.base_model.model.model.layers.28.mlp.down_proj', 3), ('backbone.base_model.model.model.layers.28.mlp.act_fn', 3), ('backbone.base_model.model.model.layers.28.input_layernorm', 3), ('backbone.base_model.model.model.layers.28.post_attention_layernorm', 3), ('backbone.base_model.model.model.layers.29', 3), ('backbone.base_model.model.model.layers.30', 3), ('backbone.base_model.model.model.layers.31', 3), ('backbone.base_model.model.model.layers.32', 3), ('backbone.base_model.model.model.layers.33', 3), ('backbone.base_model.model.model.layers.34', 3), ('backbone.base_model.model.model.layers.35', 3), ('backbone.base_model.model.model.layers.36', 3), ('backbone.base_model.model.model.layers.37', 3), ('backbone.base_model.model.model.layers.38.self_attn', 3), ('backbone.base_model.model.model.layers.38.mlp.gate_proj', 3), ('backbone.base_model.model.model.layers.38.mlp.up_proj', 3), ('backbone.base_model.model.model.layers.38.mlp.act_fn', 4), ('backbone.base_model.model.model.layers.38.input_layernorm', 4), ('backbone.base_model.model.model.layers.38.post_attention_layernorm', 4), ('backbone.base_model.model.model.layers.39', 4), ('backbone.base_model.model.model.layers.40', 4), ('backbone.base_model.model.model.layers.41', 4), ('backbone.base_model.model.model.layers.42', 4), ('backbone.base_model.model.model.layers.43', 4), ('backbone.base_model.model.model.layers.44', 4), ('backbone.base_model.model.model.layers.45', 4), ('backbone.base_model.model.model.layers.46', 4), ('backbone.base_model.model.model.layers.47', 4), ('backbone.base_model.model.model.layers.48', 4), ('backbone.base_model.model.model.layers.50', 5), ('backbone.base_model.model.model.layers.51', 5), ('backbone.base_model.model.model.layers.52', 5), ('backbone.base_model.model.model.layers.53', 5), ('backbone.base_model.model.model.layers.54', 5), ('backbone.base_model.model.model.layers.55', 5), ('backbone.base_model.model.model.layers.56', 5), ('backbone.base_model.model.model.layers.57', 5), ('backbone.base_model.model.model.layers.58', 5), ('backbone.base_model.model.model.layers.59.self_attn', 5), ('backbone.base_model.model.model.layers.59.input_layernorm', 6), ('backbone.base_model.model.model.layers.59.post_attention_layernorm', 6), ('backbone.base_model.model.model.layers.60', 6), ('backbone.base_model.model.model.layers.61', 6), ('backbone.base_model.model.model.layers.62', 6), ('backbone.base_model.model.model.layers.63', 6), ('backbone.base_model.model.model.layers.64', 6), ('backbone.base_model.model.model.layers.65', 6), ('backbone.base_model.model.model.layers.66', 6), ('backbone.base_model.model.model.layers.67', 6), ('backbone.base_model.model.model.layers.68', 6), ('backbone.base_model.model.model.layers.69.self_attn', 6), ('backbone.base_model.model.model.layers.69.mlp.gate_proj', 6), ('backbone.base_model.model.model.layers.69.mlp.down_proj', 7), ('backbone.base_model.model.model.layers.69.mlp.act_fn', 7), ('backbone.base_model.model.model.layers.69.input_layernorm', 7), ('backbone.base_model.model.model.layers.69.post_attention_layernorm', 7), ('backbone.base_model.model.model.layers.70', 7), ('backbone.base_model.model.model.layers.71', 7), ('backbone.base_model.model.model.layers.72', 7), ('backbone.base_model.model.model.layers.73', 7), ('backbone.base_model.model.model.layers.74', 7), ('backbone.base_model.model.model.layers.75', 7), ('backbone.base_model.model.model.layers.76', 7), ('backbone.base_model.model.model.layers.77', 7), ('backbone.base_model.model.model.layers.78', 7), ('backbone.base_model.model.model.layers.79', 7), ('backbone.base_model.model.model.norm', 7), ('backbone.base_model.model.model.rotary_emb', 7), ('backbone.base_model.model.lm_head', 7), ('loss_fn', 7), ('perplexity', 7), ('backbone.base_model.model.model.layers.18.mlp', 2), ('backbone.base_model.model.model.layers.38.mlp.down_proj', 4), ('backbone.base_model.model.model.layers.49', 5), ('backbone.base_model.model.model.layers.28.mlp.up_proj', 3), ('backbone.base_model.model.model.layers.59.mlp', 6), ('backbone.base_model.model.model.layers.69.mlp.up_proj', 7)])

2024-07-25 14:07:02,059 - INFO: Merging LORA layers with base model.
2024-07-25 14:07:02,263 - ERROR: Unknown exception
Traceback (most recent call last):
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/handlers.py", line 358, in handle
    await experiment_push_to_huggingface_dialog(q)
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/experiment.py", line 2015, in experiment_push_to_huggingface_dialog
    publish_model_to_hugging_face(
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/hugging_face_utils.py", line 216, in publish_model_to_hugging_face
    cfg, model, tokenizer = load_cfg_model_tokenizer(
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/chat.py", line 249, in load_cfg_model_tokenizer
    model.backbone = model.backbone.merge_and_unload()
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 838, in merge_and_unload
    return self._unload_and_optionally_merge(
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 457, in _unload_and_optionally_merge
    target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 470, in merge
    delta_weight = self.get_delta_weight(active_adapter)
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 533, in get_delta_weight
    output_tensor = transpose(weight_B @ weight_A, self.fan_in_fan_out) * self.scaling[adapter]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

What seems to be off is max_memory has 19.3GB on each GPU except for GPU 7, with 84.5GB.

I'm wonder if this then messes up the peft merge_and_unload logic and causes tensor-device assignment to get off, causing the error?

Also note my nvidia-smi output taken at the beginning or merge_and_unload(), particularly that no GPU memory is assigned to GPU 7 (although not sure if this is an artifact of the GPUs being loaded up sequentially?)

(base) ubuntu@149-130-217-69:~/h2o-llmstudio/output/user/heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1$ nvidia-smi
Thu Jul 25 14:06:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   36C    P0              72W / 400W |  17081MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   35C    P0              71W / 400W |  18695MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   34C    P0              72W / 400W |  19005MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0              69W / 400W |  19005MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   34C    P0              71W / 400W |  19005MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   33C    P0              71W / 400W |  18859MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   34C    P0              72W / 400W |  11725MiB / 81920MiB |      6%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   36C    P0              71W / 400W |    427MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    17068MiB |
|    1   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    18682MiB |
|    2   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    18992MiB |
|    3   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    18992MiB |
|    4   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    18992MiB |
|    5   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    18846MiB |
|    6   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3    11712MiB |
|    7   N/A  N/A    959317      C   ...s/h2o_llm_studio_jul_24/bin/python3      414MiB |
+---------------------------------------------------------------------------------------+

Thoughts or suggestions?

@tmostak tmostak closed this as completed Jul 25, 2024
@tmostak
Copy link
Author

tmostak commented Jul 25, 2024

One thing I found that might relate

https://discuss.huggingface.co/t/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least-two-devices-cuda-2-and-cuda-0-when-checking-argument-for-argument-index-in-method-wrapper-cuda-index-select/48991/4

They basically hit the same issue, and someone noted that i have a similar error on the other model(minicpm),i change the version of deepspeed from 0.14.0 to 0.13.2. and it works

Going to try downgrading Deepspeed to see if this helps.

@tmostak tmostak reopened this Jul 25, 2024
@tmostak
Copy link
Author

tmostak commented Jul 25, 2024

I tried again with deepspeed 0.13.2 and hit the same issue

2024-07-25 15:14:39,542 - INFO: Weights loaded from: /home/ubuntu/h2o-llmstudio/output/user/heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1/checkpoint.pth
2024-07-25 15:16:00,567 - INFO: Merging LORA layers with base model.
2024-07-25 15:16:00,771 - ERROR: Unknown exception
Traceback (most recent call last):
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/handlers.py", line 358, in handle
    await experiment_push_to_huggingface_dialog(q)
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/experiment.py", line 2012, in experiment_push_to_huggingface_dialog
    publish_model_to_hugging_face(
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/hugging_face_utils.py", line 216, in publish_model_to_hugging_face
    cfg, model, tokenizer = load_cfg_model_tokenizer(
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/chat.py", line 241, in load_cfg_model_tokenizer
    model.backbone = model.backbone.merge_and_unload()
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 838, in merge_and_unload
    return self._unload_and_optionally_merge(
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 457, in _unload_and_optionally_merge
    target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 470, in merge
    delta_weight = self.get_delta_weight(active_adapter)
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 533, in get_delta_weight
    output_tensor = transpose(weight_B @ weight_A, self.fan_in_fan_out) * self.scaling[adapter]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
2024-07-25 15:16:00,773 - INFO: {'home/gpu_stats', 'experiment/display/footer', 'experiment/display/charts/train_loss', 'experiment/display/tab', 'experiment/display/charts/validation_Perplexity', 'experiment/display/charts/validation_loss', 'home/experiments_stats', 'dataset/list', 'experiment/list', 'init_app', 'home/disk_usage', 'home/compute_stats', 'dataset/display/footer', 'experiment/display/charts/meta_lr'}

@pascal-pfeiffer
Copy link
Collaborator

Thank you for the details. We recently upgraded deepspeed, so could indeed be an issue caused by this. I'll look into it.

@tmostak
Copy link
Author

tmostak commented Jul 25, 2024

@pascal-pfeiffer I wrote a quick Python script to write out the layer names per GPU, and it seems the issue might be how the LoRA layers for layer 8 are split between GPU 0 and GPU 1. Also why are there only the LoRA layers for layer 8 and not the other layers?

GPU 0:
  backbone.base_model.model.model.embed_tokens
  backbone.base_model.model.model.layers.0
  backbone.base_model.model.model.layers.1
  backbone.base_model.model.model.layers.2
  backbone.base_model.model.model.layers.3
  backbone.base_model.model.model.layers.4
  backbone.base_model.model.model.layers.5
  backbone.base_model.model.model.layers.6
  backbone.base_model.model.model.layers.7
  backbone.base_model.model.model.layers.8.self_attn.k_proj.base_layer
  backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_A
  backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_dropout
  backbone.base_model.model.model.layers.8.self_attn.q_proj
GPU 1:
  backbone.base_model.model.model.layers.10
  backbone.base_model.model.model.layers.11
  backbone.base_model.model.model.layers.12
  backbone.base_model.model.model.layers.13
  backbone.base_model.model.model.layers.14
  backbone.base_model.model.model.layers.15
  backbone.base_model.model.model.layers.16
  backbone.base_model.model.model.layers.17
  backbone.base_model.model.model.layers.18.self_attn
  backbone.base_model.model.model.layers.8.input_layernorm
  backbone.base_model.model.model.layers.8.mlp
  backbone.base_model.model.model.layers.8.post_attention_layernorm
  backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_B.default
  backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_embedding_A
  backbone.base_model.model.model.layers.8.self_attn.k_proj.lora_embedding_B
  backbone.base_model.model.model.layers.8.self_attn.o_proj
  backbone.base_model.model.model.layers.8.self_attn.rotary_emb
  backbone.base_model.model.model.layers.8.self_attn.v_proj
  backbone.base_model.model.model.layers.9
GPU 2:
  backbone.base_model.model.model.layers.18.input_layernorm
  backbone.base_model.model.model.layers.18.mlp
  backbone.base_model.model.model.layers.18.post_attention_layernorm
  backbone.base_model.model.model.layers.19
  backbone.base_model.model.model.layers.20
  backbone.base_model.model.model.layers.21
  backbone.base_model.model.model.layers.22
  backbone.base_model.model.model.layers.23
  backbone.base_model.model.model.layers.24
  backbone.base_model.model.model.layers.25
  backbone.base_model.model.model.layers.26
  backbone.base_model.model.model.layers.27
  backbone.base_model.model.model.layers.28.mlp.gate_proj
  backbone.base_model.model.model.layers.28.self_attn
GPU 3:
  backbone.base_model.model.model.layers.28.input_layernorm
  backbone.base_model.model.model.layers.28.mlp.act_fn
  backbone.base_model.model.model.layers.28.mlp.down_proj
  backbone.base_model.model.model.layers.28.mlp.up_proj
  backbone.base_model.model.model.layers.28.post_attention_layernorm
  backbone.base_model.model.model.layers.29
  backbone.base_model.model.model.layers.30
  backbone.base_model.model.model.layers.31
  backbone.base_model.model.model.layers.32
  backbone.base_model.model.model.layers.33
  backbone.base_model.model.model.layers.34
  backbone.base_model.model.model.layers.35
  backbone.base_model.model.model.layers.36
  backbone.base_model.model.model.layers.37
  backbone.base_model.model.model.layers.38.mlp.gate_proj
  backbone.base_model.model.model.layers.38.mlp.up_proj
  backbone.base_model.model.model.layers.38.self_attn
GPU 4:
  backbone.base_model.model.model.layers.38.input_layernorm
  backbone.base_model.model.model.layers.38.mlp.act_fn
  backbone.base_model.model.model.layers.38.mlp.down_proj
  backbone.base_model.model.model.layers.38.post_attention_layernorm
  backbone.base_model.model.model.layers.39
  backbone.base_model.model.model.layers.40
  backbone.base_model.model.model.layers.41
  backbone.base_model.model.model.layers.42
  backbone.base_model.model.model.layers.43
  backbone.base_model.model.model.layers.44
  backbone.base_model.model.model.layers.45
  backbone.base_model.model.model.layers.46
  backbone.base_model.model.model.layers.47
  backbone.base_model.model.model.layers.48
GPU 5:
  backbone.base_model.model.model.layers.49
  backbone.base_model.model.model.layers.50
  backbone.base_model.model.model.layers.51
  backbone.base_model.model.model.layers.52
  backbone.base_model.model.model.layers.53
  backbone.base_model.model.model.layers.54
  backbone.base_model.model.model.layers.55
  backbone.base_model.model.model.layers.56
  backbone.base_model.model.model.layers.57
  backbone.base_model.model.model.layers.58
  backbone.base_model.model.model.layers.59.self_attn
GPU 6:
  backbone.base_model.model.model.layers.59.input_layernorm
  backbone.base_model.model.model.layers.59.mlp
  backbone.base_model.model.model.layers.59.post_attention_layernorm
  backbone.base_model.model.model.layers.60
  backbone.base_model.model.model.layers.61
  backbone.base_model.model.model.layers.62
  backbone.base_model.model.model.layers.63
  backbone.base_model.model.model.layers.64
  backbone.base_model.model.model.layers.65
  backbone.base_model.model.model.layers.66
  backbone.base_model.model.model.layers.67
  backbone.base_model.model.model.layers.68
  backbone.base_model.model.model.layers.69.mlp.gate_proj
  backbone.base_model.model.model.layers.69.self_attn
GPU 7:
  backbone.base_model.model.lm_head
  backbone.base_model.model.model.layers.69.input_layernorm
  backbone.base_model.model.model.layers.69.mlp.act_fn
  backbone.base_model.model.model.layers.69.mlp.down_proj
  backbone.base_model.model.model.layers.69.mlp.up_proj
  backbone.base_model.model.model.layers.69.post_attention_layernorm
  backbone.base_model.model.model.layers.70
  backbone.base_model.model.model.layers.71
  backbone.base_model.model.model.layers.72
  backbone.base_model.model.model.layers.73
  backbone.base_model.model.model.layers.74
  backbone.base_model.model.model.layers.75
  backbone.base_model.model.model.layers.76
  backbone.base_model.model.model.layers.77
  backbone.base_model.model.model.layers.78
  backbone.base_model.model.model.layers.79
  backbone.base_model.model.model.norm
  backbone.base_model.model.model.rotary_emb
  loss_fn
  perplexity

@tmostak
Copy link
Author

tmostak commented Jul 25, 2024

Another update, I tried to upload to HuggingFace the Llama 3 (not 3.1) model that I had previously successfully uploaded, and got the same Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm) error as with 3.1, suggesting it was the requirements upgrade that has changed things?

@tmostak
Copy link
Author

tmostak commented Jul 26, 2024

Ok so I rolled back to b4d04c057c7b0a4894d57264d2df7e219e234db2 (fix prompt separator) and re-installed the deps (although had to use transformers 4.43.2 to accomodate the llama 3.1 rope change but had deepspeed 0.13.2), and got the same error exporting.

I then wrote a script to dump the layer names from the model... note how all the LoRA layers are there unlike in the output of device_map

Just showing the first 4 layers but you get the idea

model.embed_tokens.weight
model.layers.0.self_attn.q_proj.base_layer.weight
model.layers.0.self_attn.q_proj.lora_A.default.weight
model.layers.0.self_attn.q_proj.lora_B.default.weight
model.layers.0.self_attn.k_proj.base_layer.weight
model.layers.0.self_attn.k_proj.lora_A.default.weight
model.layers.0.self_attn.k_proj.lora_B.default.weight
model.layers.0.self_attn.v_proj.base_layer.weight
model.layers.0.self_attn.v_proj.lora_A.default.weight
model.layers.0.self_attn.v_proj.lora_B.default.weight
model.layers.0.self_attn.o_proj.base_layer.weight
model.layers.0.self_attn.o_proj.lora_A.default.weight
model.layers.0.self_attn.o_proj.lora_B.default.weight
model.layers.0.mlp.gate_proj.base_layer.weight
model.layers.0.mlp.gate_proj.lora_A.default.weight
model.layers.0.mlp.gate_proj.lora_B.default.weight
model.layers.0.mlp.up_proj.base_layer.weight
model.layers.0.mlp.up_proj.lora_A.default.weight
model.layers.0.mlp.up_proj.lora_B.default.weight
model.layers.0.mlp.down_proj.base_layer.weight
model.layers.0.mlp.down_proj.lora_A.default.weight
model.layers.0.mlp.down_proj.lora_B.default.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
model.layers.1.self_attn.q_proj.base_layer.weight
model.layers.1.self_attn.q_proj.lora_A.default.weight
model.layers.1.self_attn.q_proj.lora_B.default.weight
model.layers.1.self_attn.k_proj.base_layer.weight
model.layers.1.self_attn.k_proj.lora_A.default.weight
model.layers.1.self_attn.k_proj.lora_B.default.weight
model.layers.1.self_attn.v_proj.base_layer.weight
model.layers.1.self_attn.v_proj.lora_A.default.weight
model.layers.1.self_attn.v_proj.lora_B.default.weight
model.layers.1.self_attn.o_proj.base_layer.weight
model.layers.1.self_attn.o_proj.lora_A.default.weight
model.layers.1.self_attn.o_proj.lora_B.default.weight
model.layers.1.mlp.gate_proj.base_layer.weight
model.layers.1.mlp.gate_proj.lora_A.default.weight
model.layers.1.mlp.gate_proj.lora_B.default.weight
model.layers.1.mlp.up_proj.base_layer.weight
model.layers.1.mlp.up_proj.lora_A.default.weight
model.layers.1.mlp.up_proj.lora_B.default.weight
model.layers.1.mlp.down_proj.base_layer.weight
model.layers.1.mlp.down_proj.lora_A.default.weight
model.layers.1.mlp.down_proj.lora_B.default.weight
model.layers.1.input_layernorm.weight
model.layers.1.post_attention_layernorm.weight
model.layers.2.self_attn.q_proj.base_layer.weight
model.layers.2.self_attn.q_proj.lora_A.default.weight
model.layers.2.self_attn.q_proj.lora_B.default.weight
model.layers.2.self_attn.k_proj.base_layer.weight
model.layers.2.self_attn.k_proj.lora_A.default.weight
model.layers.2.self_attn.k_proj.lora_B.default.weight
model.layers.2.self_attn.v_proj.base_layer.weight
model.layers.2.self_attn.v_proj.lora_A.default.weight
model.layers.2.self_attn.v_proj.lora_B.default.weight
model.layers.2.self_attn.o_proj.base_layer.weight
model.layers.2.self_attn.o_proj.lora_A.default.weight
model.layers.2.self_attn.o_proj.lora_B.default.weight
model.layers.2.mlp.gate_proj.base_layer.weight
model.layers.2.mlp.gate_proj.lora_A.default.weight
model.layers.2.mlp.gate_proj.lora_B.default.weight
model.layers.2.mlp.up_proj.base_layer.weight
model.layers.2.mlp.up_proj.lora_A.default.weight
model.layers.2.mlp.up_proj.lora_B.default.weight
model.layers.2.mlp.down_proj.base_layer.weight
model.layers.2.mlp.down_proj.lora_A.default.weight
model.layers.2.mlp.down_proj.lora_B.default.weight
model.layers.2.input_layernorm.weight
model.layers.2.post_attention_layernorm.weight
model.layers.3.self_attn.q_proj.base_layer.weight
model.layers.3.self_attn.q_proj.lora_A.default.weight
model.layers.3.self_attn.q_proj.lora_B.default.weight
model.layers.3.self_attn.k_proj.base_layer.weight
model.layers.3.self_attn.k_proj.lora_A.default.weight
model.layers.3.self_attn.k_proj.lora_B.default.weight
model.layers.3.self_attn.v_proj.base_layer.weight
model.layers.3.self_attn.v_proj.lora_A.default.weight
model.layers.3.self_attn.v_proj.lora_B.default.weight
model.layers.3.self_attn.o_proj.base_layer.weight
model.layers.3.self_attn.o_proj.lora_A.default.weight
model.layers.3.self_attn.o_proj.lora_B.default.weight
model.layers.3.mlp.gate_proj.base_layer.weight
model.layers.3.mlp.gate_proj.lora_A.default.weight
model.layers.3.mlp.gate_proj.lora_B.default.weight
model.layers.3.mlp.up_proj.base_layer.weight
model.layers.3.mlp.up_proj.lora_A.default.weight
model.layers.3.mlp.up_proj.lora_B.default.weight
model.layers.3.mlp.down_proj.base_layer.weight
model.layers.3.mlp.down_proj.lora_A.default.weight
model.layers.3.mlp.down_proj.lora_B.default.weight
model.layers.3.input_layernorm.weight
model.layers.3.post_attention_layernorm.weight
model.layers.4.self_attn.q_proj.base_layer.weight
model.layers.4.self_attn.q_proj.lora_A.default.weight
model.layers.4.self_attn.q_proj.lora_B.default.weight
model.layers.4.self_attn.k_proj.base_layer.weight
model.layers.4.self_attn.k_proj.lora_A.default.weight
model.layers.4.self_attn.k_proj.lora_B.default.weight
model.layers.4.self_attn.v_proj.base_layer.weight
model.layers.4.self_attn.v_proj.lora_A.default.weight
model.layers.4.self_attn.v_proj.lora_B.default.weight
model.layers.4.self_attn.o_proj.base_layer.weight
model.layers.4.self_attn.o_proj.lora_A.default.weight
model.layers.4.self_attn.o_proj.lora_B.default.weight
model.layers.4.mlp.gate_proj.base_layer.weight
model.layers.4.mlp.gate_proj.lora_A.default.weight
model.layers.4.mlp.gate_proj.lora_B.default.weight
model.layers.4.mlp.up_proj.base_layer.weight
model.layers.4.mlp.up_proj.lora_A.default.weight
model.layers.4.mlp.up_proj.lora_B.default.weight
model.layers.4.mlp.down_proj.base_layer.weight
model.layers.4.mlp.down_proj.lora_A.default.weight
model.layers.4.mlp.down_proj.lora_B.default.weight
model.layers.4.input_layernorm.weight
model.layers.4.post_attention_layernorm.weight

@pascal-pfeiffer pascal-pfeiffer self-assigned this Jul 26, 2024
@pascal-pfeiffer
Copy link
Collaborator

Thank you for all the further investigations @tmostak. I am trying to reproduce the issue starting with default parameters and mostly aligning with the ones you used and the default dataset.
Using the cfg below, I ran a successful training experiment and upload to Hugging Face Hub.

Everything ran on commit 87c2978, so basically what we have in v1.9.0 release.

Could you by chance upload a reproducable config using the default dataset where you are facing the issue? Your config above for example doesn't include LoRA settings.

architecture:
    backbone_dtype: bfloat16
    gradient_checkpointing: true
    intermediate_dropout: 0.0
    pretrained: true
    pretrained_weights: ''
augmentation:
    neftune_noise_alpha: 0.0
    random_parent_probability: 0.0
    skip_parent_probability: 0.0
    token_mask_probability: 0.0
dataset:
    add_eos_token_to_answer: true
    add_eos_token_to_prompt: true
    add_eos_token_to_system: true
    answer_column: output
    chatbot_author: H2O.ai
    chatbot_name: h2oGPT
    data_sample: 0.2
    data_sample_choice:
    - Train
    limit_chained_samples: false
    mask_prompt_labels: true
    only_last_answer: false
    parent_id_column: None
    personalize: false
    prompt_column:
    - instruction
    prompt_column_separator: \n\n
    system_column: None
    text_answer_separator: <|answer|>
    text_prompt_start: <|prompt|>
    text_system_start: <|system|>
    train_dataframe: /home/pascal/h2o-llmstudio/data/user/oasst/train_full.pq
    validation_dataframe: None
    validation_size: 0.01
    validation_strategy: automatic
environment:
    compile_model: false
    deepspeed_allgather_bucket_size: 1000000
    deepspeed_method: ZeRO3
    deepspeed_reduce_bucket_size: 1000000
    deepspeed_stage3_param_persistence_threshold: 1000000
    deepspeed_stage3_prefetch_bucket_size: 1000000
    find_unused_parameters: false
    gpus:
    - '0'
    - '1'
    - '2'
    - '3'
    - '4'
    - '5'
    - '6'
    - '7'
    huggingface_branch: main
    mixed_precision: false
    mixed_precision_dtype: bfloat16
    number_of_workers: 8
    seed: -1
    trust_remote_code: true
    use_deepspeed: true
experiment_name: ruby-walrus
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
    logger: None
    neptune_project: ''
output_directory: /home/pascal/h2o-llmstudio/output/user/ruby-walrus/
prediction:
    batch_size_inference: 0
    do_sample: false
    max_length_inference: 256
    max_time: 0.0
    metric: Perplexity
    metric_gpt_model: gpt-3.5-turbo-0301
    metric_gpt_template: general
    min_length_inference: 2
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.0
    stop_tokens: ''
    temperature: 0.0
    top_k: 0
    top_p: 1.0
problem_type: text_causal_language_modeling
tokenizer:
    add_prompt_answer_tokens: false
    max_length: 8096
    padding_quantile: 1.0
    tokenizer_kwargs: '{"use_fast": true, "add_prefix_space": false}'
training:
    attention_implementation: auto
    batch_size: 2
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 1.0
    freeze_layers: []
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 0.0001
    lora: true
    lora_alpha: 16
    lora_dropout: 0.05
    lora_r: 4
    lora_target_modules: ''
    lora_unfreeze_layers: []
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_checkpoint: last
    schedule: Cosine
    train_validation_data: false
    use_dora: false
    warmup_epochs: 0.0
    weight_decay: 0.0

Training

[2024-07-26 08:48:22,582] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-07-26 08:48:22,981] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2024-07-26 08:48:22,982] [INFO] [utils.py:782:see_memory_usage] MA 16.52 GB         Max_MA 24.61 GB         CA 25.61 GB         Max_CA 26 GB
[2024-07-26 08:48:22,983] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 44.94 GB, percent = 2.2%
[2024-07-26 08:48:23,012] [INFO] [stage3.py:130:__init__] Reduce bucket size 1000000
[2024-07-26 08:48:23,012] [INFO] [stage3.py:131:__init__] Prefetch bucket size 1000000
[2024-07-26 08:48:23,289] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-07-26 08:48:23,290] [INFO] [utils.py:782:see_memory_usage] MA 16.52 GB         Max_MA 16.52 GB         CA 25.61 GB         Max_CA 26 GB
[2024-07-26 08:48:23,290] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 44.94 GB, percent = 2.2%
Parameter Offload: Total persistent parameters: 53092352 in 1281 params
[2024-07-26 08:48:24,352] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-07-26 08:48:24,353] [INFO] [utils.py:782:see_memory_usage] MA 16.44 GB         Max_MA 16.52 GB         CA 25.61 GB         Max_CA 26 GB
[2024-07-26 08:48:24,353] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.17 GB, percent = 2.2%
[2024-07-26 08:48:24,530] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2024-07-26 08:48:24,531] [INFO] [utils.py:782:see_memory_usage] MA 16.44 GB         Max_MA 16.44 GB         CA 25.61 GB         Max_CA 26 GB
[2024-07-26 08:48:24,531] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.18 GB, percent = 2.2%
[2024-07-26 08:48:25,043] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 1
[2024-07-26 08:48:25,044] [INFO] [utils.py:782:see_memory_usage] MA 16.44 GB         Max_MA 16.44 GB         CA 16.64 GB         Max_CA 26 GB
[2024-07-26 08:48:25,045] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.28 GB, percent = 2.2%
[2024-07-26 08:48:25,217] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2024-07-26 08:48:25,217] [INFO] [utils.py:782:see_memory_usage] MA 16.44 GB         Max_MA 16.44 GB         CA 16.64 GB         Max_CA 17 GB
[2024-07-26 08:48:25,218] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.28 GB, percent = 2.2%
[2024-07-26 08:48:25,420] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2024-07-26 08:48:25,420] [INFO] [utils.py:782:see_memory_usage] MA 16.46 GB         Max_MA 16.48 GB         CA 16.64 GB         Max_CA 17 GB
[2024-07-26 08:48:25,421] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.28 GB, percent = 2.2%
[2024-07-26 08:48:25,600] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-07-26 08:48:25,600] [INFO] [utils.py:782:see_memory_usage] MA 16.46 GB         Max_MA 16.46 GB         CA 16.64 GB         Max_CA 17 GB
[2024-07-26 08:48:25,601] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.28 GB, percent = 2.2%
[2024-07-26 08:48:25,770] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-07-26 08:48:25,771] [INFO] [utils.py:782:see_memory_usage] MA 16.46 GB         Max_MA 16.49 GB         CA 16.64 GB         Max_CA 17 GB
[2024-07-26 08:48:25,771] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.28 GB, percent = 2.2%
[2024-07-26 08:48:25,771] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[2024-07-26 08:48:26,387] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-07-26 08:48:26,388] [INFO] [utils.py:782:see_memory_usage] MA 16.48 GB         Max_MA 16.48 GB         CA 16.64 GB         Max_CA 17 GB
[2024-07-26 08:48:26,388] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 45.3 GB, percent = 2.2%
[...]
[2024-07-26 08:48:26,397] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2024-07-26 08:48:26,397] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. False
[2024-07-26 08:48:26,397] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 3
[2024-07-26 08:48:26,397] [INFO] [config.py:987:print_user_config]   json = {
    "fp16": {
        "enabled": false,
        "loss_scale_window": 100
    },
    "bf16": {
        "enabled": true,
        "loss_scale_window": 100
    },
    "zero_force_ds_cpu_optimizer": false,
    "zero_optimization": {
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 1.000000e+06,
        "stage": 3,
        "stage3_prefetch_bucket_size": 1.000000e+06,
        "stage3_param_persistence_threshold": 1.000000e+06,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "steps_per_print": 2.000000e+03,
    "train_micro_batch_size_per_gpu": 2,
    "gradient_accumulation_steps": 1,
    "wall_clock_breakdown": false
}
2024-07-26 08:48:26,466 - INFO: Evaluation step: 161
2024-07-26 08:48:26,553 - INFO: Evaluation step: 161
2024-07-26 08:48:26,557 - INFO: Evaluation step: 161
2024-07-26 08:48:26,582 - INFO: Evaluation step: 161
2024-07-26 08:48:26,590 - INFO: Evaluation step: 161
2024-07-26 08:48:26,615 - INFO: Evaluation step: 161
2024-07-26 08:48:26,637 - INFO: Evaluation step: 161
2024-07-26 08:48:26,675 - INFO: Training Epoch: 1 / 1
2024-07-26 08:48:26,675 - INFO: train loss:   0%|          | 0/161 [00:00<?, ?it/s]
2024-07-26 08:48:26,807 - INFO: Evaluation step: 161
2024-07-26 08:48:28,215 - INFO: Stop token ids: [tensor([  27,   91, 9399,   91,   29]), tensor([  27,   91, 9125,   91,   29]), tensor([   27,    91,
 41681,    91,    29])]
2024-07-26 08:49:08,638 - INFO: train loss: 1.14:   5%|4         | 8/161 [00:41<13:22,  5.25s/it]
2024-07-26 08:49:23,998 - INFO: train loss: 1.14:   5%|4         | 8/161 [00:57<13:22,  5.25s/it]
2024-07-26 08:49:28,649 - INFO: train loss: 1.13:  10%|9         | 16/161 [01:01<08:46,  3.63s/it
[...]
2024-07-26 08:54:54,019 - INFO: train loss: 1.16:  84%|########4 | 136/161 [06:27<01:08,  2.75s/it]
2024-07-26 08:55:01,619 - INFO: train loss: 1.00:  89%|########9 | 144/161 [06:34<00:45,  2.69s/it]
2024-07-26 08:55:14,021 - INFO: train loss: 1.00:  89%|########9 | 144/161 [06:47<00:45,  2.69s/it]
2024-07-26 08:55:22,118 - INFO: train loss: 0.97:  94%|#########4| 152/161 [06:55<00:23,  2.65s/it]
2024-07-26 08:55:34,023 - INFO: train loss: 0.97:  94%|#########4| 152/161 [07:07<00:23,  2.65s/it]
2024-07-26 08:55:43,204 - INFO: train loss: 1.21:  99%|#########9| 160/161 [07:16<00:02,  2.65s/it]
2024-07-26 08:55:46,551 - INFO: Saving last model checkpoint to /home/pascal/h2o-llmstudio/output/user/ruby-walrus/
2024-07-26 08:55:54,024 - INFO: train loss: 1.15: 100%|##########| 161/161 [07:27<00:00,  2.65s/it]
[2024-07-26 08:56:53,065] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,065] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,066] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,065] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,066] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,066] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,066] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[2024-07-26 08:56:53,069] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step0 is about to be saved!
[2024-07-26 08:56:53,069] [INFO] [engine.py:3591:save_16bit_model] Saving model weights to /home/pascal/h2o-llmstudio/output/user/ruby-walrus/checkpoi
nt.pth, tag: global_step0
[2024-07-26 08:56:53,070] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /home/pascal/h2o-llmstudio/output/user/ruby-walrus/checkpoint.pth
...
[2024-07-26 08:59:22,348] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /home/pascal/h2o-llmstudio/output/user/ruby-walrus/checkpoint.pth.
[2024-07-26 08:59:22,349] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step0 is ready now!
[...]
2024-07-26 09:01:21,682 - INFO: Starting validation inference
2024-07-26 09:01:21,683 - INFO: validation progress:   0%|          | 0/9 [00:00<?, ?it/s]
2024-07-26 09:01:24,152 - INFO: validation progress:  11%|#1        | 1/9 [00:02<00:19,  2.47s/it]
2024-07-26 09:01:26,059 - INFO: validation progress:  22%|##2       | 2/9 [00:04<00:14,  2.14s/it]
2024-07-26 09:01:26,915 - INFO: validation progress:  33%|###3      | 3/9 [00:05<00:09,  1.55s/it]
2024-07-26 09:01:27,717 - INFO: validation progress:  44%|####4     | 4/9 [00:06<00:06,  1.26s/it]
2024-07-26 09:01:28,391 - INFO: validation progress:  56%|#####5    | 5/9 [00:06<00:04,  1.05s/it]
2024-07-26 09:01:29,159 - INFO: validation progress:  67%|######6   | 6/9 [00:07<00:02,  1.05it/s]
2024-07-26 09:01:29,803 - INFO: validation progress:  78%|#######7  | 7/9 [00:08<00:01,  1.17it/s]
2024-07-26 09:01:30,438 - INFO: validation progress:  89%|########8 | 8/9 [00:08<00:00,  1.28it/s]
2024-07-26 09:01:31,069 - INFO: validation progress: 100%|##########| 9/9 [00:09<00:00,  1.36it/s]
2024-07-26 09:01:31,077 - INFO: validation progress: 100%|##########| 9/9 [00:09<00:00,  1.04s/it]
2024-07-26 09:01:31,103 - INFO: Validation Perplexity: 18.37264
2024-07-26 09:01:31,103 - INFO: Mean validation loss: 1.09179
2024-07-26 09:01:34,473 - INFO: train loss: 1.15: 100%|##########| 161/161 [13:07<00:00,  4.89s/it]
[2024-07-26 09:01:38,110] [INFO] [launch.py:351:main] Process 1912825 exits successfully.
[2024-07-26 09:01:38,111] [INFO] [launch.py:351:main] Process 1912823 exits successfully.
[2024-07-26 09:01:38,111] [INFO] [launch.py:351:main] Process 1912821 exits successfully.
[2024-07-26 09:01:39,113] [INFO] [launch.py:351:main] Process 1912822 exits successfully.
[2024-07-26 09:01:39,113] [INFO] [launch.py:351:main] Process 1912824 exits successfully.
[2024-07-26 09:01:39,113] [INFO] [launch.py:351:main] Process 1912826 exits successfully.
[2024-07-26 09:01:39,114] [INFO] [launch.py:351:main] Process 1912827 exits successfully.
[2024-07-26 09:01:41,116] [INFO] [launch.py:351:main] Process 1912820 exits successfully.
[...]

Upload with cpu_shard

2024-07-26 09:07:20,750 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id 128001.
2024-07-26 09:07:20,750 - INFO: Setting pretraining_tp of model config to 1.
2024-07-26 09:07:20,778 - INFO: Using bfloat16 for backbone
2024-07-26 09:36:05,021 - INFO: Attention implementation: sdpa
2024-07-26 09:36:05,026 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
2024-07-26 09:36:05,954 - INFO: Trainable parameters count: 51773440
2024-07-26 09:36:05,954 - INFO: Total parameters count: 70605479936
2024-07-26 09:36:05,955 - INFO: Trainable %: 0.0733%
2024-07-26 09:37:46,950 - INFO: Weights loaded from: /home/pascal/h2o-llmstudio/output/user/ruby-walrus/checkpoint.pth
2024-07-26 09:38:23,721 - INFO: Merging LORA layers with base model.
2024-07-26 09:38:24,035 - INFO: Enough space available for saving model weights.Required space: 138607.63MB, Available space: 3927968.00MB.
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/pascal/.cache/huggingface/token
Login successful
README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 7.75k/7.75k [00:00<00:00, 22.1MB/s]
model-00014-of-00030.safetensors: 4.98GB [00:09, 522MB/s]
model-00025-of-00030.safetensors: 4.67GB [00:12, 382MB/s]
model-00024-of-00030.safetensors: 4.98GB [00:11, 449MB/s]
model-00006-of-00030.safetensors: 4.67GB [00:17, 273MB/s]
model-00029-of-00030.safetensors: 4.98GB [00:10, 478MB/s]
model-00023-of-00030.safetensors: 5.01GB [00:09, 534MB/s]
model-00020-of-00030.safetensors: 4.67GB [00:09, 509MB/s]
model-00026-of-00030.safetensors: 4.67GB [00:09, 497MB/s]
model-00013-of-00030.safetensors: 5.01GB [00:09, 511MB/s]
model-00027-of-00030.safetensors: 4.67GB [00:11, 392MB/s]
model-00009-of-00030.safetensors: 4.98GB [00:10, 467MB/s]                                                             | 10/30 [01:55<03:43, 11.17s/it]
model-00003-of-00030.safetensors: 5.01GB [00:09, 533MB/s]
model-00019-of-00030.safetensors: 4.98GB [00:09, 534MB/s]
model-00021-of-00030.safetensors: 4.67GB [00:09, 498MB/s]
model-00022-of-00030.safetensors: 4.67GB [00:09, 507MB/s]                                                                                             
model-00030-of-00030.safetensors: 2.11GB [00:04, 510MB/s]                                                                                             
model-00001-of-00030.safetensors: 4.59GB [00:09, 496MB/s]                                                                                             
model-00011-of-00030.safetensors: 4.67GB [00:09, 505MB/s]                                                                                             
model-00016-of-00030.safetensors: 4.67GB [00:09, 516MB/s]                                                                                             
model-00015-of-00030.safetensors: 4.67GB [00:09, 510MB/s]                                                                                             
model-00007-of-00030.safetensors: 4.67GB [00:10, 458MB/s]                                                                                             
model-00028-of-00030.safetensors: 5.01GB [00:09, 532MB/s]                                                                                             
model-00012-of-00030.safetensors: 4.67GB [00:09, 514MB/s]                                                                                             
model-00005-of-00030.safetensors: 4.67GB [00:09, 501MB/s]                                                                                             
model-00010-of-00030.safetensors: 4.67GB [00:11, 402MB/s]                                                                                             
model-00004-of-00030.safetensors: 4.98GB [00:09, 499MB/s]                                                                                             
model-00018-of-00030.safetensors: 5.01GB [00:09, 521MB/s]                                                                                             
model-00002-of-00030.safetensors: 4.67GB [00:09, 492MB/s]                                                                                             
model-00017-of-00030.safetensors: 4.67GB [00:12, 385MB/s]                                                                                             
model-00008-of-00030.safetensors: 5.01GB [00:10, 499MB/s]                                                                                             
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [06:50<00:00, 13.68s/it]

Memory allocation on the GPUs (yes, this indeed isn't freed but that is another issue #736)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   33C    P0            143W /  700W |   17520MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   34C    P0            133W /  700W |   18969MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   35C    P0            133W /  700W |   18393MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   31C    P0            126W /  700W |   18841MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:84:00.0 Off |                    0 |
| N/A   32C    P0            126W /  700W |   18841MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0            131W /  700W |   18969MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:91:00.0 Off |                    0 |
| N/A   35C    P0            134W /  700W |   18393MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:E4:00.0 Off |                    0 |
| N/A   32C    P0            131W /  700W |   21741MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

@psinger
Copy link
Collaborator

psinger commented Jul 26, 2024

could some of these issues be related to this?
huggingface/transformers#32214

maybe try updating transformer

@pascal-pfeiffer
Copy link
Collaborator

It worked for me on current main/v1.9.0, so there seems to be at least one issue that isn't easily reproducible.

@tmostak
Copy link
Author

tmostak commented Jul 30, 2024

Hmm... as a sanity check I started a new instance and redid the dep install, trained again and got the same issue. I should note that I did make one change to requirements.txt to set transformers to the latest 4.43.3 version

transformers==4.43.3; python_full_version >= '3.8.0'

Full config file

architecture:
    backbone_dtype: bfloat16
    gradient_checkpointing: true
    intermediate_dropout: 0.0
    pretrained: true
    pretrained_weights: ''
augmentation:
    neftune_noise_alpha: 0.0
    random_parent_probability: 0.0
    skip_parent_probability: 0.0
    token_mask_probability: 0.0
dataset:
    add_eos_token_to_answer: true
    add_eos_token_to_prompt: true
    add_eos_token_to_system: true
    answer_column: answer
    chatbot_author: H2O.ai
    chatbot_name: h2oGPT
    data_sample: 1.0
    data_sample_choice:
    - Train
    - Validation
    limit_chained_samples: false
    mask_prompt_labels: true
    only_last_answer: false
    parent_id_column: None
    personalize: false
    prompt_column:
    - prompt
    prompt_column_separator: \n\n
    system_column: None
    text_answer_separator: ''
    text_prompt_start: ''
    text_system_start: <|system|>
    train_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_train.csv
    validation_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_eval.csv
    validation_size: 0.01
    validation_strategy: custom
environment:
    compile_model: false
    deepspeed_allgather_bucket_size: 1000000
    deepspeed_method: ZeRO3
    deepspeed_reduce_bucket_size: 1000000
    deepspeed_stage3_param_persistence_threshold: 1000000
    deepspeed_stage3_prefetch_bucket_size: 1000000
    find_unused_parameters: false
    gpus:
    - '0'
    - '1'
    - '2'
    - '3'
    - '4'
    - '5'
    - '6'
    - '7'
    huggingface_branch: main
    mixed_precision: false
    mixed_precision_dtype: bfloat16
    number_of_workers: 8
    seed: 2
    trust_remote_code: true
    use_deepspeed: true
experiment_name: heavyai-heavyiq-llama-3.1-70b-combo-v61-5-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.2.1
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
    logger: None
    neptune_project: ''
output_directory: /home/ubuntu/h2o-llmstudio/output/user/heavyai-heavyiq-llama-3.1-70b-combo-v61-5-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.2.1/
prediction:
    batch_size_inference: 0
    do_sample: false
    max_length_inference: 256
    max_time: 0.0
    metric: Perplexity
    metric_gpt_model: gpt-3.5-turbo-0301
    metric_gpt_template: general
    min_length_inference: 768
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.0
    stop_tokens: ''
    temperature: 0.0
    top_k: 0
    top_p: 1.0
problem_type: text_causal_language_modeling
tokenizer:
    add_prompt_answer_tokens: false
    max_length: 4416
    padding_quantile: 1.0
    tokenizer_kwargs: '{"use_fast": true, "add_prefix_space": false}'
training:
    attention_implementation: auto
    batch_size: 1
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 0.05
    freeze_layers: []
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 1.2e-05
    lora: true
    lora_alpha: 1024
    lora_dropout: 0.05
    lora_r: 512
    lora_target_modules: ''
    lora_unfreeze_layers: []
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_checkpoint: last
    schedule: Cosine
    train_validation_data: false
    use_dora: false
    warmup_epochs: 0.0

I will try training with a default dataset but not sure how that would make a difference.

@tmostak
Copy link
Author

tmostak commented Jul 30, 2024

Ok I trained with the default dataset but set lora_r: 4 and lora_alpha: 16 per the config shared by @pascal-pfeiffer. And indeed it succesfully merged the lora and is uploading.

This makes me think there's been some regression (assuming in the underlying peft library?) that is causing issues for large LoRA layers?

Here's my cfg

(base) ubuntu@164-152-107-167:~/h2o-llmstudio/output/user$ cat llama_3.1_70b_test/cfg.yaml
architecture:
    backbone_dtype: bfloat16
    gradient_checkpointing: true
    intermediate_dropout: 0.0
    pretrained: true
    pretrained_weights: ''
augmentation:
    neftune_noise_alpha: 0.0
    random_parent_probability: 0.0
    skip_parent_probability: 0.0
    token_mask_probability: 0.0
dataset:
    add_eos_token_to_answer: true
    add_eos_token_to_prompt: true
    add_eos_token_to_system: true
    answer_column: output
    chatbot_author: H2O.ai
    chatbot_name: h2oGPT
    data_sample: 0.05
    data_sample_choice:
    - Train
    - Validation
    limit_chained_samples: false
    mask_prompt_labels: true
    only_last_answer: false
    parent_id_column: None
    personalize: false
    prompt_column:
    - instruction
    prompt_column_separator: \n\n
    system_column: None
    text_answer_separator: <|answer|>
    text_prompt_start: <|prompt|>
    text_system_start: <|system|>
    train_dataframe: /home/ubuntu/h2o-llmstudio/data/user/oasst/train_full.pq
    validation_dataframe: None
    validation_size: 0.02
    validation_strategy: automatic
environment:
    compile_model: false
    deepspeed_allgather_bucket_size: 1000000
    deepspeed_method: ZeRO3
    deepspeed_reduce_bucket_size: 1000000
    deepspeed_stage3_param_persistence_threshold: 1000000
    deepspeed_stage3_prefetch_bucket_size: 1000000
    find_unused_parameters: false
    gpus:
    - '0'
    - '1'
    - '2'
    - '3'
    - '4'
    - '5'
    - '6'
    - '7'
    huggingface_branch: main
    mixed_precision: false
    mixed_precision_dtype: bfloat16
    number_of_workers: 8
    seed: 2
    trust_remote_code: true
    use_deepspeed: true
experiment_name: llama_3.1_70b_test
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
    logger: None
    neptune_project: ''
output_directory: /home/ubuntu/h2o-llmstudio/output/user/llama_3.1_70b_test/
prediction:
    batch_size_inference: 0
    do_sample: false
    max_length_inference: 4096
    max_time: 0.0
    metric: Perplexity
    metric_gpt_model: gpt-3.5-turbo-0301
    metric_gpt_template: general
    min_length_inference: 2
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.0
    stop_tokens: ''
    temperature: 0.0
    top_k: 0
    top_p: 1.0
problem_type: text_causal_language_modeling
tokenizer:
    add_prompt_answer_tokens: false
    max_length: 4864
    padding_quantile: 1.0
    tokenizer_kwargs: '{"use_fast": true, "add_prefix_space": false}'
training:
    attention_implementation: auto
    batch_size: 1
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 1.0
    freeze_layers: []
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 1.2e-05
    lora: true
    lora_alpha: 16
    lora_dropout: 0.05
    lora_r: 4
    lora_target_modules: ''
    lora_unfreeze_layers: []
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_checkpoint: last
    schedule: Cosine
    train_validation_data: false
    use_dora: false
    warmup_epochs: 0.0
    weight_decay: 0.0

Would you guys be able to try a bigger lora (i.e. rank 512 alpha 1024) as I did to see if you can repro? I'll try some sizes between 4/16 and 512/1024 to see if I can find the breaking point.

@pascal-pfeiffer
Copy link
Collaborator

Yes, I am starting up the 512/1024 test right now. That could indeed be an issue then. Also, why I was asking for lora settings earlier, as default settings seemed to work fine.

So, seems that very large LoRA layers are split across GPUs, while smaller ones are on a single GPU and the deepspeed wrapper isn't gathering them on a single (meta) device.

Will see how we can deal with that and if there are any workarounds such as CPU only merge.

@tmostak
Copy link
Author

tmostak commented Jul 30, 2024

Thanks @pascal-pfeiffer... just should note I've been training and uploading r512/a1024 models (llama 3 70b) for some months, so seems there was a recent change that has caused the issues.

@tmostak
Copy link
Author

tmostak commented Jul 30, 2024

Also I tried a CPU-only merge and gave up after nearly 24 hours of waiting.

@tmostak
Copy link
Author

tmostak commented Jul 30, 2024

Ok, to follow up on this, altering my training config from above (#782 (comment)) to use LoRA Rank 256 and Alpha 512 worked, but when I changed it to Rank 512 and Alpha 1024 I got the failure seen before.

  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/handlers.py", line 358, in handle
    await experiment_push_to_huggingface_dialog(q)
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/experiment.py", line 2015, in experiment_push_to_huggingface_dialog
    publish_model_to_hugging_face(
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/hugging_face_utils.py", line 216, in publish_model_to_hugging_face
    cfg, model, tokenizer = load_cfg_model_tokenizer(
  File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/chat.py", line 241, in load_cfg_model_tokenizer
    model.backbone = model.backbone.merge_and_unload()
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 838, in merge_and_unload
    return self._unload_and_optionally_merge(
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 457, in _unload_and_optionally_merge
    target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 470, in merge
    delta_weight = self.get_delta_weight(active_adapter)
  File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 533, in get_delta_weight
    output_tensor = transpose(weight_B @ weight_A, self.fan_in_fan_out) * self.scaling[adapter]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

@pascal-pfeiffer
Copy link
Collaborator

pascal-pfeiffer commented Jul 30, 2024

Interesting, I used your config with 512/1024 and was able to merge and upload. Though, a bit different GPUs, so maybe it was down to luck there if the layers got split or not.

@tmostak
Copy link
Author

tmostak commented Aug 1, 2024

@pascal-pfeiffer would you be able to list all the versions of packages in your environment?

@pascal-pfeiffer
Copy link
Collaborator

I checked out this commit (87c2978) when testing and installed a fresh environment.
So, https://github.com/h2oai/h2o-llmstudio/blob/87c2978698545c758b639fb83e0ceef7e43e91e5/requirements.txt

Given that this is dependent on the size of LoRA, I have the strong feeling this can be very hardware dependent.

By chance, what is the disk space left on your primary disk? I noticed that the export uses always the primary disk for an intermediate saving, which is ~170GB for this model. Could be that this also affects somehow the sharding, as you also saw unusual distribution across the 8 GPUs.

With slightly different config, I was again able to export and upload. So, hard to replicate for me now.

Latest, I did update

transformers = "==4.43.3"
accelerate = "==0.33.0"
hf-transfer = "==0.1.8"
peft = "==0.12.0"

and export was fine again.
(though without HF_Transfer, the upload often fails, as you reported earlier) And setting it as an env var is required (#801).

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   33C    P0            144W /  700W |   20997MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   35C    P0            135W /  700W |   22509MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   36C    P0            133W /  700W |   22509MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   32C    P0            127W /  700W |   22509MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:84:00.0 Off |                    0 |
| N/A   32C    P0            128W /  700W |   22889MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   35C    P0            132W /  700W |   22509MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:91:00.0 Off |                    0 |
| N/A   36C    P0            138W /  700W |   22509MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:E4:00.0 Off |                    0 |
| N/A   33C    P0            134W /  700W |   24513MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

I'll do some more testing with even more extreme LoRA dimensions.

@pascal-pfeiffer
Copy link
Collaborator

I'll do some more testing with even more extreme LoRA dimensions.

1024 LoRA Rank also worked fine. Now thinking it might be something else.

2024-08-01 23:27:36,341 - INFO: Trainable parameters count: 13254000640
2024-08-01 23:27:36,342 - INFO: Total parameters count: 83807707136
2024-08-01 23:27:36,342 - INFO: Trainable %: 15.8148%
[...]
100%|███████████████████████████████████████████████████████████████████████████| 30/30 [05:20<00:00, 10

Though, that was again with the updated dependencies

transformers = "==4.43.3"
accelerate = "==0.33.0"
hf-transfer = "==0.1.8"
peft = "==0.12.0"

@pascal-pfeiffer
Copy link
Collaborator

For 100 % reproduceability, I am on 6755a58 (current main) and updated the dependencies as above. Attached, I have the Pipfile.lock and my train config

experiment_llama31.zip

@tmostak
Copy link
Author

tmostak commented Aug 20, 2024

Just to follow up on this, as a workaround I was able to start LLM Studio with 4 GPUs via the CUDA_VISIBLE_DEVICES environment variable, and it worked fine. Still don't know why it was/is still failing with 8 GPUs, but at least I was able to export my model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

3 participants