-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Error during LoRA-merge in HF upload for Llama 3.1 70B model #782
Comments
Ok so I did a bit more investigation, in particular logging
What seems to be off is I'm wonder if this then messes up the Also note my
Thoughts or suggestions? |
One thing I found that might relate They basically hit the same issue, and someone noted that Going to try downgrading Deepspeed to see if this helps. |
I tried again with
|
Thank you for the details. We recently upgraded deepspeed, so could indeed be an issue caused by this. I'll look into it. |
@pascal-pfeiffer I wrote a quick Python script to write out the layer names per GPU, and it seems the issue might be how the LoRA layers for layer 8 are split between GPU 0 and GPU 1. Also why are there only the LoRA layers for layer 8 and not the other layers?
|
Another update, I tried to upload to HuggingFace the Llama 3 (not 3.1) model that I had previously successfully uploaded, and got the same |
Ok so I rolled back to I then wrote a script to dump the layer names from the model... note how all the LoRA layers are there unlike in the output of Just showing the first 4 layers but you get the idea
|
Thank you for all the further investigations @tmostak. I am trying to reproduce the issue starting with default parameters and mostly aligning with the ones you used and the default dataset. Everything ran on commit 87c2978, so basically what we have in v1.9.0 release. Could you by chance upload a reproducable config using the default dataset where you are facing the issue? Your config above for example doesn't include LoRA settings. architecture:
backbone_dtype: bfloat16
gradient_checkpointing: true
intermediate_dropout: 0.0
pretrained: true
pretrained_weights: ''
augmentation:
neftune_noise_alpha: 0.0
random_parent_probability: 0.0
skip_parent_probability: 0.0
token_mask_probability: 0.0
dataset:
add_eos_token_to_answer: true
add_eos_token_to_prompt: true
add_eos_token_to_system: true
answer_column: output
chatbot_author: H2O.ai
chatbot_name: h2oGPT
data_sample: 0.2
data_sample_choice:
- Train
limit_chained_samples: false
mask_prompt_labels: true
only_last_answer: false
parent_id_column: None
personalize: false
prompt_column:
- instruction
prompt_column_separator: \n\n
system_column: None
text_answer_separator: <|answer|>
text_prompt_start: <|prompt|>
text_system_start: <|system|>
train_dataframe: /home/pascal/h2o-llmstudio/data/user/oasst/train_full.pq
validation_dataframe: None
validation_size: 0.01
validation_strategy: automatic
environment:
compile_model: false
deepspeed_allgather_bucket_size: 1000000
deepspeed_method: ZeRO3
deepspeed_reduce_bucket_size: 1000000
deepspeed_stage3_param_persistence_threshold: 1000000
deepspeed_stage3_prefetch_bucket_size: 1000000
find_unused_parameters: false
gpus:
- '0'
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
huggingface_branch: main
mixed_precision: false
mixed_precision_dtype: bfloat16
number_of_workers: 8
seed: -1
trust_remote_code: true
use_deepspeed: true
experiment_name: ruby-walrus
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
logger: None
neptune_project: ''
output_directory: /home/pascal/h2o-llmstudio/output/user/ruby-walrus/
prediction:
batch_size_inference: 0
do_sample: false
max_length_inference: 256
max_time: 0.0
metric: Perplexity
metric_gpt_model: gpt-3.5-turbo-0301
metric_gpt_template: general
min_length_inference: 2
num_beams: 1
num_history: 4
repetition_penalty: 1.0
stop_tokens: ''
temperature: 0.0
top_k: 0
top_p: 1.0
problem_type: text_causal_language_modeling
tokenizer:
add_prompt_answer_tokens: false
max_length: 8096
padding_quantile: 1.0
tokenizer_kwargs: '{"use_fast": true, "add_prefix_space": false}'
training:
attention_implementation: auto
batch_size: 2
differential_learning_rate: 1.0e-05
differential_learning_rate_layers: []
drop_last_batch: true
epochs: 1
evaluate_before_training: false
evaluation_epochs: 1.0
freeze_layers: []
grad_accumulation: 1
gradient_clip: 0.0
learning_rate: 0.0001
lora: true
lora_alpha: 16
lora_dropout: 0.05
lora_r: 4
lora_target_modules: ''
lora_unfreeze_layers: []
loss_function: TokenAveragedCrossEntropy
optimizer: AdamW
save_checkpoint: last
schedule: Cosine
train_validation_data: false
use_dora: false
warmup_epochs: 0.0
weight_decay: 0.0 Training
Upload with cpu_shard
Memory allocation on the GPUs (yes, this indeed isn't freed but that is another issue #736)
|
could some of these issues be related to this? maybe try updating transformer |
It worked for me on current main/v1.9.0, so there seems to be at least one issue that isn't easily reproducible. |
Hmm... as a sanity check I started a new instance and redid the dep install, trained again and got the same issue. I should note that I did make one change to
Full config file
I will try training with a default dataset but not sure how that would make a difference. |
Ok I trained with the default dataset but set This makes me think there's been some regression (assuming in the underlying peft library?) that is causing issues for large LoRA layers? Here's my cfg
Would you guys be able to try a bigger lora (i.e. rank 512 alpha 1024) as I did to see if you can repro? I'll try some sizes between 4/16 and 512/1024 to see if I can find the breaking point. |
Yes, I am starting up the 512/1024 test right now. That could indeed be an issue then. Also, why I was asking for lora settings earlier, as default settings seemed to work fine. So, seems that very large LoRA layers are split across GPUs, while smaller ones are on a single GPU and the deepspeed wrapper isn't gathering them on a single (meta) device. Will see how we can deal with that and if there are any workarounds such as CPU only merge. |
Thanks @pascal-pfeiffer... just should note I've been training and uploading r512/a1024 models (llama 3 70b) for some months, so seems there was a recent change that has caused the issues. |
Also I tried a CPU-only merge and gave up after nearly 24 hours of waiting. |
Ok, to follow up on this, altering my training config from above (#782 (comment)) to use LoRA Rank 256 and Alpha 512 worked, but when I changed it to Rank 512 and Alpha 1024 I got the failure seen before.
|
Interesting, I used your config with 512/1024 and was able to merge and upload. Though, a bit different GPUs, so maybe it was down to luck there if the layers got split or not. |
@pascal-pfeiffer would you be able to list all the versions of packages in your environment? |
I checked out this commit (87c2978) when testing and installed a fresh environment. Given that this is dependent on the size of LoRA, I have the strong feeling this can be very hardware dependent. By chance, what is the disk space left on your primary disk? I noticed that the export uses always the primary disk for an intermediate saving, which is ~170GB for this model. Could be that this also affects somehow the sharding, as you also saw unusual distribution across the 8 GPUs. With slightly different config, I was again able to export and upload. So, hard to replicate for me now. Latest, I did update
and export was fine again.
I'll do some more testing with even more extreme LoRA dimensions. |
1024 LoRA Rank also worked fine. Now thinking it might be something else.
Though, that was again with the updated dependencies
|
For 100 % reproduceability, I am on 6755a58 (current main) and updated the dependencies as above. Attached, I have the Pipfile.lock and my train config |
Just to follow up on this, as a workaround I was able to start LLM Studio with 4 GPUs via the CUDA_VISIBLE_DEVICES environment variable, and it worked fine. Still don't know why it was/is still failing with 8 GPUs, but at least I was able to export my model. |
🐛 Bug
Today when attempting to upload a LoRA-trained Llama 3.1 70B model (first time I've trained Llama 3.1), I hit the following during the eLoRA merge. Note I used the
cpu_shard
method to upload. I've tried it twice now with the same error.2024-07-24 17:22:58,705 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])]
2024-07-24 17:22:59,686 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])]
2024-07-24 17:22:59,701 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id 128001.
2024-07-24 17:22:59,701 - INFO: Setting pretraining_tp of model config to 1.
2024-07-24 17:22:59,723 - INFO: Using bfloat16 for backbone
2024/07/24 17:23:07 # {"client":"3f76ec33-3e3f-4837-9673-cda3f39f377f","state":"DISCONNECT","t":"ws_disconnect"}
2024/07/24 17:23:07 # {"addr":"99.68.143.103:49420","client_id":"3f76ec33-3e3f-4837-9673-cda3f39f377f","t":"client_reconnect"}
2024-07-24 18:04:05,704 - INFO: Attention implementation: sdpa
2024-07-24 18:04:05,713 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
2024-07-24 18:06:03,026 - INFO: Trainable parameters count: 6627000320
2024-07-24 18:06:03,027 - INFO: Total parameters count: 77180706816
2024-07-24 18:06:03,027 - INFO: Trainable %: 8.5863%
2024-07-24 18:08:56,811 - INFO: Weights loaded from: /home/ubuntu/h2o-llmstudio/output/user/heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1/checkpoint.pth
2024-07-24 18:10:15,356 - INFO: Merging LORA layers with base model.
2024-07-24 18:10:15,561 - ERROR: Unknown exception
Traceback (most recent call last):
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/handlers.py", line 358, in handle
await experiment_push_to_huggingface_dialog(q)
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/experiment.py", line 2015, in experiment_push_to_huggingface_dialog
publish_model_to_hugging_face(
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/hugging_face_utils.py", line 216, in publish_model_to_hugging_face
cfg, model, tokenizer = load_cfg_model_tokenizer(
File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/chat.py", line 241, in load_cfg_model_tokenizer
model.backbone = model.backbone.merge_and_unload()
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 838, in merge_and_unload
return self._unload_and_optionally_merge(
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 457, in _unload_and_optionally_merge
target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 470, in merge
delta_weight = self.get_delta_weight(active_adapter)
File "/home/ubuntu/miniconda3/envs/h2o_llm_studio_jul_24/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 533, in get_delta_weight
output_tensor = transpose(weight_B @ weight_A, self.fan_in_fan_out) * self.scaling[adapter]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
To Reproduce
cfg.yaml
Last login: Wed Jul 24 09:21:21 on ttys003
The default interactive shell is now zsh.
To update your account to use zsh, please run
chsh -s /bin/zsh
.For more details, please visit https://support.apple.com/kb/HT208050.
(base) Todds-MBP:heavydb_benchmarks todd$ ssh lambda_train
Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.2.0-37-generic x86_64)
.============.
|| __ || _ _ _
|| _\ || | | __ _ _ __ ___ | |__ | | __ _
|| _\ || | | /
| '_
_ | ' \ / _|/ _
||| /λ\ || | || (| | | | | | | |) | (| | (| |
|| // _\ || |______,|| || ||.__/ _,|_,_|
.============. GPU CLOUD
System information as of Wed Jul 24 18:59:10 UTC 2024
System load: 0.04296875 Processes: 2188
Usage of /: 16.2% of 18.93TB Users logged in: 1
Memory usage: 3% IPv4 address for docker0: 172.17.0.1
Swap usage: 0% IPv4 address for eno1: 10.19.143.128
architecture:
backbone_dtype: bfloat16
gradient_checkpointing: true
intermediate_dropout: 0.0
pretrained: true
pretrained_weights: ''
augmentation:
neftune_noise_alpha: 0.0
random_parent_probability: 0.0
skip_parent_probability: 0.0
token_mask_probability: 0.0
dataset:
add_bos_token_to_answer: false
add_bos_token_to_prompt: false
add_bos_token_to_system: false
add_eos_token_to_answer: true
add_eos_token_to_prompt: false
add_eos_token_to_system: true
answer_column: answer
chatbot_author: H2O.ai
chatbot_name: h2oGPT
data_sample: 1.0
data_sample_choice:
- Train
- Validation
limit_chained_samples: false
mask_prompt_labels: true
parent_id_column: None
personalize: false
prompt_column:
- prompt
prompt_column_separator: ''
system_column: None
text_answer_separator: ''
text_prompt_start: ''
text_system_start: <|system|>
train_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_train.csv
validation_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_eval.csv
validation_size: 0.01
validation_strategy: custom
environment:
compile_model: false
deepspeed_allgather_bucket_size: 1000000
deepspeed_method: ZeRO3
deepspeed_reduce_bucket_size: 1000000
deepspeed_stage3_param_persistence_threshold: 1000000
deepspeed_stage3_prefetch_bucket_size: 1000000
find_unused_parameters: false
gpus:
- '0'
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
huggingface_branch: main
mixed_precision: false
mixed_precision_dtype: bfloat16
number_of_workers: 8
seed: 2
trust_remote_code: true
use_deepspeed: true
experiment_name: heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
logger: Neptune
neptune_project: tmostak/heavyiq
"cfg.yaml" 119L, 3565B 1,1 Top
architecture:
backbone_dtype: bfloat16
gradient_checkpointing: true
intermediate_dropout: 0.0
pretrained: true
pretrained_weights: ''
augmentation:
neftune_noise_alpha: 0.0
random_parent_probability: 0.0
skip_parent_probability: 0.0
token_mask_probability: 0.0
dataset:
add_bos_token_to_answer: false
add_bos_token_to_prompt: false
add_bos_token_to_system: false
add_eos_token_to_answer: true
add_eos_token_to_prompt: false
add_eos_token_to_system: true
answer_column: answer
chatbot_author: H2O.ai
chatbot_name: h2oGPT
data_sample: 1.0
data_sample_choice:
- Train
- Validation
limit_chained_samples: false
mask_prompt_labels: true
parent_id_column: None
personalize: false
prompt_column:
- prompt
prompt_column_separator: ''
system_column: None
text_answer_separator: ''
text_prompt_start: ''
text_system_start: <|system|>
train_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_train.csv
validation_dataframe: /home/ubuntu/h2o-llmstudio/data/user/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1/heavyiq_combo_v61_5_no_cte_judgements_3584_tokens_gen1_eval.csv
validation_size: 0.01
validation_strategy: custom
environment:
compile_model: false
deepspeed_allgather_bucket_size: 1000000
deepspeed_method: ZeRO3
deepspeed_reduce_bucket_size: 1000000
deepspeed_stage3_param_persistence_threshold: 1000000
deepspeed_stage3_prefetch_bucket_size: 1000000
find_unused_parameters: false
gpus:
- '0'
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
huggingface_branch: main
mixed_precision: false
mixed_precision_dtype: bfloat16
number_of_workers: 8
seed: 2
trust_remote_code: true
use_deepspeed: true
experiment_name: heavyiq-llama-3-1-70b-combo-v61-5-no-cte-judge-3584-tokens-lora-r-512-a-1024-lr-1-1e-5.1
llm_backbone: meta-llama/Meta-Llama-3.1-70B
logging:
logger: Neptune
neptune_project: tmostak/heavyiq
"cfg.yaml" 119L, 3565B 1,1 Top
LLM Studio version
a1b2923 (tip of main)
The text was updated successfully, but these errors were encountered: