[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

tmostak · 2024-05-31T17:53:02Z

🐛 Bug

When uploading a model to HuggingFace and using the cpu_shard setting, and I believe any available GPUs, allocations are left resident in GPU memory after upload. This usually means I have to restart H2O LLM Studio so I can train another model, especially if I expect to be tight on memory.

To Reproduce

Upload any model to HuggingFace using the cpu_shard setting. After finished, check nvidia-smi. See below after I uploaded a 22B param model:

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Fri May 31 17:48:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0              70W / 400W |   5585MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   33C    P0              69W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   33C    P0              70W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0              70W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   33C    P0              68W / 400W |   5965MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   32C    P0              67W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   33C    P0              71W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   34C    P0              68W / 400W |   5589MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5572MiB |
|    1   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    2   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    3   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    4   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5952MiB |
|    5   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    6   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    7   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5576MiB |
+---------------------------------------------------------------------------------------+

### LLM Studio version
c23a3c80f847561736217a1d355837c0e4a8f595 (master)

The text was updated successfully, but these errors were encountered:

pascal-pfeiffer · 2024-07-26T13:10:34Z

It seems that the memory gets freed mostly after the whole process finished successfully (this may take a bit, though). What is left is still a small footprint though, that we should ideally remove, too. We do see the same when using models in the build-in chat tool.

I think the cleanest solution would be to run this in a subprocess as we do with model training. This ensures a clean environment even when the subprocess fails at some point. We might also consider to merge LoRA back automatically at the end of each experiment.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   33C    P0            143W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   34C    P0            133W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   35C    P0            133W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   31C    P0            126W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:84:00.0 Off |                    0 |
| N/A   31C    P0            126W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0            131W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:91:00.0 Off |                    0 |
| N/A   35C    P0            135W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:E4:00.0 Off |                    0 |
| N/A   32C    P0            130W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

tmostak added the type/bug Bug in code label May 31, 2024

psinger assigned pascal-pfeiffer Jul 25, 2024

pascal-pfeiffer mentioned this issue Jul 26, 2024

[BUG] Error during LoRA-merge in HF upload for Llama 3.1 70B model #782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

tmostak commented May 31, 2024

pascal-pfeiffer commented Jul 26, 2024

[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

Comments

tmostak commented May 31, 2024

🐛 Bug

To Reproduce

pascal-pfeiffer commented Jul 26, 2024