Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

Open
tmostak opened this issue May 31, 2024 · 1 comment
Assignees
Labels
type/bug Bug in code

Comments

@tmostak
Copy link

tmostak commented May 31, 2024

🐛 Bug

When uploading a model to HuggingFace and using the cpu_shard setting, and I believe any available GPUs, allocations are left resident in GPU memory after upload. This usually means I have to restart H2O LLM Studio so I can train another model, especially if I expect to be tight on memory.

To Reproduce

Upload any model to HuggingFace using the cpu_shard setting. After finished, check nvidia-smi. See below after I uploaded a 22B param model:

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Fri May 31 17:48:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0              70W / 400W |   5585MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   33C    P0              69W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   33C    P0              70W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0              70W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   33C    P0              68W / 400W |   5965MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   32C    P0              67W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   33C    P0              71W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   34C    P0              68W / 400W |   5589MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5572MiB |
|    1   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    2   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    3   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    4   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5952MiB |
|    5   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    6   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    7   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5576MiB |
+---------------------------------------------------------------------------------------+

### LLM Studio version
c23a3c80f847561736217a1d355837c0e4a8f595 (master)
@pascal-pfeiffer
Copy link
Collaborator

It seems that the memory gets freed mostly after the whole process finished successfully (this may take a bit, though). What is left is still a small footprint though, that we should ideally remove, too. We do see the same when using models in the build-in chat tool.

I think the cleanest solution would be to run this in a subprocess as we do with model training. This ensures a clean environment even when the subprocess fails at some point. We might also consider to merge LoRA back automatically at the end of each experiment.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   33C    P0            143W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   34C    P0            133W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   35C    P0            133W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   31C    P0            126W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:84:00.0 Off |                    0 |
| N/A   31C    P0            126W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0            131W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:91:00.0 Off |                    0 |
| N/A   35C    P0            135W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:E4:00.0 Off |                    0 |
| N/A   32C    P0            130W /  700W |     715MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

2 participants