-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIG partitioning leading to nvidia_a100_3g.39gb instead of 3g.40gb partition for NVIDIA driver versions 535.x and 545.x #31
Comments
Just to confirm, which version of |
Hi @elezar , I've tried
|
This used to be a known issue and was supposed to be resolved by this commit: I verified this is part of v0.5.4 and v0.5.5 so maybe something has changed again and we need to sync up. Can you run your memory sizes through this calculation manually to see if it gives the wrong value: |
Hi @klueska : I wasn't able to get the memory of the "40gb" partition manually using the logic in ef1220b. The full GPU and MIG partition memory sizes differ based in the driver versions:
The full GPU memory is reported as 669 MiB higher, but the "40GB" partition is 256MiB smaller. |
Sorry for the late reply here. I am following up internally as to whether there were changes in how the profile names are calculated. We do qualify the mig manager as part of the GPU Operator releases on new driver versions, but it may be that we miss specific hardware-driver combinations. |
Thanks, @elezar . If you are able to reproduce the same behavior internally and have a patch to fix it, we would appreciate using that until an updated release comes out. |
Hi,
I've a bunch of servers with 4 A100 GPUs each and I've MIG-partitioned each GPU in the 'all-balanced' profile and managed them through Slurm.
WIth NVIDIA driver 495.x, I could partition them as follows without any issues.
However, with the latest drivers, namely 535.x and 545.x, each GPUs get partitioned into
I use AutoDetect=nvml for Slurm to detect the types of MIG partitions and their CPU affinities. Slurm reports this discrepancy in the logs:
I have tried using nvidia-mig-manager versions [0.5.3, 0.5.4.1 and 0.5.5] and I see the same behavior as long as the NVIDIA driver version is 535 or 545. I haven't tried 505, 515, 525.
=== w/ NVIDIA driver 495.x ===
=== w/ NVIDIA driver 545.x ===
Looking at the memory of the different partitions, the 10GB and 20GB partitions are the same regardless of the NVIDIA driver version, but the "40GB" partitions are a little lower (40192MiB) for driver version 545.x compared to 40448 MiB for driver version 495.x, I see that the memory is smaller for
=== w/ NVIDIA driver 495.x ===
=== w/ NVIDIA driver 545.x ===
This is perhaps the reason why the partition is reported as 3g.39gb instead of 3g.40gb. Since we already have lots of GPUs with 3g.40gb partitions and people are trained to use them, having to hack things by introducing a different label for the same Slurm GRES would create a lot of confusion and inconvenience. So, we should appreciate any guidance in resolving this issue.
Thanks a lot.
The text was updated successfully, but these errors were encountered: