Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

GPU_I_PROFILE="<<<NULL>>>" #193

Open
munir-georges opened this issue Jun 4, 2021 · 7 comments
Open

GPU_I_PROFILE="<<<NULL>>>" #193

munir-georges opened this issue Jun 4, 2021 · 7 comments

Comments

@munir-georges
Copy link

munir-georges commented Jun 4, 2021

Hello,

there is a strange issue on my system. There are 2 Tesla A100 with mig configured. It's all running as expected. I can access also some >0 numbers with: dcgmi dmon -e 1001,1004 -g 8

However, all numbers are 0 with dcgm-exporter and there are two issues:
(1) "Error getting supported metrics"
$ dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: The GPUs of this group are incompatible with each other for the requested operation
INFO[0000] Pipeline starting
INFO[0000] Starting webserver

(2) there is a NULL for GPU_I_PROFILE
curl localhost:9400/metrics
...
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-fc6d497e-d08c-374e-218c-7e44b065d764",device="nvidia0",GPU_I_PROFILE="<<< NULL >>>",GPU_I_ID="6",Hostname="a0321111701"}
...

I'm using dcgmi version: 2.2.3 and tested with dcgm-exporter master as well as 2.4.0-rc.2.

Thank you for your help,
Munir

@dbeer
Copy link
Contributor

dbeer commented Jun 7, 2021

Munir - can you add the output of nvidia-smi mig -lgi to this ticket?

@munir-georges
Copy link
Author

munir-georges commented Jun 7, 2021

Munir - can you add the output of nvidia-smi mig -lgi to this ticket?

Thank you David, for your help. Here you go:

$ sudo nvidia-smi mig -lgi
[sudo] password for munir:
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 2g.10gb 14 5 0:2 |
+----------------------------------------------------+
| 0 MIG 2g.10gb 14 6 2:2 |
+----------------------------------------------------+
| 0 MIG 3g.20gb 9 1 4:4 |
+----------------------------------------------------+
| 1 MIG 2g.10gb 14 3 0:2 |
+----------------------------------------------------+
| 1 MIG 2g.10gb 14 4 2:2 |
+----------------------------------------------------+
| 1 MIG 3g.20gb 9 2 4:4 |
+----------------------------------------------------+

Not sure if it is relevant, but i have two more A6000 in the server.... It's not visible here.

@nikkon-dev
Copy link

Munir,

The version you are using cannot monitor DCP metrics (1001-1012) for a heterogeneous group of GPUs (groups that contain GPUs of different SKUs).

It should work if you specify only A100 GPUs in the dcgmi command (e.g. dcgmi dmon -e 1001,1004 -i 0,1)

Unfortunately, the dcgm-exporter would not provide you such flexibility.

We already fixed this issue, and upcoming versions of the DCGM will support heterogenous groups of GPUs.

WBR,
Nik

@munir-georges
Copy link
Author

Hello Nik,

dcgmi dmon -e 1001,1004 -i 0,1 is working well.

Awesome. I'm looking forward to use the most recent version of dcgm-exporter.
Is there a git branch/tag/commit that i can checkout and use?

Kind regards,
Munir

@nikkon-dev
Copy link

This fix will be available in the next public release that we will release soon.
Unfortunately, the change is in the profiling module that is not open-sourced.

@munir-georges
Copy link
Author

Thanks, i let this ticket open until next releas - will it be dcgmi version: 2.2.4 ?

@nikkon-dev
Copy link

=2.2.6
2.2.4 and 2.2.5 are skipped.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants