GPU_I_PROFILE="<<<NULL>>>" #193

munir-georges · 2021-06-04T22:03:25Z

Hello,

there is a strange issue on my system. There are 2 Tesla A100 with mig configured. It's all running as expected. I can access also some >0 numbers with: dcgmi dmon -e 1001,1004 -g 8

However, all numbers are 0 with dcgm-exporter and there are two issues:
(1) "Error getting supported metrics"
$ dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: The GPUs of this group are incompatible with each other for the requested operation
INFO[0000] Pipeline starting
INFO[0000] Starting webserver

(2) there is a NULL for GPU_I_PROFILE
curl localhost:9400/metrics
...
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-fc6d497e-d08c-374e-218c-7e44b065d764",device="nvidia0",GPU_I_PROFILE="<<< NULL >>>",GPU_I_ID="6",Hostname="a0321111701"}
...

I'm using dcgmi version: 2.2.3 and tested with dcgm-exporter master as well as 2.4.0-rc.2.

Thank you for your help,
Munir

dbeer · 2021-06-07T19:03:47Z

Munir - can you add the output of nvidia-smi mig -lgi to this ticket?

munir-georges · 2021-06-07T19:09:23Z

Munir - can you add the output of nvidia-smi mig -lgi to this ticket?

Thank you David, for your help. Here you go:

$ sudo nvidia-smi mig -lgi
[sudo] password for munir:
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 2g.10gb 14 5 0:2 |
+----------------------------------------------------+
| 0 MIG 2g.10gb 14 6 2:2 |
+----------------------------------------------------+
| 0 MIG 3g.20gb 9 1 4:4 |
+----------------------------------------------------+
| 1 MIG 2g.10gb 14 3 0:2 |
+----------------------------------------------------+
| 1 MIG 2g.10gb 14 4 2:2 |
+----------------------------------------------------+
| 1 MIG 3g.20gb 9 2 4:4 |
+----------------------------------------------------+

Not sure if it is relevant, but i have two more A6000 in the server.... It's not visible here.

nikkon-dev · 2021-06-08T04:44:36Z

Munir,

The version you are using cannot monitor DCP metrics (1001-1012) for a heterogeneous group of GPUs (groups that contain GPUs of different SKUs).

It should work if you specify only A100 GPUs in the dcgmi command (e.g. dcgmi dmon -e 1001,1004 -i 0,1)

Unfortunately, the dcgm-exporter would not provide you such flexibility.

We already fixed this issue, and upcoming versions of the DCGM will support heterogenous groups of GPUs.

WBR,
Nik

munir-georges · 2021-06-08T08:31:04Z

Hello Nik,

dcgmi dmon -e 1001,1004 -i 0,1 is working well.

Awesome. I'm looking forward to use the most recent version of dcgm-exporter.
Is there a git branch/tag/commit that i can checkout and use?

Kind regards,
Munir

nikkon-dev · 2021-06-08T19:04:54Z

This fix will be available in the next public release that we will release soon.
Unfortunately, the change is in the profiling module that is not open-sourced.

munir-georges · 2021-06-08T22:08:36Z

Thanks, i let this ticket open until next releas - will it be dcgmi version: 2.2.4 ?

nikkon-dev · 2021-06-08T22:53:54Z

=2.2.6
2.2.4 and 2.2.5 are skipped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU_I_PROFILE="<<<NULL>>>" #193

GPU_I_PROFILE="<<<NULL>>>" #193

munir-georges commented Jun 4, 2021 •

edited

Loading

dbeer commented Jun 7, 2021

munir-georges commented Jun 7, 2021 •

edited

Loading

nikkon-dev commented Jun 8, 2021

munir-georges commented Jun 8, 2021

nikkon-dev commented Jun 8, 2021

munir-georges commented Jun 8, 2021

nikkon-dev commented Jun 8, 2021

GPU_I_PROFILE="<<<NULL>>>" #193

GPU_I_PROFILE="<<<NULL>>>" #193

Comments

munir-georges commented Jun 4, 2021 • edited Loading

dbeer commented Jun 7, 2021

munir-georges commented Jun 7, 2021 • edited Loading

nikkon-dev commented Jun 8, 2021

munir-georges commented Jun 8, 2021

nikkon-dev commented Jun 8, 2021

munir-georges commented Jun 8, 2021

nikkon-dev commented Jun 8, 2021

munir-georges commented Jun 4, 2021 •

edited

Loading

munir-georges commented Jun 7, 2021 •

edited

Loading