-
Notifications
You must be signed in to change notification settings - Fork 301
GPU_I_PROFILE="<<<NULL>>>" #193
Comments
Munir - can you add the output of |
Thank you David, for your help. Here you go: $ sudo nvidia-smi mig -lgi Not sure if it is relevant, but i have two more A6000 in the server.... It's not visible here. |
Munir, The version you are using cannot monitor DCP metrics (1001-1012) for a heterogeneous group of GPUs (groups that contain GPUs of different SKUs). It should work if you specify only A100 GPUs in the dcgmi command (e.g. Unfortunately, the dcgm-exporter would not provide you such flexibility. We already fixed this issue, and upcoming versions of the DCGM will support heterogenous groups of GPUs. WBR, |
Hello Nik,
Awesome. I'm looking forward to use the most recent version of dcgm-exporter. Kind regards, |
This fix will be available in the next public release that we will release soon. |
Thanks, i let this ticket open until next releas - will it be dcgmi version: 2.2.4 ? |
|
Hello,
there is a strange issue on my system. There are 2 Tesla A100 with mig configured. It's all running as expected. I can access also some >0 numbers with: dcgmi dmon -e 1001,1004 -g 8
However, all numbers are 0 with dcgm-exporter and there are two issues:
(1) "Error getting supported metrics"
$ dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: The GPUs of this group are incompatible with each other for the requested operation
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
(2) there is a NULL for GPU_I_PROFILE
curl localhost:9400/metrics
...
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-fc6d497e-d08c-374e-218c-7e44b065d764",device="nvidia0",GPU_I_PROFILE="<<< NULL >>>",GPU_I_ID="6",Hostname="a0321111701"}
...
I'm using dcgmi version: 2.2.3 and tested with dcgm-exporter master as well as 2.4.0-rc.2.
Thank you for your help,
Munir
The text was updated successfully, but these errors were encountered: