You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently alnair-profiler use Nvidia DCGM-exporter to collect and view GPU metrics.
Problem: DCGM's resolution at GPU level (per card).
Considering Alnair is able to run multiple jobs on one GPU, so process-level utilization is important to monitor each job's resource utilization.
Plan: take advantage of nvml library's per process gpu utilization and memory usage function python example, and add custom collector and exporter to achieve this.
The text was updated successfully, but these errors were encountered:
Currently alnair-profiler use Nvidia DCGM-exporter to collect and view GPU metrics.
Problem: DCGM's resolution at GPU level (per card).
Considering Alnair is able to run multiple jobs on one GPU, so process-level utilization is important to monitor each job's resource utilization.
Plan: take advantage of nvml library's per process gpu utilization and memory usage function python example, and add custom collector and exporter to achieve this.
The text was updated successfully, but these errors were encountered: