Note
This template should be assigned to a Kubernetes node where dcgm-exporter pod should be running on that respective node.Nvidia GPU Monitoring - v1 (15.0.0)
Collector Type: Agent
Category: Application Monitors
Application Name: nvidiagpumonitor
G2 Monitor Name: Agent G2 - Nvidia Gpu Monitor
Global Template Name: Agent G2 - Linux - Nvidia GPU Monitoring
Supported DCGM Version: 3.1.7
Supported Agent Version : 15.0.0
Configuration Parameters
Name | Description | Default Value |
---|---|---|
Namespace | Namespace on which dcgm exporter is running | gpu-operator |
Port | Port on which metrics are exported | 9400 |
Collected Metrics
Monitor Name | Display Name | Description |
---|---|---|
nvidia_dcgm_power_usage | Nvidia Dcgm Power Usage | Power draw |
nvidia_dcgm_mem_clock | Nvidia Dcgm Mem Clock Freq | Memory clock frequency |
nvidia_dcgm_mem_copy_util | Nvidia Dcgm Mem Util | Memory utilization |
nvidia_dcgm_fb_mem_used | Nvidia Dcgm Framebuffer Memory Used | Framebuffer Memory Used |
nvidia_dcgm_gpu_temp | Nvidia Dcgm Gpu Temp | GPU temperature |
nvidia_dcgm_memory_temp | Nvidia Dcgm Memory Temp | Memory temperature |
nvidia_dcgm_gpu_util | Nvidia Dcgm Gpu Util | GPU utilization |
Nvidia GPU Monitoring - v2 (16.0.0)
Collector Type: Agent
Category: Application Monitors
Application Name: nvidiagpumonitor
G2 Monitor Name: Agent G2 - Nvidia Gpu Monitor - v2
Global Template Name: Agent G2 - Linux - Nvidia GPU Monitoring - v2
Supported DCGM Version: 3.1.7
Supported Agent Version : 16.0.0
Configuration Parameters
Name | Description | Default Value |
---|---|---|
Namespace | Namespace on which dcgm exporter is running | gpu-operator |
Port | Port on which metrics are exported | 9400 |
Collected Metrics
Monitor Name | Display Name | Description |
---|---|---|
nvidia_dcgm_fi_dev_sm_clock | Nvidia Dcgm Fi Dev Sm Clock | SM clock frequency |
nvidia_dcgm_fi_dev_total_energy_consumption | Nvidia Dcgm Fi Dev Total Energy Consumption | Total energy consumption since boot (in mJ) |
nvidia_dcgm_fi_dev_pcie_replay_counter | Nvidia Dcgm Fi Dev Pcie Replay Counter | Total number of PCIe retries |
nvidia_dcgm_fi_dev_xid_errors | Nvidia Dcgm Fi Dev Xid Errors | Value of the last XID error encountered |
nvidia_dcgm_fi_dev_nvlink_bandwidth_total | Nvidia Dcgm Fi Dev Nvlink Bandwidth Total | Total number of NVLink bandwidth counters for all lanes |
nvidia_dcgm_fi_dev_vgpu_license_status | Nvidia Dcgm Fi Dev Vgpu License Status | vGPU License status |
nvidia_dcgm_fi_prof_gr_engine_active | Nvidia Dcgm Fi Prof Gr Engine Active | Ratio of time the graphics engine is active (in %) |
nvidia_dcgm_fi_prof_pipe_tensor_active | Nvidia Dcgm Fi Prof Pipe Tensor Active | Ratio of cycles the tensor (HMMA) pipe is active (in %) |
nvidia_dcgm_fi_prof_dram_active | Nvidia Dcgm Fi Prof Dram Active | Ratio of cycles the device memory interface is active sending or receiving data (in %) |
nvidia_dcgm_fi_prof_pcie_tx_bytes | Nvidia Dcgm Fi Prof Pcie Tx Bytes | The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second |
nvidia_dcgm_fi_prof_pcie_rx_bytes | Nvidia Dcgm Fi Prof Pcie Rx Bytes | The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second |
nvidia_dcgm_fi_dev_fb_free | Nvidia Dcgm Fi Dev Fb Free | Framebuffer memory free (in MiB) |
nvidia_dcgm_fi_dev_fb_used | Nvidia Dcgm Fi Dev Fb Used | Framebuffer memory used (in MiB) |
nvidia_dcgm_power_usage | Nvidia Dcgm Power Usage | Power draw |
nvidia_dcgm_mem_clock | Nvidia Dcgm Mem Clock Freq | Memory clock frequency |
nvidia_dcgm_mem_copy_util | Nvidia Dcgm Mem Util | Memory utilization |
nvidia_dcgm_fb_mem_used | Nvidia Dcgm Framebuffer Memory Used | Framebuffer Memory Used |
nvidia_dcgm_gpu_temp | Nvidia Dcgm Gpu Temp | GPU temperature |
nvidia_dcgm_memory_temp | Nvidia Dcgm Memory Temp | Memory temperature |
nvidia_dcgm_gpu_util | Nvidia Dcgm Gpu Util | GPU utilization |