Introduction

NVIDIA DCGM Exporter is a monitoring tool built on NVIDIA DCGM Go APIs that enables the collection of GPU metrics. It helps users monitor GPU health, analyze workload behavior, and gain visibility into GPU usage across Kubernetes clusters.

The DCGM Exporter exposes GPU metrics through an HTTP endpoint (/metrics), which can be scraped by monitoring systems such as Prometheus.

Prerequisites

  • The NVIDIA DCGM Exporter pod must be deployed on every Kubernetes node that has GPUs.
  • Ensure that the exporter is running and accessible before configuring metric collection.

Kubernetes 2.0 ConfigMap

To enable GPU metric collection, update or append the existing ConfigMap named opsramp-workload-metric-user-config by adding the DCGM Exporter configuration under the workloads section.

Edit the ConfigMap

kubectl edit cm opsramp-workload-metric-user-config -n opsramp-agent

Example Configuration

Add the nvidia-dcgm/prometheus configuration under the workloads section as shown below:

  • Use port 9400, which is the default port on which DCGM Exporter exposes metrics.
    (You can confirm this by describing the DCGM Exporter pod.)
  • Specify a matching label in targetPodSelector to correctly identify the DCGM Exporter pods.
apiVersion: v1
kind: ConfigMap
metadata:
  name: opsramp-workload-metric-user-config
  namespace: opsramp-agent
data:
  workloads: |
    nvidia-dcgm/prometheus:
      - name: nvidia-dcgm
        collectionFrequency: 2m
        port: 9400
        auth: none
        metrics_path: "/metrics"
        scheme: "http"
        targetPodSelector:
          matchLabels:
            - key: app
              operator: ==
              value:
                - nvidia-dcgm-exporter

Metrics Filtering

By default, all DCGM Exporter metrics are collected.
If required, users can optionally apply metric filtering based on:

  • Metric name (full name or regular expression)
  • Action (include or exclude)

This allows you to control which metrics are included in the final metric list.

Default behavior: All DCGM Exporter metrics are included.

Example

nvidia-dcgm/prometheus:
  - name: nvidia-dcgm
    collectionFrequency: 2m
    port: 9400
    auth: none
    metrics_path: "/metrics"
    scheme: "http"
    filters: # optional
      - regex: 'DCGM_FI_PROF_GR_ENGINE_ACTIVE'
        action: exclude
    targetPodSelector:
      matchLabels:
        - key: app
          operator: ==
          value:
            - nvidia-dcgm-exporter

ConfigMap for Alerting

To receive alerts when GPU metrics collected from the DCGM Exporter breach defined thresholds, configure alert rules in the opsramp-alert-user-config ConfigMap under the rules section.

This configuration enables alerting on DCGM Exporter metrics at the Kubernetes pod level.

Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: "opsramp-alert-user-config"
  namespace: {{ include "common.names.namespace" . | quote }}
  labels:
    app: "opsramp-alert-user-config"
data:
  alert-definitions.yaml: |
    alertDefinitions:
      - resourceType: k8s_pod
        rules:
          - name: gpu_fb_memory_used_percentage
            interval: 2m
            expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
            component: "${labels.gpu}"
            isAvailability: false
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '90'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '95'
            alertSub: '${severity} - Pod ${resource.name} is consuming GPU framebuffer memory usage of ${metric.value}%.'
            alertBody: '${severity}. GPU framebuffer memory usage on resource: ${resource.name} is ${metric.value}%.'

Supported Metrics

Supported metrics for this workload as provided by the Kubernetes 2.0 Agent.

MetricDescriptionUnit
DCGM_FI_DEV_SM_CLOCKSM clock frequencyMHz
DCGM_FI_DEV_MEM_CLOCKMemory clock frequencyMHz
DCGM_FI_DEV_MEMORY_TEMPMemory temperatureC
DCGM_FI_DEV_GPU_TEMPCurrent temperature readings for the deviceC
DCGM_FI_DEV_POWER_USAGEPower usage for the deviceW
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONTotal energy consumption for the GPU since the driver was last reloadedmJ
DCGM_FI_DEV_PCIE_REPLAY_COUNTERTotal number of PCIe retriescount
DCGM_FI_DEV_GPU_UTILGPU utilization%
DCGM_FI_DEV_MEM_COPY_UTILMemory utilization%
DCGM_FI_DEV_ENC_UTILEncoder utilization%
DCGM_FI_DEV_DEC_UTILDecoder utilization%
DCGM_FI_DEV_XID_ERRORSValue of the last XID error encounteredcount
DCGM_FI_DEV_FB_FREEFree frame bufferMiB
DCGM_FI_DEV_FB_USEDUsed frame bufferMiB
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWSNumber of remapped rows for uncorrectable errorscount
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWSNumber of remapped rows for correctable errorscount
DCGM_FI_DEV_ROW_REMAP_FAILUREIndicates whether row remapping has failedcount (flag)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALTotal number of NVLink bandwidth counters for all lanescount
DCGM_FI_DEV_VGPU_LICENSE_STATUSvGPU license status (0 = not licensed, 1 = licensed)count (flag)
DCGM_FI_PROF_GR_ENGINE_ACTIVERatio of time the graphics engine is activeratio
DCGM_FI_PROF_PIPE_TENSOR_ACTIVERatio of cycles any tensor pipe is activeratio
DCGM_FI_PROF_DRAM_ACTIVERatio of cycles the device memory interface is activeratio
DCGM_FI_PROF_PCIE_TX_BYTESNumber of bytes transmitted over PCIe from the GPUbytes/sec
DCGM_FI_PROF_PCIE_RX_BYTESNumber of bytes received over PCIe by the GPUbytes/sec