Nvidia DCGM Exporter

Introduction

Nvidia DCGM Exporter is a monitoring tool built on Nvidia DCGM Go APIs that enables the collection of GPU metrics. It helps users monitor GPU health, analyze workload behavior, and gain visibility into GPU usage across Kubernetes clusters.

The DCGM Exporter exposes GPU metrics through an HTTP endpoint (/metrics), which can be scraped by monitoring systems such as Prometheus.

Prerequisites

The Nvidia DCGM Exporter pod must be deployed on every Kubernetes node that has GPUs.
Ensure that the exporter is running and accessible before configuring metric collection.

Kubernetes 2.0 ConfigMap

To enable GPU metric collection, update or append the existing ConfigMap named opsramp-workload-metric-user-config by adding the DCGM Exporter configuration under the workloads section.

Edit the ConfigMap

kubectl edit cm opsramp-workload-metric-user-config -n opsramp-agent

Example Configuration

Add the nvidia-dcgm/prometheus configuration under the workloads section as shown below:

Use port 9400, which is the default port on which DCGM Exporter exposes metrics.
(You can confirm this by describing the DCGM Exporter pod.)
Specify a matching label in targetPodSelector to correctly identify the DCGM Exporter pods.

apiVersion: v1
kind: ConfigMap
metadata:
  name: opsramp-workload-metric-user-config
  namespace: opsramp-agent
data:
  workloads: |
    nvidia-dcgm/prometheus:
      - name: nvidia-dcgm
        collectionFrequency: 2m
        port: 9400
        auth: none
        metrics_path: "/metrics"
        scheme: "http"
        targetPodSelector:
          matchLabels:
            - key: app
              operator: ==
              value:
                - nvidia-dcgm-exporter

Note

Ensure that all field values are configured according to your Kubernetes cluster environment and DCGM Exporter deployment.

Metrics Filtering

By default, all DCGM Exporter metrics are collected.
If required, users can optionally apply metric filtering based on:

Metric name (full name or regular expression)
Action (include or exclude)

This allows you to control which metrics are included in the final metric list.

Default behavior: All DCGM Exporter metrics are included.

Example

nvidia-dcgm/prometheus:
  - name: nvidia-dcgm
    collectionFrequency: 2m
    port: 9400
    auth: none
    metrics_path: "/metrics"
    scheme: "http"
    filters: # optional
      - regex: 'DCGM_FI_PROF_GR_ENGINE_ACTIVE'
        action: exclude
    targetPodSelector:
      matchLabels:
        - key: app
          operator: ==
          value:
            - nvidia-dcgm-exporter

ConfigMap for Alerting

To receive alerts when GPU metrics collected from the DCGM Exporter breach defined thresholds, configure alert rules in the opsramp-alert-user-config ConfigMap under the rules section.

This configuration enables alerting on DCGM Exporter metrics at the Kubernetes pod level.

Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: "opsramp-alert-user-config"
  namespace: {{ include "common.names.namespace" . | quote }}
  labels:
    app: "opsramp-alert-user-config"
data:
  alert-definitions.yaml: |
    alertDefinitions:
      - resourceType: k8s_pod
        rules:
          - name: gpu_fb_memory_used_percentage
            interval: 2m
            expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
            component: "${labels.gpu}"
            isAvailability: false
            warnOperator: GREATER_THAN_EQUAL
            warnThreshold: '90'
            criticalOperator: GREATER_THAN_EQUAL
            criticalThreshold: '95'
            alertSub: '${severity} - Pod ${resource.name} is consuming GPU framebuffer memory usage of ${metric.value}%.'
            alertBody: '${severity}. GPU framebuffer memory usage on resource: ${resource.name} is ${metric.value}%.'

Supported Metrics

Supported metrics for this workload as provided by the Kubernetes 2.0 Agent.

Metric	Description	Unit
DCGM_FI_DEV_SM_CLOCK	SM clock frequency	MHz
DCGM_FI_DEV_MEM_CLOCK	Memory clock frequency	MHz
DCGM_FI_DEV_MEMORY_TEMP	Memory temperature	C
DCGM_FI_DEV_GPU_TEMP	Current temperature readings for the device	C
DCGM_FI_DEV_POWER_USAGE	Power usage for the device	W
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	Total energy consumption for the GPU since the driver was last reloaded	mJ
DCGM_FI_DEV_PCIE_REPLAY_COUNTER	Total number of PCIe retries	count
DCGM_FI_DEV_GPU_UTIL	GPU utilization	%
DCGM_FI_DEV_MEM_COPY_UTIL	Memory utilization	%
DCGM_FI_DEV_ENC_UTIL	Encoder utilization	%
DCGM_FI_DEV_DEC_UTIL	Decoder utilization	%
DCGM_FI_DEV_XID_ERRORS	Value of the last XID error encountered	count
DCGM_FI_DEV_FB_FREE	Free frame buffer	MiB
DCGM_FI_DEV_FB_USED	Used frame buffer	MiB
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS	Number of remapped rows for uncorrectable errors	count
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS	Number of remapped rows for correctable errors	count
DCGM_FI_DEV_ROW_REMAP_FAILURE	Indicates whether row remapping has failed	count (flag)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL	Total number of NVLink bandwidth counters for all lanes	count
DCGM_FI_DEV_VGPU_LICENSE_STATUS	vGPU license status (0 = not licensed, 1 = licensed)	count (flag)
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Ratio of time the graphics engine is active	ratio
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Ratio of cycles any tensor pipe is active	ratio
DCGM_FI_PROF_DRAM_ACTIVE	Ratio of cycles the device memory interface is active	ratio
DCGM_FI_PROF_PCIE_TX_BYTES	Number of bytes transmitted over PCIe from the GPU	bytes/sec
DCGM_FI_PROF_PCIE_RX_BYTES	Number of bytes received over PCIe by the GPU	bytes/sec