The Alert Definition Repository provides predefined Kubernetes alert rules with clear descriptions and PromQL expressions. Use them as a baseline to monitor and customize alerts for your environment.

Resource TypeRule NameDescriptionAlert Definition ExpressionAlert Definition
k8s_clusterk8s_cluster_nodes_healthTracks the percentage of healthy nodes in the cluster. A healthy node is Ready and free from disk, memory, network, or PID pressure. Alerts when fewer than 80% of nodes are healthy, indicating potential cluster-wide issues.
(
  sum(
    (
      (k8s_node_condition_ready == bool 1)
      * (k8s_node_condition_disk_pressure == bool 0)
      * (k8s_node_condition_memory_pressure == bool 0)
      * (k8s_node_condition_network_unavailable == bool 0)
      * (k8s_node_condition_pid_pressure == bool 0)
    )
  )
  / sum(k8s_node_condition_ready)
) * 100
- name: k8s_cluster_nodes_health
  interval: 5m
  expr: (sum(((k8s_node_condition_ready == bool 1) * (k8s_node_condition_disk_pressure == bool 0) * (k8s_node_condition_memory_pressure == bool 0) * (k8s_node_condition_network_unavailable == bool 0) * (k8s_node_condition_pid_pressure == bool 0)))/sum(k8s_node_condition_ready))*100
  isAvailability: true
  warnOperator: LESS_THAN
  warnThreshold: '80'
  criticalOperator: LESS_THAN
  criticalThreshold: '60'
  alertSub: '${severity} - Cluster ${resource.name} Healthy nodes percentage below ${threshold}%'
  alertBody: 'Cluster ${resource.name} has only ${metric.value}% healthy nodes than the threshold of ${threshold}%. Verify node conditions (Ready, DiskPressure, MemoryPressure, Network).'
k8s_apiserver_requests_error_rateMeasures the percentage of successful API server requests (excluding WATCH operations). Alert when success rate falls below 95% → indicates API server instability or failures.
(sum(increase(apiserver_request_total{verb!="WATCH",code=~"2.."}[5m]))
/ sum(increase(apiserver_request_total{verb!="WATCH"}[5m]))) * 100
- name: k8s_apiserver_requests_error_rate
  interval: 5m
  expr: (sum(increase(apiserver_request_total{verb!="WATCH",code=~"2.."}[5m]))/ sum(increase(apiserver_request_total{verb!="WATCH"}[5m])))*100
  isAvailability: true
  warnOperator: LESS_THAN
  warnThreshold: '85'
  alertSub: '${severity} - Cluster ${resource.name} API Server availability dropped below ${threshold}%'
  alertBody: 'The API server on cluster ${resource.name} is returning errors on the metric ${metric.name}. Only ${metric.value}% of non-WATCH requests succeeded in the last 5 minutes than the defined threshold of ${threshold}%. Investigate API server logs and cluster health.'
k8s_apiserver_dropped_requests_totalTotal number of API server requests that were aborted by clients or terminated by the server.
sum(increase(apiserver_request_aborts_total[5m])) + sum(increase(apiserver_request_terminations_total[5m]))
- name: k8s_apiserver_dropped_requests_total
  interval: 5m
  expr: sum(increase(apiserver_request_aborts_total[5m])) + sum(increase(apiserver_request_terminations_total[5m]))
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '3'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '10'
  alertSub: '${severity}. High number of dropped API server requests exceeds ${threshold}.'
  alertBody: '${severity}. The total number of API server requests that were aborted by clients or terminated by the server is ${metric.value} in the last 5 minutes, exceeding the threshold of ${threshold}.'
k8s_coredns_health_request_failures_totalThe number of CoreDNS health check request failures.
sum(increase(coredns_health_request_failures_total[5m]))
- name: k8s_coredns_health_request_failures_total
  interval: 5m
  expr: expr: sum(increase(coredns_health_request_failures_total[5m]))
  isAvailability: true
  warnOperator: GREATER_THAN
  warnThreshold: '0'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '3'
  alertSub: '${severity}: CoreDNS health check failures detected.'
  alertBody: '${severity}: The number of CoreDNS health check request failures has increased by ${metric.value} in the last 5 minutes, exceeding the threshold of ${threshold}.'
k8s_nodek8s_node_conditionChecks if node is Ready and not under pressure (disk, memory, PID, network). Alert if a node becomes unhealthy.
(
  (k8s_node_condition_ready == bool 1)
  * (k8s_node_condition_disk_pressure == bool 0)
  * (k8s_node_condition_memory_pressure == bool 0)
  * (k8s_node_condition_network_unavailable == bool 0)
  * (k8s_node_condition_pid_pressure == bool 0)
)
- name: k8s_node_condition
  interval: 5m
  expr: ((k8s_node_condition_ready == bool 1) * (k8s_node_condition_disk_pressure == bool 0) * (k8s_node_condition_memory_pressure == bool 0) * (k8s_node_condition_network_unavailable == bool 0) * (k8s_node_condition_pid_pressure == bool 0))
  isAvailability: true
  criticalOperator: EQUAL
  criticalThreshold: '0'
  alertSub: '${severity} - Node ${resource.name} is unhealthy.'
  alertBody: 'Node ${resource.name} failed one or more health conditions (Ready, DiskPressure, MemoryPressure, Network, PIDPressure). Metric: ${metric.value}. Immediate remediation needed.'
k8s_node_cpu_usage_percentCompares node CPU usage against allocatable CPU. Alert when ≥ 90% (warning) or ≥ 95% (critical) → node CPU bottleneck.
(
  (sum by (k8s_node_name) (k8s_node_cpu_usage)
  / sum by (k8s_node_name) (k8s_node_allocatable_cpu))
) * 100
- name: k8s_node_cpu_usage_percent
  interval: 5m
  expr: ((sum by (k8s_node_name) (k8s_node_cpu_usage) / sum by (k8s_node_name) (k8s_node_allocatable_cpu)) * 100)
  isAvailability: false
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '90'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '95'
  alertSub: '${severity} - Node ${resource.name} CPU Usage is above ${threshold}%'
  alertBody: 'Node ${resource.name} CPU usage is ${metric.value}% than the defined threshold of ${threshold}%. Consider scaling nodes or workloads.'
k8s_node_memory_usage_percentCompares node working set memory vs. available memory. Alert when ≥ 90% (warning) or ≥ 95% (critical) → node memory saturation.
(
  sum by (k8s_node_name) (k8s_node_memory_working_set)
  / (
    sum by (k8s_node_name)
    (k8s_node_memory_working_set + k8s_node_memory_available)
  )
) * 100
- name: k8s_node_memory_usage_percent
  interval: 5m
  expr: (sum by (k8s_node_name) (k8s_node_memory_working_set) / (sum by (k8s_node_name) (k8s_node_memory_working_set + k8s_node_memory_available))) * 100
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '90'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '95'
  alertSub: '${severity} - Node ${resource.name} Memory Usage is above ${threshold}%'
  alertBody: 'Node ${resource.name} memory usage is ${metric.value}% than the defined threshold of ${threshold}%. Investigate workload memory usage or scale resources.'
k8s_node_disk_usage_percentCompares node filesystem capacity vs. filesystem usage. Alert when ≥ 90% (warning) or ≥ 95% (critical) → node filesystem saturation.
(sum by (k8s_node_name) (k8s_node_filesystem_usage) / sum by (k8s_node_name) (k8s_node_filesystem_capacity)) * 100
- name: k8s_node_disk_usage_percent
  interval: 5m
  expr: (sum by (k8s_node_name) (k8s_node_filesystem_usage) / sum by (k8s_node_name) (k8s_node_filesystem_capacity)) * 100
  isAvailability: false
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '90'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '95'
  alertSub: '${severity} - Node ${resource.name} Disk Usage on  ${component.name}  is above ${threshold}%'
  alertBody: 'Node ${resource.name} Disk usage on ${component.name} is ${metric.value}% than the defined threshold of ${threshold}%. Validate the possibilities to cleanup or scale resources.'
k8s_node_network_rx_error_rateThe amount of Rx (receive) errors per second on node.
rate(k8s_node_network_errors{direction="receive"}[5m])
- name: k8s_node_network_rx_error_rate
  interval: 5m
  expr: rate(k8s_node_network_errors{direction="receive"}[5m])
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '5'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '20'
  alertSub: '${severity}: High network Rx error rate on node ${resource.name}.'
  alertBody: '${severity}: The amount of Rx errors per second on node ${resource.name} is ${metric.value}. The unit is in errors/second.'
k8s_node_network_tx_error_rateThe amount of Tx (transmit) errors per second on node.
rate(k8s_node_network_errors{direction="transmit"}[5m])
- name: k8s_node_network_tx_error_rate
  interval: 5m
  expr: rate(k8s_node_network_errors{direction="transmit"}[5m])
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '2'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '20'
  alertSub: '${severity}: High network Tx error rate on node ${resource.name}.'
  alertBody: '${severity}: The amount of Tx errors per second on node ${resource.name} is ${metric.value}. The unit is in errors/second.'
k8s_podk8s_pod_phaseChecks if a pod is in Failed (2) or Unknown (3) phase. Alert when pods are stuck in bad states.
(k8s_pod_phase == bool 2) OR (k8s_pod_phase == bool 3)
- name: k8s_pod_phase
  interval: 5m
  expr: (k8s_pod_phase == bool 2) OR (k8s_pod_phase == bool 3)
  isAvailability: true
  criticalOperator: EQUAL
  criticalThreshold: '0'
  alertSub: '${severity} - Pod ${resource.name} is in Failed or Unknown state.'
  alertBody: 'Pod ${resource.name} has entered phase ${metric.value} (Failed/Unknown). Immediate attention required to restore workload.'
k8s_pod_cpu_usage_percentCompares pod CPU usage vs. defined CPU limits. Alert when usage ≥ 90% (warning) or ≥ 95% (critical) → CPU saturation risk
(
  sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_cpu_usage)
  / sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_cpu_limit)
) * 100
- name: k8s_pod_cpu_usage_percent
  interval: 5m
  expr: (sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_cpu_usage) / sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_cpu_limit)) * 100
  isAvailability: false
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '90'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '95'
  alertSub: '${severity} - Pod ${resource.name} CPU Usage is above ${threshold}%'
  alertBody: 'Pod ${resource.name} CPU usage is ${metric.value}%  than the defined threshold of ${threshold}%. Check workload resource requests/limits or scaling.'
k8s_pod_memory_usage_percentCompares pod memory usage vs. defined memory limits. Alert when usage ≥ 90% (warning) or ≥ 95% (critical) → memory exhaustion risk
(
  sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_memory_working_set)
  / sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_memory_limit)
) * 100
- name: k8s_pod_memory_usage_percent
  interval: 5m
  expr: (sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_memory_working_set) / sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_memory_limit)) * 100
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '90'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '95'
  alertSub: '${severity} - Pod ${resource.name} Memory Usage is above ${threshold}%'
  alertBody: 'Pod ${resource.name} memory usage is ${metric.value}%  than the defined threshold of ${threshold}%. Investigate memory leaks or adjust memory requests/limits.'
k8s_namespacek8s_namespace_memory_mbAggregates memory usage of all pods in a namespace (in MB). Alert when ≥ 50000MB (warning) → prevents namespace-level resource exhaustion.
(sum by (k8s_cluster_name, k8s_namespace_name) (k8s_pod_memory_usage/1000000))
- name: k8s_namespace_memory_mb
  interval: 5m
  expr: (sum by (k8s_cluster_name, k8s_namespace_name) (k8s_pod_memory_usage/1000000))
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '50000'
  alertSub: '${severity} - Namespace ${resource.name} Memory Usage is above ${threshold} MB.'
  alertBody: 'Namespace ${resource.name} memory usage is ${metric.value} MB than the defined threshold of ${threshold} MB. Investigate workload memory usage or scale resources.'
k8s_deploymentk8s_deployment_statusRatio of available pods to desired replicas in a deployment. Alert when <= 0.9 (warning) or <= 0.8 (critical) → deployment not meeting replica goals.
(k8s_deployment_available / k8s_deployment_desired)
- name: k8s_deployment_status
  interval: 5m
  expr: (k8s_deployment_available/k8s_deployment_desired)
  isAvailability: true
  warnOperator: LESS_THAN_EQUAL
  warnThreshold: '0.9'
  criticalOperator: LESS_THAN_EQUAL
  criticalThreshold: '0.8'
  alertSub: '${severity} - Deployment ${resource.name} availability below threshold of ${threshold} of the Cluster ${resource.name}.'
  alertBody: 'Deployment ${resource.name} has only ${metric.value} available replicas than the defined threshold of ${threshold} of the Cluster ${resource.name}. Some Pods may not be running as expected.'
k8s_deployment_replicas_unavailableThe number of unavailable replicas for deployment.
(k8s_deployment_desired - k8s_deployment_available)
- name: k8s_deployment_replicas_unavailable
  interval: 5m
  expr: (k8s_deployment_desired - k8s_deployment_available)
  isAvailability: true
  warnOperator: GREATER_THAN_EQUAL
  warnThreshold: '1'
  criticalOperator: GREATER_THAN_EQUAL
  criticalThreshold: '2'
  alertSub: '${severity}: Deployment ${resource.name} has ${metric.value} unavailable replicas.'
  alertBody: '${severity}: The number of unavailable replicas for deployment ${resource.name} is ${metric.value}.'
k8s_daemonsetk8s_daemonset_statusRatio of scheduled nodes vs. desired nodes in a DaemonSet. Alert when <= 0.9 (warning) or <= 0.8 (critical) → DaemonSet pods missing on nodes.
(k8s_daemonset_current_scheduled_nodes
/ k8s_daemonset_desired_scheduled_nodes)
- name: k8s_daemonset_status
  interval: 5m
  expr: >-
    (k8s_daemonset_current_scheduled_nodes/k8s_daemonset_desired_scheduled_nodes)
  isAvailability: true
  warnOperator: LESS_THAN_EQUAL
  warnThreshold: '0.9'
  criticalOperator: LESS_THAN_EQUAL
  criticalThreshold: '0.8'
  alertSub: '${severity} - DaemonSet ${resource.name} scheduled limit is below the threshold of ${threshold} of the Cluster ${resource.name}.'
  alertBody: 'DaemonSet ${resource.name} has only ${metric.value} nodes scheduled  than the defined threshold of ${threshold} of the Cluster ${resource.name}. Some nodes are missing DaemonSet Pods.'
k8s_replicasetk8s_replicaset_statusRatio of available pods to desired replicas in a ReplicaSet. Alert when <= 0.9 (warning) or <= 0.8 (critical) → ReplicaSet not maintaining availability.
(k8s_replicaset_available / k8s_replicaset_desired)
- name: k8s_replicaset_status
  interval: 5m
  expr: (k8s_replicaset_available/k8s_replicaset_desired)
  isAvailability: true
  warnOperator: LESS_THAN_EQUAL
  warnThreshold: '0.9'
  criticalOperator: LESS_THAN_EQUAL
  criticalThreshold: '0.8'
  alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
  alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'
k8s_statefulsetk8s_statefulset_statusRatio of current pods to desired pods in a StatefulSet. Alert when <= 0.9 (warning) or <= 0.8 (critical) → StatefulSet pods not fully running.
(k8s_statefulset_current_pods / k8s_statefulset_desired_pods)
- name: k8s_statefulset_status
  interval: 5m
  expr: (k8s_statefulset_current_pods/k8s_statefulset_desired_pods)
  isAvailability: true
  warnOperator: LESS_THAN_EQUAL
  warnThreshold: '0.9'
  criticalOperator: LESS_THAN_EQUAL
  criticalThreshold: '0.8'
  alertSub: '${severity} - StatefulSet ${resource.name} availability below threshold of ${threshold} of the Cluster ${resource.name}.'
  alertBody: 'StatefulSet ${resource.name} has only ${metric.value} running Pods  than the defined threshold of ${threshold} of the Cluster ${resource.name}. Check persistent volumes and StatefulSet events.'