| k8s_cluster | k8s_cluster_nodes_health | Tracks the percentage of healthy nodes in the cluster. A healthy node is Ready and free from disk, memory, network, or PID pressure.
Alerts when fewer than 80% of nodes are healthy, indicating potential cluster-wide issues. | (
sum(
(
(k8s_node_condition_ready == bool 1)
* (k8s_node_condition_disk_pressure == bool 0)
* (k8s_node_condition_memory_pressure == bool 0)
* (k8s_node_condition_network_unavailable == bool 0)
* (k8s_node_condition_pid_pressure == bool 0)
)
)
/ sum(k8s_node_condition_ready)
) * 100
| - name: k8s_cluster_nodes_health
interval: 5m
expr: (sum(((k8s_node_condition_ready == bool 1) * (k8s_node_condition_disk_pressure == bool 0) * (k8s_node_condition_memory_pressure == bool 0) * (k8s_node_condition_network_unavailable == bool 0) * (k8s_node_condition_pid_pressure == bool 0)))/sum(k8s_node_condition_ready))*100
isAvailability: true
warnOperator: LESS_THAN
warnThreshold: '80'
criticalOperator: LESS_THAN
criticalThreshold: '60'
alertSub: '${severity} - Cluster ${resource.name} Healthy nodes percentage below ${threshold}%'
alertBody: 'Cluster ${resource.name} has only ${metric.value}% healthy nodes than the threshold of ${threshold}%. Verify node conditions (Ready, DiskPressure, MemoryPressure, Network).'
|
| k8s_apiserver_requests_error_rate | Measures the percentage of successful API server requests (excluding WATCH operations).
Alert when success rate falls below 95% → indicates API server instability or failures. | (sum(increase(apiserver_request_total{verb!="WATCH",code=~"2.."}[5m]))
/ sum(increase(apiserver_request_total{verb!="WATCH"}[5m]))) * 100
| - name: k8s_apiserver_requests_error_rate
interval: 5m
expr: (sum(increase(apiserver_request_total{verb!="WATCH",code=~"2.."}[5m]))/ sum(increase(apiserver_request_total{verb!="WATCH"}[5m])))*100
isAvailability: true
warnOperator: LESS_THAN
warnThreshold: '85'
alertSub: '${severity} - Cluster ${resource.name} API Server availability dropped below ${threshold}%'
alertBody: 'The API server on cluster ${resource.name} is returning errors on the metric ${metric.name}. Only ${metric.value}% of non-WATCH requests succeeded in the last 5 minutes than the defined threshold of ${threshold}%. Investigate API server logs and cluster health.'
|
| k8s_apiserver_dropped_requests_total | Total number of API server requests that were aborted by clients or terminated by the server. | sum(increase(apiserver_request_aborts_total[5m])) + sum(increase(apiserver_request_terminations_total[5m]))
| - name: k8s_apiserver_dropped_requests_total
interval: 5m
expr: sum(increase(apiserver_request_aborts_total[5m])) + sum(increase(apiserver_request_terminations_total[5m]))
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '3'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '10'
alertSub: '${severity}. High number of dropped API server requests exceeds ${threshold}.'
alertBody: '${severity}. The total number of API server requests that were aborted by clients or terminated by the server is ${metric.value} in the last 5 minutes, exceeding the threshold of ${threshold}.'
|
| k8s_coredns_health_request_failures_total | The number of CoreDNS health check request failures. | sum(increase(coredns_health_request_failures_total[5m]))
| - name: k8s_coredns_health_request_failures_total
interval: 5m
expr: expr: sum(increase(coredns_health_request_failures_total[5m]))
isAvailability: true
warnOperator: GREATER_THAN
warnThreshold: '0'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '3'
alertSub: '${severity}: CoreDNS health check failures detected.'
alertBody: '${severity}: The number of CoreDNS health check request failures has increased by ${metric.value} in the last 5 minutes, exceeding the threshold of ${threshold}.'
|
| k8s_node | k8s_node_condition | Checks if node is Ready and not under pressure (disk, memory, PID, network).
Alert if a node becomes unhealthy. | (
(k8s_node_condition_ready == bool 1)
* (k8s_node_condition_disk_pressure == bool 0)
* (k8s_node_condition_memory_pressure == bool 0)
* (k8s_node_condition_network_unavailable == bool 0)
* (k8s_node_condition_pid_pressure == bool 0)
)
| - name: k8s_node_condition
interval: 5m
expr: ((k8s_node_condition_ready == bool 1) * (k8s_node_condition_disk_pressure == bool 0) * (k8s_node_condition_memory_pressure == bool 0) * (k8s_node_condition_network_unavailable == bool 0) * (k8s_node_condition_pid_pressure == bool 0))
isAvailability: true
criticalOperator: EQUAL
criticalThreshold: '0'
alertSub: '${severity} - Node ${resource.name} is unhealthy.'
alertBody: 'Node ${resource.name} failed one or more health conditions (Ready, DiskPressure, MemoryPressure, Network, PIDPressure). Metric: ${metric.value}. Immediate remediation needed.'
|
| k8s_node_cpu_usage_percent | Compares node CPU usage against allocatable CPU.
Alert when ≥ 90% (warning) or ≥ 95% (critical) → node CPU bottleneck. | (
(sum by (k8s_node_name) (k8s_node_cpu_usage)
/ sum by (k8s_node_name) (k8s_node_allocatable_cpu))
) * 100
| - name: k8s_node_cpu_usage_percent
interval: 5m
expr: ((sum by (k8s_node_name) (k8s_node_cpu_usage) / sum by (k8s_node_name) (k8s_node_allocatable_cpu)) * 100)
isAvailability: false
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '90'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '95'
alertSub: '${severity} - Node ${resource.name} CPU Usage is above ${threshold}%'
alertBody: 'Node ${resource.name} CPU usage is ${metric.value}% than the defined threshold of ${threshold}%. Consider scaling nodes or workloads.'
|
| k8s_node_memory_usage_percent | Compares node working set memory vs. available memory.
Alert when ≥ 90% (warning) or ≥ 95% (critical) → node memory saturation. | (
sum by (k8s_node_name) (k8s_node_memory_working_set)
/ (
sum by (k8s_node_name)
(k8s_node_memory_working_set + k8s_node_memory_available)
)
) * 100
| - name: k8s_node_memory_usage_percent
interval: 5m
expr: (sum by (k8s_node_name) (k8s_node_memory_working_set) / (sum by (k8s_node_name) (k8s_node_memory_working_set + k8s_node_memory_available))) * 100
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '90'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '95'
alertSub: '${severity} - Node ${resource.name} Memory Usage is above ${threshold}%'
alertBody: 'Node ${resource.name} memory usage is ${metric.value}% than the defined threshold of ${threshold}%. Investigate workload memory usage or scale resources.'
|
| k8s_node_disk_usage_percent | Compares node filesystem capacity vs. filesystem usage.
Alert when ≥ 90% (warning) or ≥ 95% (critical) → node filesystem saturation. | (sum by (k8s_node_name) (k8s_node_filesystem_usage) / sum by (k8s_node_name) (k8s_node_filesystem_capacity)) * 100
| - name: k8s_node_disk_usage_percent
interval: 5m
expr: (sum by (k8s_node_name) (k8s_node_filesystem_usage) / sum by (k8s_node_name) (k8s_node_filesystem_capacity)) * 100
isAvailability: false
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '90'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '95'
alertSub: '${severity} - Node ${resource.name} Disk Usage on ${component.name} is above ${threshold}%'
alertBody: 'Node ${resource.name} Disk usage on ${component.name} is ${metric.value}% than the defined threshold of ${threshold}%. Validate the possibilities to cleanup or scale resources.'
|
| k8s_node_network_rx_error_rate | The amount of Rx (receive) errors per second on node. | rate(k8s_node_network_errors{direction="receive"}[5m])
| - name: k8s_node_network_rx_error_rate
interval: 5m
expr: rate(k8s_node_network_errors{direction="receive"}[5m])
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '5'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '20'
alertSub: '${severity}: High network Rx error rate on node ${resource.name}.'
alertBody: '${severity}: The amount of Rx errors per second on node ${resource.name} is ${metric.value}. The unit is in errors/second.'
|
| k8s_node_network_tx_error_rate | The amount of Tx (transmit) errors per second on node. | rate(k8s_node_network_errors{direction="transmit"}[5m])
| - name: k8s_node_network_tx_error_rate
interval: 5m
expr: rate(k8s_node_network_errors{direction="transmit"}[5m])
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '2'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '20'
alertSub: '${severity}: High network Tx error rate on node ${resource.name}.'
alertBody: '${severity}: The amount of Tx errors per second on node ${resource.name} is ${metric.value}. The unit is in errors/second.'
|
| k8s_pod | k8s_pod_phase | Checks if a pod is in Failed (2) or Unknown (3) phase.
Alert when pods are stuck in bad states. | (k8s_pod_phase == bool 2) OR (k8s_pod_phase == bool 3)
| - name: k8s_pod_phase
interval: 5m
expr: (k8s_pod_phase == bool 2) OR (k8s_pod_phase == bool 3)
isAvailability: true
criticalOperator: EQUAL
criticalThreshold: '0'
alertSub: '${severity} - Pod ${resource.name} is in Failed or Unknown state.'
alertBody: 'Pod ${resource.name} has entered phase ${metric.value} (Failed/Unknown). Immediate attention required to restore workload.'
|
| k8s_pod_cpu_usage_percent | Compares pod CPU usage vs. defined CPU limits.
Alert when usage ≥ 90% (warning) or ≥ 95% (critical) → CPU saturation risk | (
sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_cpu_usage)
/ sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_cpu_limit)
) * 100
| - name: k8s_pod_cpu_usage_percent
interval: 5m
expr: (sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_cpu_usage) / sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_cpu_limit)) * 100
isAvailability: false
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '90'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '95'
alertSub: '${severity} - Pod ${resource.name} CPU Usage is above ${threshold}%'
alertBody: 'Pod ${resource.name} CPU usage is ${metric.value}% than the defined threshold of ${threshold}%. Check workload resource requests/limits or scaling.'
|
| k8s_pod_memory_usage_percent | Compares pod memory usage vs. defined memory limits.
Alert when usage ≥ 90% (warning) or ≥ 95% (critical) → memory exhaustion risk | (
sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_memory_working_set)
/ sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_memory_limit)
) * 100
| - name: k8s_pod_memory_usage_percent
interval: 5m
expr: (sum by (k8s_pod_name, k8s_namespace_name) (k8s_pod_memory_working_set) / sum by (k8s_pod_name, k8s_namespace_name) (k8s_container_memory_limit)) * 100
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '90'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '95'
alertSub: '${severity} - Pod ${resource.name} Memory Usage is above ${threshold}%'
alertBody: 'Pod ${resource.name} memory usage is ${metric.value}% than the defined threshold of ${threshold}%. Investigate memory leaks or adjust memory requests/limits.'
|
| k8s_namespace | k8s_namespace_memory_mb | Aggregates memory usage of all pods in a namespace (in MB).
Alert when ≥ 50000MB (warning) → prevents namespace-level resource exhaustion. | (sum by (k8s_cluster_name, k8s_namespace_name) (k8s_pod_memory_usage/1000000))
| - name: k8s_namespace_memory_mb
interval: 5m
expr: (sum by (k8s_cluster_name, k8s_namespace_name) (k8s_pod_memory_usage/1000000))
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '50000'
alertSub: '${severity} - Namespace ${resource.name} Memory Usage is above ${threshold} MB.'
alertBody: 'Namespace ${resource.name} memory usage is ${metric.value} MB than the defined threshold of ${threshold} MB. Investigate workload memory usage or scale resources.'
|
| k8s_deployment | k8s_deployment_status | Ratio of available pods to desired replicas in a deployment.
Alert when <= 0.9 (warning) or <= 0.8 (critical) → deployment not meeting replica goals. | (k8s_deployment_available / k8s_deployment_desired)
| - name: k8s_deployment_status
interval: 5m
expr: (k8s_deployment_available/k8s_deployment_desired)
isAvailability: true
warnOperator: LESS_THAN_EQUAL
warnThreshold: '0.9'
criticalOperator: LESS_THAN_EQUAL
criticalThreshold: '0.8'
alertSub: '${severity} - Deployment ${resource.name} availability below threshold of ${threshold} of the Cluster ${resource.name}.'
alertBody: 'Deployment ${resource.name} has only ${metric.value} available replicas than the defined threshold of ${threshold} of the Cluster ${resource.name}. Some Pods may not be running as expected.'
|
| k8s_deployment_replicas_unavailable | The number of unavailable replicas for deployment. | (k8s_deployment_desired - k8s_deployment_available)
| - name: k8s_deployment_replicas_unavailable
interval: 5m
expr: (k8s_deployment_desired - k8s_deployment_available)
isAvailability: true
warnOperator: GREATER_THAN_EQUAL
warnThreshold: '1'
criticalOperator: GREATER_THAN_EQUAL
criticalThreshold: '2'
alertSub: '${severity}: Deployment ${resource.name} has ${metric.value} unavailable replicas.'
alertBody: '${severity}: The number of unavailable replicas for deployment ${resource.name} is ${metric.value}.'
|
| k8s_daemonset | k8s_daemonset_status | Ratio of scheduled nodes vs. desired nodes in a DaemonSet.
Alert when <= 0.9 (warning) or <= 0.8 (critical) → DaemonSet pods missing on nodes. | (k8s_daemonset_current_scheduled_nodes
/ k8s_daemonset_desired_scheduled_nodes)
| - name: k8s_daemonset_status
interval: 5m
expr: >-
(k8s_daemonset_current_scheduled_nodes/k8s_daemonset_desired_scheduled_nodes)
isAvailability: true
warnOperator: LESS_THAN_EQUAL
warnThreshold: '0.9'
criticalOperator: LESS_THAN_EQUAL
criticalThreshold: '0.8'
alertSub: '${severity} - DaemonSet ${resource.name} scheduled limit is below the threshold of ${threshold} of the Cluster ${resource.name}.'
alertBody: 'DaemonSet ${resource.name} has only ${metric.value} nodes scheduled than the defined threshold of ${threshold} of the Cluster ${resource.name}. Some nodes are missing DaemonSet Pods.'
|
| k8s_replicaset | k8s_replicaset_status | Ratio of available pods to desired replicas in a ReplicaSet.
Alert when <= 0.9 (warning) or <= 0.8 (critical) → ReplicaSet not maintaining availability. | (k8s_replicaset_available / k8s_replicaset_desired)
| - name: k8s_replicaset_status
interval: 5m
expr: (k8s_replicaset_available/k8s_replicaset_desired)
isAvailability: true
warnOperator: LESS_THAN_EQUAL
warnThreshold: '0.9'
criticalOperator: LESS_THAN_EQUAL
criticalThreshold: '0.8'
alertSub: '${severity}. ${metric.name} on the resource ${resource.name} is ${metric.value}.'
alertBody: '${severity}. ${metric.name} on the resource: ${resource.name} is ${metric.value}.'
|
| k8s_statefulset | k8s_statefulset_status | Ratio of current pods to desired pods in a StatefulSet.
Alert when <= 0.9 (warning) or <= 0.8 (critical) → StatefulSet pods not fully running. | (k8s_statefulset_current_pods / k8s_statefulset_desired_pods)
| - name: k8s_statefulset_status
interval: 5m
expr: (k8s_statefulset_current_pods/k8s_statefulset_desired_pods)
isAvailability: true
warnOperator: LESS_THAN_EQUAL
warnThreshold: '0.9'
criticalOperator: LESS_THAN_EQUAL
criticalThreshold: '0.8'
alertSub: '${severity} - StatefulSet ${resource.name} availability below threshold of ${threshold} of the Cluster ${resource.name}.'
alertBody: 'StatefulSet ${resource.name} has only ${metric.value} running Pods than the defined threshold of ${threshold} of the Cluster ${resource.name}. Check persistent volumes and StatefulSet events.'
|