Introduction
When monitoring cloud metrics, it is important to understand the inherent delays in data reporting and alerting. These delays apply across major cloud service providers, including AWS, Azure, and GCP. This document outlines the nature of these delays and their impact on monitoring and alerting processes.
Cloud Metrics Delay
For any cloud service provider (AWS, Azure, or GCP) there is a standard delay in the availability of metrics data. This delay can be up to 10 minutes from the time the event or measurement occurs to when the data is available in the cloud platform’s metrics dashboard.
Example:
- Event Time: 8:00 PM
- Data Available in Cloud Dashboard: 8:10 PM (due to up to 10 minutes delay)
Monitoring Frequency and Data Collection
Once the metrics data is available on the cloud platform, it will be collected based on the monitoring frequency set by the customer in our platform.
Alerting Mechanism
If the collected data breaches any predefined thresholds, an alert will be generated. Given the delays in both data availability and collection intervals, the total potential delay from the actual event to the alert can be up to 15 minutes.
Total Delay Breakdown:
- Cloud Metrics Delay: Up to 10 minutes
- Monitoring Interval: 5 minutes
- Total Possible Delay: Up to 15 minutes
Practical Implications
To illustrate this with a concrete example, let’s consider an AWS metric:
Scenario:
- Current Time: 8:00 PM
- Metrics Delay: Data reflects metrics at 7:50 PM
- AWS Portal Check: Even on the AWS portal, real-time data is not available immediately and reflects data with a delay.
When you check the AWS metrics graph at 8:00 PM, the data shown will be for approximately 7:50 PM due to the 10-minute delay in data population on the AWS platform. Even if you directly check the AWS portal, you will observe this delay in data availability for 5-minute frequencies.
Conclusion
Understanding these delays is critical for effective cloud metrics monitoring and alerting. The inherent delay of up to 10 minutes from the cloud service provider, coupled with the chosen monitoring interval (for example, 5 minutes), means alerts may be delayed by up to 15 minutes from the actual event occurrence. This should be factored into any monitoring and response strategies to ensure timely and accurate responses to critical events.
By being aware of these delays, customers can set realistic expectations and configure their monitoring to accommodate the inherent latency in cloud metrics data availability and alerting processes.
Alternatives
To poll the data immediately, customers should use either agent or gateway monitoring as those metric data will be reflected immediately rather than regular cloud metric monitoring like CloudWatch metrics for AWS, Insight metrics for Azure, or Stackdriver monitoring in GCP.
Correcting the Source of Delays
It is important to note that the delay is from the actual respective cloud services like AWS, Azure, and GCP rather than a delay at our end. Since we use public APIs of the respective clouds to fetch the monitoring data, the delay is intrinsic to the cloud services and not something we can directly mitigate.