延迟 AWS Cloudwatch 警报状态更改

Delay in AWS Cloudwatch Alarm state change

我有一个警报跟踪单个 ALB 中 LoadBalancer 5xx 错误的指标。如果过去 1 中的 1 个数据点高于阈值 2,这应该处于“警报中”状态。周期设置为 1 分钟。查看报警详情:

2020 年 9 月 23 日 17:18 UTC,负载均衡器开始出现 return 502 错误。这显示在下面的 Cloudwatch 指标图表中,我已经确认时间是正确的(这是一个强制的 502 响应,所以我知道我什么时候触发它并且我可以在 ALB 日志中看到 17:18 时间戳)

但在警报日志中,“警报中”状态仅在 17:22 UTC 时触发 - 在 17:18 期间出现超过 2 个错误后 4 分钟。这不是接收通知的延迟 - 这是关于状态更改与我的预期相比的延迟。在状态更改后的几秒内正确收到通知。

这是带有状态更改时间戳的警报日志:

我们认为丢失的数据是好的,因此根据指标图,我认为它应该在 17:22 恢复到正常(在 17:21 期间出现 0 个错误)但只有 return在 17:27 时成功 - 延迟 5 分钟。

然后我希望它 return 在 17:24 处“处于警报状态”,但是 return 直到 17:28。

最后,我希望它在 17:31 时 return 正常,但直到 17:40 - 整整 9 分钟之后。

为什么在我预期状态转换和实际发生之间有 4-9 分钟的延迟?

我认为在以下AWS论坛中给出了解释:

Unexplainable delay between Alarm data breach and Alarm state change

基本上,警报 的评估时间比您设置的时间更长 ,而不仅仅是 1 分钟。期间是 evaluation range,作为用户,您没有直接控制权。

来自论坛:

The reporting criteria for the HTTPCode_Target_4XX_Count metric is if there is a non-zero value. That means data point will only be reported if a non-zero value is generated, otherwise nothing will be pushed to the metric.

CloudWatch standard alarm evaluates its state every minute and no matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods (1 in this case). The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the evaluation range. Treat missing data as setting is applied if all the data in the evaluation range is missing, and not just if the data in evaluation period is missing.

Hence, CloudWatch alarms will look at some previous data points to evaluate its state, and will use the treat missing data as setting if all the data in evaluation range is missing. In this case, for the time when alarm did not transition to OK state, it was using the previous data points in the evaluation range to evaluate its state, as expected.

The alarm evaluation in case of missing data is explained in detail here, that will help in understanding this further: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-evaluating-missing-data