超出阈值，但 StackDriver 中未创建任何事件

Question

问题：如果时间序列超过阈值，则不会创建事件。

如果 5% 的请求在 CloudRun 中返回 4xx，我想收到警报。我使用以下查询创建了警报策略：

fetch cloud_run_revision::run.googleapis.com/request_count
| { filter metric.response_code_class = '4xx'
  ; ident }
| group_by [resource.service_name], 1m, max(val())
| ratio
| condition val() > 0.05 '10^2.%'

在云控制台中，我看到实际上有超过阈值的时间序列：

期望是创建一个事件。然而，事实并非如此。

为了完整起见：我使用 terraform 创建了警报：

resource "google_monitoring_alert_policy" "cloudrun_http_4xx_errors" {
  display_name = "CloudRun 4xx errors"

  documentation {
    content = "CloudRun returned 4xx for more than 5% of its requests."
  }
  combiner = "OR"

  notification_channels = var.environment == "dev" ? [] : [
  google_monitoring_notification_channel.pubsubchannel.name]
  conditions {
    display_name = "4xx errors"
    condition_monitoring_query_language {
      query    = <<EOT
fetch cloud_run_revision::run.googleapis.com/request_count
| { filter metric.response_code_class = '4xx'
  ; ident }
| group_by [resource.service_name], 1m, max(val())
| ratio
| condition val() > 0.05 '10^2.%'
EOT
      duration = "60s"
    }
  }
}

Answer 1

我同意@c69。

我总结一下几点：

我们需要增加持续时间，因为它会增加校准周期，这将有助于更远地回顾以包括已经摄取的数据。
我们应该使用时长，或者说时长window，来防止一个条件因为一次测量就满足。在 Google Cloud Console 中，使用以下字段配置持续时间：
- 旧版界面：警报策略的 For 字段 Configuration 窗格。
- 预览界面：高于阈值的时间（或低于阈值的时间
  配置触发器步骤中的阈值)字段。
所以我们应该将持续时间 window 设置得足够长以尽量减少误报，但又要足够短以确保及时打开事件。

详情可参考Alerting behaviour

MQL 警报策略也有点不同。 MQL 查询，警报策略条件包括两个值：

必须满足条件的输入时间序列数。该值可以是以下任何一项：

一个时间序列。
特定数量的时间序列。
时间序列的百分比。
所有时间序列。

警报状态的持续时间，即警报条件必须持续计算为真的时间。

并且在 MQL 中，策略中必须只有条件。您不能在基于 MQL 的警报策略中使用多个条件。

详情请参考Alerting policies with MQL

超出阈值，但 StackDriver 中未创建任何事件

Threshold exeeced, but no incident is created in StackDiver

google-cloud-platform

google-cloud-monitoring

terraform-provider-gcp

monitoring-query-language