如何使用配置文件使 Prometheus Alertmanager 静音？

Question

我正在使用官方 stable/prometheus-operator chart 用 helm 部署 Prometheus。

到目前为止，它运行良好，除了烦人的 CPUThrottlingHigh 警报正在触发许多 pods（包括自己的普罗米修斯 config-reloaders containers). This alert is currently under discussion，我想关闭它的通知现在。

Alertmanager 有一个 silence feature，但它是基于网络的：

Silences are a straightforward way to simply mute alerts for a given time. Silences are configured in the web interface of the Alertmanager.

有一种方法可以使用配置文件使来自 CPUThrottlingHigh 的通知静音吗？

Answer 1

我怀疑是否存在通过配置使警报静音的方法（除了将所述警报路由到 /dev/null 接收器，即没有配置电子邮件或任何其他通知机制的接收器，但警报仍会显示在 Alertmanager UI).

您显然可以使用 alertmanager 附带的 command line tool amtool 添加静音（尽管我看不到设置静音过期时间的方法）。

或者您可以直接使用 API（即使它没有记录并且理论上它可能会改变）。根据 this prometheus-users thread 这应该有效：

curl https://alertmanager/api/v1/silences -d '{
      "matchers": [
        {
          "name": "alername1",
          "value": ".*",
          "isRegex": true
        }
      ],
      "startsAt": "2018-10-25T22:12:33.533330795Z",
      "endsAt": "2018-10-25T23:11:44.603Z",
      "createdBy": "api",
      "comment": "Silence",
      "status": {
        "state": "active"
      }

}'

Answer 2

好吧，我通过配置 hackish inhibit_rule:

让它工作

inhibit_rules:
- target_match:
     alertname: 'CPUThrottlingHigh'
  source_match:
     alertname: 'DeadMansSwitch'
  equal: ['prometheus']

根据设计，DeadMansSwitch 是 prometheus-operator 附带的 "always firing" 警报，prometheus 标签是所有警报的通用标签，因此 CPUThrottlingHigh 最终 永远被抑制 。它很臭，但有效。

优点：

这可以通过配置文件完成（使用 alertmanager.config helm 参数）。
Prometheus 上仍然存在 CPUThrottlingHigh 警报分析。
CPUThrottlingHigh 警报仅显示在 Alertmanager UI 如果 "Inhibited" 框被选中。
我的接收器上没有烦人的通知。

缺点：

DeadMansSwitch 或 prometheus 标签设计中的任何更改都会破坏此设置（这仅意味着再次触发警报）。

更新： 我的缺点变成了现实...

stable/prometheus-operator 4.0.0 中的 DeadMansSwitch altertname just changed。如果使用此版本（或更高版本），新警报名称为 Watchdog。

Answer 3

一种选择是将您希望静音的警报路由到 "null" 接收器。在 alertmanager.yaml:

route:
  # Other settings...
  group_wait: 0s
  group_interval: 1m
  repeat_interval: 1h

  # Default receiver.
  receiver: "null"

  routes:
  # continue defaults to false, so the first match will end routing.
  - match:
      # This was previously named DeadMansSwitch
      alertname: Watchdog
    receiver: "null"
  - match:
      alertname: CPUThrottlingHigh
    receiver: "null"
  - receiver: "regular_alert_receiver"

receivers:
  - name: "null"
  - name: regular_alert_receiver
    <snip>

Answer 4

您可以通过 Robusta 发送警报来使其静音。（免责声明：我写了 Robusta。）

这是一个例子：

- triggers:
  - on_prometheus_alert: {}
  actions:
  - name_silencer:
      names: ["Watchdog", "CPUThrottlingHigh"]

然而，这可能不是你想要做的！

有些 CPUThrottlingHigh 警报是垃圾邮件，无法修复，例如 the one for metrics-server on GKE.。

但是，一般来说，警报是有意义的，可以指示真正的问题。 Typically the best-practice is to change or remove the pod's CPU limit..

我花了很多时间看 CPUThrottlingHigh，因为我为 Robusta 编写了一个自动化剧本，其中分析了每个 CPUThrottlingHigh 并推荐了最佳实践。

如何使用配置文件使 Prometheus Alertmanager 静音？

How to silence Prometheus Alertmanager using config files?

kubernetes

prometheus

prometheus-operator

prometheus-alertmanager