如何为 Prometheus/Alertmanager 正确配置 Alerting yaml 规则
How to correctly configure Alerting yaml rules for Prometheus / Alertmanager
因为我在为 Prometheus Alertmanager 配置警报规则时遇到了麻烦,也许有人可以给我一些正确方向的提示。
这是我目前正在尝试实施的规则(直接取自:
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
rules.yml:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
使用 amtool 和 promtool 配置检查我收到以下错误:
Checking '/etc/prometheus/rules.yml' FAILED: yaml: unmarshal errors:
line 1: field groups not found in type config.plain
amtool: error: failed to validate 1 file(s)
我的第一个猜测是缩进错误或其他类型的 yaml 语法错误。
但是我已经尝试使用多个警报规则以及不同的文件和编辑器(目前我正在使用 nano)。yaml 也已经用多个 yaml Linters 进行了检查。
但是暂时我一直在显示的行中有错误。
如有任何帮助或建议,我们将不胜感激!
prometheus, version 2.22.2 (branch: HEAD, revision: de1c1243f4dd66fbac3e8213e9a7bd8dbc9f38b2)
go version: go1.15.5
platform: linux/amd64
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
go version: go1.14.4
yaml linters:
https://codebeautify.org/yaml-validator
https://onlineyamltools.com/validate-yaml
已测试警报规则:
https://onlineyamltools.com/validate-yaml
https://rakeshjain-devops.medium.com/prometheus-alerting-most-common-alert-rules-e9e219d4e949
https://github.com/vegasbrianc/prometheus/blob/master/prometheus/alert.rules
groups
的解组失败,因为它应该是一个列表:
groups:
- name: GroupName
rules:
- alert: ...
参见documentation about recording rules,与告警规则相同
post 更正后更新
您的文件似乎是正确的。命令行为:
promtool check rules /etc/prometheus/rules.yml
我希望您使用该命令来检查 config
而不是 rules
。
请注意 amtool
验证 AlertManager 的配置,而不是 Prometheus 的配置。
因为我在为 Prometheus Alertmanager 配置警报规则时遇到了麻烦,也许有人可以给我一些正确方向的提示。
这是我目前正在尝试实施的规则(直接取自: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
rules.yml:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
使用 amtool 和 promtool 配置检查我收到以下错误:
Checking '/etc/prometheus/rules.yml' FAILED: yaml: unmarshal errors:
line 1: field groups not found in type config.plain
amtool: error: failed to validate 1 file(s)
我的第一个猜测是缩进错误或其他类型的 yaml 语法错误。 但是我已经尝试使用多个警报规则以及不同的文件和编辑器(目前我正在使用 nano)。yaml 也已经用多个 yaml Linters 进行了检查。 但是暂时我一直在显示的行中有错误。
如有任何帮助或建议,我们将不胜感激!
prometheus, version 2.22.2 (branch: HEAD, revision: de1c1243f4dd66fbac3e8213e9a7bd8dbc9f38b2)
go version: go1.15.5
platform: linux/amd64
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
go version: go1.14.4
yaml linters:
https://codebeautify.org/yaml-validator
https://onlineyamltools.com/validate-yaml
已测试警报规则:
https://onlineyamltools.com/validate-yaml
https://rakeshjain-devops.medium.com/prometheus-alerting-most-common-alert-rules-e9e219d4e949
https://github.com/vegasbrianc/prometheus/blob/master/prometheus/alert.rules
groups
的解组失败,因为它应该是一个列表:
groups:
- name: GroupName
rules:
- alert: ...
参见documentation about recording rules,与告警规则相同
post 更正后更新
您的文件似乎是正确的。命令行为:
promtool check rules /etc/prometheus/rules.yml
我希望您使用该命令来检查 config
而不是 rules
。
请注意 amtool
验证 AlertManager 的配置,而不是 Prometheus 的配置。