当所有抓取作业都停止时,Prometheus 中的主机停止警报

Host down alert in Prometheus when all scrape jobs are down

我无法理解 and/or 在 Prometheus 中实现警报逻辑。我有两个警报规则:

alert: JobDown
expr: up == 0
for: 5m
labels:
  severity: warning
annotations:
  summary: Scrape job {{ $labels.job }} down on {{ $labels.hostname }}.

alert: HostDown
expr: sum(up) == 0
for: 5m
labels:
  severity: critical
annotations:
  description: All scrape jobs down on {{ $labels.hostname }}.
  summary: Host {{ $labels.hostname }} down.

我希望当所有作业都关闭时会触发 HostDown 警报,但事实并非如此:我看到主机关闭,Prometheus 显示每个抓取作业的警报,但没有触发 HostDown警报。我写对了吗?

sum 将忽略 hostname 并对所有内容求和。要对 hostname 求和,您需要

sum by (hostname) (up) == 0

注意:主机名不是 up 上的标准 label,它是原始发布者配置中的自定义标签