Prometheus:如何为 1 个特定 job_name 禁用 1 条规则?

Prometheus: How to disable 1 rule for 1 specific job_name?

我正在为 2 个 elasticsearch 集群设置 prometheus 警报(使用 elasticsearch_exporter),1 个有 8 个节点,1 个有 3 个节点。 我想要的是在每个集群丢失 1 个节点时发送警报,但目前所有规则都适用于两个集群。所以不可能。

prometheus.yml 文件

global:
  scrape_interval: 10s

rule_files:
  - alert.rules.yml

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

scrape_configs:
 - job_name: cluster1
   scrape_interval: 30s
   scrape_timeout:  30s
   metrics_path: "/metrics"
   static_configs:
   - targets: ['xxx1:9114' ]
     labels:
       service: cluster1
 - job_name: cluster2
   scrape_interval: 30s
   scrape_timeout:  30s
   metrics_path: "/metrics"
   static_configs:
   - targets: ['xxx2:9114' ]
     labels:
       service: cluster2

alert.rules.yml 文件:

groups:
- name: alert.rules
  rules:
    - alert: ElasticsearchLostNode
      expr: elasticsearch_cluster_health_number_of_nodes < 8
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
        description: Number Healthy Nodes less than 8
...

Ofc number_of_nodes < 8 将始终适用于小型集群,如果我设置 < 3,则当大型集群丢失 1 个节点时不会触发警报。

有没有办法为 1 个特定 job_name 免除 1 个特定规则,或者定义这些规则 A 适用于 1 个特定 job_name A,这些规则 B 适用于 1 个特定 job_name B?

是的,您可以在 alert.rules.yml 文件中为每个作业创建一个规则:

groups:
- name: alert.rules
  rules:
    - alert: ElasticsearchLostNode1
      expr: elasticsearch_cluster_health_number_of_nodes{job="cluster1"} < 8
      ...
    - alert: ElasticsearchLostNode2
      expr: elasticsearch_cluster_health_number_of_nodes{job="cluster2"} < 3
      ...