Alertmanager 没有向 slack 发送警报

Alertmanager not sending out alert to slack

我用 prometheus 配置了 alertmanager,根据 prometheus 界面,警报正在触发。但是没有显示松弛消息,我想知道是否需要配置 ufw 或者是否有任何其他配置我错过了。

alertmanager 服务 运行 并且 prometheus 显示“firing”。

这是我的配置文件:

alertmanager.yml:

global:
 slack_api_url: 'https://hooks.slack.com/services/my_id_removed...'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'slack_general'
receivers:
#- name: 'web.hook'
#  webhook_configs:
#  - url: 'http://127.0.0.1:5001/'
- name: slack_general
  slack_configs:
  - channel: '#alerts'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

prometheus.yml:

global:
  scrape_interval: 10s
  evaluation_interval: 15s # Evaluates rules every 15s. Default is 1m
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
rule_files:
  - rules.yml
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090', 'localhost:9104']

  - job_name: 'node_exporter_metrics'
    scrape_interval: 5s
    static_configs:
      - targets: ['leo:9100', 'dog:9100']

  - job_name: 'alertmanager'
    static_configs:
      - targets: ['localhost:9093']

alerts.yml:

groups:
 - name: test
   rules:
   - alert: InstanceDown
     expr: up == 0
     for: 1m
   - alert: HostHighCpuLoad
     expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
     for: 0m
     labels:
       severity: warning
     annotations:
       summary: Host high CPU load (instance {{ $labels.instance }})
       description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
   - alert: HostOutOfMemory
     expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: Host out of memory (instance {{ $labels.instance }})
       description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
   # Please add ignored mountpoints in node_exporter parameters like
   # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
   # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
   - alert: HostOutOfDiskSpace
     expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 8 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
     for: 2m
     labels:
       severity: warning
     annotations:
       summary: Host out of disk space (instance {{ $labels.instance }})
       description: "Disk is almost full (< 8% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

 sudo systemctl status alertmanager.service
● alertmanager.service - Alertmanager for prometheus
     Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-03-05 22:02:31 CET; 6min ago
   Main PID: 9398 (alertmanager)
      Tasks: 30 (limit: 154409)
     Memory: 21.8M
     CGroup: /system.slice/alertmanager.service
             └─9398 /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data

Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.098Z caller=cluster.go:184 component=cluster msg="setting advertise address explicitly" addr=192.168.0.2 port=9094
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.099Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.127Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/opt/alertmanager/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.128Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/opt/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=main.go:518 msg=Listening address=:9093
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
Mar 05 22:02:33 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:33.099Z caller=cluster.go:696 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000022377s
Mar 05 22:02:41 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:41.101Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.002298352s

我相信已经找到了警报有时出现有时不出现的原因。 alertmanager 在重启后无法自动启动,因为它没有等待 prometheus。

已修复:

sudo nano /etc/systemd/system/alertmanager.service

添加“想要”和“之后”:

[Unit]
Description=Alertmanager for prometheus
Wants=network-online.target
After=network-online.target

重启。