Alertmanager 没有向 slack 发送警报
Alertmanager not sending out alert to slack
我用 prometheus 配置了 alertmanager,根据 prometheus 界面,警报正在触发。但是没有显示松弛消息,我想知道是否需要配置 ufw 或者是否有任何其他配置我错过了。
alertmanager 服务 运行 并且 prometheus 显示“firing”。
这是我的配置文件:
alertmanager.yml:
global:
slack_api_url: 'https://hooks.slack.com/services/my_id_removed...'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'slack_general'
receivers:
#- name: 'web.hook'
# webhook_configs:
# - url: 'http://127.0.0.1:5001/'
- name: slack_general
slack_configs:
- channel: '#alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
prometheus.yml:
global:
scrape_interval: 10s
evaluation_interval: 15s # Evaluates rules every 15s. Default is 1m
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- rules.yml
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090', 'localhost:9104']
- job_name: 'node_exporter_metrics'
scrape_interval: 5s
static_configs:
- targets: ['leo:9100', 'dog:9100']
- job_name: 'alertmanager'
static_configs:
- targets: ['localhost:9093']
alerts.yml:
groups:
- name: test
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 8 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: Host out of disk space (instance {{ $labels.instance }})
description: "Disk is almost full (< 8% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
sudo systemctl status alertmanager.service
● alertmanager.service - Alertmanager for prometheus
Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2022-03-05 22:02:31 CET; 6min ago
Main PID: 9398 (alertmanager)
Tasks: 30 (limit: 154409)
Memory: 21.8M
CGroup: /system.slice/alertmanager.service
└─9398 /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.098Z caller=cluster.go:184 component=cluster msg="setting advertise address explicitly" addr=192.168.0.2 port=9094
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.099Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.127Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/opt/alertmanager/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.128Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/opt/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=main.go:518 msg=Listening address=:9093
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
Mar 05 22:02:33 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:33.099Z caller=cluster.go:696 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000022377s
Mar 05 22:02:41 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:41.101Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.002298352s
我相信已经找到了警报有时出现有时不出现的原因。 alertmanager 在重启后无法自动启动,因为它没有等待 prometheus。
已修复:
sudo nano /etc/systemd/system/alertmanager.service
添加“想要”和“之后”:
[Unit]
Description=Alertmanager for prometheus
Wants=network-online.target
After=network-online.target
重启。
我用 prometheus 配置了 alertmanager,根据 prometheus 界面,警报正在触发。但是没有显示松弛消息,我想知道是否需要配置 ufw 或者是否有任何其他配置我错过了。
alertmanager 服务 运行 并且 prometheus 显示“firing”。
这是我的配置文件:
alertmanager.yml:
global:
slack_api_url: 'https://hooks.slack.com/services/my_id_removed...'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'slack_general'
receivers:
#- name: 'web.hook'
# webhook_configs:
# - url: 'http://127.0.0.1:5001/'
- name: slack_general
slack_configs:
- channel: '#alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
prometheus.yml:
global:
scrape_interval: 10s
evaluation_interval: 15s # Evaluates rules every 15s. Default is 1m
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- rules.yml
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090', 'localhost:9104']
- job_name: 'node_exporter_metrics'
scrape_interval: 5s
static_configs:
- targets: ['leo:9100', 'dog:9100']
- job_name: 'alertmanager'
static_configs:
- targets: ['localhost:9093']
alerts.yml:
groups:
- name: test
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 8 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: Host out of disk space (instance {{ $labels.instance }})
description: "Disk is almost full (< 8% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
sudo systemctl status alertmanager.service
● alertmanager.service - Alertmanager for prometheus
Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2022-03-05 22:02:31 CET; 6min ago
Main PID: 9398 (alertmanager)
Tasks: 30 (limit: 154409)
Memory: 21.8M
CGroup: /system.slice/alertmanager.service
└─9398 /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.094Z caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.098Z caller=cluster.go:184 component=cluster msg="setting advertise address explicitly" addr=192.168.0.2 port=9094
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.099Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.127Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/opt/alertmanager/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.128Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/opt/alertma>
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=main.go:518 msg=Listening address=:9093
Mar 05 22:02:31 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:31.131Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
Mar 05 22:02:33 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:33.099Z caller=cluster.go:696 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000022377s
Mar 05 22:02:41 leo alertmanager[9398]: level=info ts=2022-03-05T21:02:41.101Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.002298352s
我相信已经找到了警报有时出现有时不出现的原因。 alertmanager 在重启后无法自动启动,因为它没有等待 prometheus。
已修复:
sudo nano /etc/systemd/system/alertmanager.service
添加“想要”和“之后”:
[Unit]
Description=Alertmanager for prometheus
Wants=network-online.target
After=network-online.target
重启。