如何配置 helm 在 kubernetes 上安装的 alertmanager?
How to config alertmanager which installed by helm on kubernetes?
在 kubernetes 集群中使用 Helm
安装 Prometheus
和 Grafana
:
helm install stable/prometheus
helm install stable/grafana
它有一项 alertmanage
服务。
但是我看到一篇博客介绍了如何使用 yaml 文件设置 alertmanager 配置:
是否可以使用当前方式(由 helm 安装)设置一些 alert rules
并配置 CPU
、memory
并在不创建其他 yaml 文件的情况下发送电子邮件?
看到k8s的介绍configmap
到alertmanager
:
https://github.com/kubernetes/charts/tree/master/stable/prometheus#configmap-files
但是不清楚怎么用,怎么做。
编辑
我下载了 stable/prometheus
的源代码,看看它能做什么。从 values.yaml
文件中我发现:
serverFiles:
alerts: ""
rules: ""
prometheus.yml: |-
rule_files:
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
https://github.com/kubernetes/charts/blob/master/stable/prometheus/values.yaml#L600
所以我认为应该自己写这个配置文件来定义警报rules
和alertmanager
。但是不清楚这个块:
rule_files:
- /etc/config/rules
- /etc/config/alerts
可能是指容器中的路径。但是现在没有任何文件。应在此处添加:
serverFiles:
alert: ""
rules: ""
编辑 2
在values.yaml
中设置alert rules
和alertmanager
配置后:
## Prometheus server ConfigMap entries
##
serverFiles:
alerts: ""
rules: |-
#
# CPU Alerts
#
ALERT HighCPU
IF ((sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) - ( sum(node_cpu{mode=~"idle|iowait"}) by (instance,job) ) ) / (sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) * 100 > 95
FOR 10m
LABELS { service = "backend" }
ANNOTATIONS {
summary = "High CPU Usage",
description = "This machine has really high CPU usage for over 10m",
}
# TEST
ALERT APIHighRequestLatency
IF api_http_request_latencies_second{quantile="0.5"} >1
FOR 1m
ANNOTATIONS {
summary = "High request latency on {{$labels.instance }}",
description = "{{ $labels.instance }} has amedian request latency above 1s (current value: {{ $value }}s)",
}
运行 helm install prometheus/
安装它。
为 alertmanager
组件启动 port-forward
:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=alertmanager" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9093
然后访问浏览器 http://127.0.0.1:9003
,得到这些消息:
Forwarding from 127.0.0.1:9093 -> 9093
Handling connection for 9093
Handling connection for 9093
E0122 17:41:53.229084 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:54 socat[31237.140275133073152] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
Handling connection for 9093
E0122 17:41:53.243511 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:54 socat[31238.140565602109184] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
E0122 17:41:53.246011 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:54 socat[31239.140184300869376] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
Handling connection for 9093
Handling connection for 9093
E0122 17:41:53.846399 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:55 socat[31250.140004515874560] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
E0122 17:41:53.847821 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:55 socat[31251.140355466835712] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
Handling connection for 9093
E0122 17:41:53.858521 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:55 socat[31252.140268300003072] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
为什么?
当我检查 kubectl describe po illocutionary-heron-prometheus-alertmanager-587d747b9c-qwmm6
时,得到:
Name: illocutionary-heron-prometheus-alertmanager-587d747b9c-qwmm6
Namespace: default
Node: minikube/192.168.99.100
Start Time: Mon, 22 Jan 2018 17:33:54 +0900
Labels: app=prometheus
component=alertmanager
pod-template-hash=1438303657
release=illocutionary-heron
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"illocutionary-heron-prometheus-alertmanager-587d747b9c","uid":"f...
Status: Running
IP: 172.17.0.10
Created By: ReplicaSet/illocutionary-heron-prometheus-alertmanager-587d747b9c
Controlled By: ReplicaSet/illocutionary-heron-prometheus-alertmanager-587d747b9c
Containers:
prometheus-alertmanager:
Container ID: docker://0808a3ecdf1fa94b36a1bf4b8f0d9d2933bc38afa8b25e09d0d86f036ac3165b
Image: prom/alertmanager:v0.9.1
Image ID: docker-pullable://prom/alertmanager@sha256:ed926b227327eecfa61a9703702c9b16fc7fe95b69e22baa656d93cfbe098320
Port: 9093/TCP
Args:
--config.file=/etc/config/alertmanager.yml
--storage.path=/data
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 22 Jan 2018 17:55:24 +0900
Finished: Mon, 22 Jan 2018 17:55:24 +0900
Ready: False
Restart Count: 9
Readiness: http-get http://:9093/%23/status delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h5b8l (ro)
prometheus-alertmanager-configmap-reload:
Container ID: docker://b4a349bf7be4ea78abe6899ad0173147f0d3f6ff1005bc513b2c0ac726385f0b
Image: jimmidyson/configmap-reload:v0.1
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:2d40c2eaa6f435b2511d0cfc5f6c0a681eeb2eaa455a5d5ac25f88ce5139986e
Port: <none>
Args:
--volume-dir=/etc/config
--webhook-url=http://localhost:9093/-/reload
State: Running
Started: Mon, 22 Jan 2018 17:33:56 +0900
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h5b8l (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: illocutionary-heron-prometheus-alertmanager
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: illocutionary-heron-prometheus-alertmanager
ReadOnly: false
default-token-h5b8l:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-h5b8l
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 29m (x2 over 29m) default-scheduler PersistentVolumeClaim is not bound: "illocutionary-heron-prometheus-alertmanager"
Normal Scheduled 29m default-scheduler Successfully assigned illocutionary-heron-prometheus-alertmanager-587d747b9c-qwmm6 to minikube
Normal SuccessfulMountVolume 29m kubelet, minikube MountVolume.SetUp succeeded for volume "config-volume"
Normal SuccessfulMountVolume 29m kubelet, minikube MountVolume.SetUp succeeded for volume "pvc-fa84b197-ff4e-11e7-a584-0800270fb7fc"
Normal SuccessfulMountVolume 29m kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-h5b8l"
Normal Started 29m kubelet, minikube Started container
Normal Created 29m kubelet, minikube Created container
Normal Pulled 29m kubelet, minikube Container image "jimmidyson/configmap-reload:v0.1" already present on machine
Normal Started 29m (x3 over 29m) kubelet, minikube Started container
Normal Created 29m (x4 over 29m) kubelet, minikube Created container
Normal Pulled 29m (x4 over 29m) kubelet, minikube Container image "prom/alertmanager:v0.9.1" already present on machine
Warning BackOff 9m (x91 over 29m) kubelet, minikube Back-off restarting failed container
Warning FailedSync 4m (x113 over 29m) kubelet, minikube Error syncing pod
编辑 3
alertmanager
配置在 values.yaml
文件中:
## alertmanager ConfigMap entries
##
alertmanagerFiles:
alertmanager.yml: |-
global:
resolve_timeout: 5m
smtp_smarthost: smtp.gmail.com:587
smtp_from: sender@gmail.com
smtp_auth_username: sender@gmail.com
smtp_auth_password: sender_password
receivers:
- name: default-receiver
email_configs:
- to: target_email@gmail.com
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval: 3h
不工作。出现以上错误。
alertmanagerFiles:
alertmanager.yml: |-
global:
# slack_api_url: ''
receivers:
- name: default-receiver
# slack_configs:
# - channel: '@you'
# send_resolved: true
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval
工作没有任何错误。
所以,问题出在 email_configs
配置方法上。
values.yaml
文件serverFiles
组中的alerts
和rules
键挂载在/etc/config
文件夹下的Prometheus容器中。您可以在那里放置您想要的配置(例如从您链接的博客 post 中获取灵感),Prometheus 将使用它来处理警报。
例如,一个简单的规则可以这样设置:
serverFiles:
alerts: |
ALERT cpu_threshold_exceeded
IF (100 * (1 - avg by(job)(irate(node_cpu{mode='idle'}[5m])))) > 80
FOR 300s
LABELS {
severity = "warning",
}
ANNOTATIONS {
summary = "CPU usage > 80% for {{ $labels.job }}",
description = "CPU usage avg for last 5m: {{ $value }}",
}
在 kubernetes 集群中使用 Helm
安装 Prometheus
和 Grafana
:
helm install stable/prometheus
helm install stable/grafana
它有一项 alertmanage
服务。
但是我看到一篇博客介绍了如何使用 yaml 文件设置 alertmanager 配置:
是否可以使用当前方式(由 helm 安装)设置一些 alert rules
并配置 CPU
、memory
并在不创建其他 yaml 文件的情况下发送电子邮件?
看到k8s的介绍configmap
到alertmanager
:
https://github.com/kubernetes/charts/tree/master/stable/prometheus#configmap-files
但是不清楚怎么用,怎么做。
编辑
我下载了 stable/prometheus
的源代码,看看它能做什么。从 values.yaml
文件中我发现:
serverFiles:
alerts: ""
rules: ""
prometheus.yml: |-
rule_files:
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
https://github.com/kubernetes/charts/blob/master/stable/prometheus/values.yaml#L600
所以我认为应该自己写这个配置文件来定义警报rules
和alertmanager
。但是不清楚这个块:
rule_files:
- /etc/config/rules
- /etc/config/alerts
可能是指容器中的路径。但是现在没有任何文件。应在此处添加:
serverFiles:
alert: ""
rules: ""
编辑 2
在values.yaml
中设置alert rules
和alertmanager
配置后:
## Prometheus server ConfigMap entries
##
serverFiles:
alerts: ""
rules: |-
#
# CPU Alerts
#
ALERT HighCPU
IF ((sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) - ( sum(node_cpu{mode=~"idle|iowait"}) by (instance,job) ) ) / (sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) * 100 > 95
FOR 10m
LABELS { service = "backend" }
ANNOTATIONS {
summary = "High CPU Usage",
description = "This machine has really high CPU usage for over 10m",
}
# TEST
ALERT APIHighRequestLatency
IF api_http_request_latencies_second{quantile="0.5"} >1
FOR 1m
ANNOTATIONS {
summary = "High request latency on {{$labels.instance }}",
description = "{{ $labels.instance }} has amedian request latency above 1s (current value: {{ $value }}s)",
}
运行 helm install prometheus/
安装它。
为 alertmanager
组件启动 port-forward
:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=alertmanager" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9093
然后访问浏览器 http://127.0.0.1:9003
,得到这些消息:
Forwarding from 127.0.0.1:9093 -> 9093
Handling connection for 9093
Handling connection for 9093
E0122 17:41:53.229084 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:54 socat[31237.140275133073152] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
Handling connection for 9093
E0122 17:41:53.243511 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:54 socat[31238.140565602109184] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
E0122 17:41:53.246011 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:54 socat[31239.140184300869376] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
Handling connection for 9093
Handling connection for 9093
E0122 17:41:53.846399 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:55 socat[31250.140004515874560] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
E0122 17:41:53.847821 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:55 socat[31251.140355466835712] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
Handling connection for 9093
E0122 17:41:53.858521 7159 portforward.go:331] an error occurred forwarding 9093 -> 9093: error forwarding port 9093 to pod 6614ee96df545c266e5fff18023f8f7c87981f3340ee8913acf3d8da0e39e906, uid : exit status 1: 2018/01/22 08:37:55 socat[31252.140268300003072] E connect(5, AF=2 127.0.0.1:9093, 16): Connection refused
为什么?
当我检查 kubectl describe po illocutionary-heron-prometheus-alertmanager-587d747b9c-qwmm6
时,得到:
Name: illocutionary-heron-prometheus-alertmanager-587d747b9c-qwmm6
Namespace: default
Node: minikube/192.168.99.100
Start Time: Mon, 22 Jan 2018 17:33:54 +0900
Labels: app=prometheus
component=alertmanager
pod-template-hash=1438303657
release=illocutionary-heron
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"illocutionary-heron-prometheus-alertmanager-587d747b9c","uid":"f...
Status: Running
IP: 172.17.0.10
Created By: ReplicaSet/illocutionary-heron-prometheus-alertmanager-587d747b9c
Controlled By: ReplicaSet/illocutionary-heron-prometheus-alertmanager-587d747b9c
Containers:
prometheus-alertmanager:
Container ID: docker://0808a3ecdf1fa94b36a1bf4b8f0d9d2933bc38afa8b25e09d0d86f036ac3165b
Image: prom/alertmanager:v0.9.1
Image ID: docker-pullable://prom/alertmanager@sha256:ed926b227327eecfa61a9703702c9b16fc7fe95b69e22baa656d93cfbe098320
Port: 9093/TCP
Args:
--config.file=/etc/config/alertmanager.yml
--storage.path=/data
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 22 Jan 2018 17:55:24 +0900
Finished: Mon, 22 Jan 2018 17:55:24 +0900
Ready: False
Restart Count: 9
Readiness: http-get http://:9093/%23/status delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h5b8l (ro)
prometheus-alertmanager-configmap-reload:
Container ID: docker://b4a349bf7be4ea78abe6899ad0173147f0d3f6ff1005bc513b2c0ac726385f0b
Image: jimmidyson/configmap-reload:v0.1
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:2d40c2eaa6f435b2511d0cfc5f6c0a681eeb2eaa455a5d5ac25f88ce5139986e
Port: <none>
Args:
--volume-dir=/etc/config
--webhook-url=http://localhost:9093/-/reload
State: Running
Started: Mon, 22 Jan 2018 17:33:56 +0900
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h5b8l (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: illocutionary-heron-prometheus-alertmanager
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: illocutionary-heron-prometheus-alertmanager
ReadOnly: false
default-token-h5b8l:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-h5b8l
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 29m (x2 over 29m) default-scheduler PersistentVolumeClaim is not bound: "illocutionary-heron-prometheus-alertmanager"
Normal Scheduled 29m default-scheduler Successfully assigned illocutionary-heron-prometheus-alertmanager-587d747b9c-qwmm6 to minikube
Normal SuccessfulMountVolume 29m kubelet, minikube MountVolume.SetUp succeeded for volume "config-volume"
Normal SuccessfulMountVolume 29m kubelet, minikube MountVolume.SetUp succeeded for volume "pvc-fa84b197-ff4e-11e7-a584-0800270fb7fc"
Normal SuccessfulMountVolume 29m kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-h5b8l"
Normal Started 29m kubelet, minikube Started container
Normal Created 29m kubelet, minikube Created container
Normal Pulled 29m kubelet, minikube Container image "jimmidyson/configmap-reload:v0.1" already present on machine
Normal Started 29m (x3 over 29m) kubelet, minikube Started container
Normal Created 29m (x4 over 29m) kubelet, minikube Created container
Normal Pulled 29m (x4 over 29m) kubelet, minikube Container image "prom/alertmanager:v0.9.1" already present on machine
Warning BackOff 9m (x91 over 29m) kubelet, minikube Back-off restarting failed container
Warning FailedSync 4m (x113 over 29m) kubelet, minikube Error syncing pod
编辑 3
alertmanager
配置在 values.yaml
文件中:
## alertmanager ConfigMap entries
##
alertmanagerFiles:
alertmanager.yml: |-
global:
resolve_timeout: 5m
smtp_smarthost: smtp.gmail.com:587
smtp_from: sender@gmail.com
smtp_auth_username: sender@gmail.com
smtp_auth_password: sender_password
receivers:
- name: default-receiver
email_configs:
- to: target_email@gmail.com
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval: 3h
不工作。出现以上错误。
alertmanagerFiles:
alertmanager.yml: |-
global:
# slack_api_url: ''
receivers:
- name: default-receiver
# slack_configs:
# - channel: '@you'
# send_resolved: true
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval
工作没有任何错误。
所以,问题出在 email_configs
配置方法上。
values.yaml
文件serverFiles
组中的alerts
和rules
键挂载在/etc/config
文件夹下的Prometheus容器中。您可以在那里放置您想要的配置(例如从您链接的博客 post 中获取灵感),Prometheus 将使用它来处理警报。
例如,一个简单的规则可以这样设置:
serverFiles:
alerts: |
ALERT cpu_threshold_exceeded
IF (100 * (1 - avg by(job)(irate(node_cpu{mode='idle'}[5m])))) > 80
FOR 300s
LABELS {
severity = "warning",
}
ANNOTATIONS {
summary = "CPU usage > 80% for {{ $labels.job }}",
description = "CPU usage avg for last 5m: {{ $value }}",
}