我如何在 kubernetes 中提醒 Pod Eviction 或 Failed && Evicted pods
how can I alert on Pod Eviction or Failed && Evicted pods in kubernetes
我可以从 pod 描述中看出我的 pod 由于内存压力而被“逐出”而“失败”。但是我如何使用普罗米修斯警报规则或其他方式测试太多“失败 && 被驱逐”pods?
我安装了 Prometheus Operator,我可以看到失败的指标 Pods,但看不到失败和驱逐的指标
kubectl describe pod 给出:
Name: besteffort-evictme-001
Namespace: skyfii
Priority: 0
Node: ip-172-17-2-169.ap-southeast-2.compute.internal/
Start Time: Fri, 24 Sep 2021 15:28:53 +1000
Labels: <none>
Annotations: kubernetes.io/psp: eks.privileged
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container termination-demo-container was using 17165108Ki, which exceeds its request of 0.
IP:
IPs: <none>
Containers:
普罗米修斯法则:
kube_pod_status_phase{phase="Failed"} > 0
显示失败的 pod
kube_pod_status_phase{endpoint="http",instance="172.17.3.141:8080",job="kube-state-metrics",namespace="skyfii",phase="Failed",pod="besteffort-evictme-001",service="prometheus-kube-state-metrics"}
但是
没有任何显示
kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
有什么想法吗?
谢谢
卡尔
看来我需要更新我的 kube-prometheus-stack
helm chart 版本。
我们在 pod 描述中看到的“Evicted”Reason
挂起 podStatus
较新的 kube-prometheus-stack
版本引入了 kube-state-metrics (v.2) 的更高版本 (v.2),后者又公开了 kube_pod_status_reason
我将升级然后重构我的 prometheus 查询以使用这个新指标,并在它工作时post返回答案。
干杯
卡尔
升级到 kube-prometheus-stack v 18.1.0 允许我这样做:-
这样我就可以设计我现在需要的查询了
我将其添加到我的 prometheus alertmanager 规则中
prometheusAdditionalRulesMap
kube-prometheus-stack 的 Values.yaml
部分
- name: kubernetes-container-evictions
rules:
# Mem pressure evicted pods are left in a Failed state, alert if we see too many failed pods
# NB you will need to delete the failed pods after investigating
- alert: FailedEvictedPods
expr: sum by(namespace, pod) (kube_pod_status_phase{phase="Failed"} > 0 and on(namespace, pod) kube_pod_status_reason{reason="Evicted"} > 0) > 0
for: 10m
labels:
severity: warning
annotations:
message: 'Failed Evicted pod:{{ $labels.pod }} namespace:{{ $labels.namespace }}'
- alert: TooManyEvictedPods
expr: sum(kube_pod_status_reason{reason="Evicted"}) >= 2
labels:
severity: high
annotations:
message: 'Too many Failed Evicted Pods: {{ $value }}'
现在我得到了我想要的警报:-)
我可以从 pod 描述中看出我的 pod 由于内存压力而被“逐出”而“失败”。但是我如何使用普罗米修斯警报规则或其他方式测试太多“失败 && 被驱逐”pods?
我安装了 Prometheus Operator,我可以看到失败的指标 Pods,但看不到失败和驱逐的指标
kubectl describe pod 给出:
Name: besteffort-evictme-001
Namespace: skyfii
Priority: 0
Node: ip-172-17-2-169.ap-southeast-2.compute.internal/
Start Time: Fri, 24 Sep 2021 15:28:53 +1000
Labels: <none>
Annotations: kubernetes.io/psp: eks.privileged
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container termination-demo-container was using 17165108Ki, which exceeds its request of 0.
IP:
IPs: <none>
Containers:
普罗米修斯法则:
kube_pod_status_phase{phase="Failed"} > 0
显示失败的 pod
kube_pod_status_phase{endpoint="http",instance="172.17.3.141:8080",job="kube-state-metrics",namespace="skyfii",phase="Failed",pod="besteffort-evictme-001",service="prometheus-kube-state-metrics"}
但是
没有任何显示kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
有什么想法吗?
谢谢 卡尔
看来我需要更新我的 kube-prometheus-stack
helm chart 版本。
我们在 pod 描述中看到的“Evicted”Reason
挂起 podStatus
较新的 kube-prometheus-stack
版本引入了 kube-state-metrics (v.2) 的更高版本 (v.2),后者又公开了 kube_pod_status_reason
我将升级然后重构我的 prometheus 查询以使用这个新指标,并在它工作时post返回答案。
干杯 卡尔
升级到 kube-prometheus-stack v 18.1.0 允许我这样做:-
这样我就可以设计我现在需要的查询了
我将其添加到我的 prometheus alertmanager 规则中
prometheusAdditionalRulesMap
kube-prometheus-stack 的 Values.yaml
- name: kubernetes-container-evictions
rules:
# Mem pressure evicted pods are left in a Failed state, alert if we see too many failed pods
# NB you will need to delete the failed pods after investigating
- alert: FailedEvictedPods
expr: sum by(namespace, pod) (kube_pod_status_phase{phase="Failed"} > 0 and on(namespace, pod) kube_pod_status_reason{reason="Evicted"} > 0) > 0
for: 10m
labels:
severity: warning
annotations:
message: 'Failed Evicted pod:{{ $labels.pod }} namespace:{{ $labels.namespace }}'
- alert: TooManyEvictedPods
expr: sum(kube_pod_status_reason{reason="Evicted"}) >= 2
labels:
severity: high
annotations:
message: 'Too many Failed Evicted Pods: {{ $value }}'
现在我得到了我想要的警报:-)