我如何在 kubernetes 中提醒 Pod Eviction 或 Failed && Evicted pods

how can I alert on Pod Eviction or Failed && Evicted pods in kubernetes

我可以从 pod 描述中看出我的 pod 由于内存压力而被“逐出”而“失败”。但是我如何使用普罗米修斯警报规则或其他方式测试太多“失败 && 被驱逐”pods?

我安装了 Prometheus Operator,我可以看到失败的指标 Pods,但看不到失败和驱逐的指标

kubectl describe pod 给出:

Name:         besteffort-evictme-001
Namespace:    skyfii
Priority:     0
Node:         ip-172-17-2-169.ap-southeast-2.compute.internal/
Start Time:   Fri, 24 Sep 2021 15:28:53 +1000
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: memory. Container termination-demo-container was using 17165108Ki, which exceeds its request of 0. 
IP:           
IPs:          <none>
Containers:

普罗米修斯法则:

kube_pod_status_phase{phase="Failed"} > 0

显示失败的 pod

kube_pod_status_phase{endpoint="http",instance="172.17.3.141:8080",job="kube-state-metrics",namespace="skyfii",phase="Failed",pod="besteffort-evictme-001",service="prometheus-kube-state-metrics"}

但是

没有任何显示
kube_pod_container_status_terminated_reason{reason="Evicted"} > 0

有什么想法吗?

谢谢 卡尔

看来我需要更新我的 kube-prometheus-stack helm chart 版本。

我们在 pod 描述中看到的“Evicted”Reason 挂起 podStatus

较新的 kube-prometheus-stack 版本引入了 kube-state-metrics (v.2) 的更高版本 (v.2),后者又公开了 kube_pod_status_reason

我将升级然后重构我的 prometheus 查询以使用这个新指标,并在它工作时post返回答案。

干杯 卡尔

升级到 kube-prometheus-stack v 18.1.0 允许我这样做:-

这样我就可以设计我现在需要的查询了

我将其添加到我的 prometheus alertmanager 规则中 prometheusAdditionalRulesMap kube-prometheus-stack 的 Values.yaml

部分

      - name: kubernetes-container-evictions

        rules:

        # Mem pressure evicted pods are left in a Failed state, alert if we see too many failed pods

        # NB you will need to delete the failed pods after investigating

        - alert: FailedEvictedPods

          expr: sum by(namespace, pod) (kube_pod_status_phase{phase="Failed"} > 0 and on(namespace, pod) kube_pod_status_reason{reason="Evicted"} > 0) > 0

          for: 10m

          labels:

            severity: warning

          annotations:

            message: 'Failed Evicted pod:{{ $labels.pod }} namespace:{{ $labels.namespace }}'


        - alert: TooManyEvictedPods

          expr: sum(kube_pod_status_reason{reason="Evicted"}) >= 2

          labels:

            severity: high

          annotations:

            message: 'Too many Failed Evicted Pods: {{ $value }}'

现在我得到了我想要的警报:-)