K8s 中的 Pod 失败警报

Alerts in K8s for Pod failing

我想在 Grafana 中为我的 Kubernetes 集群创建警报。 我在我的 k8s 集群中配置了 Prometheus、Node exporter、Kube-Metrics、Alert Manager。 我想在无法安排或失败时设置警报 Pods。

  1. 无法安排或失败的原因pods
  2. 稍后生成警报
  3. 正在创建另一个警报以在 pods 失败时通知我们。 你能指导我如何实现吗??

根据 Suresh Vishnoi 的评论:

it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes

是的,这可能非常有帮助。在此站点上,您可以找到 failed pods (not healthy):

的模板

Pod has been in a non-ready state for longer than 15 minutes.

  - alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: "Pod has been in a non-ready state for longer than 15 minutes.\n  V

crash looping:

Pod {{ $labels.pod }} is crash looping

  - alert: KubernetesPodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
      description: "Pod {{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

另见 this good guide about monitoring kubernetes cluster with Prometheus:

The Kubernetes API and the kube-state-metrics (which natively uses prometheus metrics) solve part of this problem by exposing Kubernetes internal data, such as the number of desired / running replicas in a deployment, unschedulable nodes, etc.

Prometheus is a good fit for microservices because you just need to expose a metrics port, and don’t need to add too much complexity or run additional services. Often, the service itself is already presenting a HTTP interface, and the developer just needs to add an additional path like /metrics.

如果涉及到不可调度的节点,可以使用指标kube_node_spec_unschedulable。它被描述为 here or herekube_node_spec_unschedulable - 节点是否可以调度新的pods。

另请参阅 this guide。 基本上,您需要找到要监控的指标并在 Prometheus 中适当地设置它。或者,您可以使用模板,如我在答案开头所示。