K8s 中的 Pod 失败警报
Alerts in K8s for Pod failing
我想在 Grafana 中为我的 Kubernetes 集群创建警报。
我在我的 k8s 集群中配置了 Prometheus、Node exporter、Kube-Metrics、Alert Manager。
我想在无法安排或失败时设置警报 Pods。
- 无法安排或失败的原因pods
- 稍后生成警报
- 正在创建另一个警报以在 pods 失败时通知我们。
你能指导我如何实现吗??
根据 Suresh Vishnoi 的评论:
it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes
是的,这可能非常有帮助。在此站点上,您可以找到 failed pods (not healthy):
的模板
Pod has been in a non-ready state for longer than 15 minutes.
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod has been in a non-ready state for longer than 15 minutes.\n V
Pod {{ $labels.pod }} is crash looping
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
另见 this good guide about monitoring kubernetes cluster with Prometheus:
The Kubernetes API and the kube-state-metrics (which natively uses prometheus metrics) solve part of this problem by exposing Kubernetes internal data, such as the number of desired / running replicas in a deployment, unschedulable nodes, etc.
Prometheus is a good fit for microservices because you just need to expose a metrics port, and don’t need to add too much complexity or run additional services. Often, the service itself is already presenting a HTTP interface, and the developer just needs to add an additional path like /metrics
.
如果涉及到不可调度的节点,可以使用指标kube_node_spec_unschedulable
。它被描述为 here or here:
kube_node_spec_unschedulable
- 节点是否可以调度新的pods。
另请参阅 this guide。
基本上,您需要找到要监控的指标并在 Prometheus 中适当地设置它。或者,您可以使用模板,如我在答案开头所示。
我想在 Grafana 中为我的 Kubernetes 集群创建警报。 我在我的 k8s 集群中配置了 Prometheus、Node exporter、Kube-Metrics、Alert Manager。 我想在无法安排或失败时设置警报 Pods。
- 无法安排或失败的原因pods
- 稍后生成警报
- 正在创建另一个警报以在 pods 失败时通知我们。 你能指导我如何实现吗??
根据 Suresh Vishnoi 的评论:
it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes
是的,这可能非常有帮助。在此站点上,您可以找到 failed pods (not healthy):
的模板Pod has been in a non-ready state for longer than 15 minutes.
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod has been in a non-ready state for longer than 15 minutes.\n V
Pod {{ $labels.pod }} is crash looping
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
另见 this good guide about monitoring kubernetes cluster with Prometheus:
The Kubernetes API and the kube-state-metrics (which natively uses prometheus metrics) solve part of this problem by exposing Kubernetes internal data, such as the number of desired / running replicas in a deployment, unschedulable nodes, etc.
Prometheus is a good fit for microservices because you just need to expose a metrics port, and don’t need to add too much complexity or run additional services. Often, the service itself is already presenting a HTTP interface, and the developer just needs to add an additional path like
/metrics
.
如果涉及到不可调度的节点,可以使用指标kube_node_spec_unschedulable
。它被描述为 here or here:
kube_node_spec_unschedulable
- 节点是否可以调度新的pods。
另请参阅 this guide。 基本上,您需要找到要监控的指标并在 Prometheus 中适当地设置它。或者,您可以使用模板,如我在答案开头所示。