失败的 GKE CronJob 的 GCP 警报策略

Question

设置 GCP monitoring alert policy for a Kubernetes CronJob 失败的最佳方法是什么？我在那里找不到任何好的例子。

现在，我有一个可行的解决方案，该解决方案基于严重性 ERROR 的 Pod 中的监控日志。然而，我发现这很不稳定。有时作业会由于我无法控制的某些短暂原因而失败（例如，外部服务器返回临时 500），而在下一次重试时，作业会成功运行。

我真正需要的是只有当 CronJob 处于持续失败状态时才会触发的警报。也就是说，Kubernetes 多次尝试重新运行整个过程，但仍然失败。理想情况下，它还可以处理 Pod 无法启动的情况（例如，下载图像失败）。

有什么想法吗？

谢谢。

Answer 1

首先确认你是GKE的版本运行。为此，以下命令将帮助您识别 GKE 的 默认版本以及可用版本：

默认版本。

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.defaultVersion)"

可用版本。

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.validVersions)"

现在您知道了您的 GKE 的 版本，并且基于您想要的是仅当 CronJob 处于持续失败状态，GKE Workload Metrics was the GCP’s solution that used to provide a fully managed and highly configurable solution for sending to Cloud Monitoring all Prometheus-compatible metrics emitted by GKE workloads (such as a CronJob or a Deployment for an application). But, as it is right now deprecated in GKE 1.24 and was replaced with Google Cloud Managed Service for Prometheus，那么这最后一个是您在 GCP 中获得的最佳选择，因为它允许您使用 [=] 监控和提醒您的工作负载40=]Prometheus，无需大规模手动管理和操作Prometheus。

此外，您还有 2 个来自 GCP 外部的选项：Prometheus as well and Ranch’s Prometheus Push Gateway。

最后，仅供参考，可以通过查询作业然后检查它的开始时间手动完成，并将其与当前时间进行比较，这样，bash:

START_TIME=$(kubectl -n=your-namespace get job your-job-name -o json | jq '.status.startTime')
echo $START_TIME

或者，您可以通过 JSON blob 获取作业的当前状态，如下所示：

kubectl -n=your-namespace get job your-job-name -o json | jq '.status'

您也可以参考以下内容thread。

以“失败”状态作为你的需求的髓点，设置一个bash脚本kubectl发送邮件，如果你查看处于 “失败” 状态的作业可能很有用。在这里我将与您分享一些例子：

while true; do if `kubectl get jobs myjob -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' | grep True`; then mail email@address -s jobfailed; else sleep 1 ; fi; done

对于较新的 K8s：

while true; do kubectl wait --for=condition=failed job/myjob; mail@address -s jobfailed; done

失败的 GKE CronJob 的 GCP 警报策略

GCP Alerting Policy for failed GKE CronJob

google-cloud-platform

kubernetes

google-kubernetes-engine

stackdriver

google-cloud-monitoring