健康检查以检测耗尽的节点

Healthcheck to detect drained node

我们在 L4 负载均衡器后面有一个 Kubernetes 集群，但是当我们需要 update/reboot 个节点时，我们无法以编程方式访问 add/remove 个节点的负载均衡器（LB 由我们托管服务提供商的支持团队）。

负载均衡器确实支持健康检查，但当前设置是调用每个节点上的端口 80 以确定节点是否健康。即使节点被耗尽，这也会成功，所以我们别无选择，只能重启节点并等待最多 10 秒让 LB 注意到并在 kubeapi 死亡时将其从集合中取出。

我想要像每个节点的 pod 这样的东西，我们可以用它来确定节点是否存活，大概设置了节点端口。问题是我找不到如何做到这一点。如果我使用 daemonset，我不认为 pods 在 drain 期间被逐出，所以那是行不通的，如果我使用正常部署，则无法保证健康的节点将具有 pod 的实例并且会显得不健康。即使使用反亲和性设置，我也不认为可以保证所有健康的节点都会有一个运行 pod 来检查。

有谁知道使用 TCP 或 HTTP 调用节点来检测节点耗尽的方法吗？

您正在寻找的解决方案似乎在 this documentation:

中有完整描述

Node Problem Detector is a daemon for monitoring and reporting about a node's health. You can run Node Problem Detector as a DaemonSet or as a standalone daemon. Node Problem Detector collects information about node problems from various daemons and reports these conditions to the API server as NodeCondition and Event.

您可以根据其condition创建节点监控。

您还需要了解 limitations:

Node Problem Detector only supports file based kernel log. Log tools such as journald are not supported.

Node Problem Detector uses the kernel log format for reporting kernel issues. To learn how to extend the kernel log format, see Add support for another log format.

健康检查以检测耗尽的节点

Healthcheck to detect drained node

load-balancing

kubernetes