Pods 当工作节点关闭时卡在终止状态(从未在健康节点上重新部署),如何解决这个问题?

Pods stuck in Terminating state when worker node is down (never redeployed on healthy nodes), how to fix this?

我们 运行 一个使用 kubespray 配置的 kubernetes 集群,发现每次当一个故障节点出现故障时(我们最近遇到了硬件问题),在此节点上执行的 pods 卡在无限期终止状态。即使在几个小时后,pods 也没有在健康的节点上重新部署,因此我们的整个应用程序出现故障,用户会长时间受到影响。

如何配置 kubernetes 以在这种情况下执行故障转移?

下面是我们的状态集清单。

apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: project-stock
  name: ps-ra
spec:
  selector:
    matchLabels:
      infrastructure: ps
      application: report-api
      environment: staging
  serviceName: hl-ps-report-api
  replicas: 1
  template:
    metadata:
      namespace: project-stock
      labels:
        infrastructure: ps
        application: report-api
        environment: staging
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: ps-report-api
          image: localhost:5000/ps/nodejs-chrome-application:latest
          ports:
            - containerPort: 3000
              protocol: TCP
              name: nodejs-rest-api
          volumeMounts:
          resources:
            limits:
              cpu: 1000m
              memory: 8192Mi
            requests:
              cpu: 333m
              memory: 8192Mi
          livenessProbe:
            httpGet:
              path: /health/
              port: 3000
            initialDelaySeconds: 180
            periodSeconds: 10
            failureThreshold: 12
            timeoutSeconds: 10

发布社区 wiki 以获得更好的可见性。随意扩展它。


在我看来,您的 kubespray 集群(pod 处于 Terminating 状态)的行为完全是故意的。基于 Kubernetes documentation:

A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node.

同一文档介绍了删除处于 Terminating 状态的 Pod 的方法。还有一些推荐的最佳做法:

The only ways in which a Pod in such a state can be removed from the apiserver are as follows:

  • The Node object is deleted (either by you, or by the Node Controller).
  • The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
  • Force deletion of the Pod by the user.

The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver. Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.

你可以实现Graceful Node Shutdown if your node is shutdown in one of the following ways:

On Linux, your system can shut down in many different situations. For example:

  • A user or script running shutdown -h now or systemctl poweroff or systemctl reboot.
  • Physically pressing a power button on the machine.
  • Stopping a VM instance on a cloud provider, e.g. gcloud compute instances stop on GCP.
  • A Preemptible VM or Spot Instance that your cloud provider can terminate unexpectedly, but with a brief warning.

请记住,此功能从版本 1.20(处于 alpha 状态)及更高版本(目前在 1.21 中支持处于测试状态)。

文档中提到的另一种解决方案是手动删除节点,例如使用 kubectl delete node <your-node-name>:

If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object.

然后pod会被重新调度到其他节点

最后的解决方法是将 TerminationGracePeriodSeconds 设置为 0,但这是 strongly discouraged:

For the above to lead to graceful termination, the Pod must not specify a pod.Spec.TerminationGracePeriodSeconds of 0. The practice of setting a pod.Spec.TerminationGracePeriodSeconds of 0 seconds is unsafe and strongly discouraged for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod shuts down gracefully before the kubelet deletes the name from the apiserver.