指定了 startupProbe 和 initialDelaySeconds 的 K8S Pod 等待太久才变为 Ready

K8S Pod with startupProbe and initialDelaySeconds specified waits too long to become Ready

我一直在尝试调试我的 K8S 部署中的一个非常奇怪的延迟。我已将其追踪到下面的简单复制。看起来是,如果我在启动探测器上设置 initialDelaySeconds 或将其保留为 0 并出现一次故障,则探测器在一段时间内不会再次获得 运行 并以至少 1-1.5 分钟结束延迟进入 Ready:true 状态。

我 运行 在本地使用 Ubutunu 18.04 和 microk8s v1.19.3,具有以下版本:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: microbot
  name: microbot
spec:
  replicas: 1
  selector:
    matchLabels:
      app: microbot
  strategy: {}
  template:
    metadata:
      labels:
        app: microbot
    spec:
      containers:
      - image: cdkbot/microbot-amd64
        name: microbot
        command: ["/bin/sh"]
        args: ["-c", "sleep 3; /start_nginx.sh"]
        #args: ["-c", "/start_nginx.sh"]
        ports:
        - containerPort: 80
        startupProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 0  # 5 also has same issue
          periodSeconds: 1
          failureThreshold: 10
          successThreshold: 1
        ##livenessProbe:
        ##  httpGet:
        ##    path: /
        ##    port: 80
        ##  initialDelaySeconds: 0
        ##  periodSeconds: 10
        ##  failureThreshold: 1
        resources: {}
      restartPolicy: Always
      serviceAccountName: ""
status: {}
---
apiVersion: v1
kind: Service
metadata:
  name: microbot
  labels:
    app: microbot
spec:
  ports:
    - port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: microbot

问题是,如果我在 startupProbe 中有任何延迟,或者如果有初始故障,pod 会进入 Initialized:true 状态,但有 Ready:False 和容器 Ready:False。在 1-1.5 分钟内不会从此状态改变。我还没有找到设置的模式。

我也留下了注释设置,这样你就可以看到我想要到达这里的内容。我所拥有的是一个启动的容器,它有一个需要几秒钟才能启动的服务。我想告诉 startupProbe 稍等片刻,然后每秒检查一次,看看我们是否准备好了。该配置似乎有效,但存在延迟,我无法追踪。即使在启动探测通过后,它也不会在超过一分钟的时间内将 pod 转换为就绪状态。

如果 Pod 最初未就绪,则 k8s 的其他地方是否有一些设置会延迟 Pod 进入就绪之前的时间量?

非常感谢任何想法。

其实我在评论中犯了一个错误,你可以在startupProbe中使用initialDelaySeconds,但你应该改用failureThresholdperiodSeconds


如前所述here

Kubernetes 探测器

Kubernetes supports readiness and liveness probes for versions ≤ 1.15. Startup probes were added in 1.16 as an alpha feature and graduated to beta in 1.18 (WARNING: 1.16 deprecated several Kubernetes APIs. Use this migration guide to check for compatibility). All the probe have the following parameters:

  • initialDelaySeconds : number of seconds to wait before initiating liveness or readiness probes
  • periodSeconds: how often to check the probe
  • timeoutSeconds: number of seconds before marking the probe as timing out (failing the health check)
  • successThreshold : minimum number of consecutive successful checks for the probe to pass
  • failureThreshold : number of retries before marking the probe as failed. For liveness probes, this will lead to the pod restarting. For readiness probes, this will mark the pod as unready.

那么为什么要使用 failureThresholdperiodSeconds

consider an application where it occasionally needs to download large amounts of data or do an expensive operation at the start of the process. Since initialDelaySeconds is a static number, we are forced to always take the worst-case scenario (or extend the failureThreshold that may affect long-running behavior) and wait for a long time even when that application does not need to carry out long-running initialization steps. With startup probes, we can instead configure failureThreshold and periodSeconds to model this uncertainty better. For example, setting failureThreshold to 15 and periodSeconds to 5 means the application will get 15 (fifteen) x 5 (five) = 75s to startup before it fails.

此外,如果您需要更多信息,请在媒体上查看此 article


引用自 kubernetes documentation 关于 Protect slow starting containers with startup probes

Sometimes, you have to deal with legacy applications that might require an additional startup time on their first initialization. In such cases, it can be tricky to set up liveness probe parameters without compromising the fast response to deadlocks that motivated such a probe. The trick is to set up a startup probe with the same command, HTTP or TCP check, with a failureThreshold * periodSeconds long enough to cover the worse case startup time.

So, the previous example would become:

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 1
  periodSeconds: 10

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.