kubernetes Pod 的 readinessProbe 出错但端点未从服务中删除

kubernetes Pod's readinessProbe errored but endpoint not removed from Service

我是 运行 Spinnaker on Kubernetes 1.10.111。 Spinnaker 服务之一是 Pod 运行 一种名为 Clouddriver 的服务。此 Pod 运行 正常,但随后 readinessProbe 开始不断出错。 Kubernetes 文档说

readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

但是这个 Pod 的 IP 仍然在服务的端点中。为什么?

Clouddriver Pod YAML

kubectl -n spinnaker-test get pods spin-clouddriver-5559d44484-mp8q9 -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: spotify.backend-service
  creationTimestamp: 2019-02-15T20:46:38Z
  generateName: spin-clouddriver-5559d44484-
  labels:
    app: spin
    app.kubernetes.io/managed-by: halyard
    app.kubernetes.io/name: clouddriver
    app.kubernetes.io/part-of: spinnaker
    app.kubernetes.io/version: 1.12.1
    cluster: spin-clouddriver
    pod-template-hash: "1115800040"
  name: spin-clouddriver-5559d44484-mp8q9
  namespace: spinnaker-test
  ownerReferences:
  - apiVersion: extensions/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: spin-clouddriver-5559d44484
    uid: ce79561c-3161-11e9-acdf-42010a800082
  resourceVersion: "53541277"
  selfLink: /api/v1/namespaces/spinnaker-test/pods/spin-clouddriver-5559d44484-mp8q9
  uid: caa66d7c-3162-11e9-acdf-42010a800082
spec:
  containers:
  - env:
    - name: JAVA_OPTS
      value: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
    - name: SPRING_PROFILES_ACTIVE
      value: local
    image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
    imagePullPolicy: IfNotPresent
    lifecycle: {}
    name: clouddriver
    ports:
    - containerPort: 7002
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - wget
        - --no-check-certificate
        - --spider
        - -q
        - http://localhost:7002/health
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "20"
        memory: 5000Mi
      requests:
        cpu: "20"
        memory: 5000Mi
    securityContext:
      allowPrivilegeEscalation: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/spinnaker/config
      name: spin-clouddriver-files-1952526246
    - mountPath: /home/halyard/.hal/k8s-spinnaker/staging/dependencies
      name: spin-clouddriver-files-1757773194
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-w2lt5
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-production-us-ce-terraform-201812-d63606d6-9vq9
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 720
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: spin-clouddriver-files-1952526246
    secret:
      defaultMode: 420
      secretName: spin-clouddriver-files-1952526246
  - name: spin-clouddriver-files-1757773194
    secret:
      defaultMode: 420
      secretName: spin-clouddriver-files-1757773194
  - name: default-token-w2lt5
    secret:
      defaultMode: 420
      secretName: default-token-w2lt5
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-02-15T20:46:38Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-02-15T20:53:40Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2019-02-15T20:46:38Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
    image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
    imageID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver@sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
    lastState: {}
    name: clouddriver
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2019-02-15T20:47:26Z
  hostIP: 10.178.32.98
  phase: Running
  podIP: 10.179.34.24
  qosClass: Guaranteed
  startTime: 2019-02-15T20:46:38Z

对 Pod 的描述表明 readinessProbe 已经连续出错一天多了。

kubectl -n spinnaker-test describe pods spin-clouddriver-5559d44484-mp8q9

Name:           spin-clouddriver-5559d44484-mp8q9
Namespace:      spinnaker-test
Node:           gke-production-us-ce-terraform-201812-d63606d6-9vq9/10.178.32.98
Start Time:     Fri, 15 Feb 2019 15:46:38 -0500
Labels:         app=spin
                app.kubernetes.io/managed-by=halyard
                app.kubernetes.io/name=clouddriver
                app.kubernetes.io/part-of=spinnaker
                app.kubernetes.io/version=1.12.1
                cluster=spin-clouddriver
                pod-template-hash=1115800040
Annotations:    kubernetes.io/psp=spotify.backend-service
Status:         Running
IP:             10.179.34.24
Controlled By:  ReplicaSet/spin-clouddriver-5559d44484
Containers:
  clouddriver:
    Container ID:   docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
    Image:          gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
    Image ID:       docker-pullable://gcr.io/spinnaker-marketplace/clouddriver@sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
    Port:           7002/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 15 Feb 2019 15:47:26 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     20
      memory:  5000Mi
    Requests:
      cpu:      20
      memory:   5000Mi
    Readiness:  exec [wget --no-check-certificate --spider -q http://localhost:7002/health] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      JAVA_OPTS:               -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
      SPRING_PROFILES_ACTIVE:  local
    Mounts:
      /home/halyard/.hal/k8s-spinnaker/staging/dependencies from spin-clouddriver-files-1757773194 (rw)
      /opt/spinnaker/config from spin-clouddriver-files-1952526246 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-w2lt5 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  spin-clouddriver-files-1952526246:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spin-clouddriver-files-1952526246
    Optional:    false
  spin-clouddriver-files-1757773194:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spin-clouddriver-files-1757773194
    Optional:    false
  default-token-w2lt5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-w2lt5
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From                                                          Message
  ----     ------     ----               ----                                                          -------
  Warning  Unhealthy  3m (x321 over 1d)  kubelet, gke-production-us-ce-terraform-201812-d63606d6-9vq9  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

但是 Service 在其端点中仍然具有 Pod 的 IP 10.179.34.24。

kubectl -n spinnaker-test describe services spin-clouddriver

Name:              spin-clouddriver
Namespace:         spinnaker-test
Labels:            app=spin
                   cluster=spin-clouddriver
Annotations:       kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"spin","cluster":"spin-clouddriver"},"name":"spin-clouddriver","namesp...
Selector:          app=spin,cluster=spin-clouddriver
Type:              ClusterIP
IP:                10.178.65.100
Port:              <unset>  7002/TCP
TargetPort:        7002/TCP
Endpoints:         10.179.34.24:7002
Session Affinity:  None
Events:            <none>

kubectl -n spinnaker-test describe endpoints spin-clouddriver

Name:         spin-clouddriver
Namespace:    spinnaker-test
Labels:       app=spin
              cluster=spin-clouddriver
Annotations:  <none>
Subsets:
  Addresses:          10.179.34.24
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  7002  TCP

Events:  <none>

脚注

  1. 确切地说是 GKE 1.10.11-gke.1,但事实上它是 GKE 并不重要。

kubelet 的探测可以以三种状态之一结束:

  • 成功
  • 失败(命令return输入非 0 退出代码)
  • 错误(命令在超时前没有return,容器内不存在命令等)

Here is the code (in 1.10.11) 记录事件 probe errored。请注意 err != nil.

Here is the code 调用上述函数 - 当 err != nil(探测 return 出错)时,结果将被丢弃。

只有失败的探测才会真正导致 pod 的就绪状态发生变化。