使用 kOps 更新 kubernetes 导致 calico-node 失败 "BIRD is not ready: BGP not established"

Updating kubernetes with kOps causes calico-node to fail with "BIRD is not ready: BGP not established"

首先让我说这是 运行 在生产集群上,因此任何 'destructive' 会导致停机的解决方案都不是一个选项(除非绝对必要)。

我的环境

我在 AWS 上有一个 Kubernetes 集群(11 个节点,其中 3 个是主节点)运行 v1.13.1。这个集群是通过 kOps 创建的,如下所示:

kops create cluster \
    --yes \
    --authorization RBAC \
    --cloud aws \
    --networking calico \
    ...

我认为这无关紧要,但集群上的所有内容都是通过 helm3 安装的。

这是我的确切版本:

$ helm version
version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.5"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
$ kops version
Version 1.18.2
$ kubectl get nodes                                                                                                                                                                            
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-2-147-44.ec2.internal    Ready    node     47h   v1.13.1
ip-10-2-149-115.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-150-124.ec2.internal   Ready    master   2d    v1.13.1
ip-10-2-151-33.ec2.internal    Ready    node     47h   v1.13.1
ip-10-2-167-145.ec2.internal   Ready    master   43h   v1.18.14
ip-10-2-167-162.ec2.internal   Ready    node     2d    v1.13.1
ip-10-2-172-248.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-173-134.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-177-100.ec2.internal   Ready    master   2d    v1.13.1
ip-10-2-181-235.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-182-14.ec2.internal    Ready    node     47h   v1.13.1

我正在尝试做什么

我正在尝试从 v1.13.1 -> v1.18.14

更新集群

我通过

编辑了配置
$ kops edit cluster

并改变了

kubernetesVersion: 1.18.14

那我运行

kops update cluster --yes
kops rolling-update cluster --yes

然后开始滚动更新过程。

NAME                STATUS        NEEDUPDATE    READY   MIN   TARGET   MAX   NODES
master-us-east-1a   NeedsUpdate   1             0       1     1        1     1
master-us-east-1b   NeedsUpdate   1             0       1     1        1     1
master-us-east-1c   NeedsUpdate   1             0       1     1        1     1
nodes               NeedsUpdate   8             0       8     8        8     8

问题:

进程卡在第一个节点升级时出现此错误

I0108 10:48:40.137256   59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": master "ip-10-2-167-145.ec2.internal" is not ready, system-node-critical pod "calico-node-m255f" is not ready (calico-node).
I0108 10:49:12.474458   59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-node-critical pod "calico-node-m255f" is not ready (calico-node).

calico-node-m255f 是集群中唯一的 calico 节点(我很确定每个 k8s 节点应该有一个?)

关于该 pod 的信息:

$ kubectl get pods -n kube-system -o wide | grep calico-node
calico-node-m255f                                            0/1     Running             0          35m   10.2.167.145      ip-10-2-167-145.ec2.internal   <none>           <none>

$ kubectl describe pod calico-node-m255f -n kube-system

Name:                 calico-node-m255f
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-2-167-145.ec2.internal/10.2.167.145
Start Time:           Fri, 08 Jan 2021 10:18:05 -0800
Labels:               controller-revision-hash=59875785d9
                      k8s-app=calico-node
                      pod-template-generation=5
                      role.kubernetes.io/networking=1
Annotations:          <none>
Status:               Running
IP:                   10.2.167.145
IPs:                  <none>
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://9a6d035ee4a9d881574f45075e033597a33118e1ed2c964204cc2a5b175fbc60
    Image:         calico/cni:v3.15.3
    Image ID:      docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Jan 2021 10:18:06 -0800
      Finished:     Fri, 08 Jan 2021 10:18:06 -0800
    Ready:          True
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
  install-cni:
    Container ID:  docker://5788e3519a2b1c1b77824dbfa090ad387e27d5bb16b751c3cf7637a7154ac576
    Image:         calico/cni:v3.15.3
    Image ID:      docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Jan 2021 10:18:07 -0800
      Finished:     Fri, 08 Jan 2021 10:18:08 -0800
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
  flexvol-driver:
    Container ID:   docker://bc8ad32a2dd0eb5bbb21843d4d248171bc117d2eede9e1efa9512026d9205888
    Image:          calico/pod2daemon-flexvol:v3.15.3
    Image ID:       docker-pullable://calico/pod2daemon-flexvol@sha256:cec7a31b08ab5f9b1ed14053b91fd08be83f58ddba0577e9dabd8b150a51233f
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Jan 2021 10:18:08 -0800
      Finished:     Fri, 08 Jan 2021 10:18:08 -0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Containers:
  calico-node:
    Container ID:   docker://8911e4bdc0e60aa5f6c553c0e0d0e5f7aa981d62884141120d8f7cc5bc079884
    Image:          calico/node:v3.15.3
    Image ID:       docker-pullable://calico/node@sha256:1d674438fd05bd63162d9c7b732d51ed201ee7f6331458074e3639f4437e34b1
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 08 Jan 2021 10:18:09 -0800
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                         kubernetes
      WAIT_FOR_DATASTORE:                     true
      NODENAME:                                (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           kops,bgp
      IP:                                     autodetect
      CALICO_IPV4POOL_IPIP:                   Always
      CALICO_IPV4POOL_VXLAN:                  Never
      FELIX_IPINIPMTU:                        <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      FELIX_VXLANMTU:                         <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      FELIX_WIREGUARDMTU:                     <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:                   100.96.0.0/11
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      ACCEPT
      FELIX_IPV6SUPPORT:                      false
      FELIX_LOGSEVERITYSCREEN:                info
      FELIX_HEALTHENABLED:                    true
      FELIX_IPTABLESBACKEND:                  Auto
      FELIX_PROMETHEUSMETRICSENABLED:         false
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       true
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  true
      FELIX_WIREGUARDENABLED:                 false
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-mnnrd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-mnnrd
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     :NoSchedule op=Exists
                 :NoExecute op=Exists
                 CriticalAddonsOnly op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  35m   default-scheduler  Successfully assigned kube-system/calico-node-m255f to ip-10-2-167-145.ec2.internal
  Normal   Pulled     35m   kubelet            Container image "calico/cni:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container upgrade-ipam
  Normal   Started    35m   kubelet            Started container upgrade-ipam
  Normal   Started    35m   kubelet            Started container install-cni
  Normal   Pulled     35m   kubelet            Container image "calico/cni:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container install-cni
  Normal   Pulled     35m   kubelet            Container image "calico/pod2daemon-flexvol:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container flexvol-driver
  Normal   Started    35m   kubelet            Started container flexvol-driver
  Normal   Started    35m   kubelet            Started container calico-node
  Normal   Pulled     35m   kubelet            Container image "calico/node:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container calico-node
  Warning  Unhealthy  35m   kubelet            Readiness probe failed: 2021-01-08 18:18:12.731 [INFO][130] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:22.727 [INFO][169] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:32.733 [INFO][207] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:42.730 [INFO][237] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:52.736 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:02.731 [INFO][294] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:12.734 [INFO][318] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:22.739 [INFO][360] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:32.748 [INFO][391] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  45s (x202 over 34m)  kubelet  (combined from similar events): Readiness probe failed: 2021-01-08 18:53:12.726 [INFO][6053] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14

我可以通过 ssh 进入节点并从那里检查 calico

$ sudo ./calicoctl-linux-amd64 node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.2.147.44  | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.149.115 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.150.124 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.151.33  | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.167.162 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.172.248 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.173.134 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.177.100 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.181.235 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.182.14  | node-to-node mesh | start | 00:21:18 | Passive                        |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.

这是 calico-node DaemonSet 配置(我假设这是由 kops 生成的并且没有被修改过)

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: calico-node
  namespace: kube-system
  selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/calico-node
  uid: 33dfb80a-c840-11e9-af87-02fc30bb40d6
  resourceVersion: '142850829'
  generation: 5
  creationTimestamp: '2019-08-26T20:29:28Z'
  labels:
    k8s-app: calico-node
    role.kubernetes.io/networking: '1'
  annotations:
    deprecated.daemonset.template.generation: '5'
    kubectl.kubernetes.io/last-applied-configuration: '[cut out to save space]'
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
        role.kubernetes.io/networking: '1'
    spec:
      volumes:
        - name: lib-modules
          hostPath:
            path: /lib/modules
            type: ''
        - name: var-run-calico
          hostPath:
            path: /var/run/calico
            type: ''
        - name: var-lib-calico
          hostPath:
            path: /var/lib/calico
            type: ''
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: FileOrCreate
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
            type: ''
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
            type: ''
        - name: host-local-net-dir
          hostPath:
            path: /var/lib/cni/networks
            type: ''
        - name: policysync
          hostPath:
            path: /var/run/nodeagent
            type: DirectoryOrCreate
        - name: flexvol-driver-host
          hostPath:
            path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
            type: DirectoryOrCreate
      initContainers:
        - name: upgrade-ipam
          image: 'calico/cni:v3.15.3'
          command:
            - /opt/cni/bin/calico-ipam
            - '-upgrade'
          env:
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
          resources: {}
          volumeMounts:
            - name: host-local-net-dir
              mountPath: /var/lib/cni/networks
            - name: cni-bin-dir
              mountPath: /host/opt/cni/bin
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
        - name: install-cni
          image: 'calico/cni:v3.15.3'
          command:
            - /install-cni.sh
          env:
            - name: CNI_CONF_NAME
              value: 10-calico.conflist
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: cni_network_config
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: CNI_MTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: SLEEP
              value: 'false'
          resources: {}
          volumeMounts:
            - name: cni-bin-dir
              mountPath: /host/opt/cni/bin
            - name: cni-net-dir
              mountPath: /host/etc/cni/net.d
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
        - name: flexvol-driver
          image: 'calico/pod2daemon-flexvol:v3.15.3'
          resources: {}
          volumeMounts:
            - name: flexvol-driver-host
              mountPath: /host/driver
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
      containers:
        - name: calico-node
          image: 'calico/node:v3.15.3'
          env:
            - name: DATASTORE_TYPE
              value: kubernetes
            - name: WAIT_FOR_DATASTORE
              value: 'true'
            - name: NODENAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
            - name: CLUSTER_TYPE
              value: 'kops,bgp'
            - name: IP
              value: autodetect
            - name: CALICO_IPV4POOL_IPIP
              value: Always
            - name: CALICO_IPV4POOL_VXLAN
              value: Never
            - name: FELIX_IPINIPMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: FELIX_VXLANMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: FELIX_WIREGUARDMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: CALICO_IPV4POOL_CIDR
              value: 100.96.0.0/11
            - name: CALICO_DISABLE_FILE_LOGGING
              value: 'true'
            - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
              value: ACCEPT
            - name: FELIX_IPV6SUPPORT
              value: 'false'
            - name: FELIX_LOGSEVERITYSCREEN
              value: info
            - name: FELIX_HEALTHENABLED
              value: 'true'
            - name: FELIX_IPTABLESBACKEND
              value: Auto
            - name: FELIX_PROMETHEUSMETRICSENABLED
              value: 'false'
            - name: FELIX_PROMETHEUSMETRICSPORT
              value: '9091'
            - name: FELIX_PROMETHEUSGOMETRICSENABLED
              value: 'true'
            - name: FELIX_PROMETHEUSPROCESSMETRICSENABLED
              value: 'true'
            - name: FELIX_WIREGUARDENABLED
              value: 'false'
          resources:
            requests:
              cpu: 100m
          volumeMounts:
            - name: lib-modules
              readOnly: true
              mountPath: /lib/modules
            - name: xtables-lock
              mountPath: /run/xtables.lock
            - name: var-run-calico
              mountPath: /var/run/calico
            - name: var-lib-calico
              mountPath: /var/lib/calico
            - name: policysync
              mountPath: /var/run/nodeagent
          livenessProbe:
            exec:
              command:
                - /bin/calico-node
                - '-felix-live'
                - '-bird-live'
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
                - /bin/calico-node
                - '-felix-ready'
                - '-bird-ready'
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
      restartPolicy: Always
      terminationGracePeriodSeconds: 0
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: calico-node
      serviceAccount: calico-node
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - operator: Exists
          effect: NoSchedule
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
          effect: NoExecute
      priorityClassName: system-node-critical
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10
status:
  currentNumberScheduled: 1
  numberMisscheduled: 0
  desiredNumberScheduled: 1
  numberReady: 0
  observedGeneration: 5
  updatedNumberScheduled: 1
  numberUnavailable: 1

pod 日志中也没有什么真正有用的东西;没有错误或任何明显的东西。它主要是这样的:

2021-01-08 19:08:21.603 [INFO][48] felix/int_dataplane.go 1245: Applying dataplane updates
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 223: Asked to resync with the dataplane on next update. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 306: Resyncing ipsets with dataplane. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/wireguard.go 578: Wireguard is not enabled
2021-01-08 19:08:21.605 [INFO][48] felix/ipsets.go 356: Finished resync family="inet" numInconsistenciesFound=0 resyncDuration=1.573324ms
2021-01-08 19:08:21.605 [INFO][48] felix/int_dataplane.go 1259: Finished applying updates to dataplane. msecToApply=2.03915

我尝试过的东西

不幸的是,我不是网络专家,所以我没有深入了解 calico 的细节。

我已经尝试重启相关的 pods,重启实际的 EC2 实例,删除守护进程并使用上述配置重新添加它。

我还可以向您保证,内部网络中没有可能阻止连接的网络限制(防火墙、sec 组等)。

同样值得指出的是,这个集群在 kops rolling-update 尝试之前运行良好。

我在这里几乎遇到了障碍,不确定我还能尝试什么。

我通过同时更新所有母版解决了这个问题,没有验证

kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --node-interval=1s

现在一切正常!

我没有明确的答案为什么会这样。但是,如果您从 k8s 1.13 直接跳到 1.18,跳过了一些增量更改,这可能会导致您看到的问题。

虽然始终使用最新的 kOps 版本是安全的(只要它支持您正在使用的 k8s 版本),但 k8s 本身仅支持按次要版本跳转次要版本:https://kubernetes.io/docs/setup/release/version-skew-policy/