使用 kOps 更新 kubernetes 导致 calico-node 失败 "BIRD is not ready: BGP not established"
Updating kubernetes with kOps causes calico-node to fail with "BIRD is not ready: BGP not established"
首先让我说这是 运行 在生产集群上,因此任何 'destructive' 会导致停机的解决方案都不是一个选项(除非绝对必要)。
我的环境
我在 AWS 上有一个 Kubernetes 集群(11 个节点,其中 3 个是主节点)运行 v1.13.1。这个集群是通过 kOps 创建的,如下所示:
kops create cluster \
--yes \
--authorization RBAC \
--cloud aws \
--networking calico \
...
我认为这无关紧要,但集群上的所有内容都是通过 helm3 安装的。
这是我的确切版本:
$ helm version
version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.5"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
$ kops version
Version 1.18.2
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-2-147-44.ec2.internal Ready node 47h v1.13.1
ip-10-2-149-115.ec2.internal Ready node 47h v1.13.1
ip-10-2-150-124.ec2.internal Ready master 2d v1.13.1
ip-10-2-151-33.ec2.internal Ready node 47h v1.13.1
ip-10-2-167-145.ec2.internal Ready master 43h v1.18.14
ip-10-2-167-162.ec2.internal Ready node 2d v1.13.1
ip-10-2-172-248.ec2.internal Ready node 47h v1.13.1
ip-10-2-173-134.ec2.internal Ready node 47h v1.13.1
ip-10-2-177-100.ec2.internal Ready master 2d v1.13.1
ip-10-2-181-235.ec2.internal Ready node 47h v1.13.1
ip-10-2-182-14.ec2.internal Ready node 47h v1.13.1
我正在尝试做什么
我正在尝试从 v1.13.1
-> v1.18.14
更新集群
我通过
编辑了配置
$ kops edit cluster
并改变了
kubernetesVersion: 1.18.14
那我运行
kops update cluster --yes
kops rolling-update cluster --yes
然后开始滚动更新过程。
NAME STATUS NEEDUPDATE READY MIN TARGET MAX NODES
master-us-east-1a NeedsUpdate 1 0 1 1 1 1
master-us-east-1b NeedsUpdate 1 0 1 1 1 1
master-us-east-1c NeedsUpdate 1 0 1 1 1 1
nodes NeedsUpdate 8 0 8 8 8 8
问题:
进程卡在第一个节点升级时出现此错误
I0108 10:48:40.137256 59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": master "ip-10-2-167-145.ec2.internal" is not ready, system-node-critical pod "calico-node-m255f" is not ready (calico-node).
I0108 10:49:12.474458 59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-node-critical pod "calico-node-m255f" is not ready (calico-node).
calico-node-m255f
是集群中唯一的 calico 节点(我很确定每个 k8s 节点应该有一个?)
关于该 pod 的信息:
$ kubectl get pods -n kube-system -o wide | grep calico-node
calico-node-m255f 0/1 Running 0 35m 10.2.167.145 ip-10-2-167-145.ec2.internal <none> <none>
$ kubectl describe pod calico-node-m255f -n kube-system
Name: calico-node-m255f
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: ip-10-2-167-145.ec2.internal/10.2.167.145
Start Time: Fri, 08 Jan 2021 10:18:05 -0800
Labels: controller-revision-hash=59875785d9
k8s-app=calico-node
pod-template-generation=5
role.kubernetes.io/networking=1
Annotations: <none>
Status: Running
IP: 10.2.167.145
IPs: <none>
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://9a6d035ee4a9d881574f45075e033597a33118e1ed2c964204cc2a5b175fbc60
Image: calico/cni:v3.15.3
Image ID: docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:06 -0800
Finished: Fri, 08 Jan 2021 10:18:06 -0800
Ready: True
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
install-cni:
Container ID: docker://5788e3519a2b1c1b77824dbfa090ad387e27d5bb16b751c3cf7637a7154ac576
Image: calico/cni:v3.15.3
Image ID: docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:07 -0800
Finished: Fri, 08 Jan 2021 10:18:08 -0800
Ready: True
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
flexvol-driver:
Container ID: docker://bc8ad32a2dd0eb5bbb21843d4d248171bc117d2eede9e1efa9512026d9205888
Image: calico/pod2daemon-flexvol:v3.15.3
Image ID: docker-pullable://calico/pod2daemon-flexvol@sha256:cec7a31b08ab5f9b1ed14053b91fd08be83f58ddba0577e9dabd8b150a51233f
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:08 -0800
Finished: Fri, 08 Jan 2021 10:18:08 -0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Containers:
calico-node:
Container ID: docker://8911e4bdc0e60aa5f6c553c0e0d0e5f7aa981d62884141120d8f7cc5bc079884
Image: calico/node:v3.15.3
Image ID: docker-pullable://calico/node@sha256:1d674438fd05bd63162d9c7b732d51ed201ee7f6331458074e3639f4437e34b1
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 08 Jan 2021 10:18:09 -0800
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Liveness: exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: kops,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
CALICO_IPV4POOL_VXLAN: Never
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
FELIX_VXLANMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
FELIX_WIREGUARDMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 100.96.0.0/11
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
FELIX_IPTABLESBACKEND: Auto
FELIX_PROMETHEUSMETRICSENABLED: false
FELIX_PROMETHEUSMETRICSPORT: 9091
FELIX_PROMETHEUSGOMETRICSENABLED: true
FELIX_PROMETHEUSPROCESSMETRICSENABLED: true
FELIX_WIREGUARDENABLED: false
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-mnnrd:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-mnnrd
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 35m default-scheduler Successfully assigned kube-system/calico-node-m255f to ip-10-2-167-145.ec2.internal
Normal Pulled 35m kubelet Container image "calico/cni:v3.15.3" already present on machine
Normal Created 35m kubelet Created container upgrade-ipam
Normal Started 35m kubelet Started container upgrade-ipam
Normal Started 35m kubelet Started container install-cni
Normal Pulled 35m kubelet Container image "calico/cni:v3.15.3" already present on machine
Normal Created 35m kubelet Created container install-cni
Normal Pulled 35m kubelet Container image "calico/pod2daemon-flexvol:v3.15.3" already present on machine
Normal Created 35m kubelet Created container flexvol-driver
Normal Started 35m kubelet Started container flexvol-driver
Normal Started 35m kubelet Started container calico-node
Normal Pulled 35m kubelet Container image "calico/node:v3.15.3" already present on machine
Normal Created 35m kubelet Created container calico-node
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:12.731 [INFO][130] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:22.727 [INFO][169] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:32.733 [INFO][207] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:42.730 [INFO][237] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:52.736 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:02.731 [INFO][294] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:12.734 [INFO][318] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:22.739 [INFO][360] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:32.748 [INFO][391] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 45s (x202 over 34m) kubelet (combined from similar events): Readiness probe failed: 2021-01-08 18:53:12.726 [INFO][6053] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
我可以通过 ssh 进入节点并从那里检查 calico
$ sudo ./calicoctl-linux-amd64 node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.2.147.44 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.149.115 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.150.124 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.151.33 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.167.162 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.172.248 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.173.134 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.177.100 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.181.235 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.182.14 | node-to-node mesh | start | 00:21:18 | Passive |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.
这是 calico-node DaemonSet 配置(我假设这是由 kops 生成的并且没有被修改过)
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: calico-node
namespace: kube-system
selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/calico-node
uid: 33dfb80a-c840-11e9-af87-02fc30bb40d6
resourceVersion: '142850829'
generation: 5
creationTimestamp: '2019-08-26T20:29:28Z'
labels:
k8s-app: calico-node
role.kubernetes.io/networking: '1'
annotations:
deprecated.daemonset.template.generation: '5'
kubectl.kubernetes.io/last-applied-configuration: '[cut out to save space]'
spec:
selector:
matchLabels:
k8s-app: calico-node
template:
metadata:
creationTimestamp: null
labels:
k8s-app: calico-node
role.kubernetes.io/networking: '1'
spec:
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
type: ''
- name: var-run-calico
hostPath:
path: /var/run/calico
type: ''
- name: var-lib-calico
hostPath:
path: /var/lib/calico
type: ''
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: cni-bin-dir
hostPath:
path: /opt/cni/bin
type: ''
- name: cni-net-dir
hostPath:
path: /etc/cni/net.d
type: ''
- name: host-local-net-dir
hostPath:
path: /var/lib/cni/networks
type: ''
- name: policysync
hostPath:
path: /var/run/nodeagent
type: DirectoryOrCreate
- name: flexvol-driver-host
hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
type: DirectoryOrCreate
initContainers:
- name: upgrade-ipam
image: 'calico/cni:v3.15.3'
command:
- /opt/cni/bin/calico-ipam
- '-upgrade'
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
resources: {}
volumeMounts:
- name: host-local-net-dir
mountPath: /var/lib/cni/networks
- name: cni-bin-dir
mountPath: /host/opt/cni/bin
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
- name: install-cni
image: 'calico/cni:v3.15.3'
command:
- /install-cni.sh
env:
- name: CNI_CONF_NAME
value: 10-calico.conflist
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
name: calico-config
key: cni_network_config
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CNI_MTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: SLEEP
value: 'false'
resources: {}
volumeMounts:
- name: cni-bin-dir
mountPath: /host/opt/cni/bin
- name: cni-net-dir
mountPath: /host/etc/cni/net.d
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
- name: flexvol-driver
image: 'calico/pod2daemon-flexvol:v3.15.3'
resources: {}
volumeMounts:
- name: flexvol-driver-host
mountPath: /host/driver
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
containers:
- name: calico-node
image: 'calico/node:v3.15.3'
env:
- name: DATASTORE_TYPE
value: kubernetes
- name: WAIT_FOR_DATASTORE
value: 'true'
- name: NODENAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
- name: CLUSTER_TYPE
value: 'kops,bgp'
- name: IP
value: autodetect
- name: CALICO_IPV4POOL_IPIP
value: Always
- name: CALICO_IPV4POOL_VXLAN
value: Never
- name: FELIX_IPINIPMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: FELIX_VXLANMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: FELIX_WIREGUARDMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: CALICO_IPV4POOL_CIDR
value: 100.96.0.0/11
- name: CALICO_DISABLE_FILE_LOGGING
value: 'true'
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: ACCEPT
- name: FELIX_IPV6SUPPORT
value: 'false'
- name: FELIX_LOGSEVERITYSCREEN
value: info
- name: FELIX_HEALTHENABLED
value: 'true'
- name: FELIX_IPTABLESBACKEND
value: Auto
- name: FELIX_PROMETHEUSMETRICSENABLED
value: 'false'
- name: FELIX_PROMETHEUSMETRICSPORT
value: '9091'
- name: FELIX_PROMETHEUSGOMETRICSENABLED
value: 'true'
- name: FELIX_PROMETHEUSPROCESSMETRICSENABLED
value: 'true'
- name: FELIX_WIREGUARDENABLED
value: 'false'
resources:
requests:
cpu: 100m
volumeMounts:
- name: lib-modules
readOnly: true
mountPath: /lib/modules
- name: xtables-lock
mountPath: /run/xtables.lock
- name: var-run-calico
mountPath: /var/run/calico
- name: var-lib-calico
mountPath: /var/lib/calico
- name: policysync
mountPath: /var/run/nodeagent
livenessProbe:
exec:
command:
- /bin/calico-node
- '-felix-live'
- '-bird-live'
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
- '-felix-ready'
- '-bird-ready'
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
restartPolicy: Always
terminationGracePeriodSeconds: 0
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: calico-node
serviceAccount: calico-node
hostNetwork: true
securityContext: {}
schedulerName: default-scheduler
tolerations:
- operator: Exists
effect: NoSchedule
- key: CriticalAddonsOnly
operator: Exists
- operator: Exists
effect: NoExecute
priorityClassName: system-node-critical
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
revisionHistoryLimit: 10
status:
currentNumberScheduled: 1
numberMisscheduled: 0
desiredNumberScheduled: 1
numberReady: 0
observedGeneration: 5
updatedNumberScheduled: 1
numberUnavailable: 1
pod 日志中也没有什么真正有用的东西;没有错误或任何明显的东西。它主要是这样的:
2021-01-08 19:08:21.603 [INFO][48] felix/int_dataplane.go 1245: Applying dataplane updates
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 223: Asked to resync with the dataplane on next update. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 306: Resyncing ipsets with dataplane. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/wireguard.go 578: Wireguard is not enabled
2021-01-08 19:08:21.605 [INFO][48] felix/ipsets.go 356: Finished resync family="inet" numInconsistenciesFound=0 resyncDuration=1.573324ms
2021-01-08 19:08:21.605 [INFO][48] felix/int_dataplane.go 1259: Finished applying updates to dataplane. msecToApply=2.03915
我尝试过的东西
不幸的是,我不是网络专家,所以我没有深入了解 calico 的细节。
我已经尝试重启相关的 pods,重启实际的 EC2 实例,删除守护进程并使用上述配置重新添加它。
我还可以向您保证,内部网络中没有可能阻止连接的网络限制(防火墙、sec 组等)。
同样值得指出的是,这个集群在 kops rolling-update
尝试之前运行良好。
我在这里几乎遇到了障碍,不确定我还能尝试什么。
我通过同时更新所有母版解决了这个问题,没有验证
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --node-interval=1s
现在一切正常!
我没有明确的答案为什么会这样。但是,如果您从 k8s 1.13 直接跳到 1.18,跳过了一些增量更改,这可能会导致您看到的问题。
虽然始终使用最新的 kOps 版本是安全的(只要它支持您正在使用的 k8s 版本),但 k8s 本身仅支持按次要版本跳转次要版本:https://kubernetes.io/docs/setup/release/version-skew-policy/
首先让我说这是 运行 在生产集群上,因此任何 'destructive' 会导致停机的解决方案都不是一个选项(除非绝对必要)。
我的环境
我在 AWS 上有一个 Kubernetes 集群(11 个节点,其中 3 个是主节点)运行 v1.13.1。这个集群是通过 kOps 创建的,如下所示:
kops create cluster \
--yes \
--authorization RBAC \
--cloud aws \
--networking calico \
...
我认为这无关紧要,但集群上的所有内容都是通过 helm3 安装的。
这是我的确切版本:
$ helm version
version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.5"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
$ kops version
Version 1.18.2
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-2-147-44.ec2.internal Ready node 47h v1.13.1
ip-10-2-149-115.ec2.internal Ready node 47h v1.13.1
ip-10-2-150-124.ec2.internal Ready master 2d v1.13.1
ip-10-2-151-33.ec2.internal Ready node 47h v1.13.1
ip-10-2-167-145.ec2.internal Ready master 43h v1.18.14
ip-10-2-167-162.ec2.internal Ready node 2d v1.13.1
ip-10-2-172-248.ec2.internal Ready node 47h v1.13.1
ip-10-2-173-134.ec2.internal Ready node 47h v1.13.1
ip-10-2-177-100.ec2.internal Ready master 2d v1.13.1
ip-10-2-181-235.ec2.internal Ready node 47h v1.13.1
ip-10-2-182-14.ec2.internal Ready node 47h v1.13.1
我正在尝试做什么
我正在尝试从 v1.13.1
-> v1.18.14
我通过
编辑了配置$ kops edit cluster
并改变了
kubernetesVersion: 1.18.14
那我运行
kops update cluster --yes
kops rolling-update cluster --yes
然后开始滚动更新过程。
NAME STATUS NEEDUPDATE READY MIN TARGET MAX NODES
master-us-east-1a NeedsUpdate 1 0 1 1 1 1
master-us-east-1b NeedsUpdate 1 0 1 1 1 1
master-us-east-1c NeedsUpdate 1 0 1 1 1 1
nodes NeedsUpdate 8 0 8 8 8 8
问题:
进程卡在第一个节点升级时出现此错误
I0108 10:48:40.137256 59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": master "ip-10-2-167-145.ec2.internal" is not ready, system-node-critical pod "calico-node-m255f" is not ready (calico-node).
I0108 10:49:12.474458 59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-node-critical pod "calico-node-m255f" is not ready (calico-node).
calico-node-m255f
是集群中唯一的 calico 节点(我很确定每个 k8s 节点应该有一个?)
关于该 pod 的信息:
$ kubectl get pods -n kube-system -o wide | grep calico-node
calico-node-m255f 0/1 Running 0 35m 10.2.167.145 ip-10-2-167-145.ec2.internal <none> <none>
$ kubectl describe pod calico-node-m255f -n kube-system
Name: calico-node-m255f
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: ip-10-2-167-145.ec2.internal/10.2.167.145
Start Time: Fri, 08 Jan 2021 10:18:05 -0800
Labels: controller-revision-hash=59875785d9
k8s-app=calico-node
pod-template-generation=5
role.kubernetes.io/networking=1
Annotations: <none>
Status: Running
IP: 10.2.167.145
IPs: <none>
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://9a6d035ee4a9d881574f45075e033597a33118e1ed2c964204cc2a5b175fbc60
Image: calico/cni:v3.15.3
Image ID: docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:06 -0800
Finished: Fri, 08 Jan 2021 10:18:06 -0800
Ready: True
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
install-cni:
Container ID: docker://5788e3519a2b1c1b77824dbfa090ad387e27d5bb16b751c3cf7637a7154ac576
Image: calico/cni:v3.15.3
Image ID: docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:07 -0800
Finished: Fri, 08 Jan 2021 10:18:08 -0800
Ready: True
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
flexvol-driver:
Container ID: docker://bc8ad32a2dd0eb5bbb21843d4d248171bc117d2eede9e1efa9512026d9205888
Image: calico/pod2daemon-flexvol:v3.15.3
Image ID: docker-pullable://calico/pod2daemon-flexvol@sha256:cec7a31b08ab5f9b1ed14053b91fd08be83f58ddba0577e9dabd8b150a51233f
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:08 -0800
Finished: Fri, 08 Jan 2021 10:18:08 -0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Containers:
calico-node:
Container ID: docker://8911e4bdc0e60aa5f6c553c0e0d0e5f7aa981d62884141120d8f7cc5bc079884
Image: calico/node:v3.15.3
Image ID: docker-pullable://calico/node@sha256:1d674438fd05bd63162d9c7b732d51ed201ee7f6331458074e3639f4437e34b1
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 08 Jan 2021 10:18:09 -0800
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Liveness: exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: kops,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
CALICO_IPV4POOL_VXLAN: Never
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
FELIX_VXLANMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
FELIX_WIREGUARDMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 100.96.0.0/11
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
FELIX_IPTABLESBACKEND: Auto
FELIX_PROMETHEUSMETRICSENABLED: false
FELIX_PROMETHEUSMETRICSPORT: 9091
FELIX_PROMETHEUSGOMETRICSENABLED: true
FELIX_PROMETHEUSPROCESSMETRICSENABLED: true
FELIX_WIREGUARDENABLED: false
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-mnnrd:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-mnnrd
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 35m default-scheduler Successfully assigned kube-system/calico-node-m255f to ip-10-2-167-145.ec2.internal
Normal Pulled 35m kubelet Container image "calico/cni:v3.15.3" already present on machine
Normal Created 35m kubelet Created container upgrade-ipam
Normal Started 35m kubelet Started container upgrade-ipam
Normal Started 35m kubelet Started container install-cni
Normal Pulled 35m kubelet Container image "calico/cni:v3.15.3" already present on machine
Normal Created 35m kubelet Created container install-cni
Normal Pulled 35m kubelet Container image "calico/pod2daemon-flexvol:v3.15.3" already present on machine
Normal Created 35m kubelet Created container flexvol-driver
Normal Started 35m kubelet Started container flexvol-driver
Normal Started 35m kubelet Started container calico-node
Normal Pulled 35m kubelet Container image "calico/node:v3.15.3" already present on machine
Normal Created 35m kubelet Created container calico-node
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:12.731 [INFO][130] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:22.727 [INFO][169] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:32.733 [INFO][207] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:42.730 [INFO][237] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:52.736 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:02.731 [INFO][294] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:12.734 [INFO][318] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:22.739 [INFO][360] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:32.748 [INFO][391] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 45s (x202 over 34m) kubelet (combined from similar events): Readiness probe failed: 2021-01-08 18:53:12.726 [INFO][6053] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
我可以通过 ssh 进入节点并从那里检查 calico
$ sudo ./calicoctl-linux-amd64 node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.2.147.44 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.149.115 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.150.124 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.151.33 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.167.162 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.172.248 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.173.134 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.177.100 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.181.235 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.182.14 | node-to-node mesh | start | 00:21:18 | Passive |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.
这是 calico-node DaemonSet 配置(我假设这是由 kops 生成的并且没有被修改过)
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: calico-node
namespace: kube-system
selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/calico-node
uid: 33dfb80a-c840-11e9-af87-02fc30bb40d6
resourceVersion: '142850829'
generation: 5
creationTimestamp: '2019-08-26T20:29:28Z'
labels:
k8s-app: calico-node
role.kubernetes.io/networking: '1'
annotations:
deprecated.daemonset.template.generation: '5'
kubectl.kubernetes.io/last-applied-configuration: '[cut out to save space]'
spec:
selector:
matchLabels:
k8s-app: calico-node
template:
metadata:
creationTimestamp: null
labels:
k8s-app: calico-node
role.kubernetes.io/networking: '1'
spec:
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
type: ''
- name: var-run-calico
hostPath:
path: /var/run/calico
type: ''
- name: var-lib-calico
hostPath:
path: /var/lib/calico
type: ''
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: cni-bin-dir
hostPath:
path: /opt/cni/bin
type: ''
- name: cni-net-dir
hostPath:
path: /etc/cni/net.d
type: ''
- name: host-local-net-dir
hostPath:
path: /var/lib/cni/networks
type: ''
- name: policysync
hostPath:
path: /var/run/nodeagent
type: DirectoryOrCreate
- name: flexvol-driver-host
hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
type: DirectoryOrCreate
initContainers:
- name: upgrade-ipam
image: 'calico/cni:v3.15.3'
command:
- /opt/cni/bin/calico-ipam
- '-upgrade'
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
resources: {}
volumeMounts:
- name: host-local-net-dir
mountPath: /var/lib/cni/networks
- name: cni-bin-dir
mountPath: /host/opt/cni/bin
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
- name: install-cni
image: 'calico/cni:v3.15.3'
command:
- /install-cni.sh
env:
- name: CNI_CONF_NAME
value: 10-calico.conflist
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
name: calico-config
key: cni_network_config
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CNI_MTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: SLEEP
value: 'false'
resources: {}
volumeMounts:
- name: cni-bin-dir
mountPath: /host/opt/cni/bin
- name: cni-net-dir
mountPath: /host/etc/cni/net.d
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
- name: flexvol-driver
image: 'calico/pod2daemon-flexvol:v3.15.3'
resources: {}
volumeMounts:
- name: flexvol-driver-host
mountPath: /host/driver
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
containers:
- name: calico-node
image: 'calico/node:v3.15.3'
env:
- name: DATASTORE_TYPE
value: kubernetes
- name: WAIT_FOR_DATASTORE
value: 'true'
- name: NODENAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
- name: CLUSTER_TYPE
value: 'kops,bgp'
- name: IP
value: autodetect
- name: CALICO_IPV4POOL_IPIP
value: Always
- name: CALICO_IPV4POOL_VXLAN
value: Never
- name: FELIX_IPINIPMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: FELIX_VXLANMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: FELIX_WIREGUARDMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: CALICO_IPV4POOL_CIDR
value: 100.96.0.0/11
- name: CALICO_DISABLE_FILE_LOGGING
value: 'true'
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: ACCEPT
- name: FELIX_IPV6SUPPORT
value: 'false'
- name: FELIX_LOGSEVERITYSCREEN
value: info
- name: FELIX_HEALTHENABLED
value: 'true'
- name: FELIX_IPTABLESBACKEND
value: Auto
- name: FELIX_PROMETHEUSMETRICSENABLED
value: 'false'
- name: FELIX_PROMETHEUSMETRICSPORT
value: '9091'
- name: FELIX_PROMETHEUSGOMETRICSENABLED
value: 'true'
- name: FELIX_PROMETHEUSPROCESSMETRICSENABLED
value: 'true'
- name: FELIX_WIREGUARDENABLED
value: 'false'
resources:
requests:
cpu: 100m
volumeMounts:
- name: lib-modules
readOnly: true
mountPath: /lib/modules
- name: xtables-lock
mountPath: /run/xtables.lock
- name: var-run-calico
mountPath: /var/run/calico
- name: var-lib-calico
mountPath: /var/lib/calico
- name: policysync
mountPath: /var/run/nodeagent
livenessProbe:
exec:
command:
- /bin/calico-node
- '-felix-live'
- '-bird-live'
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
- '-felix-ready'
- '-bird-ready'
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
restartPolicy: Always
terminationGracePeriodSeconds: 0
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: calico-node
serviceAccount: calico-node
hostNetwork: true
securityContext: {}
schedulerName: default-scheduler
tolerations:
- operator: Exists
effect: NoSchedule
- key: CriticalAddonsOnly
operator: Exists
- operator: Exists
effect: NoExecute
priorityClassName: system-node-critical
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
revisionHistoryLimit: 10
status:
currentNumberScheduled: 1
numberMisscheduled: 0
desiredNumberScheduled: 1
numberReady: 0
observedGeneration: 5
updatedNumberScheduled: 1
numberUnavailable: 1
pod 日志中也没有什么真正有用的东西;没有错误或任何明显的东西。它主要是这样的:
2021-01-08 19:08:21.603 [INFO][48] felix/int_dataplane.go 1245: Applying dataplane updates
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 223: Asked to resync with the dataplane on next update. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 306: Resyncing ipsets with dataplane. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/wireguard.go 578: Wireguard is not enabled
2021-01-08 19:08:21.605 [INFO][48] felix/ipsets.go 356: Finished resync family="inet" numInconsistenciesFound=0 resyncDuration=1.573324ms
2021-01-08 19:08:21.605 [INFO][48] felix/int_dataplane.go 1259: Finished applying updates to dataplane. msecToApply=2.03915
我尝试过的东西
不幸的是,我不是网络专家,所以我没有深入了解 calico 的细节。
我已经尝试重启相关的 pods,重启实际的 EC2 实例,删除守护进程并使用上述配置重新添加它。
我还可以向您保证,内部网络中没有可能阻止连接的网络限制(防火墙、sec 组等)。
同样值得指出的是,这个集群在 kops rolling-update
尝试之前运行良好。
我在这里几乎遇到了障碍,不确定我还能尝试什么。
我通过同时更新所有母版解决了这个问题,没有验证
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --node-interval=1s
现在一切正常!
我没有明确的答案为什么会这样。但是,如果您从 k8s 1.13 直接跳到 1.18,跳过了一些增量更改,这可能会导致您看到的问题。
虽然始终使用最新的 kOps 版本是安全的(只要它支持您正在使用的 k8s 版本),但 k8s 本身仅支持按次要版本跳转次要版本:https://kubernetes.io/docs/setup/release/version-skew-policy/