calico-kube-controllers 和 calico-node 没有准备好 (CrashLoopBackOff)
calico-kube-controllers and calico-node are not ready (CrashLoopBackOff)
我使用 kubespray 部署了一个全新的 k8s 集群,一切正常,但所有与 calico 相关的 pods 还没有准备好。经过几个小时的调试,我找不到 calico pods 崩溃的原因。我什至 disabled/stopped 整个 firewalld 服务,但没有任何改变。
另一件重要的事情是 calicoctl node status
输出不稳定,每次调用都会显示不同的内容:
Calico process is not running.
Calico process is running.
None of the BGP backend processes (BIRD or GoBGP) are running.
Calico process is running.
IPv4 BGP status
+----------------+-------------------+-------+----------+---------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+----------------+-------------------+-------+----------+---------+
| 192.168.231.42 | node-to-node mesh | start | 06:23:41 | Passive |
+----------------+-------------------+-------+----------+---------+
IPv6 BGP status
No IPv6 peers found.
另一条经常出现的日志是以下消息:
bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
还尝试使用以下各项更改 IP_AUTODETECTION_METHOD,但没有任何更改:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=8.8.8.8
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth1
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth.*
预期行为
与 calico 相关的所有 pods、daemonset、deployment 和 replicaset 应处于 READY 状态。
当前行为
与 calico 相关的所有 pods、daemonset、deployment 和 replicaset 都处于 NOT READY 状态。
可能的解决方案
还没有,我正在寻求有关如何调试/克服此问题的帮助。
重现步骤(针对错误)
它是具有以下上下文和环境的最新版本的 kubespray。
git reflog
7e4b176 HEAD@{0}: clone: from https://github.com/kubernetes-sigs/kubespray.git
上下文
我正在尝试部署一个具有一个主节点和一个工作节点的 k8s 集群。另请注意,参与该集群的服务器位于几乎 airgapped/offline 环境中,对全球互联网的访问受限,当然使用 kubespray 部署集群的可靠过程是成功的,但我正面临 calico pods.
您的环境
cat inventory/mycluster/hosts.yaml
all:
hosts:
node1:
ansible_host: 192.168.231.41
ansible_port: 32244
ip: 192.168.231.41
access_ip: 192.168.231.41
node2:
ansible_host: 192.168.231.42
ansible_port: 32244
ip: 192.168.231.42
access_ip: 192.168.231.42
children:
kube_control_plane:
hosts:
node1:
kube_node:
hosts:
node1:
node2:
etcd:
hosts:
node1:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
calicoctl version
Client Version: v3.19.2
Git commit: 6f3d4900
Cluster Version: v3.19.2
Cluster Type: kubespray,bgp,kubeadm,kdd,k8s
cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
uname -r
3.10.0-1160.42.2.el7.x86_64
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:10:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane,master 19h v1.21.4 192.168.231.41 <none> CentOS Linux 7 (Core) 3.10.0-1160.42.2.el7.x86_64 docker://20.10.8
node2 Ready <none> 19h v1.21.4 192.168.231.42 <none> CentOS Linux 7 (Core) 3.10.0-1160.42.2.el7.x86_64 docker://20.10.8
kubectl get all --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/calico-kube-controllers-8575b76f66-57zw4 0/1 CrashLoopBackOff 327 19h 192.168.231.42 node2 <none> <none>
kube-system pod/calico-node-4hkzb 0/1 Running 245 14h 192.168.231.42 node2 <none> <none>
kube-system pod/calico-node-hznhc 0/1 Running 245 14h 192.168.231.41 node1 <none> <none>
kube-system pod/coredns-8474476ff8-b6lqz 1/1 Running 0 19h 10.233.96.1 node2 <none> <none>
kube-system pod/coredns-8474476ff8-gdkml 1/1 Running 0 19h 10.233.90.1 node1 <none> <none>
kube-system pod/dns-autoscaler-7df78bfcfb-xnn4r 1/1 Running 0 19h 10.233.90.2 node1 <none> <none>
kube-system pod/kube-apiserver-node1 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/kube-controller-manager-node1 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/kube-proxy-dmw22 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/kube-proxy-wzpnv 1/1 Running 0 19h 192.168.231.42 node2 <none> <none>
kube-system pod/kube-scheduler-node1 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/nginx-proxy-node2 1/1 Running 0 19h 192.168.231.42 node2 <none> <none>
kube-system pod/nodelocaldns-6h5q2 1/1 Running 0 19h 192.168.231.42 node2 <none> <none>
kube-system pod/nodelocaldns-7fwbd 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.233.0.1 <none> 443/TCP 19h <none>
kube-system service/coredns ClusterIP 10.233.0.3 <none> 53/UDP,53/TCP,9153/TCP 19h k8s-app=kube-dns
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
kube-system daemonset.apps/calico-node 2 2 0 2 0 kubernetes.io/os=linux 19h calico-node quay.io/calico/node:v3.19.2 k8s-app=calico-node
kube-system daemonset.apps/kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 19h kube-proxy k8s.gcr.io/kube-proxy:v1.21.4 k8s-app=kube-proxy
kube-system daemonset.apps/nodelocaldns 2 2 2 2 2 kubernetes.io/os=linux 19h node-cache k8s.gcr.io/dns/k8s-dns-node-cache:1.17.1 k8s-app=nodelocaldns
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
kube-system deployment.apps/calico-kube-controllers 0/1 1 0 19h calico-kube-controllers quay.io/calico/kube-controllers:v3.19.2 k8s-app=calico-kube-controllers
kube-system deployment.apps/coredns 2/2 2 2 19h coredns k8s.gcr.io/coredns/coredns:v1.8.0 k8s-app=kube-dns
kube-system deployment.apps/dns-autoscaler 1/1 1 1 19h autoscaler k8s.gcr.io/cpa/cluster-proportional-autoscaler-amd64:1.8.3 k8s-app=dns-autoscaler
NAMESPACE NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
kube-system replicaset.apps/calico-kube-controllers-8575b76f66 1 1 0 19h calico-kube-controllers quay.io/calico/kube-controllers:v3.19.2 k8s-app=calico-kube-controllers,pod-template-hash=8575b76f66
kube-system replicaset.apps/coredns-8474476ff8 2 2 2 19h coredns k8s.gcr.io/coredns/coredns:v1.8.0 k8s-app=kube-dns,pod-template-hash=8474476ff8
kube-system replicaset.apps/dns-autoscaler-7df78bfcfb 1 1 1 19h autoscaler k8s.gcr.io/cpa/cluster-proportional-autoscaler-amd64:1.8.3 k8s-app=dns-autoscaler,pod-template-hash=7df78bfcfb
幸运的是,livenessProbe
和 readinessProbe
的 timeoutSeconds
从 1 增加到 60 修复问题。
kubectl edit -n kube-system daemonset.apps/calico-node
kubectl edit -n kube-system deployment.apps/calico-kube-controllers
我使用 kubespray 部署了一个全新的 k8s 集群,一切正常,但所有与 calico 相关的 pods 还没有准备好。经过几个小时的调试,我找不到 calico pods 崩溃的原因。我什至 disabled/stopped 整个 firewalld 服务,但没有任何改变。
另一件重要的事情是 calicoctl node status
输出不稳定,每次调用都会显示不同的内容:
Calico process is not running.
Calico process is running.
None of the BGP backend processes (BIRD or GoBGP) are running.
Calico process is running.
IPv4 BGP status
+----------------+-------------------+-------+----------+---------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+----------------+-------------------+-------+----------+---------+
| 192.168.231.42 | node-to-node mesh | start | 06:23:41 | Passive |
+----------------+-------------------+-------+----------+---------+
IPv6 BGP status
No IPv6 peers found.
另一条经常出现的日志是以下消息:
bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
还尝试使用以下各项更改 IP_AUTODETECTION_METHOD,但没有任何更改:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=8.8.8.8
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth1
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth.*
预期行为
与 calico 相关的所有 pods、daemonset、deployment 和 replicaset 应处于 READY 状态。
当前行为
与 calico 相关的所有 pods、daemonset、deployment 和 replicaset 都处于 NOT READY 状态。
可能的解决方案
还没有,我正在寻求有关如何调试/克服此问题的帮助。
重现步骤(针对错误)
它是具有以下上下文和环境的最新版本的 kubespray。
git reflog
7e4b176 HEAD@{0}: clone: from https://github.com/kubernetes-sigs/kubespray.git
上下文
我正在尝试部署一个具有一个主节点和一个工作节点的 k8s 集群。另请注意,参与该集群的服务器位于几乎 airgapped/offline 环境中,对全球互联网的访问受限,当然使用 kubespray 部署集群的可靠过程是成功的,但我正面临 calico pods.
您的环境
cat inventory/mycluster/hosts.yaml
all:
hosts:
node1:
ansible_host: 192.168.231.41
ansible_port: 32244
ip: 192.168.231.41
access_ip: 192.168.231.41
node2:
ansible_host: 192.168.231.42
ansible_port: 32244
ip: 192.168.231.42
access_ip: 192.168.231.42
children:
kube_control_plane:
hosts:
node1:
kube_node:
hosts:
node1:
node2:
etcd:
hosts:
node1:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
calicoctl version
Client Version: v3.19.2
Git commit: 6f3d4900
Cluster Version: v3.19.2
Cluster Type: kubespray,bgp,kubeadm,kdd,k8s
cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
uname -r
3.10.0-1160.42.2.el7.x86_64
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:10:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane,master 19h v1.21.4 192.168.231.41 <none> CentOS Linux 7 (Core) 3.10.0-1160.42.2.el7.x86_64 docker://20.10.8
node2 Ready <none> 19h v1.21.4 192.168.231.42 <none> CentOS Linux 7 (Core) 3.10.0-1160.42.2.el7.x86_64 docker://20.10.8
kubectl get all --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/calico-kube-controllers-8575b76f66-57zw4 0/1 CrashLoopBackOff 327 19h 192.168.231.42 node2 <none> <none>
kube-system pod/calico-node-4hkzb 0/1 Running 245 14h 192.168.231.42 node2 <none> <none>
kube-system pod/calico-node-hznhc 0/1 Running 245 14h 192.168.231.41 node1 <none> <none>
kube-system pod/coredns-8474476ff8-b6lqz 1/1 Running 0 19h 10.233.96.1 node2 <none> <none>
kube-system pod/coredns-8474476ff8-gdkml 1/1 Running 0 19h 10.233.90.1 node1 <none> <none>
kube-system pod/dns-autoscaler-7df78bfcfb-xnn4r 1/1 Running 0 19h 10.233.90.2 node1 <none> <none>
kube-system pod/kube-apiserver-node1 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/kube-controller-manager-node1 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/kube-proxy-dmw22 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/kube-proxy-wzpnv 1/1 Running 0 19h 192.168.231.42 node2 <none> <none>
kube-system pod/kube-scheduler-node1 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
kube-system pod/nginx-proxy-node2 1/1 Running 0 19h 192.168.231.42 node2 <none> <none>
kube-system pod/nodelocaldns-6h5q2 1/1 Running 0 19h 192.168.231.42 node2 <none> <none>
kube-system pod/nodelocaldns-7fwbd 1/1 Running 0 19h 192.168.231.41 node1 <none> <none>
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.233.0.1 <none> 443/TCP 19h <none>
kube-system service/coredns ClusterIP 10.233.0.3 <none> 53/UDP,53/TCP,9153/TCP 19h k8s-app=kube-dns
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
kube-system daemonset.apps/calico-node 2 2 0 2 0 kubernetes.io/os=linux 19h calico-node quay.io/calico/node:v3.19.2 k8s-app=calico-node
kube-system daemonset.apps/kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 19h kube-proxy k8s.gcr.io/kube-proxy:v1.21.4 k8s-app=kube-proxy
kube-system daemonset.apps/nodelocaldns 2 2 2 2 2 kubernetes.io/os=linux 19h node-cache k8s.gcr.io/dns/k8s-dns-node-cache:1.17.1 k8s-app=nodelocaldns
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
kube-system deployment.apps/calico-kube-controllers 0/1 1 0 19h calico-kube-controllers quay.io/calico/kube-controllers:v3.19.2 k8s-app=calico-kube-controllers
kube-system deployment.apps/coredns 2/2 2 2 19h coredns k8s.gcr.io/coredns/coredns:v1.8.0 k8s-app=kube-dns
kube-system deployment.apps/dns-autoscaler 1/1 1 1 19h autoscaler k8s.gcr.io/cpa/cluster-proportional-autoscaler-amd64:1.8.3 k8s-app=dns-autoscaler
NAMESPACE NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
kube-system replicaset.apps/calico-kube-controllers-8575b76f66 1 1 0 19h calico-kube-controllers quay.io/calico/kube-controllers:v3.19.2 k8s-app=calico-kube-controllers,pod-template-hash=8575b76f66
kube-system replicaset.apps/coredns-8474476ff8 2 2 2 19h coredns k8s.gcr.io/coredns/coredns:v1.8.0 k8s-app=kube-dns,pod-template-hash=8474476ff8
kube-system replicaset.apps/dns-autoscaler-7df78bfcfb 1 1 1 19h autoscaler k8s.gcr.io/cpa/cluster-proportional-autoscaler-amd64:1.8.3 k8s-app=dns-autoscaler,pod-template-hash=7df78bfcfb
幸运的是,livenessProbe
和 readinessProbe
的 timeoutSeconds
从 1 增加到 60 修复问题。
kubectl edit -n kube-system daemonset.apps/calico-node
kubectl edit -n kube-system deployment.apps/calico-kube-controllers