Rancher:kube-system pods 卡在 ContainerCreating

Rancher: kube-system pods stuck on ContainerCreating

我正在尝试用一个节点(VM 机器)启动一个集群,但我得到一些 pods 用于 kube-system 卡住为 ContainerCreating

> kubectl get pods,svc -owide --all-namespaces
NAMESPACE       NAME                                          READY   STATUS              RESTARTS   AGE     IP            NODE            NOMINATED NODE   READINESS GATES
cattle-system   pod/cattle-cluster-agent-7db88c6b68-bz5dp     0/1     ContainerCreating   0          7m13s   <none>        hdn-dev-app66   <none>           <none>
cattle-system   pod/cattle-node-agent-ccntw                   1/1     Running             0          7m13s   10.105.1.76   hdn-dev-app66   <none>           <none>
cattle-system   pod/kube-api-auth-9kdpw                       1/1     Running             0          7m13s   10.105.1.76   hdn-dev-app66   <none>           <none>
ingress-nginx   pod/default-http-backend-598b7d7dbd-rwvhm     0/1     ContainerCreating   0          7m29s   <none>        hdn-dev-app66   <none>           <none>
ingress-nginx   pod/nginx-ingress-controller-62vhq            1/1     Running             0          7m29s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/coredns-849545576b-w87zr                  0/1     ContainerCreating   0          7m39s   <none>        hdn-dev-app66   <none>           <none>
kube-system     pod/coredns-autoscaler-5dcd676cbd-pj54d       0/1     ContainerCreating   0          7m38s   <none>        hdn-dev-app66   <none>           <none>
kube-system     pod/kube-flannel-d9m6q                        2/2     Running             0          7m43s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/metrics-server-697746ff48-q7cpx           0/1     ContainerCreating   0          7m33s   <none>        hdn-dev-app66   <none>           <none>
kube-system     pod/rke-coredns-addon-deploy-job-npjll        0/1     Completed           0          7m40s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/rke-ingress-controller-deploy-job-b9rs4   0/1     Completed           0          7m30s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/rke-metrics-addon-deploy-job-5rpbj        0/1     Completed           0          7m35s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/rke-network-plugin-deploy-job-lvk2q       0/1     Completed           0          7m50s   10.105.1.76   hdn-dev-app66   <none>           <none>

NAMESPACE       NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
default         service/kubernetes             ClusterIP   10.43.0.1      <none>        443/TCP                  8m19s   <none>
ingress-nginx   service/default-http-backend   ClusterIP   10.43.144.25   <none>        80/TCP                   7m29s   app=default-http-backend
kube-system     service/kube-dns               ClusterIP   10.43.0.10     <none>        53/UDP,53/TCP,9153/TCP   7m39s   k8s-app=kube-dns
kube-system     service/metrics-server         ClusterIP   10.43.251.47   <none>        443/TCP                  7m34s   k8s-app=metrics-server

当我描述失败时 pods 我明白了:

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "345460c8f6399a0cf20956d8ea24d52f5a684ae47c3e8ec247f83d66d56b2baa" network for pod "cattle-cluster-agent-7db88c6b68-bz5dp": networkPlugin cni failed to set up pod "cattle-cluster-agent-7db88c6b68-bz5dp_cattle-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope, failed to clean up sandbox container "345460c8f6399a0cf20956d8ea24d52f5a684ae47c3e8ec247f83d66d56b2baa" network for pod "cattle-cluster-agent-7db88c6b68-bz5dp": networkPlugin cni failed to teardown pod "cattle-cluster-agent-7db88c6b68-bz5dp_cattle-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope]

曾尝试再次重新注册该节点,但没有成功。有什么想法吗?

因为它说未经授权所以你必须给 rbac 权限才能让它工作。

尝试添加

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:calico-node
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-node
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes

解决了 https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cleaning-cluster-nodes/ 中有关如何回收损坏节点的以下文章的问题。