不正确的 cni 安装阻止 coredns pods 启动

Improper cni install preventing coredns pods from starting

刚刚使用 kubeadm v1.15.0 安装了一个主集群。然而,coredns 似乎停留在挂起模式:

coredns-5c98db65d4-4pm65                      0/1     Pending    0          2m17s   <none>        <none>                <none>           <none>
coredns-5c98db65d4-55hcc                      0/1     Pending    0          2m2s    <none>        <none>                <none>           <none>

以下是为广告连播显示的内容:

kubectl describe pods coredns-5c98db65d4-4pm65 --namespace=kube-system
Name:                 coredns-5c98db65d4-4pm65
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=kube-dns
                      pod-template-hash=5c98db65d4
Annotations:          <none>
Status:               Pending
IP:
Controlled By:        ReplicaSet/coredns-5c98db65d4
Containers:
  coredns:
    Image:       k8s.gcr.io/coredns:1.3.1
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-5t2wn (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-5t2wn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-5t2wn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  61s (x4 over 5m21s)  default-scheduler  0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.

我删除了主节点上的污点,但无济于事。我不应该能够毫无问题地创建一个单节点主机吗?我知道如果不移除污点就不可能在 master 上调度 pods,但这很奇怪。

我尝试添加最新的 calico cni,也没有用。

我得到以下 运行 journalctl(systemctl 显示没有错误):

sudo journalctl -xn --unit kubelet.service
[sudo] password for gms:
-- Logs begin at Fri 2019-07-12 04:31:34 CDT, end at Tue 2019-07-16 16:58:17 CDT. --
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:54.122355   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:54.400606   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:59.124863   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:59.400924   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:04.127120   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:04.401266   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:09.129287   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:09.401520   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:14.133059   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:14.402008   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d

确实,当我查看 /etc/cni/net.d 时,那里什么也没有 -> 是的,我 运行 kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml...这是我应用它时的输出:

configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created

我在 calico-node 的 pod 上 运行 以下内容,卡在以下状态:

calico-node-tcfhw    0/1     Init:0/3   0          11m   10.32.3.158




describe pods calico-node-tcfhw --namespace=kube-system
Name:                 calico-node-tcfhw
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 thalia0.ahc.umn.edu/10.32.3.158
Start Time:           Tue, 16 Jul 2019 18:08:25 -0500
Labels:               controller-revision-hash=844ddd97c6
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod:
Status:               Pending
IP:                   10.32.3.158
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://1e1bf9e65cb182656f6f06a1bb8291237562f0f5a375e557a454942e81d32063
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://docker.io/calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Running
      Started:      Tue, 16 Jul 2019 18:08:26 -0500
    Ready:          False
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
  install-cni:
    Container ID:
    Image:         calico/cni:v3.8.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
  flexvol-driver:
    Container ID:
    Image:          calico/pod2daemon-flexvol:v3.8.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Containers:
  calico-node:
    Container ID:
    Image:          calico/node:v3.8.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-b9c6p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-b9c6p
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age    From                          Message
  ----    ------     ----   ----                          -------
  Normal  Scheduled  9m15s  default-scheduler             Successfully assigned kube-system/calico-node-tcfhw to thalia0.ahc.umn.edu
  Normal  Pulled     9m14s  kubelet, thalia0.ahc.umn.edu  Container image "calico/cni:v3.8.0" already present on machine
  Normal  Created    9m14s  kubelet, thalia0.ahc.umn.edu  Created container upgrade-ipam
  Normal  Started    9m14s  kubelet, thalia0.ahc.umn.edu  Started container upgrade-ipam

我尝试将 Flannel 作为 cni,但情况更糟。由于污染,kube-proxy 甚至无法启动!

编辑附录

kube-controller-managerkube-scheduler 应该没有定义端点吗?

[gms@thalia0 ~]$ kubectl get ep --namespace=kube-system -o wide
NAME                      ENDPOINTS   AGE
kube-controller-manager   <none>      19h
kube-dns                  <none>      19h
kube-scheduler            <none>      19h

[gms@thalia0 ~]$ kubectl get pods --namespace=kube-system
NAME                                          READY   STATUS    RESTARTS   AGE
coredns-5c98db65d4-nmn4g                      0/1     Pending   0          19h
coredns-5c98db65d4-qv8fm                      0/1     Pending   0          19h
etcd-thalia0.x.x.edu.                         1/1     Running   0          19h
kube-apiserver-thalia0.x.x.edu                1/1     Running   0          19h
kube-controller-manager-thalia0.x.x.edu       1/1     Running   0          19h
kube-proxy-4hrdc                              1/1     Running   0          19h
kube-proxy-vb594                              1/1     Running   0          19h
kube-proxy-zwrst                              1/1     Running   0          19h
kube-scheduler-thalia0.x.x.edu                1/1     Running   0          19h

最后,为了理智起见,我尝试了 v1.13.1,瞧!成功:

NAME                                          READY   STATUS    RESTARTS   AGE
calico-node-pbrps                             2/2     Running   0          15s
coredns-86c58d9df4-g5944                      1/1     Running   0          2m40s
coredns-86c58d9df4-zntjl                      1/1     Running   0          2m40s
etcd-thalia0.ahc.umn.edu                      1/1     Running   0          110s
kube-apiserver-thalia0.ahc.umn.edu            1/1     Running   0          105s
kube-controller-manager-thalia0.ahc.umn.edu   1/1     Running   0          103s
kube-proxy-qxh2h                              1/1     Running   0          2m39s
kube-scheduler-thalia0.ahc.umn.edu            1/1     Running   0          117s

编辑 2

已尝试 sudo kubeadm upgrade plan,但在 api 服务器的健康状况和错误证书方面出现错误。

运行 在 api-服务器上:

kubectl logs kube-apiserver-thalia0.x.x.edu --namespace=kube-system1

并得到大量 TLS handshake error from 10.x.x.157:52384: remote error: tls: bad certificate 类型的错误,这些错误来自早已从集群中删除的节点,并且在 master 上出现 kubeadm resets 很久之后,以及 [= kubelet、kubeadm 等的 65=]

为什么会出现这些旧节点?证书不会在 kubeadm init 上重新创建吗?

Coreos-dns 在 Calico 启动之前不会启动,请使用此命令检查您的工作节点是否准备就绪

kubectl get nodes -owide

kubectl describe node <your-node>  

kubectl get node <your-node> -oyaml

要检查的另一件事是日志中的以下消息:

"Unable to update cni config: No networks found in /etc/cni/net.d"

那个目录里有什么?

可能是cni配置不正确

该目录 /etc/cni/net.d 应包含 2 个文件:

10-calico.conflist calico-kubeconfig

下面是这两个文件的内容,看看你的目录下有没有这样的文件

[root@master net.d]# cat 10-calico.conflist 
{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",
  "plugins": [
    {
      "type": "calico",
      "log_level": "info",
      "datastore_type": "kubernetes",
      "nodename": "master",
      "mtu": 1440,
      "ipam": {
        "type": "host-local",
        "subnet": "usePodCidr"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}

[root@master net.d]#猫calico-kubeconfig

# Kubeconfig file for Calico CNI plugin.
apiVersion: v1
kind: Config
clusters:
- name: local
  cluster:
    server: https://[10.20.0.1]:443
    certificate-authority-data: LSRt....  tLQJ=
users:
- name: calico
  user:
    token: "eUJh .... ZBoIA"
contexts:
- name: calico-context
  context:
    cluster: local
    user: calico
current-context: calico-context

这个问题 https://github.com/projectcalico/calico/issues/2699 有类似的症状,表明删除 /var/lib/cni/ 解决了这个问题。您可以查看它是否存在,如果存在则将其删除。