Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod network

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod network

在 k8s(v1.10) 集群上创建 Redis POD 并且 POD 创建卡在 "ContainerCreating"

Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   Scheduled               30m                   default-scheduler   Successfully assigned redis to k8snode02
  Normal   SuccessfulMountVolume   30m                   kubelet, k8snode02  MountVolume.SetUp succeeded for volume "default-token-f8tcg"
  Warning  FailedCreatePodSandBox  5m (x1202 over 30m)   kubelet, k8snode02  Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "redis_default" network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]
  Normal   SandboxChanged          47s (x1459 over 30m)  kubelet, k8snode02  Pod sandbox changed, it will be killed and re-created.

确保 /etc/cni/net.d 和它的 /opt/cni/bin 朋友都存在并且正确填充了 CNI configuration files and binaries on all Nodes. For flannel specifically, one might make use of the flannel cni repo

当我使用 calico 作为 CNI 时,我遇到了类似的问题。

容器仍处于创建状态,我检查了 master 上的 /etc/cni/net.d 和 /opt/cni/bin 都存在,但不确定工作节点是否也需要。

root@KubernetesMaster:/opt/cni/bin# kubectl get pods
NAME                   READY   STATUS              RESTARTS   AGE
nginx-5c7588df-5zds6   0/1     ContainerCreating   0          21m
root@KubernetesMaster:/opt/cni/bin# kubectl get nodes
NAME               STATUS   ROLES    AGE   VERSION
kubernetesmaster   Ready    master   26m   v1.13.4
kubernetesslave1   Ready    <none>   22m   v1.13.4
root@KubernetesMaster:/opt/cni/bin#

kubectl describe pods
Name:               nginx-5c7588df-5zds6
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               kubernetesslave1/10.0.3.80
Start Time:         Sun, 17 Mar 2019 05:13:30 +0000
Labels:             app=nginx
                    pod-template-hash=5c7588df
Annotations:        <none>
Status:             Pending
IP:
Controlled By:      ReplicaSet/nginx-5c7588df
Containers:
  nginx:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qtfbs (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-qtfbs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qtfbs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                    From                       Message
  ----     ------                  ----                   ----                       -------
  Normal   Scheduled               18m                    default-scheduler          Successfully assigned default/nginx-5c7588df-5zds6 to kubernetesslave1
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "123d527490944d80f44b1976b82dbae5dc56934aabf59cf89f151736d7ea8adc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8cc5e62ebaab7075782c2248e00d795191c45906cc9579464a00c09a2bc88b71" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "30ffdeace558b0935d1ed3c2e59480e2dd98e983b747dacae707d1baa222353f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "630e85451b6ce2452839c4cfd1ecb9acce4120515702edf29421c123cf231213" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "820b919b7edcfc3081711bb78b79d33e5be3f7dafcbad29fe46b6d7aa22227aa" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abbfb5d2756f12802072039dec20ba52f546ae755aaa642a9a75c86577be589f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfeb46ffda4d0f8a434f3f3af04328fcc4b6c7cafaa62626e41b705b06d98cc4" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9ae3f47bb0282a56e607779d3267127ee8b0ae1d7f416f5a184682119203b1c8" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "546d07f1864728b2e2675c066775f94d658e221ada5fb4ed6bf6689ec7b8de23" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          18m (x12 over 18m)     kubelet, kubernetesslave1  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  3m39s (x829 over 18m)  kubelet, kubernetesslave1  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f586be437843537a3082f37ad139c88d0eacfbe99ddf00621efd4dc049a268cc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
root@KubernetesMaster:/etc/cni/net.d#

在工作节点上,NGINX 试图启动但退出了,我不确定这里发生了什么 - 我是 kubernetes 的新手并且无法解决这个问题 -

root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   3 minutes ago       Up 3 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   3 minutes ago       Up 3 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   3 minutes ago       Up 3 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1

    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
    94b2994401d0        k8s.gcr.io/pause:3.1   "/pause"                 1 second ago        Up Less than a second                       k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_534
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                                k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
    f72500cae2b7        k8s.gcr.io/pause:3.1   "/pause"                 1 second ago        Up Less than a second                       k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_585
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                                k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   5 minutes ago       Up 5 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1

我也在工作节点上检查了 /etc/cni/net.d & /opt/cni/bin,它在那里 -

root@kubernetesslave1:/home/ubuntu# cd /etc/cni
root@kubernetesslave1:/etc/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 net.d
root@kubernetesslave1:/etc/cni# cd /opt/cni
root@kubernetesslave1:/opt/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 bin
root@kubernetesslave1:/opt/cni# cd bin
root@kubernetesslave1:/opt/cni/bin# ls -ltr
total 107440
-rwxr-xr-x 1 root root  3890407 Aug 17  2017 bridge
-rwxr-xr-x 1 root root  3475802 Aug 17  2017 ipvlan
-rwxr-xr-x 1 root root  3520724 Aug 17  2017 macvlan
-rwxr-xr-x 1 root root  3877986 Aug 17  2017 ptp
-rwxr-xr-x 1 root root  3475750 Aug 17  2017 vlan
-rwxr-xr-x 1 root root  9921982 Aug 17  2017 dhcp
-rwxr-xr-x 1 root root  2605279 Aug 17  2017 sample
-rwxr-xr-x 1 root root 32351072 Mar 17 05:19 calico
-rwxr-xr-x 1 root root 31490656 Mar 17 05:19 calico-ipam
-rwxr-xr-x 1 root root  2856252 Mar 17 05:19 flannel
-rwxr-xr-x 1 root root  3084347 Mar 17 05:19 loopback
-rwxr-xr-x 1 root root  3036768 Mar 17 05:19 host-local
-rwxr-xr-x 1 root root  3550877 Mar 17 05:19 portmap
-rwxr-xr-x 1 root root  2850029 Mar 17 05:19 tuning
root@kubernetesslave1:/opt/cni/bin#

AWS EKS 尚不支持 t3a、m5ad r5ad 实例

当我在 AWS EKS 上添加 PVC 时,我遇到了这个问题。

将 aws-node CNI 插件更新到最新版本解决了它 -

https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

以下步骤重置了 kubernetes 集群并帮助我解决了问题。

  1. 全部停止运行pods
  2. 从集群中删除所有工作节点
  3. 在主节点和节点上执行 kubeadm 重置
  4. 启动主节点 kubeadm init --apiserver-advertise-address
  5. 安装 Pod 网络“WeaveNet” kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.IPALLOC_RANGE=192.168.0.0 /16
  6. 将节点加入集群
  7. 重启所有节点

我在 GCP 上的 GKE 集群和我的抢占式节点池之一遇到了这个问题。感谢@mdaniel 检查 /etc/cni/net.d 完整性的提示,我可以使用命令 gcloud compute ssh <name of some node> --zone <zone-of-cluster> --internal-ip 通过 ssh 将问题再次重现到测试集群的节点中。然后我简单地编辑了文件 /etc/cni/net.d/10-gke-ptp.conflist 并弄乱了 "routes": [ {"dst": "0.0.0.0/0"} ] 上的值(从 0.0.0.0/0 更改为 1.0.0.0/0)。

之后,我删除了其中 运行 的 pods,它们都陷入了 ContainerCreating 状态,永远生成 kublet 事件,错误为 Failed create pod sandbox: rpc error: code...

请注意,为了进行测试,我将我的节点池设置为最多有 1 个节点。否则它将扩展一个新节点,并且 pods 将在新节点上重新创建。在我的生产事件中,节点池达到了最大节点数,因此将我的测试设置为最大 1 个节点重现了类似的情况。

从那以后,从 GKE 中删除节点解决了生产中的问题,我创建了一个 Python 脚本,列出集群上的所有事件并过滤具有关键字 "Failed create pod sandbox: rpc error: code" 的事件。然后我遍历所有事件并获取它们的 pods,然后从 pods 获取节点。最后,我遍历节点,从 Kubernetes API 和 Compute API 及其各自的 Python 客户端中删除它们。对于 Python 脚本,我使用了库:kubernetesgoogle-cloud-compute.

这是脚本的简化版本。使用前测试一下:

from kubernetes import client, config
from google.cloud.compute_v1.services.instances import InstancesClient


ERROR_KEYWORDS = [
    'Failed to create pod sandbox'.lower()
]

config.load_kube_config()
v1 = client.CoreV1Api()

events_result = v1.list_event_for_all_namespaces()

filtered_events = []

# filter only the events containing ERROR_KEYWORDS
for event in events_result.items:
    for error_keyword in ERROR_KEYWORDS:
        if error_keyword in event.message.lower():
            filtered_events.append(event)


# gets the list of pods from those events
pods_list = {}

for event in filtered_events:
    try:
        pod = v1.read_namespaced_pod(
            event.involved_object.name,
            namespace=event.involved_object.namespace
        )

        pod_dict = {
            "name": event.involved_object.name,
            "namespace": event.involved_object.namespace,
            "node": pod.spec.node_name
        }

        pods_list[event.involved_object.name] = pod_dict

    except Exception as e:
        pass


# Get the nodes from those pods
broken_nodes = set()

for name, pod_dict in pods_list.items():
    if pod_dict.get('node'):
        broken_nodes.add(pod_dict["node"])


broken_nodes = list(broken_nodes)

# Deletes the nodes from both Kubernetes API and Compute Engine API
if broken_nodes:
    broken_nodes_str = ", ".join(broken_nodes)
    print(f'BROKEN NODES: "{broken_nodes_str}"')
    for node in broken_nodes:

        try:
            api_response = v1.delete_node(node)
        except Exception as e:
            pass

        time.sleep(30)
        
        try:
            result = gcp_client.delete(project=PROJECT_ID, zone=CLUSTER_ZONE, instance=node)
        except Exception as e:
            pass
kubectl drain node1 node2 --delete-local-data --force --ignore-daemonsets

刚打算在所有节点上驱逐pods,没想到一直报错的pods变成了运行。你可以尝试执行一下,希望对你有用