kubectl 无法访问新配置的 kubernetes 节点

Newly provisioned kubernetes nodes are inaccessible by kubectl

我在 Kubernetes 1.9 中使用 Kubespray

当我尝试通过 kubectl 与新节点上的 pods 交互时,我看到的是以下内容。重要的是要注意节点被认为是健康的并且在它们上适当地安排了 pods。 pods 完全正常。

    ➜  Scripts k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host

我能够通过 IP 和 DNS 在本地 运行 kubectl 和所有主节点上 ping 到 kubeworker 节点。

➜  Scripts ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111): 56 data bytes
64 bytes from 10.0.0.111: icmp_seq=0 ttl=63 time=88.972 ms
^C

pubuntu@kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111) 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=1 ttl=64 time=0.259 ms
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=2 ttl=64 time=0.213 ms


➜  Scripts k get nodes
NAME                       STATUS    ROLES     AGE       VERSION
kubemaster-rwva1-prod-1    Ready     master    174d      v1.9.2+coreos.0
kubemaster-rwva1-prod-2    Ready     master    174d      v1.9.2+coreos.0
kubemaster-rwva1-prod-3    Ready     master    174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-1    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-10   Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-11   Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-12   Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-14   Ready     node      16d       v1.9.2+coreos.0
kubeworker-rwva1-prod-15   Ready     node      14d       v1.9.2+coreos.0
kubeworker-rwva1-prod-16   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-17   Ready     node      4d        v1.9.2+coreos.0
kubeworker-rwva1-prod-18   Ready     node      4d        v1.9.2+coreos.0
kubeworker-rwva1-prod-19   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-2    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-20   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-21   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-3    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-4    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-5    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-6    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-7    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-8    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-9    Ready     node      174d      v1.9.2+coreos.0

当我描述一个损坏的节点时,它看起来与我的一个正常运行的节点相同。

➜  Scripts k describe node kubeworker-rwva1-prod-14
Name:               kubeworker-rwva1-prod-14
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=kubeworker-rwva1-prod-14
                    node-role.kubernetes.io/node=true
                    role=app-tier
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Tue, 17 Jul 2018 19:35:08 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:08 -0700   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:08 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:08 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:18 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.0.111
  Hostname:    kubeworker-rwva1-prod-14
Capacity:
 cpu:     32
 memory:  147701524Ki
 pods:    110
Allocatable:
 cpu:     31900m
 memory:  147349124Ki
 pods:    110
System Info:
 Machine ID:                 da30025a3f8fd6c3f4de778c5b4cf558
 System UUID:                5ACCBB64-2533-E611-97F0-0894EF1D343B
 Boot ID:                    6b42ba3e-36c4-4520-97e6-e7c6fed195e2
 Kernel Version:             4.4.0-130-generic
 OS Image:                   Ubuntu 16.04.4 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.1
 Kubelet Version:            v1.9.2+coreos.0
 Kube-Proxy Version:         v1.9.2+coreos.0
ExternalID:                  kubeworker-rwva1-prod-14
Non-terminated Pods:         (5 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                         ------------  ----------  ---------------  -------------
  kube-system                calico-node-cd7qg                            150m (0%)     300m (0%)   64M (0%)         500M (0%)
  kube-system                kube-proxy-kubeworker-rwva1-prod-14          150m (0%)     500m (1%)   64M (0%)         2G (1%)
  kube-system                nginx-proxy-kubeworker-rwva1-prod-14         25m (0%)      300m (0%)   32M (0%)         512M (0%)
  prometheus                 prometheus-prometheus-node-exporter-gckzj    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  rabbit-relay               rabbit-relay-844d6865c7-q6fr2                0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  325m (1%)     1100m (3%)  160M (0%)        3012M (1%)
Events:         <none>

➜  Scripts k describe node kubeworker-rwva1-prod-11
Name:               kubeworker-rwva1-prod-11
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=kubeworker-rwva1-prod-11
                    node-role.kubernetes.io/node=true
                    role=test
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Fri, 09 Feb 2018 21:03:46 -0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Fri, 03 Aug 2018 18:46:31 -0700   Fri, 09 Feb 2018 21:03:38 -0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Fri, 03 Aug 2018 18:46:31 -0700   Mon, 16 Jul 2018 13:24:58 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 03 Aug 2018 18:46:31 -0700   Mon, 16 Jul 2018 13:24:58 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Fri, 03 Aug 2018 18:46:31 -0700   Mon, 16 Jul 2018 13:24:58 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.0.218
  Hostname:    kubeworker-rwva1-prod-11
Capacity:
 cpu:     32
 memory:  131985484Ki
 pods:    110
Allocatable:
 cpu:     31900m
 memory:  131633084Ki
 pods:    110
System Info:
 Machine ID:                 0ff6f3b9214b38ad07c063d45a6a5175
 System UUID:                4C4C4544-0044-5710-8037-B3C04F525631
 Boot ID:                    4d7ec0fc-428f-4b4c-aaae-8e70f374fbb1
 Kernel Version:             4.4.0-87-generic
 OS Image:                   Ubuntu 16.04.3 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.1
 Kubelet Version:            v1.9.2+coreos.0
 Kube-Proxy Version:         v1.9.2+coreos.0
ExternalID:                  kubeworker-rwva1-prod-11
Non-terminated Pods:         (6 in total)
  Namespace                  Name                                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                         ------------  ----------  ---------------  -------------
  ingress-nginx-internal     default-http-backend-internal-7c8ff87c86-955np               10m (0%)      10m (0%)    20Mi (0%)        20Mi (0%)
  kube-system                calico-node-8fzk6                                            150m (0%)     300m (0%)   64M (0%)         500M (0%)
  kube-system                kube-proxy-kubeworker-rwva1-prod-11                          150m (0%)     500m (1%)   64M (0%)         2G (1%)
  kube-system                nginx-proxy-kubeworker-rwva1-prod-11                         25m (0%)      300m (0%)   32M (0%)         512M (0%)
  prometheus                 prometheus-prometheus-kube-state-metrics-7c5cbb6f55-jq97n    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  prometheus                 prometheus-prometheus-node-exporter-7gn2x                    0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  335m (1%)     1110m (3%)  176730Ki (0%)    3032971520 (2%)
Events:         <none>

怎么回事?

➜  k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj

    Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host

    ➜  cat /etc/hosts | head -n1
    10.0.0.111 kubeworker-rwva1-prod-14

ubuntu@kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111) 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=1 ttl=64 time=0.275 ms
^C
--- kubeworker-rwva1-prod-14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.275/0.275/0.275/0.000 ms

ubuntu@kubemaster-rwva1-prod-1:~$ kubectl logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host

What's going on?

该名称需要可从您的工作站解析,因为对于 kubectl logskubectl exec,API 发送 URL 供客户端交互 直接与目标节点上的kubelet(以确保世界上的所有流量都不通过API服务器)。

值得庆幸的是,kubespray 有一个旋钮,通过它您可以告诉 kubernetes 更喜欢节点的 ExternalIP(当然,如果您愿意,也可以是 InternalIP):https://github.com/kubernetes-incubator/kubespray/blob/v2.5.0/roles/kubernetes/master/defaults/main.yml#L82

疯狂的问题。我不知道我是如何解决这个问题的。但是我以某种方式通过删除我的一个非功能节点并使用完整的 FQDN 重新注册它来将它重新组合在一起。这以某种方式修复了一切。然后我能够删除 FQDN 注册节点并重新创建它的短名称。

在大量 TCPdumping 之后,我能想到的最好的解释是错误消息是准确的,但是以一种非常愚蠢和令人困惑的方式。

    {"kind":"Pod","apiVersion":"v1","metadata":{"name":"prometheus-prometheus-node-exporter-gckzj","generateName":"prometheus-prometheus-node-exporter-","namespace":"prometheus","selfLink":"/api/v1/namespaces/prometheus/pods/prometheus-prometheus-node-exporter-gckzj","uid":"2fa4b744-8a33-11e8-9b15-bc305bef2e18","resourceVersion":"37138627","creationTimestamp":"2018-07-18T02:35:08Z","labels":{"app":"prometheus","component":"node-exporter","controller-revision-hash":"1725903292","pod-template-generation":"1","release":"prometheus"},"ownerReferences":[{"apiVersion":"extensions/v1beta1","kind":"DaemonSet","name":"prometheus-prometheus-node-exporter","uid":"e9216885-1616-11e8-b853-d4ae528b79ed","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"proc","hostPath":{"path":"/proc","type":""}},{"name":"sys","hostPath":{"path":"/sys","type":""}},{"name":"prometheus-prometheus-node-exporter-token-zvrdk","secret":{"secretName":"prometheus-prometheus-node-exporter-token-zvrdk","defaultMode":420}}],"containers":[{"name":"prometheus-node-exporter","image":"prom/node-exporter:v0.15.2","args":["--path.procfs=/host/proc","--path.sysfs=/host/sys"],"ports":[{"name":"metrics","hostPort":9100,"containerPort":9100,"protocol":"TCP"}],"resources":{},"volumeMounts":[{"name":"proc","readOnly":true,"mountPath":"/host/proc"},{"name":"sys","readOnly":true,"mountPath":"/host/sys"},{"name":"prometheus-prometheus-node-exporter-token-zvrdk","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"prometheus-prometheus-node-exporter","serviceAccount":"prometheus-prometheus-node-exporter","nodeName":"kubeworker-rwva1-prod-14","hostNetwork":true,"hostPID":true,"securityContext":{},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute"},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute"},{"key":"node.kubernetes.io/disk-pressure","operator":"Exists","effect":"NoSchedule"},{"key":"node.kubernetes.io/memory-pressure","operator":"Exists","effect":"NoSchedule"}]},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-18T02:35:13Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-20T08:02:58Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-18T02:35:14Z"}],"hostIP":"10.0.0.111","podIP":"10.0.0.111","startTime":"2018-07-18T02:35:13Z","containerStatuses":[{"name":"prometheus-node-exporter","state":{"running":{"startedAt":"2018-07-20T08:02:58Z"}},"lastState":{"terminated":{"exitCode":143,"reason":"Error","startedAt":"2018-07-20T08:02:27Z","finishedAt":"2018-07-20T08:02:39Z","containerID":"docker://db44927ad64eb130a73bee3c7b250f55ad911584415c373d3e3fa0fc838c146e"}},"ready":true,"restartCount":2,"image":"prom/node-exporter:v0.15.2","imageID":"docker-pullable://prom/node-exporter@sha256:6965ed8f31c5edba19d269d10238f59624e6b004f650ce925b3408ce222f9e49","containerID":"docker://4743ad5c5e60c31077e57d51eb522270c96ed227bab6522b4fcde826c4abc064"}],"qosClass":"BestEffort"}}
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host","code":500}

集群的内部 DNS 无法正确读取 API 以生成必要的记录。如果没有 DNS 授权的名称,集群会使用我的上游 DNS 记录来尝试递归解析该名称。上游 DNS 服务器不知道如何处理没有 tld 后缀的短格式名称。