kube-dns getsockopt 没有到主机的路由

Question

我正在努力了解如何在 kubernetes 1.10 上使用 flannel 正确配置 kube-dns，并将 containerd 作为 CRI。

kube-dns 无法运行，并出现以下错误：

kubectl -n kube-system logs kube-dns-595fdb6c46-9tvn9 -c kubedns
I0424 14:56:34.944476       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
I0424 14:56:35.444469       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
E0424 14:56:35.815863       1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.96.0.1:443: getsockopt: no route to host
E0424 14:56:35.815863       1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:189: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.96.0.1:443: getsockopt: no route to host
I0424 14:56:35.944444       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
I0424 14:56:36.444462       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
I0424 14:56:36.944507       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
F0424 14:56:37.444434       1 dns.go:209] Timeout waiting for initialization

kubectl -n kube-system describe pod kube-dns-595fdb6c46-9tvn9
  Type     Reason     Age                 From              Message
  ----     ------     ----                ----              -------
  Warning  Unhealthy  47m (x181 over 3h)  kubelet, worker1  Readiness probe failed: Get http://10.244.0.2:8081/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    27m (x519 over 3h)  kubelet, worker1  Back-off restarting failed container
  Normal   Killing    17m (x44 over 3h)   kubelet, worker1  Killing container with id containerd://dnsmasq:Container failed liveness probe.. Container will be killed and recreated.
  Warning  Unhealthy  12m (x178 over 3h)  kubelet, worker1  Liveness probe failed: Get http://10.244.0.2:10054/metrics: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    2m (x855 over 3h)   kubelet, worker1  Back-off restarting failed container

确实没有到 10.96.0.1 端点的路由：

ip route
default via 10.240.0.254 dev ens160 
10.240.0.0/24 dev ens160  proto kernel  scope link  src 10.240.0.21 
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink 
10.244.0.0/16 dev cni0  proto kernel  scope link  src 10.244.0.1 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink 
10.244.4.0/24 via 10.244.4.0 dev flannel.1 onlink 
10.244.5.0/24 via 10.244.5.0 dev flannel.1 onlink

什么负责配置集群服务地址范围和关联路由？是容器运行time、覆盖网络（在本例中为 flannel）还是其他？应该在哪里配置？

10-containerd-net.conflist 配置主机和我的 pod 网络之间的桥接。这里也可以配置服务网络吗？

cat /etc/cni/net.d/10-containerd-net.conflist 
{
  "cniVersion": "0.3.1",
  "name": "containerd-net",
  "plugins": [
    {
      "type": "bridge",
      "bridge": "cni0",
      "isGateway": true,
      "ipMasq": true,
      "promiscMode": true,
      "ipam": {
        "type": "host-local",
        "subnet": "10.244.0.0/16",
        "routes": [
          { "dst": "0.0.0.0/0" }
        ]
      }
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true}
    }
  ]
}

编辑：

2016 年刚遇到 this：

As of a few weeks ago (I forget the release but it was a 1.2.x where x != 0) (#24429) we fixed the routing such that any traffic that arrives at a node destined for a service IP will be handled as if it came to a node port. This means you should be able to set yo static routes for your service cluster IP range to one or more nodes and the nodes will act as bridges. This is the same trick most people do with flannel to bridge the overlay.

It's imperfect but it works. In the future will will need to get more precise with the routing if you want optimal behavior (i.e. not losing the client IP), or we will see more non-kube-proxy implementations of services.

这还有意义吗？我需要为服务 CIDR 设置静态路由吗？或者问题实际上是 kube-proxy 而不是 flannel 或 containerd？

我的绒布配置：

cat /etc/cni/net.d/10-flannel.conflist 
{
  "name": "cbr0",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

和 kube-proxy：

[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://github.com/kubernetes/kubernetes

[Service]
ExecStart=/usr/local/bin/kube-proxy \
  --cluster-cidr=10.244.0.0/16 \
  --feature-gates=SupportIPVSProxyMode=true \
  --ipvs-min-sync-period=5s \
  --ipvs-sync-period=5s \
  --ipvs-scheduler=rr \
  --kubeconfig=/etc/kubernetes/kube-proxy.conf \
  --logtostderr=true \
  --master=https://192.168.160.1:6443 \
  --proxy-mode=ipvs \
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

编辑：

看了kube-proxy debugging steps，好像kube-proxy联系不到大师。我怀疑这是问题的很大一部分。我在 HAProxy 负载均衡器后面有 3 个 controller/master 节点，它绑定到 192.168.160.1:6443 并将循环转发给 10.240.0.1[1|2|3]:6443 上的每个主节点。这个可以看上面的output/configs

在kube-proxy.service中，我指定了--master=192.168.160.1:6443。为什么尝试连接到端口 443？我可以更改它吗 - 似乎没有端口标志？出于某种原因需要端口 443 吗？

Answer 1

这个答案有两个组成部分，一个关于运行ning kube-proxy，另一个关于 :443 URL 的来源。

首先，关于kube-proxy：请不要运行 kube-proxy那样作为系统服务。它旨在由 kubelet 在集群 中启动，以便 SDN 地址行为合理，因为它们实际上是 "fake" 地址。通过运行宁kube-proxy在kubelet的控制之外，各种奇怪的事情都会发生，除非你花费大量的精力来复制kubelet配置的方式它的下属 docker 个容器。

现在，关于那个 :443 URL:

E0424 14:56:35.815863 1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.96.0.1:443: getsockopt: no route to host

...

Why are connections being attempted to port 443? Can I change this - there doesn't seem to be a port flag? Does it need to be port 443 for some reason?

10.96.0.1 来自集群的服务 CIDR，它（并且应该）与 Pod CIDR 分开，Pod CIDR 应该与节点的子网等分开。集群的 .1服务 CIDR 保留（或 传统上 分配）到 kubernetes.default.svc.cluster.local Service，其中一个 Service.port 作为 443。

我不太确定为什么 --master 标志没有取代 /etc/kubernetes/kube-proxy.conf 中的值，但由于该文件很明显只能由 kube-proxy 使用，为什么不更新文件中的值以消除所有疑问？

kube-dns getsockopt 没有到主机的路由

kube-dns getsockopt no route to host

kubernetes

flannel

kube-dns

kube-proxy

containerd