升级主机后 kube-dns pod 崩溃 OS Ubuntu 18
kube-dns pod crashes after upgrading host OS Ubuntu 18
我正在尝试将 kube 集群从 Ubuntu 16 升级到 18。升级后 kube-dns pod 不断崩溃。如果我回滚到 U16 一切正常,问题只出现在 U18 上。
Kube 版本 "v1.10.11"
kube-dns pod 事件:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28m default-scheduler Successfully assigned kube-dns-75966d58fb-pqxz4 to
Normal SuccessfulMountVolume 28m kubelet, MountVolume.SetUp succeeded for volume "kube-dns-config"
Normal SuccessfulMountVolume 28m kubelet, MountVolume.SetUp succeeded for volume "kube-dns-token-h4q66"
Normal Pulling 28m kubelet, pulling image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
Normal Pulled 28m kubelet, Successfully pulled image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
Normal Started 28m kubelet, Started container
Normal Created 28m kubelet, Created container
Normal Pulling 28m kubelet, pulling image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
Normal Pulling 28m kubelet, pulling image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
Normal Pulled 28m kubelet, Successfully pulled image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
Normal Created 28m kubelet, Created container
Normal Pulled 28m kubelet, Successfully pulled image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
Normal Started 28m kubelet, Started container
Normal Created 25m (x2 over 28m) kubelet, Created container
Normal Started 25m (x2 over 28m) kubelet, Started container
Normal Killing 25m kubelet, Killing container with id docker://dnsmasq:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 25m kubelet, Container image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10" already present on machine
Warning Unhealthy 4m (x26 over 27m) kubelet, Liveness probe failed: HTTP probe failed with statuscode: 503
kube-dns sidecar 容器日志:
kubectl logs kube-dns-75966d58fb-pqxz4 -n kube-system -c sidecar
I0809 16:31:26.768964 1 main.go:51] Version v1.14.8.3
I0809 16:31:26.769049 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
I0809 16:31:26.769079 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
I0809 16:31:26.769117 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
W0809 16:31:33.770594 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:49305->127.0.0.1:53: i/o timeout
W0809 16:31:40.771166 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:49655->127.0.0.1:53: i/o timeout
W0809 16:31:47.771773 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:53322->127.0.0.1:53: i/o timeout
W0809 16:31:54.772386 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:58999->127.0.0.1:53: i/o timeout
W0809 16:32:01.772972 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:35034->127.0.0.1:53: i/o timeout
W0809 16:32:08.773540 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:33250->127.0.0.1:53: i/o timeout
kube-dns dnsmasq 容器日志:
kubectl logs kube-dns-75966d58fb-pqxz4 -n kube-system -c dnsmasq
I0809 16:29:51.596517 1 main.go:74] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0809 16:29:51.596679 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0809 16:29:52.135179 1 nanny.go:119]
W0809 16:29:52.135211 1 nanny.go:120] Got EOF from stdout
I0809 16:29:52.135277 1 nanny.go:116] dnsmasq[20]: started, version 2.78 cachesize 1000
I0809 16:29:52.135293 1 nanny.go:116] dnsmasq[20]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0809 16:29:52.135303 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0809 16:29:52.135314 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0809 16:29:52.135323 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0809 16:29:52.135329 1 nanny.go:116] dnsmasq[20]: reading /etc/resolv.conf
I0809 16:29:52.135334 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0809 16:29:52.135343 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0809 16:29:52.135348 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0809 16:29:52.135353 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.53#53
I0809 16:29:52.135397 1 nanny.go:116] dnsmasq[20]: read /etc/hosts - 7 addresses
I0809 16:31:28.728897 1 nanny.go:116] dnsmasq[20]: Maximum number of concurrent DNS queries reached (max: 150)
I0809 16:31:38.746899 1 nanny.go:116] dnsmasq[20]: Maximum number of concurrent DNS queries reached (max: 150)
我删除了现有的 pods 但新创建的一段时间后出现同样的错误。不确定为什么这只发生在 Ubuntu 18。有什么解决办法吗?
Ubuntu 18 次使用 systemd-resolved as DNS server which listens on 127.0.0.53. You can take a look at your resolv.conf file. When /etc/resolv.conf is mapped to CoreDNS, it is acted as upstream DNS server, however the loop detection plugin failed. You can take a look at the CoreDNS troubleshooting page
在我的 Ubuntu 18 集群中,我禁用了 systemd-resolved。
在我的例子中,我发现在 ubuntu18 中 resolve.conf 指向:/etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
它有 nameserver 127.0.0.53
个条目。
同时在/run/systemd/resolve下你应该还有另一个resolv.conf
/run/systemd/resolve$ ll
total 8
drwxr-xr-x 2 systemd-resolve systemd-resolve 80 Aug 12 13:24 ./
drwxr-xr-x 23 root root 520 Aug 12 11:54 ../
-rw-r--r-- 1 systemd-resolve systemd-resolve 607 Aug 12 13:24 resolv.conf
-rw-r--r-- 1 systemd-resolve systemd-resolve 735 Aug 12 13:24 stub-resolv.conf
在我的例子中 resolv.conf 包含私有 IP 名称服务器 172.27.0.2。
只需重新链接到所有集群机器上的 ../run/systemd/resolve/resolv.conf 并重启 kube-dns pods.
我正在尝试将 kube 集群从 Ubuntu 16 升级到 18。升级后 kube-dns pod 不断崩溃。如果我回滚到 U16 一切正常,问题只出现在 U18 上。
Kube 版本 "v1.10.11"
kube-dns pod 事件:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28m default-scheduler Successfully assigned kube-dns-75966d58fb-pqxz4 to
Normal SuccessfulMountVolume 28m kubelet, MountVolume.SetUp succeeded for volume "kube-dns-config"
Normal SuccessfulMountVolume 28m kubelet, MountVolume.SetUp succeeded for volume "kube-dns-token-h4q66"
Normal Pulling 28m kubelet, pulling image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
Normal Pulled 28m kubelet, Successfully pulled image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
Normal Started 28m kubelet, Started container
Normal Created 28m kubelet, Created container
Normal Pulling 28m kubelet, pulling image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
Normal Pulling 28m kubelet, pulling image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
Normal Pulled 28m kubelet, Successfully pulled image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
Normal Created 28m kubelet, Created container
Normal Pulled 28m kubelet, Successfully pulled image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
Normal Started 28m kubelet, Started container
Normal Created 25m (x2 over 28m) kubelet, Created container
Normal Started 25m (x2 over 28m) kubelet, Started container
Normal Killing 25m kubelet, Killing container with id docker://dnsmasq:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 25m kubelet, Container image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10" already present on machine
Warning Unhealthy 4m (x26 over 27m) kubelet, Liveness probe failed: HTTP probe failed with statuscode: 503
kube-dns sidecar 容器日志:
kubectl logs kube-dns-75966d58fb-pqxz4 -n kube-system -c sidecar
I0809 16:31:26.768964 1 main.go:51] Version v1.14.8.3
I0809 16:31:26.769049 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
I0809 16:31:26.769079 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
I0809 16:31:26.769117 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
W0809 16:31:33.770594 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:49305->127.0.0.1:53: i/o timeout
W0809 16:31:40.771166 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:49655->127.0.0.1:53: i/o timeout
W0809 16:31:47.771773 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:53322->127.0.0.1:53: i/o timeout
W0809 16:31:54.772386 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:58999->127.0.0.1:53: i/o timeout
W0809 16:32:01.772972 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:35034->127.0.0.1:53: i/o timeout
W0809 16:32:08.773540 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:33250->127.0.0.1:53: i/o timeout
kube-dns dnsmasq 容器日志:
kubectl logs kube-dns-75966d58fb-pqxz4 -n kube-system -c dnsmasq
I0809 16:29:51.596517 1 main.go:74] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0809 16:29:51.596679 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0809 16:29:52.135179 1 nanny.go:119]
W0809 16:29:52.135211 1 nanny.go:120] Got EOF from stdout
I0809 16:29:52.135277 1 nanny.go:116] dnsmasq[20]: started, version 2.78 cachesize 1000
I0809 16:29:52.135293 1 nanny.go:116] dnsmasq[20]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0809 16:29:52.135303 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0809 16:29:52.135314 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0809 16:29:52.135323 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0809 16:29:52.135329 1 nanny.go:116] dnsmasq[20]: reading /etc/resolv.conf
I0809 16:29:52.135334 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0809 16:29:52.135343 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0809 16:29:52.135348 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0809 16:29:52.135353 1 nanny.go:116] dnsmasq[20]: using nameserver 127.0.0.53#53
I0809 16:29:52.135397 1 nanny.go:116] dnsmasq[20]: read /etc/hosts - 7 addresses
I0809 16:31:28.728897 1 nanny.go:116] dnsmasq[20]: Maximum number of concurrent DNS queries reached (max: 150)
I0809 16:31:38.746899 1 nanny.go:116] dnsmasq[20]: Maximum number of concurrent DNS queries reached (max: 150)
我删除了现有的 pods 但新创建的一段时间后出现同样的错误。不确定为什么这只发生在 Ubuntu 18。有什么解决办法吗?
Ubuntu 18 次使用 systemd-resolved as DNS server which listens on 127.0.0.53. You can take a look at your resolv.conf file. When /etc/resolv.conf is mapped to CoreDNS, it is acted as upstream DNS server, however the loop detection plugin failed. You can take a look at the CoreDNS troubleshooting page
在我的 Ubuntu 18 集群中,我禁用了 systemd-resolved。
在我的例子中,我发现在 ubuntu18 中 resolve.conf 指向:/etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
它有 nameserver 127.0.0.53
个条目。
同时在/run/systemd/resolve下你应该还有另一个resolv.conf
/run/systemd/resolve$ ll
total 8
drwxr-xr-x 2 systemd-resolve systemd-resolve 80 Aug 12 13:24 ./
drwxr-xr-x 23 root root 520 Aug 12 11:54 ../
-rw-r--r-- 1 systemd-resolve systemd-resolve 607 Aug 12 13:24 resolv.conf
-rw-r--r-- 1 systemd-resolve systemd-resolve 735 Aug 12 13:24 stub-resolv.conf
在我的例子中 resolv.conf 包含私有 IP 名称服务器 172.27.0.2。 只需重新链接到所有集群机器上的 ../run/systemd/resolve/resolv.conf 并重启 kube-dns pods.