kubernetes集群间歇性丢失

Intermittent loss of kubernetes cluster

我一直在尝试诊断几天前才开始出现的问题。

运行 kubelet, kubeadm 版本 1.13.1。集群有 5 个节点,并且在上周晚些时候之前几个月都没有问题。 运行 这在 RHEL 7.x 盒子上有足够的免费资源。

遇到集群资源(api、调度程序等)变得不可用的奇怪问题。这最终会自行纠正,集群会再次恢复一段时间。

如果我执行 sudo systemctl restart kubelet 集群中的所有内容再次正常工作,直到出现间歇性异常。

我正在监视 journactl 日志以查看发生这种情况时发生了什么,突出的块是:

Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.762004 I | etcdserver: skipped leadership transfer for single member cluster
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763648       1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1beta1.Event ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.762788       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.762910       1 reflector.go:270] storage/cacher.go:/podsecuritypolicy: watch of *policy.PodSecurityPolicy ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763149       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763232       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763439       1 reflector.go:270] storage/cacher.go:/apiregistration.k8s.io/apiservices: watch of *apiregistration.APIService ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763719       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.763786       1 reflector.go:270] storage/cacher.go:/daemonsets: watch of *apps.DaemonSet ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.763937       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764016       1 reflector.go:270] storage/cacher.go:/cronjobs: watch of *batch.CronJob ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764250       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.764324       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764386       1 reflector.go:270] storage/cacher.go:/services/endpoints: watch of *core.Endpoints ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.764440       1 reflector.go:270] storage/cacher.go:/deployments: watch of *apps.Deployment ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: WARNING: 2018/12/26 21:28:06 grpc: addrConn.transportMonitor exits due to: context canceled
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765201 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: 2018-12-26 21:28:06.765384 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")

...

Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784805       1 reflector.go:270] storage/cacher.go:/controllerrevisions: watch of *apps.ControllerRevision ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.784871       1 reflector.go:270] storage/cacher.go:/pods: watch of *core.Pod ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.786587       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.786700       1 reflector.go:270] storage/cacher.go:/horizontalpodautoscalers: watch of *autoscaling.HorizontalPodAutoscaler ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: E1226 21:28:06.788274       1 watcher.go:208] watch chan error: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain dockerd-current[1609]: W1226 21:28:06.788385       1 reflector.go:270] storage/cacher.go:/crd.projectcalico.org/clusterinformations: watch of *unstructured.Unstructured ended with: Internal error occurred: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
Dec 26 15:28:06 thalia0.domain oci-systemd-hook[9353]: systemdhook <debug>: 02cb55687848: Skipping as container command is etcd, not init or systemd
Dec 26 15:28:06 thalia0.domain oci-umount[9355]: umounthook <debug>: 02cb55687848: only runs in prestart stage, ignoring
Dec 26 15:28:07 thalia0.domain dockerd-current[1609]: time="2018-12-26T15:28:07.003175741-06:00" level=warning msg="02cb556878485b24e4705dd0efe1051c02f3e3bbbe7b8a7ab23ea71bd6d82b2f cleanup: failed to unmount secrets: invalid argument"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.006714   24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"
Dec 26 15:28:07 thalia0.domain kubelet[24604]: E1226 15:28:07.040361   24604 pod_workers.go:190] Error syncing pod 0264932236d6afef396f466fc3bd3181 ("etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=etcd pod=etcd-thalia0.domain_kube-system(0264932236d6afef396f466fc3bd3181)"

为了减少日志中的噪音,我封锁了其他节点。

如前所述,如果我重新启动 kubelet 服务,一段时间内一切正常,然后出现间歇性行为。

欢迎提出任何建议。我正在与我们的系统管理员合作,他说 etcd 似乎经常重启。我认为当 CrashLoopBackOff 开始发生时,麻烦就开始了。

实际上似乎是一个 RHEL/Docker 错误。参见 Bug 1655214 - docker exec does not work with registry.access.redhat.com/rhel7:7.3

我们已经应用了修复程序,到目前为止情况似乎很稳定。