推出后的 kubernetes UnexpectedAdmissionError

kubernetes UnexpectedAdmissionError after rollout

我有一个服务无法回复某些 HTTP 请求,挖掘它的日志似乎是某种 DNS 故障到达 proxy 服务

'proxy' failed to resolve 'proxy.default.svc.cluster.local' after 2 queries

所以我找不到任何错误并尝试了 kubectl rollout restart deployment/backend。 紧接着这些出现在 pods 列表中:

backend-54769cbb4-xkwf2              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-xlpgf              0/1     UnexpectedAdmissionError   0          4h4m
backend-54769cbb4-xmnr5              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-xmq5n              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-xphrw              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-xrmrq              0/1     UnexpectedAdmissionError   0          4h1m
backend-54769cbb4-xrmw8              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-xt4ck              0/1     UnexpectedAdmissionError   0          4h4m
backend-54769cbb4-xws8r              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-xx6r4              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-xxpfd              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-xzjql              0/1     UnexpectedAdmissionError   0          4h4m
backend-54769cbb4-xzzlk              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-z46ms              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-z4sl7              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-z6jpj              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-z6ngq              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-z8w4h              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-z9jqb              0/1     UnexpectedAdmissionError   0          4h3m
backend-54769cbb4-zbvqm              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zcfxg              0/1     UnexpectedAdmissionError   0          4h3m
backend-54769cbb4-zcvqm              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-zf2f8              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zgnkh              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-zhdr8              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zhx6g              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-zj8f2              0/1     UnexpectedAdmissionError   0          4h3m
backend-54769cbb4-zjbwp              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-zjc8g              0/1     UnexpectedAdmissionError   0          4h3m
backend-54769cbb4-zjdcp              0/1     UnexpectedAdmissionError   0          4h4m
backend-54769cbb4-zkcrb              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-zlpll              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zm2cx              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-zn7mr              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-znjkp              0/1     UnexpectedAdmissionError   0          4h3m
backend-54769cbb4-zpnk7              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zrrl7              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zsdsz              0/1     UnexpectedAdmissionError   0          4h4m
backend-54769cbb4-ztdx8              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-ztln6              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-ztplg              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-ztzfh              0/1     UnexpectedAdmissionError   0          4h2m
backend-54769cbb4-zvb8g              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-zwsr8              0/1     UnexpectedAdmissionError   0          4h7m
backend-54769cbb4-zwvxr              0/1     UnexpectedAdmissionError   0          4h5m
backend-54769cbb4-zwx6h              0/1     UnexpectedAdmissionError   0          4h6m
backend-54769cbb4-zz4bf              0/1     UnexpectedAdmissionError   0          4h1m
backend-54769cbb4-zzq6t              0/1     UnexpectedAdmissionError   0          4h2m

(还有更多)

所以我又添加了两个节点,现在一切似乎都很好,除了这个 pods 的大列表处于我不明白的错误状态。这是什么 UnexpectedAdmissionError,我该怎么办?

注意:这是一个 DigitalOcean 集群

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:38:36Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

以下似乎很重要:kubectl describe one_failed_pod

Events:
  Type     Reason                    Age    From                    Message
  ----     ------                    ----   ----                    -------
  Normal   Scheduled                 2m51s  default-scheduler       Successfully assigned default/backend-549f576d5f-xzdv4 to std-16gb-g7mo
  Warning  UnexpectedAdmissionError  2m51s  kubelet, std-16gb-g7mo  Update plugin resources failed due to failed to write checkpoint file "kubelet_internal_checkpoint": write /var/lib/kubelet/device-plugins/.543592130: no space left on device, which is unexpected.

我遇到了同样的问题,在用 UnexpectedAdmissionError 描述其中一个 pods 时,我看到了以下内容:

由于无法写入 deviceplugin 检查点文件“kubelet_internal_checkpoint”,更新插件资源失败:写入 /var/lib/kubelet/device-plugins/.525608957:设备上没有留下 space,这是意外的。

描述节点时:

OutOfDisk Unknown 2020 年 6 月 30 日星期二 14:07:04 -0400 2020 年 6 月 30 日星期二 14:12:05 -0400 NodeStatusUnknown Kubelet 停止发布节点状态。

我通过重启节点解决了这个问题

因为 pod 甚至还没有启动,您实际上无法检查日志。然而,描述 pod 为我提供了错误。我们在 worker5 节点上遇到了一些 disk/cpu/memory 利用率问题。

kubectl get pods | grep -i err
kube-system      coredns-autoscaler-79599b9dc6-6l8s8                            0/1     UnexpectedAdmissionError   0          10h     <none>        worker5   <none>           <none>
kube-system      coredns-autoscaler-79599b9dc6-kzt9z                            0/1     UnexpectedAdmissionError   0          10h     <none>        worker5   <none>           <none>
kube-system      coredns-autoscaler-79599b9dc6-tgkrc                            0/1     UnexpectedAdmissionError   0          10h     <none>        worker5   <none>           <none>

kubectl describe pod -n kube-system coredns-autoscaler-79599b9dc6-kzt9z

Reason:         UnexpectedAdmissionError
Message:        Pod Allocate failed due to failed to write checkpoint file "kubelet_internal_checkpoint": mkdir /var: file exists, which is unexpected

第一步是重新启动节点,这解决了问题。原因是我们已经将一些备份还原到新集群,还原过程导致了这个问题。

对于 pods 因为它们是副本集的一部分,所以它们在其他工作节点上生成。因此我们删除了 pods.

快速删除大量pods的方法,可以使用:

kubectl get pods -n namespace | grep -i Error | cut -d' ' -f 1 | xargs kubectl delete pod

删除整个集群中的所有错误pods

kubectl get pods -A | grep -i Error | awk '{print }' | xargs kubectl delete pod

您可以使用标志 -A/--all-namespaces 从集群中的所有名称空间获取 pods。

但是,如果它们没有自动生成,这很奇怪,您可以 运行 kubectl replace

kubectl get pod coredns-autoscaler-79599b9dc6-6l8s8 -n kube-system -o yaml | kubectl replace --force -f -

如需进一步详细阅读,请参阅 kubectl replace --help 和以下 blog