完成作业不稳定的 Kubernetes 集群; kubelet 日志充满 "http2: no cached connection was available"
Kubernetes Cluster with finished jobs unstable; kubelet logs filled with "http2: no cached connection was available"
总结
我有各种单节点 Kubernetes 集群,这些集群在累积约 300 个已完成的作业后变得不稳定。
以一个集群为例,有303个完成的作业:
root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303
观察
我观察到的是
kubelet
日志充满了这样的错误消息:kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
- 节点状态未更新,出现类似的错误信息:
kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
- 最终,该节点被标记为
NotReady
,并且没有安排新的 pods
NAME STATUS ROLES AGE VERSION
xxxxx NotReady master 6d4h v1.12.1
- 集群正在进入和退出主中断模式(来自
kube-controller-manager
日志):
I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode.
I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.
真正的罪魁祸首似乎是 http2: no cached connection was available
错误消息。我发现的唯一真实参考是 Go 存储库中的几个问题(如 #16582),这些问题似乎很久以前就已修复。
在大多数情况下,删除已完成的作业似乎可以恢复系统稳定性。
最小重现(待定)
我似乎能够通过创建大量使用装载 ConfigMap 的容器的作业来重现此问题:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: job-%JOB_ID%
data:
# Just some sample data
game.properties: |
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
ui.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-%JOB_ID%
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(20)"]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: job-%JOB_ID%
restartPolicy: Never
backoffLimit: 4
安排很多这样的工作:
#!/bin/bash
for i in `seq 100 399`;
do
cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
sleep 0.1
done
问题
我很好奇是什么导致了这个问题,因为 300 个已完成的工作似乎是一个相当低的数字。
这是我集群的配置问题吗? Kubernetes/Go 中可能存在错误?还有什么我可以尝试的吗?
只是总结一下这个问题以及为什么会发生。这实际上是与 1.12 和 1.13 相关的问题。如 GitHub issue (probably created by author) this seems to be an issue of http2 connection pool implementing, or as explained in one of the comments it is a connection management problem in kubelet. Described ways of mitigating can be found here 中所述。如果您需要更多信息,所有链接都可以在链接的 GitHub 问题中找到。
总结
我有各种单节点 Kubernetes 集群,这些集群在累积约 300 个已完成的作业后变得不稳定。
以一个集群为例,有303个完成的作业:
root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303
观察
我观察到的是
kubelet
日志充满了这样的错误消息:kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
- 节点状态未更新,出现类似的错误信息:
kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
- 最终,该节点被标记为
NotReady
,并且没有安排新的 podsNAME STATUS ROLES AGE VERSION xxxxx NotReady master 6d4h v1.12.1
- 集群正在进入和退出主中断模式(来自
kube-controller-manager
日志):I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode. I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.
真正的罪魁祸首似乎是 http2: no cached connection was available
错误消息。我发现的唯一真实参考是 Go 存储库中的几个问题(如 #16582),这些问题似乎很久以前就已修复。
在大多数情况下,删除已完成的作业似乎可以恢复系统稳定性。
最小重现(待定)
我似乎能够通过创建大量使用装载 ConfigMap 的容器的作业来重现此问题:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: job-%JOB_ID%
data:
# Just some sample data
game.properties: |
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
ui.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-%JOB_ID%
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(20)"]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: job-%JOB_ID%
restartPolicy: Never
backoffLimit: 4
安排很多这样的工作:
#!/bin/bash
for i in `seq 100 399`;
do
cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
sleep 0.1
done
问题
我很好奇是什么导致了这个问题,因为 300 个已完成的工作似乎是一个相当低的数字。
这是我集群的配置问题吗? Kubernetes/Go 中可能存在错误?还有什么我可以尝试的吗?
只是总结一下这个问题以及为什么会发生。这实际上是与 1.12 和 1.13 相关的问题。如 GitHub issue (probably created by author) this seems to be an issue of http2 connection pool implementing, or as explained in one of the comments it is a connection management problem in kubelet. Described ways of mitigating can be found here 中所述。如果您需要更多信息,所有链接都可以在链接的 GitHub 问题中找到。