不小心耗尽了 Kubernetes 中的所有节点(甚至是主节点)。如何恢复我的 Kubernetes?
Accidentally drained all nodes in Kubernetes (even master). How can I bring my Kubernetes back?
我不小心耗尽了 Kubernetes 中的所有节点(甚至是 master)。如何恢复我的 Kubernetes? kubectl 不再工作了:
kubectl get nodes
结果:
The connection to the server 172.16.16.111:6443 was refused - did you specify the right host or port?
这里是 systemctl status kubelet
在主节点 (node1) 上的输出:
● kubelet.service - Kubernetes Kubelet Server
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2020-06-23 21:42:39 UTC; 25min ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 15541 (kubelet)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/kubelet.service
└─15541 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.16.16.111 --hostname-override=node1 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --cpu-manager-policy=static --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330009 15541 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330201 15541 setters.go:73] Using node IP: "172.16.16.111"
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331475 15541 kubelet_node_status.go:472] Recording NodeHasSufficientMemory event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331494 15541 kubelet_node_status.go:472] Recording NodeHasNoDiskPressure event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331500 15541 kubelet_node_status.go:472] Recording NodeHasSufficientPID event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331661 15541 policy_static.go:244] [cpumanager] static policy: RemoveContainer (container id: 6dd59735cabf973b6d8b2a46a14c0711831daca248e918bfcfe2041420931963)
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.332058 15541 pod_workers.go:191] Error syncing pod 93ff1a9840f77f8b2b924a85815e17fe ("kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.427587 15541 kubelet.go:2267] node "node1" not found
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.506152 15541 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Get https://172.16.16.111:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.16.16.111:6443: connect: connection refused
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.527813 15541 kubelet.go:2267] node "node1" not found
我正在使用 Ubuntu 18.04,我的集群中有 7 个计算节点。全部耗尽(不小心,有点!)!我已经使用 Kubespray 安装了我的 K8s 集群。
有什么方法可以解除对这些节点的封锁吗?这样k8s必要的pods就可以调度了
如有任何帮助,我们将不胜感激。
更新:
我在这里问了一个关于如何连接到 etcd 的单独问题:
如果您有生产或 'live' 工作负载,最安全的方法是配置一个新集群并逐渐切换工作负载。
Kubernetes 将其状态保持在 etcd 中,因此您可能会连接到 etcd 并清除 'drained' 状态,但您可能必须查看源代码并查看发生的位置以及发生的位置具体 key/values 存储在 etcd.
您共享的日志基本上显示 kube-apiserver 无法启动,因此它很可能正在尝试连接到 etcd/startup 并且 etcd 告诉它:“您无法在此节点上启动,因为它有被抽干了。
高手的典型启动顺序是这样的:
- etcd
- kube-apiserver
- kube-controller-manager
- kube-调度器
您也可以按照任何指南连接到 etcd,看看是否可以进一步排除故障。例如,this one. Then you could examine/delete some of the node keys 风险自负:
/registry/minions/node-x1
/registry/minions/node-x2
/registry/minions/node-x3
我不小心耗尽了 Kubernetes 中的所有节点(甚至是 master)。如何恢复我的 Kubernetes? kubectl 不再工作了:
kubectl get nodes
结果:
The connection to the server 172.16.16.111:6443 was refused - did you specify the right host or port?
这里是 systemctl status kubelet
在主节点 (node1) 上的输出:
● kubelet.service - Kubernetes Kubelet Server
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2020-06-23 21:42:39 UTC; 25min ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 15541 (kubelet)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/kubelet.service
└─15541 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.16.16.111 --hostname-override=node1 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --cpu-manager-policy=static --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330009 15541 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330201 15541 setters.go:73] Using node IP: "172.16.16.111"
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331475 15541 kubelet_node_status.go:472] Recording NodeHasSufficientMemory event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331494 15541 kubelet_node_status.go:472] Recording NodeHasNoDiskPressure event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331500 15541 kubelet_node_status.go:472] Recording NodeHasSufficientPID event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331661 15541 policy_static.go:244] [cpumanager] static policy: RemoveContainer (container id: 6dd59735cabf973b6d8b2a46a14c0711831daca248e918bfcfe2041420931963)
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.332058 15541 pod_workers.go:191] Error syncing pod 93ff1a9840f77f8b2b924a85815e17fe ("kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.427587 15541 kubelet.go:2267] node "node1" not found
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.506152 15541 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Get https://172.16.16.111:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.16.16.111:6443: connect: connection refused
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.527813 15541 kubelet.go:2267] node "node1" not found
我正在使用 Ubuntu 18.04,我的集群中有 7 个计算节点。全部耗尽(不小心,有点!)!我已经使用 Kubespray 安装了我的 K8s 集群。
有什么方法可以解除对这些节点的封锁吗?这样k8s必要的pods就可以调度了
如有任何帮助,我们将不胜感激。
更新:
我在这里问了一个关于如何连接到 etcd 的单独问题:
如果您有生产或 'live' 工作负载,最安全的方法是配置一个新集群并逐渐切换工作负载。
Kubernetes 将其状态保持在 etcd 中,因此您可能会连接到 etcd 并清除 'drained' 状态,但您可能必须查看源代码并查看发生的位置以及发生的位置具体 key/values 存储在 etcd.
您共享的日志基本上显示 kube-apiserver 无法启动,因此它很可能正在尝试连接到 etcd/startup 并且 etcd 告诉它:“您无法在此节点上启动,因为它有被抽干了。
高手的典型启动顺序是这样的:
- etcd
- kube-apiserver
- kube-controller-manager
- kube-调度器
您也可以按照任何指南连接到 etcd,看看是否可以进一步排除故障。例如,this one. Then you could examine/delete some of the node keys 风险自负:
/registry/minions/node-x1
/registry/minions/node-x2
/registry/minions/node-x3