重新安装的节点无法加入 Kubernetes 集群
Re-installed node cannot join Kubernetes cluster
我使用 kubeadm
安装了一个工作的 3 节点 k8s 集群(Ubuntu 20.04 裸机上的 v1.21.0)。我删除了其中一个节点并从头开始重新安装它(擦除磁盘,新 OS 但 IP 地址相同)。现在无法加入集群:
# kubeadm join k8s.example.com:6443 --token who21h.jolq7z79twv7bf4m \
--discovery-token-ca-cert-hash sha256:f63c5786cea2be46c999f4b5c595abd0aa24896c3b37616c347df318d7406c00 \
--control-plane
...
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://65.21.128.36:2379 with maintenance client: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher
我 运行 与 --v=5
相同(在 kubeadm reset
之后),它卡住了记录这些:
Failed to get etcd status for https://123.123.123.123:2379: failed to dial endpoint https://123.123.123.123:2379 with maintenance client: context deadline exceeded
123.123.123.123
是我尝试 return 到集群的节点的 IP 地址。
运行 kubectl get nodes
在其他一位大师上只列出了剩下的 2 位大师。我正确删除了有问题的节点:
kubectl get nodes
kubectl drain <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
kubectl delete node <node-name>
有什么想法吗?发送.
仔细查看您收到的错误消息:
Failed to get etcd status for https://123.123.123.123:2379: failed to dial endpoint https://123.123.123.123:2379 with maintenance client: context deadline exceeded
这是一个很常见的问题,与 etcd
集群相关,有详细记录。与以下线程进行比较:
- Control plain won't join #81071
- kubeadm join is not fault tolerant to etcd endpoint failures #1432
- etcd becomes unhealthy after I delete one of master node, I am look for a fix
- Restoring etcd quorum
具体而言,这与 etcd
法定人数的损失有关。您可以按照here.
中的描述进行检查
解决方法一步步描述in this comment:
For the record here the command to run on one of the remaining etcd
pod :
Find the id of the member to remove
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert
/etc/kubernetes/pki/etcd/server.crt --key
/etc/kubernetes/pki/etcd/server.key member list
5a4945140f0b39d9, started, sbg2-k8s001, https://192.168.208.12:2380, https://192.168.208.12:2379
740381e3c57ef823, started, gra3-k8s001, https://192.168.208.13:2380, https://192.168.208.13:2379
77a8fbb530b10f4a, started, rbx4-k8s001, https://192.168.208.14:2380, https://192.168.208.14:2379
I want to remove 740381e3c57ef823
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert
/etc/kubernetes/pki/etcd/server.crt --key
/etc/kubernetes/pki/etcd/server.key member remove 740381e3c57ef823
Member 740381e3c57ef823 removed from cluster a2c90ef66bb95cc9
Checking
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert
/etc/kubernetes/pki/etcd/server.crt --key
/etc/kubernetes/pki/etcd/server.key member list
5a4945140f0b39d9, started, sbg2-k8s001, https://192.168.208.12:2380, https://192.168.208.12:2379
77a8fbb530b10f4a, started, rbx4-k8s001, https://192.168.208.14:2380, https://192.168.208.14:2379
Now I can join my new master.
我使用 kubeadm
安装了一个工作的 3 节点 k8s 集群(Ubuntu 20.04 裸机上的 v1.21.0)。我删除了其中一个节点并从头开始重新安装它(擦除磁盘,新 OS 但 IP 地址相同)。现在无法加入集群:
# kubeadm join k8s.example.com:6443 --token who21h.jolq7z79twv7bf4m \
--discovery-token-ca-cert-hash sha256:f63c5786cea2be46c999f4b5c595abd0aa24896c3b37616c347df318d7406c00 \
--control-plane
...
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://65.21.128.36:2379 with maintenance client: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher
我 运行 与 --v=5
相同(在 kubeadm reset
之后),它卡住了记录这些:
Failed to get etcd status for https://123.123.123.123:2379: failed to dial endpoint https://123.123.123.123:2379 with maintenance client: context deadline exceeded
123.123.123.123
是我尝试 return 到集群的节点的 IP 地址。
运行 kubectl get nodes
在其他一位大师上只列出了剩下的 2 位大师。我正确删除了有问题的节点:
kubectl get nodes
kubectl drain <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
kubectl delete node <node-name>
有什么想法吗?发送.
仔细查看您收到的错误消息:
Failed to get etcd status for https://123.123.123.123:2379: failed to dial endpoint https://123.123.123.123:2379 with maintenance client: context deadline exceeded
这是一个很常见的问题,与 etcd
集群相关,有详细记录。与以下线程进行比较:
- Control plain won't join #81071
- kubeadm join is not fault tolerant to etcd endpoint failures #1432
- etcd becomes unhealthy after I delete one of master node, I am look for a fix
- Restoring etcd quorum
具体而言,这与 etcd
法定人数的损失有关。您可以按照here.
解决方法一步步描述in this comment:
For the record here the command to run on one of the remaining etcd pod :
Find the id of the member to remove
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list 5a4945140f0b39d9, started, sbg2-k8s001, https://192.168.208.12:2380, https://192.168.208.12:2379 740381e3c57ef823, started, gra3-k8s001, https://192.168.208.13:2380, https://192.168.208.13:2379 77a8fbb530b10f4a, started, rbx4-k8s001, https://192.168.208.14:2380, https://192.168.208.14:2379
I want to remove 740381e3c57ef823
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 740381e3c57ef823 Member 740381e3c57ef823 removed from cluster a2c90ef66bb95cc9
Checking
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list 5a4945140f0b39d9, started, sbg2-k8s001, https://192.168.208.12:2380, https://192.168.208.12:2379 77a8fbb530b10f4a, started, rbx4-k8s001, https://192.168.208.14:2380, https://192.168.208.14:2379
Now I can join my new master.