当 1 个主节点关闭时,使用 kubeadm 和 nginx LB 的 Kubernetes HA 集群不工作——来自服务器的错误:etcdserver:请求超时
Kubernetes HA Cluster using kubeadm with nginx LB not working when 1 master node is down --Error from server: etcdserver: request timed out
我已经使用 Kubeadm 设置了 Kubernetes HA 集群(堆叠式 ETCD)。当我故意关闭一个主节点时,整个集群都崩溃了,我得到的错误是:
[vagrant@k8s-master01 ~]$ kubectl get nodes
Error from server: etcdserver: request timed out
我正在使用 Nginx 作为负载均衡 Kubeapi
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master01 Ready master 27d v1.19.2 192.168.30.5 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
k8s-master02 Ready master 27d v1.19.2 192.168.30.6 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
k8s-worker01 Ready <none> 27d v1.19.2 192.168.30.10 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
k8s-worker02 Ready <none> 27d v1.19.2 192.168.30.11 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
[vagrant@k8s-master01 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-wkknl 0/1 Running 9 27d
coredns-f9fd979d6-wp854 1/1 Running 8 27d
etcd-k8s-master01 1/1 Running 46 27d
etcd-k8s-master02 1/1 Running 10 27d
kube-apiserver-k8s-master01 1/1 Running 60 27d
kube-apiserver-k8s-master02 1/1 Running 13 27d
kube-controller-manager-k8s-master01 1/1 Running 20 27d
kube-controller-manager-k8s-master02 1/1 Running 15 27d
kube-proxy-7vn9l 1/1 Running 7 26d
kube-proxy-9kjrj 1/1 Running 7 26d
kube-proxy-lbmkz 1/1 Running 8 27d
kube-proxy-ndbp5 1/1 Running 9 27d
kube-scheduler-k8s-master01 1/1 Running 20 27d
kube-scheduler-k8s-master02 1/1 Running 15 27d
weave-net-77ck8 2/2 Running 21 26d
weave-net-bmpsf 2/2 Running 24 27d
weave-net-frchk 2/2 Running 27 27d
weave-net-zqjzf 2/2 Running 22 26d
[vagrant@k8s-master01 ~]$
Nginx 配置:
stream {
upstream apiserver_read {
server 192.168.30.5:6443;
server 192.168.30.6:6443;
}
server {
listen 6443;
proxy_pass apiserver_read;
}
}
Nginx 日志:
2020/10/19 09:12:01 [error] 1215#0: *12460 no live upstreams while connecting to upstream, client: 192.168.30.11, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19
2020/10/19 09:12:01 [error] 1215#0: *12465 no live upstreams while connecting to upstream, client: 192.168.30.5, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19 09:12:02 [error] 1215#0: *12466 no live upstreams while connecting to upstream, client: 192.168.30.10, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19 09:12:02 [error] 1215#0: *12467 no live upstreams while connecting to upstream, client: 192.168.30.11, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19 09:12:02 [error] 1215#0: *12468 no live upstreams while connecting to upstream, client: 192.168.30.5, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
我有相同的设置(堆叠 etcd,但使用 keepalived 和 HAProxy 而不是 nginx)并且我遇到了同样的问题。
您至少需要 3 个(!)个控制平面节点。只有这样,您才能在不丢失功能的情况下关闭三个控制平面节点中的一个。
3 个控制平面节点中的 3 个:
$ kubectl get pods -n kube-system
[...list of pods...]
3 个控制平面节点中的 2 个:
$ kubectl get pods -n kube-system
[...list of pods...]
3 个控制平面节点中有 1 个在运行:
$ kubectl get pods -n kube-system
Error from server: etcdserver: request timed out
再次三选二:
$ kubectl get pods -n kube-system
[...list of pods...]
ETCD
超时的原因是因为它是一个分布式键值数据库,需要仲裁才能健康。这基本上意味着 ETCD
集群的所有成员对某些决定进行投票,大多数人决定做什么。当你有 3 个节点时,你总是可以失去 1 个,因为 2 个节点仍然占多数
有 2 个节点的问题是,当 1 个出现故障时,最后一个 ETCD
节点在决定任何事情之前等待多数表决,这永远不会发生。
这就是为什么您在 Kubernetes cluster
上总是需要不等数量的主节点。
我已经使用 Kubeadm 设置了 Kubernetes HA 集群(堆叠式 ETCD)。当我故意关闭一个主节点时,整个集群都崩溃了,我得到的错误是:
[vagrant@k8s-master01 ~]$ kubectl get nodes
Error from server: etcdserver: request timed out
我正在使用 Nginx 作为负载均衡 Kubeapi
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master01 Ready master 27d v1.19.2 192.168.30.5 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
k8s-master02 Ready master 27d v1.19.2 192.168.30.6 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
k8s-worker01 Ready <none> 27d v1.19.2 192.168.30.10 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
k8s-worker02 Ready <none> 27d v1.19.2 192.168.30.11 <none> CentOS Linux 7 (Core) 3.10.0-1127.19.1.el7.x86_64 docker://19.3.11
[vagrant@k8s-master01 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-wkknl 0/1 Running 9 27d
coredns-f9fd979d6-wp854 1/1 Running 8 27d
etcd-k8s-master01 1/1 Running 46 27d
etcd-k8s-master02 1/1 Running 10 27d
kube-apiserver-k8s-master01 1/1 Running 60 27d
kube-apiserver-k8s-master02 1/1 Running 13 27d
kube-controller-manager-k8s-master01 1/1 Running 20 27d
kube-controller-manager-k8s-master02 1/1 Running 15 27d
kube-proxy-7vn9l 1/1 Running 7 26d
kube-proxy-9kjrj 1/1 Running 7 26d
kube-proxy-lbmkz 1/1 Running 8 27d
kube-proxy-ndbp5 1/1 Running 9 27d
kube-scheduler-k8s-master01 1/1 Running 20 27d
kube-scheduler-k8s-master02 1/1 Running 15 27d
weave-net-77ck8 2/2 Running 21 26d
weave-net-bmpsf 2/2 Running 24 27d
weave-net-frchk 2/2 Running 27 27d
weave-net-zqjzf 2/2 Running 22 26d
[vagrant@k8s-master01 ~]$
Nginx 配置:
stream {
upstream apiserver_read {
server 192.168.30.5:6443;
server 192.168.30.6:6443;
}
server {
listen 6443;
proxy_pass apiserver_read;
}
}
Nginx 日志:
2020/10/19 09:12:01 [error] 1215#0: *12460 no live upstreams while connecting to upstream, client: 192.168.30.11, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19
2020/10/19 09:12:01 [error] 1215#0: *12465 no live upstreams while connecting to upstream, client: 192.168.30.5, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19 09:12:02 [error] 1215#0: *12466 no live upstreams while connecting to upstream, client: 192.168.30.10, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19 09:12:02 [error] 1215#0: *12467 no live upstreams while connecting to upstream, client: 192.168.30.11, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
2020/10/19 09:12:02 [error] 1215#0: *12468 no live upstreams while connecting to upstream, client: 192.168.30.5, server: 0.0.0.0:6443, upstream: "apiserver_read", bytes from/to client:0/0, bytes from/to upstream:0/0
我有相同的设置(堆叠 etcd,但使用 keepalived 和 HAProxy 而不是 nginx)并且我遇到了同样的问题。
您至少需要 3 个(!)个控制平面节点。只有这样,您才能在不丢失功能的情况下关闭三个控制平面节点中的一个。
3 个控制平面节点中的 3 个:
$ kubectl get pods -n kube-system
[...list of pods...]
3 个控制平面节点中的 2 个:
$ kubectl get pods -n kube-system
[...list of pods...]
3 个控制平面节点中有 1 个在运行:
$ kubectl get pods -n kube-system
Error from server: etcdserver: request timed out
再次三选二:
$ kubectl get pods -n kube-system
[...list of pods...]
ETCD
超时的原因是因为它是一个分布式键值数据库,需要仲裁才能健康。这基本上意味着 ETCD
集群的所有成员对某些决定进行投票,大多数人决定做什么。当你有 3 个节点时,你总是可以失去 1 个,因为 2 个节点仍然占多数
有 2 个节点的问题是,当 1 个出现故障时,最后一个 ETCD
节点在决定任何事情之前等待多数表决,这永远不会发生。
这就是为什么您在 Kubernetes cluster
上总是需要不等数量的主节点。