从 Kubernetes HA 集群中安全移除 master
Safely remove master from Kubernetes HA cluster
我在 AWS EC2 实例上使用 kops 部署了一个开发 K8S 集群,我最初将其部署为具有 3 个主节点和 3 个节点的 HA 架构。
现在为了节省成本,我想关闭 3 个主控中的 2 个,只留下 1 个 运行
我试过 kubectl drain
但它没有效果,只是终止节点导致集群连接不稳定。
是否有安全的方法来移除 Master?
此问题已在 Github question - HA to single master migration 上讨论过。
已经为你准备好了solution
由于在 kops 1.12 中引入了 etcd-manager,并且 main
和 events
etcd 集群会定期自动备份到 S3(KOPS_STATE_STORE
的相同存储桶)。
所以如果你有一个高于1.12版本的k8s集群,也许你需要以下步骤:
- 删除集群中的 etcd 区域
$ kops edit cluster
在 etcdCluster
部分中,删除 etcdMembers
项以仅保留 main
和 events
的一项 instanceGroup
。例如
etcdClusters:
- etcdMembers:
- instanceGroup: master-ap-southeast-1a
name: a
name: main
- etcdMembers:
- instanceGroup: master-ap-southeast-1a
name: a
name: events
- 应用更改
$ kops update cluster --yes
$ kops rolling-update cluster --yes
- 删除 2 个主实例组
$ kops delete ig master-xxxxxx-1b
$ kops delete ig master-xxxxxx-1c
此操作不可撤销,将立即删除2个主节点。
现在您的 3 个主节点中有 2 个被删除,k8s etcd 服务可能会失败并且 kube-api 服务将无法访问。在此步骤后,您的 kops
和 kubectl
命令不再起作用是正常的。
- 重启单主节点的ectd集群
这是棘手的部分。 ssh 进入剩余的主节点,然后
$ sudo systemctl stop protokube
$ sudo systemctl stop kubelet
下载 etcd-manager-ctl
工具。如果使用不同的 etcd-manager
版本,相应地调整下载 link
$ wget https://github.com/kopeio/etcd-manager/releases/download/3.0.20190930/etcd-manager-ctl-linux-amd64
$ mv etcd-manager-ctl-linux-amd64 etcd-manager-ctl
$ chmod +x etcd-manager-ctl
$ mv etcd-manager-ctl /usr/local/bin/
从 S3 恢复备份。请参阅 official docs
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main restore-backup 2019-10-16T09:42:37Z-000001
# do the same for events
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events restore-backup 2019-10-16T09:42:37Z-000001
这不会立即开始还原;您需要重启 etcd:杀死相关容器并启动 kubelet
$ sudo systemctl start kubelet
$ sudo systemctl start protokube
等待还原完成,然后 kubectl get nodes
和 kops validate cluster
应该可以正常工作。如果没有,您可以在AWS控制台中终止剩余主节点的EC2实例,Auto Scaling Groups将创建一个新的主节点,并恢复etcd集群。
这些是为了减少 KOPS 已部署集群中的主节点数量而需要采取的步骤
注意:在您尝试执行此处描述的步骤之前,请先考虑重新创建集群的可能性。
按照步骤,虽然最终让我从 3 个 master 减少到 1 个,但在不同的场合需要额外的故障排除。下面是我从该过程中学到的所有知识,但您的情况可能有所不同,因此不能保证成功。
先决条件
转到AWS控制台并确定私有IP(稍后MASTER_IP变量)和主节点的可用区,该主节点将在此过程结束后成为单主节点(AZ)。
您需要将 AWS CLI 访问权限配置为 S3,以便 kops 能够工作。
您将需要 kubectl 配置为与我们要操作的集群一起工作。
如果由于任何原因出现问题,您可能需要 SSH 密钥,以便您可以到达剩余的主节点以在那里恢复 ETCD(因为在这种情况下 kubectl 将不再可用)。此案例目前不在本文档的涵盖范围内。
为 MASTER_IP、AZ、KOPS_STATE_BUCKET 和 CLUSTER_NAME 提供值以匹配您的环境。
# MASTER_IP is the IP of master node in availability zone AZ (so "c" in this example)
export MASTER_IP="172.20.115.115"
export AZ="c"
export KOPS_STATE_BUCKET="mironq-prod-eu-central-1-state-store"
export CLUSTER_NAME="mironq.prod.eu-central-1.aws.svc.example.com"
# no need to change following command unless you use different version of Etcd
export BACKUP_MAIN="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/main"
export BACKUP_EVENT="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/events"
export ETCD_CMD="/opt/etcd-v3.4.3-linux-amd64/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001"
CONTAINER=$(kubectl get pod -l k8s-app=etcd-manager-main -o=jsonpath='{.items[*].metadata.name}'|tr ' ' '\n'|grep ${MASTER_IP})
注意:您的 CONTAINER 变量现在应该包含 pod 的名称以及将要保留的 master,即:
$ echo $CONTAINER
etcd-manager-main-ip-172-20-109-104.eu-central-1.compute.internal
现在确认 Etcd 备份存在并且是最近的(~ 15 分钟前)和集群的当前成员数。
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n
# Confirm current members of existing Etcd cluster
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list
删除 Etcd 节点
获取要删除的 Etcd 节点的 ID
MASTERS2DELETE=$(kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list|grep -v etcd-${AZ}|cut -d, -f1)
#$ echo $MASTERS2DELETE
efb9893f347468eb ffea6e819b91a131
现在您已准备好删除不需要的 Etcd 节点
for MASTER in ${MASTERS2DELETE};do echo "Deleting ETCD node ${MSTR}"; kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member remove ${MASTER}; done
# a few minutes may be needed after this has been executed
# Confirm only one member is left
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list
您还会看到一些主节点未就绪
$ kubectl get node
安排备份恢复
!!!重要的 !!!
现在我们需要确保在继续之前进行了新的备份。默认情况下,etcd-manager 每 15 分钟进行一次备份。等待新的返回,因为它将包含有关节点预期数量的信息 (=1)
现在我们已经为这个单节点集群创建了一个新备份,我们可以安排在重启后恢复它。
下面的代码包含注释掉的响应,以帮助您确定您的命令是否按预期执行。
安排“主要”集群的恢复。
BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n)
#echo "$BACKUP_LIST"
#[...]
#2020-12-17T14:55:55Z-000001
#2020-12-17T15:11:05Z-000002
#2020-12-17T15:26:13Z-000003
#2020-12-17T15:41:14Z-000001
#2020-12-17T15:56:20Z-000001
#2020-12-17T16:11:35Z-000004
#2020-12-17T16:26:41Z-000005
LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)
# confirm that latest backup has been selected
#$ echo $LATEST_BACKUP
#2020-12-17T16:26:41Z-000005
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main
#I1217 16:41:59.101078 11608 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main/control/#2020-12-17T16:41:59Z-000000/_command.json: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" #>
#added restore-backup command: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" >
安排“主要”集群的恢复。
BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n)
#$ echo "$BACKUP_LIST"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:48:41.230896 17761 vfs.go:102] listed backups in s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events: #[2020-12-17T14:56:08Z-000001 2020-12-17T15:11:17Z-000001 2020-12-17T15:26:26Z-000002 2020-12-17T15:41:27Z-000002 2020-12-17T15:56:32Z-000003 2020-12-17T16:11:41Z-000003 #2020-12-17T16:26:48Z-000004 2020-12-17T16:41:56Z-000001]
#2020-12-17T14:56:08Z-000001
#2020-12-17T15:11:17Z-000001
#2020-12-17T15:26:26Z-000002
#2020-12-17T15:41:27Z-000002
#2020-12-17T15:56:32Z-000003
#2020-12-17T16:11:41Z-000003
#2020-12-17T16:26:48Z-000004
#2020-12-17T16:41:56Z-000001
LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)
# confirm that latest backup has been selected
$ echo $LATEST_BACKUP
2020-12-17T16:41:56Z-000001
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:53:17.876318 21958 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events/control/#2020-12-17T16:53:17Z-000000/_command.json: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" #>
#added restore-backup command: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" >
检查端点是否仍然健康(应该如此)
# check if endpoint is healthy
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} endpoint health
#https://127.0.0.1:4001 is healthy: successfully committed proposal: took = 7.036109ms
删除实例组
实例组示例列表
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} get ig
#NAME ROLE MACHINETYPE MIN MAX ZONES
#bastions Bastion t3.micro 1 1 eu-central-1a,eu-central-1b,eu-central-1c
#master-eu-central-1a Master t3.medium 1 1 eu-central-1a
#master-eu-central-1b Master t3.medium 1 1 eu-central-1b
#master-eu-central-1c Master t3.medium 1 1 eu-central-1c
#nodes Node t3.medium 2 6 eu-central-1a,eu-central-1b
删除我们禁用 Etcd 节点的可用性区域中的主实例组(本例中的 a 和 b,因为我们希望将 c 运行 作为唯一的主实例)。
编辑下面的命令(替换 [AZ-letter] 以匹配您的大小写。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} delete ig master-eu-central-1[AZ-letter]
#InstanceGroup "master-eu-central-1a" found for deletion
#I1217 17:01:39.035294 2538280 delete.go:54] Deleting "master-eu-central-1a"
#Deleted InstanceGroup: "master-eu-central-1a"
正在使用以下命令调用编辑模式手动编辑集群。这里的目标是将剩余的 Etcd 节点与集群配置匹配:通过下面的示例,您需要删除不再存在的节点的条目。
改变这个:
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: events
至此(只留下还有master的区域):
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: events
这将打开允许进行更改的编辑器。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} edit cluster
应用 KOPS 更改
应用更改并强制重新创建主节点(第二个命令将使集群无响应,直到创建新的主节点并重新联机。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} update cluster --yes
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} rolling-update cluster --cloudonly --yes
疑难解答
Etcd 集群“main”和“events”都必须重新联机才能再次启动 API。如果 API 服务器日志抱怨无法连接到端口 4001,那么你的“主要”Etcd 集群没有启动,如果端口号是 4002,它就是“事件”。
就在上面,您已指示 Etcd 集群导入回备份,并且必须完成此操作才能启动集群
我在 AWS EC2 实例上使用 kops 部署了一个开发 K8S 集群,我最初将其部署为具有 3 个主节点和 3 个节点的 HA 架构。
现在为了节省成本,我想关闭 3 个主控中的 2 个,只留下 1 个 运行
我试过 kubectl drain
但它没有效果,只是终止节点导致集群连接不稳定。
是否有安全的方法来移除 Master?
此问题已在 Github question - HA to single master migration 上讨论过。
已经为你准备好了solution
由于在 kops 1.12 中引入了 etcd-manager,并且 main
和 events
etcd 集群会定期自动备份到 S3(KOPS_STATE_STORE
的相同存储桶)。
所以如果你有一个高于1.12版本的k8s集群,也许你需要以下步骤:
- 删除集群中的 etcd 区域
$ kops edit cluster
在 etcdCluster
部分中,删除 etcdMembers
项以仅保留 main
和 events
的一项 instanceGroup
。例如
etcdClusters:
- etcdMembers:
- instanceGroup: master-ap-southeast-1a
name: a
name: main
- etcdMembers:
- instanceGroup: master-ap-southeast-1a
name: a
name: events
- 应用更改
$ kops update cluster --yes
$ kops rolling-update cluster --yes
- 删除 2 个主实例组
$ kops delete ig master-xxxxxx-1b
$ kops delete ig master-xxxxxx-1c
此操作不可撤销,将立即删除2个主节点。
现在您的 3 个主节点中有 2 个被删除,k8s etcd 服务可能会失败并且 kube-api 服务将无法访问。在此步骤后,您的 kops
和 kubectl
命令不再起作用是正常的。
- 重启单主节点的ectd集群
这是棘手的部分。 ssh 进入剩余的主节点,然后
$ sudo systemctl stop protokube
$ sudo systemctl stop kubelet
下载 etcd-manager-ctl
工具。如果使用不同的 etcd-manager
版本,相应地调整下载 link
$ wget https://github.com/kopeio/etcd-manager/releases/download/3.0.20190930/etcd-manager-ctl-linux-amd64
$ mv etcd-manager-ctl-linux-amd64 etcd-manager-ctl
$ chmod +x etcd-manager-ctl
$ mv etcd-manager-ctl /usr/local/bin/
从 S3 恢复备份。请参阅 official docs
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main restore-backup 2019-10-16T09:42:37Z-000001
# do the same for events
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events restore-backup 2019-10-16T09:42:37Z-000001
这不会立即开始还原;您需要重启 etcd:杀死相关容器并启动 kubelet
$ sudo systemctl start kubelet
$ sudo systemctl start protokube
等待还原完成,然后 kubectl get nodes
和 kops validate cluster
应该可以正常工作。如果没有,您可以在AWS控制台中终止剩余主节点的EC2实例,Auto Scaling Groups将创建一个新的主节点,并恢复etcd集群。
这些是为了减少 KOPS 已部署集群中的主节点数量而需要采取的步骤
注意:在您尝试执行此处描述的步骤之前,请先考虑重新创建集群的可能性。 按照步骤,虽然最终让我从 3 个 master 减少到 1 个,但在不同的场合需要额外的故障排除。下面是我从该过程中学到的所有知识,但您的情况可能有所不同,因此不能保证成功。
先决条件
转到AWS控制台并确定私有IP(稍后MASTER_IP变量)和主节点的可用区,该主节点将在此过程结束后成为单主节点(AZ)。
您需要将 AWS CLI 访问权限配置为 S3,以便 kops 能够工作。 您将需要 kubectl 配置为与我们要操作的集群一起工作。 如果由于任何原因出现问题,您可能需要 SSH 密钥,以便您可以到达剩余的主节点以在那里恢复 ETCD(因为在这种情况下 kubectl 将不再可用)。此案例目前不在本文档的涵盖范围内。 为 MASTER_IP、AZ、KOPS_STATE_BUCKET 和 CLUSTER_NAME 提供值以匹配您的环境。
# MASTER_IP is the IP of master node in availability zone AZ (so "c" in this example)
export MASTER_IP="172.20.115.115"
export AZ="c"
export KOPS_STATE_BUCKET="mironq-prod-eu-central-1-state-store"
export CLUSTER_NAME="mironq.prod.eu-central-1.aws.svc.example.com"
# no need to change following command unless you use different version of Etcd
export BACKUP_MAIN="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/main"
export BACKUP_EVENT="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/events"
export ETCD_CMD="/opt/etcd-v3.4.3-linux-amd64/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001"
CONTAINER=$(kubectl get pod -l k8s-app=etcd-manager-main -o=jsonpath='{.items[*].metadata.name}'|tr ' ' '\n'|grep ${MASTER_IP})
注意:您的 CONTAINER 变量现在应该包含 pod 的名称以及将要保留的 master,即:
$ echo $CONTAINER
etcd-manager-main-ip-172-20-109-104.eu-central-1.compute.internal
现在确认 Etcd 备份存在并且是最近的(~ 15 分钟前)和集群的当前成员数。
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n
# Confirm current members of existing Etcd cluster
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list
删除 Etcd 节点
获取要删除的 Etcd 节点的 ID
MASTERS2DELETE=$(kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list|grep -v etcd-${AZ}|cut -d, -f1)
#$ echo $MASTERS2DELETE
efb9893f347468eb ffea6e819b91a131
现在您已准备好删除不需要的 Etcd 节点
for MASTER in ${MASTERS2DELETE};do echo "Deleting ETCD node ${MSTR}"; kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member remove ${MASTER}; done
# a few minutes may be needed after this has been executed
# Confirm only one member is left
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list
您还会看到一些主节点未就绪
$ kubectl get node
安排备份恢复
!!!重要的 !!! 现在我们需要确保在继续之前进行了新的备份。默认情况下,etcd-manager 每 15 分钟进行一次备份。等待新的返回,因为它将包含有关节点预期数量的信息 (=1)
现在我们已经为这个单节点集群创建了一个新备份,我们可以安排在重启后恢复它。 下面的代码包含注释掉的响应,以帮助您确定您的命令是否按预期执行。
安排“主要”集群的恢复。
BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n)
#echo "$BACKUP_LIST"
#[...]
#2020-12-17T14:55:55Z-000001
#2020-12-17T15:11:05Z-000002
#2020-12-17T15:26:13Z-000003
#2020-12-17T15:41:14Z-000001
#2020-12-17T15:56:20Z-000001
#2020-12-17T16:11:35Z-000004
#2020-12-17T16:26:41Z-000005
LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)
# confirm that latest backup has been selected
#$ echo $LATEST_BACKUP
#2020-12-17T16:26:41Z-000005
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main
#I1217 16:41:59.101078 11608 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main/control/#2020-12-17T16:41:59Z-000000/_command.json: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" #>
#added restore-backup command: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" >
安排“主要”集群的恢复。
BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n)
#$ echo "$BACKUP_LIST"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:48:41.230896 17761 vfs.go:102] listed backups in s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events: #[2020-12-17T14:56:08Z-000001 2020-12-17T15:11:17Z-000001 2020-12-17T15:26:26Z-000002 2020-12-17T15:41:27Z-000002 2020-12-17T15:56:32Z-000003 2020-12-17T16:11:41Z-000003 #2020-12-17T16:26:48Z-000004 2020-12-17T16:41:56Z-000001]
#2020-12-17T14:56:08Z-000001
#2020-12-17T15:11:17Z-000001
#2020-12-17T15:26:26Z-000002
#2020-12-17T15:41:27Z-000002
#2020-12-17T15:56:32Z-000003
#2020-12-17T16:11:41Z-000003
#2020-12-17T16:26:48Z-000004
#2020-12-17T16:41:56Z-000001
LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)
# confirm that latest backup has been selected
$ echo $LATEST_BACKUP
2020-12-17T16:41:56Z-000001
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:53:17.876318 21958 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events/control/#2020-12-17T16:53:17Z-000000/_command.json: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" #>
#added restore-backup command: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" >
检查端点是否仍然健康(应该如此)
# check if endpoint is healthy
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} endpoint health
#https://127.0.0.1:4001 is healthy: successfully committed proposal: took = 7.036109ms
删除实例组
实例组示例列表
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} get ig
#NAME ROLE MACHINETYPE MIN MAX ZONES
#bastions Bastion t3.micro 1 1 eu-central-1a,eu-central-1b,eu-central-1c
#master-eu-central-1a Master t3.medium 1 1 eu-central-1a
#master-eu-central-1b Master t3.medium 1 1 eu-central-1b
#master-eu-central-1c Master t3.medium 1 1 eu-central-1c
#nodes Node t3.medium 2 6 eu-central-1a,eu-central-1b
删除我们禁用 Etcd 节点的可用性区域中的主实例组(本例中的 a 和 b,因为我们希望将 c 运行 作为唯一的主实例)。 编辑下面的命令(替换 [AZ-letter] 以匹配您的大小写。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} delete ig master-eu-central-1[AZ-letter]
#InstanceGroup "master-eu-central-1a" found for deletion
#I1217 17:01:39.035294 2538280 delete.go:54] Deleting "master-eu-central-1a"
#Deleted InstanceGroup: "master-eu-central-1a"
正在使用以下命令调用编辑模式手动编辑集群。这里的目标是将剩余的 Etcd 节点与集群配置匹配:通过下面的示例,您需要删除不再存在的节点的条目。
改变这个:
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: events
至此(只留下还有master的区域):
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: events
这将打开允许进行更改的编辑器。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} edit cluster
应用 KOPS 更改
应用更改并强制重新创建主节点(第二个命令将使集群无响应,直到创建新的主节点并重新联机。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} update cluster --yes
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} rolling-update cluster --cloudonly --yes
疑难解答
Etcd 集群“main”和“events”都必须重新联机才能再次启动 API。如果 API 服务器日志抱怨无法连接到端口 4001,那么你的“主要”Etcd 集群没有启动,如果端口号是 4002,它就是“事件”。 就在上面,您已指示 Etcd 集群导入回备份,并且必须完成此操作才能启动集群