在 2 个或更多实例之间添加 "cooldown" 或 "pausetime" 周期,同时减少所需容量
adding a "cooldown" or a "pausetime" period between 2 or more instances while reducing the Desired Capacity
(提前抱歉,因为我是 aws 的新手)。
我正在使用 cloudformation 堆栈来管理我的 ECS 集群。
假设我们有一个 ASG,其所需容量为 5 个 ec2 实例(minSize:1,maxSize:7),我手动将所需容量的值从 5 更改为 2,它减少了
通过集群变更集的实例数,所有实例都立即关闭。它没有时间在左侧实例上分派回先前的容器。
因此,如果从 5 个实例变为 2 个实例,则所有 3 个实例都将直接关闭。如果运气不好,一种类型的所有容器都在 3 台机器上,则容器不再存在并且服务已关闭。
是否可以在每次终止之间有一个“冷却时间”?
使用缩放策略显然没有帮助,因为我们不想设置指标,因为可用指标对我的情况没有帮助。
请在下面找到一些日志:
2021-01-15 15:45:52 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2021-01-15 15:45:52 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2021-01-15 15:45:54 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 15:47:40 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 15:47:40 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 33%).
2021-01-15 15:52:42 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 15:53:59 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 15:53:59 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 67%).
2021-01-15 15:59:02 UTC+0100 dev-cluster UPDATE_ROLLBACK_IN_PROGRESS The following resource(s) failed to update: [autoScalingGroup].
2021-01-15 15:59:17 UTC+0100 securityGroup UPDATE_IN_PROGRESS -
2021-01-15 15:59:32 UTC+0100 securityGroup UPDATE_COMPLETE -
2021-01-15 15:59:33 UTC+0100 launchConfiguration UPDATE_COMPLETE -
2021-01-15 15:59:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS -
2021-01-15 15:59:37 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2021-01-15 15:59:37 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2021-01-15 15:59:38 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 16:01:25 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 16:01:25 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 50%).
2021-01-15 16:01:46 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Received SUCCESS signal with UniqueId i-X
2021-01-15 16:01:47 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 16:03:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 16:03:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Received SUCCESS signal with UniqueId i-X
2021-01-15 16:03:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 100%).
2021-01-15 16:03:37 UTC+0100 autoScalingGroup UPDATE_COMPLETE -
2021-01-15 16:03:37 UTC+0100 dev-cluster UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS -
2021-01-15 16:03:38 UTC+0100 launchConfiguration DELETE_IN_PROGRESS -
2021-01-15 16:03:39 UTC+0100 dev-cluster UPDATE_ROLLBACK_COMPLETE -
2021-01-15 16:03:39 UTC+0100 launchConfiguration DELETE_COMPLETE -
在此先感谢您的帮助!
对于您的直接问题,没有任何功能可以强制 ASG 在所需数量下降时一次仅删除 x 个实例
如果您还没有这样做,您应该在 ASG 上有一个生命周期挂钩来触发一个脚本,告诉 ECS 从实例中排出容器(我从上下文中假设您使用的是 ECS)。在这种情况下,您仍然需要一次手动降低所需的 1。
https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/
如果您降低 CloudFormation 中的所需值,那么您可以将 UpdatePolicy 附加到该组,告诉 CFN 执行 RollingUpdate 以一次批量替换 1 个实例
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
如果您使用 ECS,设置 2 个目标跟踪扩展策略通常是个好主意。 1 个用于 CPUReservation,1 个用于 MemoryReservation。如果你想强制 ASG 一次不超过 1 个实例,你也可以根据这些指标手动创建步进扩展策略,但在 CFN 中创建 4 个 cloudwatch 警报会很痛苦
另一种选择是在 ECS 中使用 CapacityProvider,这将在任何具有任务 运行 的实例上启用实例保护
(提前抱歉,因为我是 aws 的新手)。
我正在使用 cloudformation 堆栈来管理我的 ECS 集群。
假设我们有一个 ASG,其所需容量为 5 个 ec2 实例(minSize:1,maxSize:7),我手动将所需容量的值从 5 更改为 2,它减少了 通过集群变更集的实例数,所有实例都立即关闭。它没有时间在左侧实例上分派回先前的容器。 因此,如果从 5 个实例变为 2 个实例,则所有 3 个实例都将直接关闭。如果运气不好,一种类型的所有容器都在 3 台机器上,则容器不再存在并且服务已关闭。
是否可以在每次终止之间有一个“冷却时间”? 使用缩放策略显然没有帮助,因为我们不想设置指标,因为可用指标对我的情况没有帮助。
请在下面找到一些日志:
2021-01-15 15:45:52 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2021-01-15 15:45:52 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2021-01-15 15:45:54 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 15:47:40 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 15:47:40 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 33%).
2021-01-15 15:52:42 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 15:53:59 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 15:53:59 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 67%).
2021-01-15 15:59:02 UTC+0100 dev-cluster UPDATE_ROLLBACK_IN_PROGRESS The following resource(s) failed to update: [autoScalingGroup].
2021-01-15 15:59:17 UTC+0100 securityGroup UPDATE_IN_PROGRESS -
2021-01-15 15:59:32 UTC+0100 securityGroup UPDATE_COMPLETE -
2021-01-15 15:59:33 UTC+0100 launchConfiguration UPDATE_COMPLETE -
2021-01-15 15:59:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS -
2021-01-15 15:59:37 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2021-01-15 15:59:37 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2021-01-15 15:59:38 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 16:01:25 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 16:01:25 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 50%).
2021-01-15 16:01:46 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Received SUCCESS signal with UniqueId i-X
2021-01-15 16:01:47 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 16:03:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 16:03:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Received SUCCESS signal with UniqueId i-X
2021-01-15 16:03:34 UTC+0100 autoScalingGroup UPDATE_IN_PROGRESS Successfully terminated instance(s) [i-X] (Progress 100%).
2021-01-15 16:03:37 UTC+0100 autoScalingGroup UPDATE_COMPLETE -
2021-01-15 16:03:37 UTC+0100 dev-cluster UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS -
2021-01-15 16:03:38 UTC+0100 launchConfiguration DELETE_IN_PROGRESS -
2021-01-15 16:03:39 UTC+0100 dev-cluster UPDATE_ROLLBACK_COMPLETE -
2021-01-15 16:03:39 UTC+0100 launchConfiguration DELETE_COMPLETE -
在此先感谢您的帮助!
对于您的直接问题,没有任何功能可以强制 ASG 在所需数量下降时一次仅删除 x 个实例
如果您还没有这样做,您应该在 ASG 上有一个生命周期挂钩来触发一个脚本,告诉 ECS 从实例中排出容器(我从上下文中假设您使用的是 ECS)。在这种情况下,您仍然需要一次手动降低所需的 1。 https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/
如果您降低 CloudFormation 中的所需值,那么您可以将 UpdatePolicy 附加到该组,告诉 CFN 执行 RollingUpdate 以一次批量替换 1 个实例 https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
如果您使用 ECS,设置 2 个目标跟踪扩展策略通常是个好主意。 1 个用于 CPUReservation,1 个用于 MemoryReservation。如果你想强制 ASG 一次不超过 1 个实例,你也可以根据这些指标手动创建步进扩展策略,但在 CFN 中创建 4 个 cloudwatch 警报会很痛苦
另一种选择是在 ECS 中使用 CapacityProvider,这将在任何具有任务 运行 的实例上启用实例保护