CloudFormation AutoScalingGroup 未等待 update/scale-up 上的信号

Question

我正在使用 CloudFormation 模板工作，该模板会根据我的请求生成尽可能多的实例，并希望在堆栈 creation/update 被视为完成之前等待它们完成初始化（通过用户数据）。

期待

创建或更新堆栈应等待来自所有新创建实例的信号，以确保它们的初始化完成。

如果任何创建的实例初始化失败，我不希望堆栈创建或更新被视为成功。

现实

CloudFormation 似乎只在首次创建堆栈时等待来自实例的信号。更新堆栈和增加实例数量似乎无视信号。更新操作很快就成功完成，而实例仍在初始化中。

由于更新堆栈而创建的实例可能无法初始化，但更新操作已被视为成功。

问题

使用CloudFormation，如何让现实达到预期？

我想要在创建堆栈和更新堆栈时应用相同的行为。

类似问题

我只找到了以下符合我的问题的问题：UpdatePolicy in Autoscaling group not working correctly for AWS CloudFormation update

开了一年了，还没有收到回复。

我正在创建另一个问题，因为我有更多信息要添加，但我不确定这些细节是否与该问题中作者的相符。

正在复制

为了演示该问题，我根据 Auto Scaling Group header on this AWS documentation page 下面的示例创建了一个模板，其中包括信令。

创建的模板已修改为：

它使用 Ubuntu AMI（在 ap-northeast-1 区域）。考虑到此更改，cfn-signal 命令已根据需要进行引导和调用。
一个新参数规定了在 Auto Scaling 组中启动多少个实例。
在发出信号之前增加了 2 分钟的休眠时间，以模拟初始化时花费的时间。

这是模板，已保存到 template.yml:

Parameters:
  DesiredCapacity:
    Type: Number
    Description: How many instances would you like in the Auto Scaling Group?

Resources:
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AvailabilityZones: !GetAZs ''
      LaunchConfigurationName: !Ref LaunchConfig
      MinSize: !Ref DesiredCapacity
      MaxSize: !Ref DesiredCapacity
    CreationPolicy:
      ResourceSignal:
        Count: !Ref DesiredCapacity
        Timeout: PT5M
    UpdatePolicy:
      AutoScalingScheduledAction:
        IgnoreUnmodifiedGroupSizeProperties: true
      AutoScalingRollingUpdate:
        MinInstancesInService: 1
        MaxBatchSize: 2
        PauseTime: PT5M
        WaitOnResourceSignals: true

  LaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: ami-b7d829d6
      InstanceType: t2.micro
      UserData:
        'Fn::Base64':
          !Sub |
            #!/bin/bash -xe
            sleep 120

            apt-get -y install python-setuptools
            TMP=`mktemp -d`
            curl https://s3.amazonaws.com/cloudformation-examples/aws-cfn-bootstrap-latest.tar.gz | \
              tar xz -C $TMP --strip-components 1
            easy_install $TMP

            /usr/local/bin/cfn-signal -e $? \
              --stack ${AWS::StackName} \
              --resource AutoScalingGroup \
              --region ${AWS::Region}

现在我用单个实例创建堆栈，通过：

$ aws cloudformation create-stack \
  --region=ap-northeast-1 \
  --stack-name=asg-test \
  --template-body=file://template.yml \
  --parameters ParameterKey=DesiredCapacity,ParameterValue=1

等待几分钟创建完成后，我们来看一些关键堆栈事件：

$ aws cloudformation describe-stack-events \
  --region=ap-northeast-1 \
  --stack-name=asg-test

    ...
    {
        "Timestamp": "2017-02-03T05:36:45.445Z",
        ...
        "LogicalResourceId": "AutoScalingGroup",
        ...
        "ResourceStatus": "CREATE_COMPLETE",
        ...
    },
    {
        "Timestamp": "2017-02-03T05:36:42.487Z",
        ...
        "LogicalResourceId": "AutoScalingGroup",
        ...
        "ResourceStatusReason": "Received SUCCESS signal with UniqueId ...",
        "ResourceStatus": "CREATE_IN_PROGRESS"
    },
    {
        "Timestamp": "2017-02-03T05:33:33.274Z",
        ...
        "LogicalResourceId": "AutoScalingGroup",
        ...
        "ResourceStatusReason": "Resource creation Initiated",
        "ResourceStatus": "CREATE_IN_PROGRESS",
        ...
    }
    ...

您可以看到 Auto Scaling 组在 05:33:33 开始启动。在 05:36:42（启动后 3 分钟），它收到了成功信号。这使得 Auto Scaling 组仅在 05:36:45.

之后不久就达到了自己的成功状态

太棒了 - 工作起来很有魅力。

现在让我们尝试通过更新堆栈将此 Auto Scaling 组中的实例数增加到 2：

$ aws cloudformation update-stack \
  --region=ap-northeast-1 \
  --stack-name=asg-test \
  --template-body=file://template.yml \
  --parameters ParameterKey=DesiredCapacity,ParameterValue=2

在等待更新完成的时间短得多之后，让我们看看一些新的堆栈事件：

$ aws cloudformation describe-stack-events \
  --region=ap-northeast-1 \
  --stack-name=asg-test

    {
        "ResourceStatus": "UPDATE_COMPLETE",
        ...
        "ResourceType": "AWS::CloudFormation::Stack",
        ...
        "Timestamp": "2017-02-03T05:45:47.063Z"
    },
    ...
    {
        "ResourceStatus": "UPDATE_COMPLETE",
        ...
        "LogicalResourceId": "AutoScalingGroup",
        "Timestamp": "2017-02-03T05:45:43.047Z"
    },
    {
        "ResourceStatus": "UPDATE_IN_PROGRESS",
        ...,
        "LogicalResourceId": "AutoScalingGroup",
        "Timestamp": "2017-02-03T05:44:20.845Z"
    },
    {
        "ResourceStatus": "UPDATE_IN_PROGRESS",
        ...
        "ResourceType": "AWS::CloudFormation::Stack",
        ...
        "Timestamp": "2017-02-03T05:44:15.671Z",
        "ResourceStatusReason": "User Initiated"
    },
    ....

现在您可以看到，虽然 Auto Scaling 组在 05:44:20 开始更新，但在 05:45:43 完成 - 完成时间不到一分半钟，这是不可能的考虑到用户数据中的休眠时间为 120 秒。

然后堆栈更新继续完成，而 Auto Scaling 组从未收到任何信号。

新实例确实存在。

在我的真实用例中，我通过 SSH 连接到这些新实例之一，发现即使在堆栈更新完成后它仍在初始化过程中。

我试过的

我已经阅读并重新阅读了 CreationPolicy and UpdatePolicy 相关的文档，但未能确定我遗漏了什么。

看了上面正在使用的更新策略，我不明白它到底在做什么。为什么 WaitOnResourceSignals 是真的，但它不是等待？它还有其他用途吗？

或者这些新实例不属于 "rolling update" 政策？如果他们不属于那里，那么我希望他们属于创建政策，但这似乎也不适用。

因此，我真的不知道还能尝试什么。

我隐约觉得它的功能与 designed/expected 相同，但如果是，那么 WaitOnResourceSignals 属性的意义何在？我如何才能达到上面设定的期望值？

Answer 1

滚动更新仅适用于现有实例。文档说：

Rolling updates enable you to specify whether AWS CloudFormation updates instances that are in an Auto Scaling group in batches or all at once.

因此，要对此进行测试，请根据您的模板创建一个堆栈。而不是对启动配置进行小的修改（例如将睡眠 120 设置为 121）并更新堆栈。现在您应该会看到滚动更新。

Answer 2

AutoScalingRollingUpdate policy handles rotating out an entire set of instances in an Auto Scaling group in response to changes to the underlying LaunchConfiguration. It doesn't apply to individual changes to the number of instances in the existing group. According to the UpdatePolicy Attribute 文档，

The AutoScalingReplacingUpdate and AutoScalingRollingUpdate policies apply only when you do one or more of the following:

Change the Auto Scaling group's AWS::AutoScaling::LaunchConfiguration.

Change the Auto Scaling group's VPCZoneIdentifier property

Update an Auto Scaling group that contains instances that don't match the current LaunchConfiguration.

更改 Auto Scaling 组的 DesiredCapacity 属性不在此列表中，因此 AutoScalingRollingUpdate 策略不适用于此类更改。

据我所知，不可能（使用标准的 AWS CloudFormation 资源）延迟 Stack Update 修改 DesiredCapacity 的完成，直到完全配置添加到 Auto Scaling 组的任何新实例。

这里有一些备选方案：

不是只修改DesiredCapacity，而是同时修改一个LaunchConfiguration 属性。这将触发 AutoScalingRollingUpdate 达到所需的容量（缺点是它还会更新现有实例，实际上可能不需要修改）。
添加 AWS::AutoScaling::LifecycleHook resource to your Auto Scaling Group, and call aws autoscaling complete-lifecycle-action in addition to cfn-signal, to signal lifecycle-hook completion. This won't delay your CloudFormation stack update as desired, but it will delay the individual auto-scaled instances from entering the InService state until the lifecycle signal is received. (See Lifecycle Hooks 文档以获取更多信息。）
作为 #2 的扩展，应该可以将生命周期挂钩添加到您的 Auto Scaling 组，以及一个 Custom Resource 轮询您的 Auto Scaling 组并且仅在 Auto Scaling 组出现时才完成包含 DesiredCapacity 个处于 InService 状态的实例。

CloudFormation AutoScalingGroup 未等待 update/scale-up 上的信号

CloudFormation AutoScalingGroup not waiting for signal on update/scale-up

amazon-web-services

amazon-cloudformation

autoscaling

期待

现实

问题

类似问题

正在复制

我试过的