如何在 ECS 中自动缩放服务器？

Question

我最近开始使用 ECS。我能够在 ECR 中部署容器映像并为我的容器创建具有 CPU/Memory 限制的任务定义。我的用例是每个容器都是一个长运行应用程序（没有网络服务器，不需要端口映射）。容器将一次按需生成 1 个，一次按需删除 1 个。

我能够创建一个包含 N 个服务器实例的集群。但我希望服务器实例能够自动扩展 up/down。例如，如果集群中没有足够的 CPU/Memory，我想创建一个新实例。

如果有一个没有容器的实例运行，我希望缩小/删除该特定实例。这是为了避免自动缩小终止其中包含运行个任务的服务器实例。

需要哪些步骤才能实现这一目标？

Answer 1

考虑到您已经创建了 ECS 集群，AWS 在 Scaling cluster instances with CloudWatch Alarms.

上提供了说明

假设您想根据内存预留扩展集群，在较高级别上，您需要执行以下操作：

为您的 Auto Scaling 组创建启动配置。这个
创建一个Auto Scaling Group，这样集群的大小就可以伸缩了。
创建 CloudWatch 警报以在内存预留超过 70% 时扩展集群
如果内存预留低于 30%，则创建 CloudWatch 警报以缩减集群

因为这是我的专长，所以我写了一个示例 CloudFormation 模板，应该可以帮助您开始其中的大部分工作：

Parameters:
  MinInstances:
    Type: Number
  MaxInstances:
    Type: Number
  InstanceType:
    Type: String
    AllowedValues:
      - t2.nano
      - t2.micro
      - t2.small
      - t2.medium
      - t2.large
  VpcSubnetIds:
    Type: String

Mappings:
  EcsInstanceAmis:
    us-east-2:
      Ami: ami-1c002379
    us-east-1:
      Ami: ami-9eb4b1e5
    us-west-2:
      Ami: ami-1d668865
    us-west-1:
      Ami: ami-4a2c192a
    eu-west-2:
      Ami: ami-cb1101af
    eu-west-1:
      Ami: ami-8fcc32f6
    eu-central-1:
      Ami: ami-0460cb6b
    ap-northeast-1:
      Ami: ami-b743bed1
    ap-southeast-2:
      Ami: ami-c1a6bda2
    ap-southeast-1:
      Ami: ami-9d1f7efe
    ca-central-1:
      Ami: ami-b677c9d2

Resources:
  Cluster:
    Type: AWS::ECS::Cluster
  Role:
    Type: AWS::IAM::Role
    Properties:
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          -
            Effect: Allow
            Action:
              - sts:AssumeRole
            Principal:
              Service:
                - ec2.amazonaws.com    
  InstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: /
      Roles:
        - !Ref Role    
  LaunchConfiguration:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: !FindInMap [EcsInstanceAmis, !Ref "AWS::Region", Ami]
      InstanceType: !Ref InstanceType
      IamInstanceProfile: !Ref InstanceProfile
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          echo ECS_CLUSTER=${Cluster} >> /etc/ecs/ecs.config  
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: !Ref MinInstances
      MaxSize: !Ref MaxInstances
      LaunchConfigurationName: !Ref LaunchConfiguration
      HealthCheckGracePeriod: 300
      HealthCheckType: EC2
      VPCZoneIdentifier: !Split [",", !Ref VpcSubnetIds]
    ScaleUpPolicy:
      Type: AWS::AutoScaling::ScalingPolicy
      Properties:
        AdjustmentType: ChangeInCapacity
        AutoScalingGroupName: !Ref AutoScalingGroup
        Cooldown: '1'
        ScalingAdjustment: '1'
    MemoryReservationAlarmHigh:
      Type: AWS::CloudWatch::Alarm
      Properties:
        EvaluationPeriods: '2'
        Statistic: Average
        Threshold: '70'
        AlarmDescription: Alarm if Cluster Memory Reservation is to high
        Period: '60'
        AlarmActions:
        - Ref: ScaleUpPolicy
        Namespace: AWS/ECS
        Dimensions:
        - Name: ClusterName
          Value: !Ref Cluster
        ComparisonOperator: GreaterThanThreshold
        MetricName: MemoryReservation
    ScaleDownPolicy:
      Type: AWS::AutoScaling::ScalingPolicy
      Properties:
        AdjustmentType: ChangeInCapacity
        AutoScalingGroupName: !Ref AutoScalingGroup
        Cooldown: '1'
        ScalingAdjustment: '-1'
    MemoryReservationAlarmLow:
      Type: AWS::CloudWatch::Alarm
      Properties:
        EvaluationPeriods: '2'
        Statistic: Average
        Threshold: '30'
        AlarmDescription: Alarm if Cluster Memory Reservation is to Low
        Period: '60'
        AlarmActions:
        - Ref: ScaleDownPolicy
        Namespace: AWS/ECS
        Dimensions:
        - Name: ClusterName
          Value: !Ref Cluster
        ComparisonOperator: LessThanThreshold
        MetricName: MemoryReservation

这将创建一个 ECS 集群、一个启动配置、一个 AutoScaling 组以及基于 ECS 内存预留的警报。

现在我们可以开始有趣的讨论了。

为什么我们不能根据CPU利用率和内存预留进行扩展？

简短的回答是你完全可以但是你可能会为此付出很多。 EC2 有一个众所周知的属性，当您创建一个实例时，您至少要支付 1 小时的费用，因为部分实例小时数按完整小时数计费。为什么这很重要，想象一下你有多个警报。假设您有一堆当前运行空闲的服务，并且您填充了集群。 CPU 警报缩小集群，或者内存警报扩大集群。其中之一可能会将集群扩展到不再触发警报的程度。冷却时间过后，另一个警报将撤消其上次操作，下一次冷却时间过后，该操作可能会重做。因此实例会在每隔一个冷却时间重复创建然后销毁。

经过一番思考，我想出的策略是使用Application Autoscaling for ECS Services 基于CPU Utilization，以及基于集群的Memory Reservation。因此，如果一个服务运行很热，则会添加一个额外的任务来分担负载。这将慢慢填满集群内存预留容量。当内存变满时，集群会扩展。当服务正在冷却时，服务将开始关闭任务。随着集群上的内存预留下降，集群将缩减。

根据您的任务定义，可能需要对 CloudWatch 警报的阈值进行试验。这样做的原因是如果你把scale up threshold设置得太高，它可能会随着内存的消耗而无法缩放，然后当autoscaling去放置另一个任务时，它会发现任何一个上都没有足够的可用内存集群中的实例，因此无法放置另一个任务。

Answer 2

作为今年 re:Invent conference, AWS announced cluster auto scaling for Amazon ECS. Clusters configured with auto scaling can now add more capacity when needed and remove capacity that is not necessary. You can find more information about this in the documentation 的一部分。

但是，这取决于您的尝试运行，AWS Fargate could be a better option. Fargate allows you to run containers without provisioning and managing the underlying infrastructure; i.e., you don't have to deal with any EC2 instances. With Fargate, you can make an API call to run your container, the container can run, and then there's nothing to clean up once the container stops running. Fargate is billed per-second (with a 1-minute minimum) and is priced based on the amount of CPU and memory allocated (see here 了解详情）。

如何在 ECS 中自动缩放服务器？

How to autoscale Servers in ECS?

amazon-ec2

amazon-web-services

amazon-ecs

autoscaling

amazon-ecr