ECS Fargate 自动缩放速度更快?
ECS Fargate autoscaling more rapidly?
我正在对我的自动扩展 AWS ECS Fargate 堆栈进行负载测试,其中包括:
- 目标组指向 ECS 的应用程序负载均衡器 (ALB),
- ECS 集群、服务、任务、ApplicationAutoScaling::ScalableTarget 和 ApplicationAutoScaling::ScalingPolicy、
- 应用程序自动缩放策略定义了一个目标跟踪策略:
- 类型:TargetTrackingScaling,
- PredefinedMetricType:ALBRequestCountPerTarget,
- 阈值 = 1000 个请求
- 当 1 个数据点在过去 1 分钟评估期间超过阈值时触发警报。
一切正常。警报确实被触发,我看到扩展操作正在发生。但是检测“阈值突破”感觉很慢。这是我的负载测试和 AWS 事件的时间(从 JMeter 日志和 AWS 控制台的不同位置整理):
10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start.
问题 1:有没有办法在超过阈值时更快地触发警报 and/or?对我来说,这花费 1 分 20 秒太多了。它应该真正在 1 分钟 30 秒左右扩大(最多 1 米检测警报高状态 + 30 秒开始任务)...
注意:我在今天打开的另一个问题中记录了我的 CloudFormation 堆栈:
Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action
你对此无能为力。 ALB 在 1 minute intervals. Also these metrics are not real-time anyway, so delays are expected, even up to few minutes as explained by AWS support and reported in the comments :
中将指标发送到 CloudWatch
Some delay in metrics is expected, which is inherent for any monitoring systems- as they depend on several variables such as delay with the service publishing the metric, propagation delays and ingestion delay within CloudWatch to name a few. I do understand that a consistent 3 or 4 minute delay for ALB metrics is on the higher side.
您要么必须过度配置您的 ECS 以在警报触发和升级执行时维持增加的负载,要么降低您的阈值。
或者,您可以创建自己的 custom metrics,例如从你的应用程序。这些指标甚至可以间隔 1 秒。您的应用程序还可以“手动”触发警报。这将允许您减少观察到的延迟。
我正在对我的自动扩展 AWS ECS Fargate 堆栈进行负载测试,其中包括:
- 目标组指向 ECS 的应用程序负载均衡器 (ALB),
- ECS 集群、服务、任务、ApplicationAutoScaling::ScalableTarget 和 ApplicationAutoScaling::ScalingPolicy、
- 应用程序自动缩放策略定义了一个目标跟踪策略:
- 类型:TargetTrackingScaling,
- PredefinedMetricType:ALBRequestCountPerTarget,
- 阈值 = 1000 个请求
- 当 1 个数据点在过去 1 分钟评估期间超过阈值时触发警报。
一切正常。警报确实被触发,我看到扩展操作正在发生。但是检测“阈值突破”感觉很慢。这是我的负载测试和 AWS 事件的时间(从 JMeter 日志和 AWS 控制台的不同位置整理):
10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start.
问题 1:有没有办法在超过阈值时更快地触发警报 and/or?对我来说,这花费 1 分 20 秒太多了。它应该真正在 1 分钟 30 秒左右扩大(最多 1 米检测警报高状态 + 30 秒开始任务)...
注意:我在今天打开的另一个问题中记录了我的 CloudFormation 堆栈: Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action
你对此无能为力。 ALB 在 1 minute intervals. Also these metrics are not real-time anyway, so delays are expected, even up to few minutes as explained by AWS support and reported in the comments
Some delay in metrics is expected, which is inherent for any monitoring systems- as they depend on several variables such as delay with the service publishing the metric, propagation delays and ingestion delay within CloudWatch to name a few. I do understand that a consistent 3 or 4 minute delay for ALB metrics is on the higher side.
您要么必须过度配置您的 ECS 以在警报触发和升级执行时维持增加的负载,要么降低您的阈值。
或者,您可以创建自己的 custom metrics,例如从你的应用程序。这些指标甚至可以间隔 1 秒。您的应用程序还可以“手动”触发警报。这将允许您减少观察到的延迟。