AWS Cloudwatch 指标警报在第一次后未触发

Question

我在日志中寻找 error 消息的警报确实触发了警报状态。但它不会重置并保持 In Alarm 状态。我将警报操作作为 SNS 主题，这又会触发电子邮件。所以基本上在第一个错误之后我没有看到任何后续电子邮件。以下模板配置出了什么问题？

"AppErrorMetric": {
  "Type": "AWS::Logs::MetricFilter",
  "Properties": {
    "LogGroupName": {
      "Ref": "AppServerLG"
    },
    "FilterPattern": "[error]",
    "MetricTransformations": [
      {
        "MetricValue": "1",
        "MetricNamespace": {
          "Fn::Join": [
            "",
            [
              {
                "Ref": "ApplicationEndpoint"
              },
              "/metrics/AppError"
            ]
          ]
        },
        "MetricName": "AppError"
      }
    ]
  }
},
"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
    "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}

Answer 1

您的问题是两个因素的组合：

您的指标仅在发现错误时发出，它是一个稀疏指标，因此出现错误时会出现 1，但如果不存在错误则不会发出 0。
默认情况下，CloudWatch 警报配置为 TreatMissingData 作为 missing。

CloudWatch documentation about missing data 说：

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

notBreaching – Missing data points are treated as "good" and within the threshold,

breaching – Missing data points are treated as "bad" and breaching the threshold

ignore – The current alarm state is maintained

missing – The alarm doesn't consider missing data points when evaluating whether to change state

将 "TreatMissing": "notBreaching" 参数添加到您的警报配置将导致 CloudWatch 将丢失的数据点视为未违规并将警报转换为正常：

"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
            "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "TreatMissingData": "notBreaching",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}

AWS Cloudwatch 指标警报在第一次后未触发

AWS Cloudwatch Metric Alarm not triggering after first time

amazon-web-services

amazon-cloudformation

amazon-cloudwatch

amazon-cloudwatchlogs