AWS Cloudwatch 指标警报在第一次后未触发

AWS Cloudwatch Metric Alarm not triggering after first time

我在日志中寻找 error 消息的警报确实触发了警报状态。但它不会重置并保持 In Alarm 状态。我将警报操作作为 SNS 主题,这又会触发电子邮件。所以基本上在第一个错误之后我没有看到任何后续电子邮件。以下模板配置出了什么问题?

"AppErrorMetric": {
  "Type": "AWS::Logs::MetricFilter",
  "Properties": {
    "LogGroupName": {
      "Ref": "AppServerLG"
    },
    "FilterPattern": "[error]",
    "MetricTransformations": [
      {
        "MetricValue": "1",
        "MetricNamespace": {
          "Fn::Join": [
            "",
            [
              {
                "Ref": "ApplicationEndpoint"
              },
              "/metrics/AppError"
            ]
          ]
        },
        "MetricName": "AppError"
      }
    ]
  }
},
"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
    "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}

您的问题是两个因素的组合:

  1. 您的指标仅在发现错误时发出,它是一个稀疏指标,因此出现错误时会出现 1,但如果不存在错误则不会发出 0。
  2. 默认情况下,CloudWatch 警报配置为 TreatMissingData 作为 missing

CloudWatch documentation about missing data 说:

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

  • notBreaching – Missing data points are treated as "good" and within the threshold,
  • breaching – Missing data points are treated as "bad" and breaching the threshold
  • ignore – The current alarm state is maintained
  • missing – The alarm doesn't consider missing data points when evaluating whether to change state

"TreatMissing": "notBreaching" 参数添加到您的警报配置将导致 CloudWatch 将丢失的数据点视为未违规并将警报转换为正常:

"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
            "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "TreatMissingData": "notBreaching",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}