AWS Cloudwatch 指标警报在第一次后未触发
AWS Cloudwatch Metric Alarm not triggering after first time
我在日志中寻找 error
消息的警报确实触发了警报状态。但它不会重置并保持 In Alarm
状态。我将警报操作作为 SNS 主题,这又会触发电子邮件。所以基本上在第一个错误之后我没有看到任何后续电子邮件。以下模板配置出了什么问题?
"AppErrorMetric": {
"Type": "AWS::Logs::MetricFilter",
"Properties": {
"LogGroupName": {
"Ref": "AppServerLG"
},
"FilterPattern": "[error]",
"MetricTransformations": [
{
"MetricValue": "1",
"MetricNamespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"/metrics/AppError"
]
]
},
"MetricName": "AppError"
}
]
}
},
"AppErrorAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": "true",
"AlarmName": {
"Fn::Join": [
"",
[
{
"Ref": "AppId"
},
",",
{
"Ref": "AppServerAG"
},
":",
"AppError",
",",
"MINOR"
]
]
},
"AlarmDescription": {
"Fn::Join": [
"",
[
"service is throwing error. Please check logs.",
{
"Ref": "AppServerAG"
},
"-",
{
"Ref": "AppId"
}
]
]
},
"MetricName": "AppError",
"Namespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"metrics/AppError"
]
]
},
"Statistic": "Sum",
"Period": "300",
"EvaluationPeriods": "1",
"Threshold": "1",
"AlarmActions": [{
"Fn::GetAtt": [
"VPCInfo",
"SNSTopic"
]
}],
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
}
您的问题是两个因素的组合:
- 您的指标仅在发现错误时发出,它是一个稀疏指标,因此出现错误时会出现 1,但如果不存在错误则不会发出 0。
- 默认情况下,CloudWatch 警报配置为
TreatMissingData
作为 missing
。
CloudWatch documentation about missing data 说:
For each alarm, you can specify CloudWatch to treat missing data
points as any of the following:
- notBreaching – Missing data points are treated as "good" and within the threshold,
- breaching – Missing data points are treated as "bad" and breaching the threshold
- ignore – The current alarm state is maintained
- missing – The alarm doesn't consider missing data points when evaluating whether to change state
将 "TreatMissing": "notBreaching"
参数添加到您的警报配置将导致 CloudWatch 将丢失的数据点视为未违规并将警报转换为正常:
"AppErrorAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": "true",
"AlarmName": {
"Fn::Join": [
"",
[
{
"Ref": "AppId"
},
",",
{
"Ref": "AppServerAG"
},
":",
"AppError",
",",
"MINOR"
]
]
},
"AlarmDescription": {
"Fn::Join": [
"",
[
"service is throwing error. Please check logs.",
{
"Ref": "AppServerAG"
},
"-",
{
"Ref": "AppId"
}
]
]
},
"MetricName": "AppError",
"Namespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"metrics/AppError"
]
]
},
"Statistic": "Sum",
"Period": "300",
"EvaluationPeriods": "1",
"Threshold": "1",
"TreatMissingData": "notBreaching",
"AlarmActions": [{
"Fn::GetAtt": [
"VPCInfo",
"SNSTopic"
]
}],
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
}
我在日志中寻找 error
消息的警报确实触发了警报状态。但它不会重置并保持 In Alarm
状态。我将警报操作作为 SNS 主题,这又会触发电子邮件。所以基本上在第一个错误之后我没有看到任何后续电子邮件。以下模板配置出了什么问题?
"AppErrorMetric": {
"Type": "AWS::Logs::MetricFilter",
"Properties": {
"LogGroupName": {
"Ref": "AppServerLG"
},
"FilterPattern": "[error]",
"MetricTransformations": [
{
"MetricValue": "1",
"MetricNamespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"/metrics/AppError"
]
]
},
"MetricName": "AppError"
}
]
}
},
"AppErrorAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": "true",
"AlarmName": {
"Fn::Join": [
"",
[
{
"Ref": "AppId"
},
",",
{
"Ref": "AppServerAG"
},
":",
"AppError",
",",
"MINOR"
]
]
},
"AlarmDescription": {
"Fn::Join": [
"",
[
"service is throwing error. Please check logs.",
{
"Ref": "AppServerAG"
},
"-",
{
"Ref": "AppId"
}
]
]
},
"MetricName": "AppError",
"Namespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"metrics/AppError"
]
]
},
"Statistic": "Sum",
"Period": "300",
"EvaluationPeriods": "1",
"Threshold": "1",
"AlarmActions": [{
"Fn::GetAtt": [
"VPCInfo",
"SNSTopic"
]
}],
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
}
您的问题是两个因素的组合:
- 您的指标仅在发现错误时发出,它是一个稀疏指标,因此出现错误时会出现 1,但如果不存在错误则不会发出 0。
- 默认情况下,CloudWatch 警报配置为
TreatMissingData
作为missing
。
CloudWatch documentation about missing data 说:
For each alarm, you can specify CloudWatch to treat missing data points as any of the following:
- notBreaching – Missing data points are treated as "good" and within the threshold,
- breaching – Missing data points are treated as "bad" and breaching the threshold
- ignore – The current alarm state is maintained
- missing – The alarm doesn't consider missing data points when evaluating whether to change state
将 "TreatMissing": "notBreaching"
参数添加到您的警报配置将导致 CloudWatch 将丢失的数据点视为未违规并将警报转换为正常:
"AppErrorAlarm": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": "true",
"AlarmName": {
"Fn::Join": [
"",
[
{
"Ref": "AppId"
},
",",
{
"Ref": "AppServerAG"
},
":",
"AppError",
",",
"MINOR"
]
]
},
"AlarmDescription": {
"Fn::Join": [
"",
[
"service is throwing error. Please check logs.",
{
"Ref": "AppServerAG"
},
"-",
{
"Ref": "AppId"
}
]
]
},
"MetricName": "AppError",
"Namespace": {
"Fn::Join": [
"",
[
{
"Ref": "ApplicationEndpoint"
},
"metrics/AppError"
]
]
},
"Statistic": "Sum",
"Period": "300",
"EvaluationPeriods": "1",
"Threshold": "1",
"TreatMissingData": "notBreaching",
"AlarmActions": [{
"Fn::GetAtt": [
"VPCInfo",
"SNSTopic"
]
}],
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
}