使用 Terraform 创建的 CWAgent 指标警报未收集到数据点(停留在数据不足)
CWAgent Metric alarms created using Terraform doesn't get data points collected (stays in Insufficient data)
我已经使用 Terraform 创建了 CloudWatch 内存使用警报,但该警报没有移动到 OK
状态(保持在 INSUFFICIENT_DATA
)。但是,当我从 AWS 管理控制台手动创建具有完全相同配置的相同警报时,它移动到 OK
状态并且我看到了数据点。
我已经在尝试创建警报的 EC2 实例中成功安装了 CloudWatch 代理,我可以在 CloudWatch 指标部分看到指标。
我的 Terraform 代码:
resource "aws_cloudwatch_metric_alarm" "memory" {
alarm_name = "memory-utilization-alarm-${var.env}"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
metric_name = "mem_used_percent"
namespace = "CWAgent"
period = "300"
statistic = "Average"
threshold = "${var.alarms_memory_threshold}"
alarm_description = "This metric monitors ec2 memory utilization"
alarm_actions = [ "${aws_sns_topic.sns_topic.arn}" ]
dimensions = {
InstanceId = "${var.instance_id}"
ImageId = "${var.ami_id}"
}
tags = {
Environment = "${var.env}"
Project = "${var.project}"
Provisioner="cloudwatch"
Name = "${local.name}.memory"
}
}
描述使用 Terraform 创建的警报的 AWS CLI 输出:
aws cloudwatch describe-alarms --alarm-names memory-utilization-alarm-dev
{
"MetricAlarms": [
{
"EvaluationPeriods": 1,
"TreatMissingData": "missing",
"AlarmArn": "arn:aws:cloudwatch:us-west-2:289914521333:alarm:memory-utilization-alarm-dev",
"StateUpdatedTimestamp": "2019-07-12T08:45:07.020Z",
"AlarmConfigurationUpdatedTimestamp": "2019-07-12T08:45:07.020Z",
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmActions": [
"arn:aws:sns:us-west-2:289914521333:sns-topic"
],
"AlarmDescription": "This metric monitors ec2 memory utilization",
"Namespace": "CWAgent",
"Period": 300,
"StateValue": "INSUFFICIENT_DATA",
"Threshold": 80.0,
"AlarmName": "memory-utilization-alarm-dev",
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-03417f2d90d3dc6ca"
},
{
"Name": "ImageId",
"Value": "ami-09d1383e2a5ae8a93"
}
],
"Statistic": "Average",
"StateReason": "Unchecked: Initial alarm creation",
"InsufficientDataActions": [],
"OKActions": [],
"ActionsEnabled": true,
"MetricName": "mem_used_percent"
}
]
}
描述使用 AWS 控制台创建的警报的 AWS CLI 输出:
aws cloudwatch describe-alarms --alarm-names memory-utilization-alarm
{
"MetricAlarms": [
{
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-03417f2d90d3dc6ca"
},
{
"Name": "ImageId",
"Value": "ami-09d1383e2a5ae8a93"
},
{
"Name": "InstanceType",
"Value": "t3.large"
}
],
"Namespace": "CWAgent",
"DatapointsToAlarm": 1,
"ActionsEnabled": true,
"MetricName": "mem_used_percent",
"EvaluationPeriods": 1,
"StateValue": "OK",
"StateUpdatedTimestamp": "2019-07-12T09:49:28.749Z",
"AlarmConfigurationUpdatedTimestamp": "2019-07-12T09:47:55.914Z",
"AlarmActions": [
"arn:aws:sns:us-west-2:289914521333:sns-topic"
],
"InsufficientDataActions": [],
"AlarmArn": "arn:aws:cloudwatch:us-west-2:289914521333:alarm:memory-utilization-alarm",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-07-12T09:49:28.746+0000\",\"startDate\":\"2019-07-12T09:44:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[61.253520518958474],\"threshold\":80.0}",
"Threshold": 80.0,
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [61.253520518958474 (12/07/19 09:44:00)] was not greater than or equal to the threshold (80.0) (minimum 1 datapoint for ALARM -> OK transition).",
"OKActions": [],
"AlarmDescription": "memory-utilization-alarm",
"Period": 300,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmName": "memory-utilization-alarm",
"Statistic": "Average",
"TreatMissingData": "missing"
}
]
}
Cloudwatch 代理的 mem_used_percent
指标有 3 个维度:InstanceId
、ImageId
和 InstanceType
。每个指标的维度当前未在 AWS user guide 中列出,但您可以使用以下 AWS CLI 命令找到它们:
$ aws cloudwatch list-metrics --namespace CWAgent --metric-name mem_used_percent --query 'Metrics[0].Dimensions[].Name'
[
"InstanceId",
"ImageId",
"InstanceType"
]
要修复您的警报,您需要更改警报定义以包含 InstanceType
维度:
resource "aws_cloudwatch_metric_alarm" "memory" {
alarm_name = "memory-utilization-alarm-${var.env}"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
metric_name = "mem_used_percent"
namespace = "CWAgent"
period = "300"
statistic = "Average"
threshold = "${var.alarms_memory_threshold}"
alarm_description = "This metric monitors ec2 memory utilization"
alarm_actions = [ "${aws_sns_topic.sns_topic.arn}" ]
dimensions = {
InstanceId = "${var.instance_id}"
ImageId = "${var.ami_id}"
InstanceType = "${var.instance_type}"
}
tags = {
Environment = "${var.env}"
Project = "${var.project}"
Provisioner="cloudwatch"
Name = "${local.name}.memory"
}
}
我已经使用 Terraform 创建了 CloudWatch 内存使用警报,但该警报没有移动到 OK
状态(保持在 INSUFFICIENT_DATA
)。但是,当我从 AWS 管理控制台手动创建具有完全相同配置的相同警报时,它移动到 OK
状态并且我看到了数据点。
我已经在尝试创建警报的 EC2 实例中成功安装了 CloudWatch 代理,我可以在 CloudWatch 指标部分看到指标。
我的 Terraform 代码:
resource "aws_cloudwatch_metric_alarm" "memory" {
alarm_name = "memory-utilization-alarm-${var.env}"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
metric_name = "mem_used_percent"
namespace = "CWAgent"
period = "300"
statistic = "Average"
threshold = "${var.alarms_memory_threshold}"
alarm_description = "This metric monitors ec2 memory utilization"
alarm_actions = [ "${aws_sns_topic.sns_topic.arn}" ]
dimensions = {
InstanceId = "${var.instance_id}"
ImageId = "${var.ami_id}"
}
tags = {
Environment = "${var.env}"
Project = "${var.project}"
Provisioner="cloudwatch"
Name = "${local.name}.memory"
}
}
描述使用 Terraform 创建的警报的 AWS CLI 输出:
aws cloudwatch describe-alarms --alarm-names memory-utilization-alarm-dev
{
"MetricAlarms": [
{
"EvaluationPeriods": 1,
"TreatMissingData": "missing",
"AlarmArn": "arn:aws:cloudwatch:us-west-2:289914521333:alarm:memory-utilization-alarm-dev",
"StateUpdatedTimestamp": "2019-07-12T08:45:07.020Z",
"AlarmConfigurationUpdatedTimestamp": "2019-07-12T08:45:07.020Z",
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmActions": [
"arn:aws:sns:us-west-2:289914521333:sns-topic"
],
"AlarmDescription": "This metric monitors ec2 memory utilization",
"Namespace": "CWAgent",
"Period": 300,
"StateValue": "INSUFFICIENT_DATA",
"Threshold": 80.0,
"AlarmName": "memory-utilization-alarm-dev",
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-03417f2d90d3dc6ca"
},
{
"Name": "ImageId",
"Value": "ami-09d1383e2a5ae8a93"
}
],
"Statistic": "Average",
"StateReason": "Unchecked: Initial alarm creation",
"InsufficientDataActions": [],
"OKActions": [],
"ActionsEnabled": true,
"MetricName": "mem_used_percent"
}
]
}
描述使用 AWS 控制台创建的警报的 AWS CLI 输出:
aws cloudwatch describe-alarms --alarm-names memory-utilization-alarm
{
"MetricAlarms": [
{
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-03417f2d90d3dc6ca"
},
{
"Name": "ImageId",
"Value": "ami-09d1383e2a5ae8a93"
},
{
"Name": "InstanceType",
"Value": "t3.large"
}
],
"Namespace": "CWAgent",
"DatapointsToAlarm": 1,
"ActionsEnabled": true,
"MetricName": "mem_used_percent",
"EvaluationPeriods": 1,
"StateValue": "OK",
"StateUpdatedTimestamp": "2019-07-12T09:49:28.749Z",
"AlarmConfigurationUpdatedTimestamp": "2019-07-12T09:47:55.914Z",
"AlarmActions": [
"arn:aws:sns:us-west-2:289914521333:sns-topic"
],
"InsufficientDataActions": [],
"AlarmArn": "arn:aws:cloudwatch:us-west-2:289914521333:alarm:memory-utilization-alarm",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-07-12T09:49:28.746+0000\",\"startDate\":\"2019-07-12T09:44:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[61.253520518958474],\"threshold\":80.0}",
"Threshold": 80.0,
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [61.253520518958474 (12/07/19 09:44:00)] was not greater than or equal to the threshold (80.0) (minimum 1 datapoint for ALARM -> OK transition).",
"OKActions": [],
"AlarmDescription": "memory-utilization-alarm",
"Period": 300,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmName": "memory-utilization-alarm",
"Statistic": "Average",
"TreatMissingData": "missing"
}
]
}
Cloudwatch 代理的 mem_used_percent
指标有 3 个维度:InstanceId
、ImageId
和 InstanceType
。每个指标的维度当前未在 AWS user guide 中列出,但您可以使用以下 AWS CLI 命令找到它们:
$ aws cloudwatch list-metrics --namespace CWAgent --metric-name mem_used_percent --query 'Metrics[0].Dimensions[].Name'
[
"InstanceId",
"ImageId",
"InstanceType"
]
要修复您的警报,您需要更改警报定义以包含 InstanceType
维度:
resource "aws_cloudwatch_metric_alarm" "memory" {
alarm_name = "memory-utilization-alarm-${var.env}"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "1"
metric_name = "mem_used_percent"
namespace = "CWAgent"
period = "300"
statistic = "Average"
threshold = "${var.alarms_memory_threshold}"
alarm_description = "This metric monitors ec2 memory utilization"
alarm_actions = [ "${aws_sns_topic.sns_topic.arn}" ]
dimensions = {
InstanceId = "${var.instance_id}"
ImageId = "${var.ami_id}"
InstanceType = "${var.instance_type}"
}
tags = {
Environment = "${var.env}"
Project = "${var.project}"
Provisioner="cloudwatch"
Name = "${local.name}.memory"
}
}