GKE:如何提醒内存 request/allocatable 比率?
GKE: How to alert on memory request/allocatable ratio?
-
google-cloud-platform
-
google-kubernetes-engine
-
stackdriver
-
google-cloud-monitoring
-
google-cloud-stackdriver
我有一个 GKE 集群,我想跟踪请求的总内存与可分配的总内存之间的比率。我能够使用
在 Google Cloud Monitoring 中创建图表
metric.type="kubernetes.io/container/memory/request_bytes" resource.type="k8s_container"
和
metric.type="kubernetes.io/node/memory/allocatable_bytes" resource.type="k8s_node"
两者都将 crossSeriesReducer
设置为 REDUCE_SUM
以获得整个集群的总计。
然后,当我尝试设置警报策略(使用云监控api)与两者的比率(以下this)时,我得到这个错误
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.
它不喜欢第一个指标是 k8s_container
第二个指标是 k8s_node
我可以使用不同的指标或某种解决方法来提醒内存request/allocatable 在 Google 云监控中的比率?
编辑:
这是完整的请求和响应
$ gcloud alpha monitoring policies create --policy-from-file=policy.json
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.
$ cat policy.json
{
"displayName": "Cluster Memory",
"enabled": true,
"combiner": "OR",
"conditions": [
{
"displayName": "Ratio: Memory Requests / Memory Allocatable",
"conditionThreshold": {
"filter": "metric.type=\"kubernetes.io/container/memory/request_bytes\" resource.type=\"k8s_container\"",
"aggregations": [
{
"alignmentPeriod": "60s",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": [
],
"perSeriesAligner": "ALIGN_MEAN"
}
],
"denominatorFilter": "metric.type=\"kubernetes.io/node/memory/allocatable_bytes\" resource.type=\"k8s_node\"",
"denominatorAggregations": [
{
"alignmentPeriod": "60s",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": [
],
"perSeriesAligner": "ALIGN_MEAN",
}
],
"comparison": "COMPARISON_GT",
"thresholdValue": 0.8,
"duration": "60s",
"trigger": {
"count": 1
}
}
}
]
}
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.
官方文档如下:
groupByFields[] - parameter
The set of fields to preserve when crossSeriesReducer
is specified. The groupByFields
determine how the time series are partitioned into subsets prior to applying the aggregation operation. Each subset contains time series that have the same value for each of the grouping fields. Each individual time series is a member of exactly one subset. The crossSeriesReducer
is applied to each subset of time series. It is not possible to reduce across different resource types, so this field implicitly contains resource.type
. Fields not specified in groupByFields
are aggregated away. If groupByFields
is not specified and all the time series have the same resource type, then the time series are aggregated into a single output time series. If crossSeriesReducer
is not defined, this field is ignored.
请具体看部分:
It is not possible to reduce across different resource types, so this field implicitly contains resource.type
.
当您尝试创建具有不同资源类型的策略时会出现上述错误。
下面显示的指标有 Resource type
个:
kubernetes.io/container/memory/request_bytes
- k8s_container
kubernetes.io/node/memory/allocatable_bytes
- k8s_node
您可以通过查看 GCP Monitoring
中的指标来检查 Resource type
:
作为解决方法,您可以尝试创建一个警报策略,当可分配的内存利用率超过 85% 时,它会提醒您。它会间接告诉您请求的内存足够高以触发警报。
以下 YAML 示例:
combiner: OR
conditions:
- conditionThreshold:
aggregations:
- alignmentPeriod: 60s
crossSeriesReducer: REDUCE_SUM
groupByFields:
- resource.label.cluster_name
perSeriesAligner: ALIGN_MEAN
comparison: COMPARISON_GT
duration: 60s
filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
resource.label."cluster_name"="GKE-CLUSTER-NAME"
thresholdValue: 0.85
trigger:
count: 1
displayName: Memory allocatable utilization for GKE-CLUSTER-NAME by label.cluster_name
[SUM]
name: projects/XX-YY-ZZ/alertPolicies/AAA/conditions/BBB
creationRecord:
mutateTime: '2020-03-31T08:29:21.443831070Z'
mutatedBy: XXX@YYY.com
displayName: alerting-policy-when-allocatable-memory-is-above-85
enabled: true
mutationRecord:
mutateTime: '2020-03-31T08:29:21.443831070Z'
mutatedBy: XXX@YYY.com
name: projects/XX-YY-ZZ/alertPolicies/
示例GCP Monitoring web access
:
如果您对此有任何疑问,请告诉我。
编辑:
要正确创建将显示相关数据的警报策略,您需要考虑很多因素,例如:
- 工作量类型
- 节点和节点池数量
- 节点亲和力(例如:在 GPU 节点上产生某种类型的工作负载)
- 等等
对于将考虑每个节点池可分配内存的更高级警报策略,您可以这样做:
combiner: OR
conditions:
- conditionThreshold:
aggregations:
- alignmentPeriod: 60s
crossSeriesReducer: REDUCE_SUM
groupByFields:
- metadata.user_labels."cloud.google.com/gke-nodepool"
perSeriesAligner: ALIGN_MEAN
comparison: COMPARISON_GT
duration: 60s
filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
resource.label."cluster_name"="CLUSTER_NAME"
thresholdValue: 0.85
trigger:
count: 1
displayName: Memory allocatable utilization (filtered) (grouped) [SUM]
creationRecord:
mutateTime: '2020-03-31T18:03:20.325259198Z'
mutatedBy: XXX@YYY.ZZZ
displayName: allocatable-memory-per-node-pool-above-85
enabled: true
mutationRecord:
mutateTime: '2020-03-31T18:18:57.169590414Z'
mutatedBy: XXX@YYY.ZZZ
请注意存在错误:Groups.google.com: Google Stackdriver discussion 并且创建上述警报策略的唯一可能性是使用命令行。
google-cloud-platform
google-kubernetes-engine
stackdriver
google-cloud-monitoring
google-cloud-stackdriver
我有一个 GKE 集群,我想跟踪请求的总内存与可分配的总内存之间的比率。我能够使用
在 Google Cloud Monitoring 中创建图表metric.type="kubernetes.io/container/memory/request_bytes" resource.type="k8s_container"
和
metric.type="kubernetes.io/node/memory/allocatable_bytes" resource.type="k8s_node"
两者都将 crossSeriesReducer
设置为 REDUCE_SUM
以获得整个集群的总计。
然后,当我尝试设置警报策略(使用云监控api)与两者的比率(以下this)时,我得到这个错误
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.
它不喜欢第一个指标是 k8s_container
第二个指标是 k8s_node
我可以使用不同的指标或某种解决方法来提醒内存request/allocatable 在 Google 云监控中的比率?
编辑:
这是完整的请求和响应
$ gcloud alpha monitoring policies create --policy-from-file=policy.json
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.
$ cat policy.json
{
"displayName": "Cluster Memory",
"enabled": true,
"combiner": "OR",
"conditions": [
{
"displayName": "Ratio: Memory Requests / Memory Allocatable",
"conditionThreshold": {
"filter": "metric.type=\"kubernetes.io/container/memory/request_bytes\" resource.type=\"k8s_container\"",
"aggregations": [
{
"alignmentPeriod": "60s",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": [
],
"perSeriesAligner": "ALIGN_MEAN"
}
],
"denominatorFilter": "metric.type=\"kubernetes.io/node/memory/allocatable_bytes\" resource.type=\"k8s_node\"",
"denominatorAggregations": [
{
"alignmentPeriod": "60s",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": [
],
"perSeriesAligner": "ALIGN_MEAN",
}
],
"comparison": "COMPARISON_GT",
"thresholdValue": 0.8,
"duration": "60s",
"trigger": {
"count": 1
}
}
}
]
}
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.
官方文档如下:
groupByFields[] - parameter
The set of fields to preserve when
crossSeriesReducer
is specified. ThegroupByFields
determine how the time series are partitioned into subsets prior to applying the aggregation operation. Each subset contains time series that have the same value for each of the grouping fields. Each individual time series is a member of exactly one subset. ThecrossSeriesReducer
is applied to each subset of time series. It is not possible to reduce across different resource types, so this field implicitly containsresource.type
. Fields not specified ingroupByFields
are aggregated away. IfgroupByFields
is not specified and all the time series have the same resource type, then the time series are aggregated into a single output time series. IfcrossSeriesReducer
is not defined, this field is ignored.
请具体看部分:
It is not possible to reduce across different resource types, so this field implicitly contains
resource.type
.
当您尝试创建具有不同资源类型的策略时会出现上述错误。
下面显示的指标有 Resource type
个:
kubernetes.io/container/memory/request_bytes
-k8s_container
kubernetes.io/node/memory/allocatable_bytes
-k8s_node
您可以通过查看 GCP Monitoring
中的指标来检查 Resource type
:
作为解决方法,您可以尝试创建一个警报策略,当可分配的内存利用率超过 85% 时,它会提醒您。它会间接告诉您请求的内存足够高以触发警报。
以下 YAML 示例:
combiner: OR
conditions:
- conditionThreshold:
aggregations:
- alignmentPeriod: 60s
crossSeriesReducer: REDUCE_SUM
groupByFields:
- resource.label.cluster_name
perSeriesAligner: ALIGN_MEAN
comparison: COMPARISON_GT
duration: 60s
filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
resource.label."cluster_name"="GKE-CLUSTER-NAME"
thresholdValue: 0.85
trigger:
count: 1
displayName: Memory allocatable utilization for GKE-CLUSTER-NAME by label.cluster_name
[SUM]
name: projects/XX-YY-ZZ/alertPolicies/AAA/conditions/BBB
creationRecord:
mutateTime: '2020-03-31T08:29:21.443831070Z'
mutatedBy: XXX@YYY.com
displayName: alerting-policy-when-allocatable-memory-is-above-85
enabled: true
mutationRecord:
mutateTime: '2020-03-31T08:29:21.443831070Z'
mutatedBy: XXX@YYY.com
name: projects/XX-YY-ZZ/alertPolicies/
示例GCP Monitoring web access
:
如果您对此有任何疑问,请告诉我。
编辑:
要正确创建将显示相关数据的警报策略,您需要考虑很多因素,例如:
- 工作量类型
- 节点和节点池数量
- 节点亲和力(例如:在 GPU 节点上产生某种类型的工作负载)
- 等等
对于将考虑每个节点池可分配内存的更高级警报策略,您可以这样做:
combiner: OR
conditions:
- conditionThreshold:
aggregations:
- alignmentPeriod: 60s
crossSeriesReducer: REDUCE_SUM
groupByFields:
- metadata.user_labels."cloud.google.com/gke-nodepool"
perSeriesAligner: ALIGN_MEAN
comparison: COMPARISON_GT
duration: 60s
filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
resource.label."cluster_name"="CLUSTER_NAME"
thresholdValue: 0.85
trigger:
count: 1
displayName: Memory allocatable utilization (filtered) (grouped) [SUM]
creationRecord:
mutateTime: '2020-03-31T18:03:20.325259198Z'
mutatedBy: XXX@YYY.ZZZ
displayName: allocatable-memory-per-node-pool-above-85
enabled: true
mutationRecord:
mutateTime: '2020-03-31T18:18:57.169590414Z'
mutatedBy: XXX@YYY.ZZZ
请注意存在错误:Groups.google.com: Google Stackdriver discussion 并且创建上述警报策略的唯一可能性是使用命令行。