GKE:如何提醒内存 request/allocatable 比率?

GKE: How to alert on memory request/allocatable ratio?

我有一个 GKE 集群,我想跟踪请求的总内存与可分配的总内存之间的比率。我能够使用

在 Google Cloud Monitoring 中创建图表
metric.type="kubernetes.io/container/memory/request_bytes" resource.type="k8s_container"

metric.type="kubernetes.io/node/memory/allocatable_bytes" resource.type="k8s_node"

两者都将 crossSeriesReducer 设置为 REDUCE_SUM 以获得整个集群的总计。

然后,当我尝试设置警报策略(使用云监控api)与两者的比率(以下this)时,我得到这个错误

ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

它不喜欢第一个指标是 k8s_container 第二个指标是 k8s_node 我可以使用不同的指标或某种解决方法来提醒内存request/allocatable 在 Google 云监控中的比率?

编辑:

这是完整的请求和响应

$ gcloud alpha monitoring policies create --policy-from-file=policy.json
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

$ cat policy.json
{
    "displayName": "Cluster Memory",
    "enabled": true,
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Ratio: Memory Requests / Memory Allocatable",
            "conditionThreshold": {
                 "filter": "metric.type=\"kubernetes.io/container/memory/request_bytes\" resource.type=\"k8s_container\"",
                 "aggregations": [
                    {
                        "alignmentPeriod": "60s",
                        "crossSeriesReducer": "REDUCE_SUM",
                        "groupByFields": [
                        ],
                        "perSeriesAligner": "ALIGN_MEAN"
                    }
                ],
                "denominatorFilter": "metric.type=\"kubernetes.io/node/memory/allocatable_bytes\" resource.type=\"k8s_node\"",
                "denominatorAggregations": [
                   {
                      "alignmentPeriod": "60s",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": [
                       ],
                      "perSeriesAligner": "ALIGN_MEAN",
                    }
                ],
                "comparison": "COMPARISON_GT",
                "thresholdValue": 0.8,
                "duration": "60s",
                "trigger": {
                    "count": 1
                }
            }
        }
    ]
}
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

官方文档如下:

groupByFields[] - parameter

The set of fields to preserve when crossSeriesReducer is specified. The groupByFields determine how the time series are partitioned into subsets prior to applying the aggregation operation. Each subset contains time series that have the same value for each of the grouping fields. Each individual time series is a member of exactly one subset. The crossSeriesReducer is applied to each subset of time series. It is not possible to reduce across different resource types, so this field implicitly contains resource.type. Fields not specified in groupByFields are aggregated away. If groupByFields is not specified and all the time series have the same resource type, then the time series are aggregated into a single output time series. If crossSeriesReducer is not defined, this field is ignored.

-- Cloud.google.com: Monitoring: projects.alertPolicies

请具体看部分:

It is not possible to reduce across different resource types, so this field implicitly contains resource.type.

当您尝试创建具有不同资源类型的策略时会出现上述错误。

下面显示的指标有 Resource type 个:

  • kubernetes.io/container/memory/request_bytes - k8s_container
  • kubernetes.io/node/memory/allocatable_bytes - k8s_node

您可以通过查看 GCP Monitoring 中的指标来检查 Resource type:

作为解决方法,您可以尝试创建一个警报策略,当可分配的内存利用率超过 85% 时,它会提醒您。它会间接告诉您请求的内存足够高以触发警报。

以下 YAML 示例:

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - resource.label.cluster_name
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="GKE-CLUSTER-NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization for GKE-CLUSTER-NAME by label.cluster_name
    [SUM]
  name: projects/XX-YY-ZZ/alertPolicies/AAA/conditions/BBB
creationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
displayName: alerting-policy-when-allocatable-memory-is-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
name: projects/XX-YY-ZZ/alertPolicies/

示例GCP Monitoring web access

如果您对此有任何疑问,请告诉我。

编辑:

要正确创建将显示相关数据的警报策略,您需要考虑很多因素,例如:

  • 工作量类型
  • 节点和节点池数量
  • 节点亲和力(例如:在 GPU 节点上产生某种类型的工作负载)
  • 等等

对于将考虑每个节点池可分配内存的更高级警报策略,您可以这样做:

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - metadata.user_labels."cloud.google.com/gke-nodepool"
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="CLUSTER_NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization (filtered) (grouped) [SUM]
creationRecord:
  mutateTime: '2020-03-31T18:03:20.325259198Z'
  mutatedBy: XXX@YYY.ZZZ
displayName: allocatable-memory-per-node-pool-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T18:18:57.169590414Z'
  mutatedBy: XXX@YYY.ZZZ

请注意存在错误:Groups.google.com: Google Stackdriver discussion 并且创建上述警报策略的唯一可能性是使用命令行。