PromQL 查询以查找 CPU 和上周使用的内存

Question

我正在尝试编写一个 Prometheus 查询，它可以告诉我 CPU（以及另一个用于内存和网络）每个命名空间在一段时间内（比如一周）使用了多少百分比.

我尝试使用的指标是 container_spec_cpu_shares 和 container_memory_working_set_bytes，但我无法弄清楚随着时间的推移它们是如何求和的。无论我尝试 returns 0 还是错误。

任何有关如何为此编写查询的帮助将不胜感激。

Answer 1

要检查每个命名空间使用的内存百分比，您需要一个类似于下面的查询：

sum( container_memory_working_set_bytes{container="", namespace=~".+"} )|
by (namespace) / ignoring (namespace) group_left 
sum( machine_memory_bytes{}) * 100

上面的查询应该生成一个类似于这个的图：

Disclaimers!:

The screenshot above is from Grafana for better visibility.

This query does not acknowledge changes in available RAM (changes in nodes, autoscaling of nodes, etc.).

要在 PromQL 中获取一段时间内的指标，您需要使用其他函数，例如：

avg_over_time(EXP[time]).

要回到过去并从特定时间点计算资源，您需要使用：

offset TIME

使用上面的指针查询应该合并为：

avg_over_time( sum(container_memory_working_set_bytes{container="", namespace=~".+"} offset 45m) by (namespace)[120m:])  / ignoring (namespace) group_left 
sum( machine_memory_bytes{})

以上查询将计算每个命名空间使用的内存的平均百分比，并将其除以当前时间 120 分钟内集群中的所有内存。它还将比现在时间提前 45 分钟开始。

示例：

运行查询时间：20:00
avg_over_time(EXPR[2h:])
offset 45 min

以上示例将从 17:15 开始，它将运行查询到 19:15。您可以修改它以包括整个星期 :).

如果您想按命名空间计算 CPU 使用情况，您可以将此指标替换为以下指标：

container_cpu_usage_seconds_total{} - 使用此指标（计数器）时请检查 rate() 函数
machine_cpu_cores{}

您还可以查看此网络指标：

container_network_receive_bytes_total - 使用此指标（计数器）时请检查 rate() 函数
container_network_transmit_bytes_total - 使用此指标（计数器）时请检查 rate() 函数

我在下面通过示例（内存）、测试方法和所用查询的剖析提供了更多解释。

让我们假设：

Kubernetes 集群 1.18.6 (Kubespray) 共 12GB 内存：
- 具有 2GB 内存的主节点
- worker-one 具有 8GB 内存的节点
- worker-two 节点 2GB 内存
Prometheus 和 Grafana 安装有：Github.com: Coreos: Kube-prometheus
命名空间 kruk 带有单个 ubuntu pod 设置为使用以下命令生成人工负载：
- $ stress-ng --vm 1 --vm-bytes <AMOUNT_OF_RAM_USED> --vm-method all -t 60m -v

两次stress-ng生成人工负载：

60 分钟 - 1GB 已用内存
60 分钟 - 使用 2GB 内存

命名空间kruk在此时间跨度内使用的内存百分比：

1GB，约占集群中所有内存 (12GB) 的 ~8.5%
2GB，约占集群中所有内存 (12GB) 的 ~17.5%

Prometheus 查询 kruk 命名空间的负载看起来像这样：

使用avg_over_time(EXPR[time:]) / memory in the cluster计算，查询人工负载产生时间时，使用率在13%左右（(17.5+8.5)/2）。这应该表明查询是正确的：

至于使用的查询：

avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
by (namespace)[120m:]) / ignoring (namespace) group_left 
sum( machine_memory_bytes{}) * 100

上面的查询与一开始的查询非常相似，但我做了一些更改以仅显示 kruk 命名空间。

我将查询解释分为两部分 (dividend/divisor)。

股息

container_memory_working_set_bytes{container="", namespace="kruk"}

此指标将输出命名空间 kruk 中的内存使用记录。如果您要查询所有名称空间，请查看其他说明：

namespace=~".+" <- 只有当命名空间键内的值包含 1 个或多个字符时，此正则表达式才会匹配。这是为了避免带有聚合指标的空命名空间结果。
container="" <- 部分用于过滤指标。如果您在没有它的情况下进行查询，您将获得每个 container/pod 的多个内存使用指标，如下所示。 container="" 仅当容器值为空时才会匹配（下面引用的最后一行）。

container_memory_working_set_bytes{container="POD",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/e249c12010a27f82389ebfff3c7c133f2a5da19799d2f5bb794bcdb5dc5f8bca",image="k8s.gcr.io/pause:3.2",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_POD_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 692224
container_memory_working_set_bytes{container="ubuntu",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/fae287e7043ff00da16b6e6a8688bfba0bfe30634c52e7563fcf18ac5850f6d9",image="ubuntu@sha256:5d1d5407f353843ecf8b16524bc5565aa332e9e6a1297c73a92d3e754b8a636d",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_ubuntu_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2186403840
container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2187096064

You can read more about pause container here:

Ianlewis.org: Almighty pause container

sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
by (namespace)

此查询将按各自的命名空间对结果求和。 offset 1380m 用于在过去进行测试时返回过去。

avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
by (namespace)[120m:])

此查询将从比当前时间早 1380m 开始的指定时间（120m 到现在）跨命名空间的内存指标计算平均值。

您可以在此处阅读有关 avg_over_time() 的更多信息：

除数

sum( machine_memory_bytes{})

此指标将汇总集群中每个节点的可用内存。

EXPR / ignoring (namespace) group_left 
sum( machine_memory_bytes{}) * 100

关注：

/ ignoring (namespace) group_left <- 此表达式将允许您将被除数中的每个“记录”（每个名称空间及其跨时间的内存平均值）除以一个除数（集群中的所有内存）。您可以在这里阅读更多相关信息：Prometheus.io: Vector matching
* 100 是不言自明的，将结果乘以 100 看起来更像百分比。

其他资源：

PromQL 查询以查找 CPU 和上周使用的内存

PromQL query to find CPU and memory used for the last week

kubernetes

prometheus

promql

股息

除数