PromQL：rate() 函数的用途是什么？

Question

我对 PromQL 及其查询函数 rate() 以及如何正确使用它有疑问。在我的应用程序中，我有一个线程运行，我使用 Micrometer 的 Timer 来监视线程的运行时间。使用 Timer 会为您提供一个后缀为 _count 的计数器和另一个后缀为 _sum 的总计秒数计数器。例如。 my_metric_sum 和 my_metric_count.

我的原始数据如下所示（抓取间隔 30 秒，范围向量 5m）：

现在根据文档，https://prometheus.io/docs/prometheus/latest/querying/functions/#rate 计算范围向量（此处为5m）中时间序列的每秒平均增长率。

现在我的问题是：我为什么要那个？我的执行运行时的相对变化对我来说似乎毫无用处。事实上，仅使用 sum/count 看起来更有用，因为它为我提供了每个时刻的平均绝对持续时间。同时，这让我感到困惑，在我找到的文档中

要根据名为 http_request_duration_seconds 的直方图或摘要计算过去 5 分钟内的平均请求持续时间，请使用以下表达式：

速率(http_request_duration_seconds_sum[5m]) / 速率(http_request_duration_seconds_count[5m])

来源：https://prometheus.io/docs/practices/histograms/

但据我了解文档，看起来这个表达式会计算请求持续时间的每秒平均增长率，即不是请求平均花费多长时间，而是请求持续时间有多少在过去 5 分钟内平均发生变化。

Answer 1

虽然我不熟悉 Micrometer Timer，但您描述的指标属于摘要类型。它正在计算 _count 中的“事件”，并在 _sum 中对事件的大小（如持续时间、经过时间等）求和。如果您现在执行 rate(metric_count[5m])，您将获得每秒 5 米的平均事件速率。如果你想知道这些事件在 5m window 以内的平均持续时间，你可以 rate(metric_sum[5m]) / rate(metric_count[5m])。如果您尝试除以 metric_sum/metric_count，您将获得所有时间（自计数器重置后）的平均值，而不是某个时间点的 5m 平均值。在某种程度上，为此使用 rate() 看起来有点滑稽。使用 increase() 对我来说似乎更直观，但在数学上它与 rate() 完全相同只是一个 increase()/range 因此这些范围在 rate(metric_sum[5m]) / rate(metric_count[5m]).[=20 中相互抵消=]

Answer 2

首先 - 使用适合您的用例的工具。

其次 - 无论您选择什么，验证数据。最好现在就做，而不是在停电期间或生气 customer/user.

第三 - _count 和 _bucket 是 直方图 和摘要的特征。采样频率在这里并不重要，只要它小于 rate() 函数的 [5m] 分组即可。

该比率只是为您提供“这五分钟内发生了多少次 ([5m])”的数据点。

一般说明 - Prometheus 中的 rate() 概念引起了很多混乱。它在太多人之间争论不休。他们可能应该叫它别的名字。

Answer 3

rate(m[d]) 函数计算 counter metric m 相对于给定后方括号 window d 的增量，然后将增量除以d。每个匹配的时间序列 m 独立执行计算。例如，假设有 http_requests_total 个带有 url 标签的指标：

http_requests_total{url="/foo"}
http_requests_total{url="/bar"}

如果它们在 t0 时具有以下值：

http_requests_total{url="/foo"} 123
http_requests_total{url="/bar"} 456

... 以及 t0 + 5 minutes 时的以下值：

http_requests_total{url="/foo"} 345
http_requests_total{url="/bar"} 789

然后 rate(http_requests_total[5m]) 在时间 t0 + 5 minutes 的计算方式如下：

要为 t0 和 t0 + 5 minutes 之间的这些指标计算 increase：

increase(http_requests_total{url="/foo"}[5m]) = 345 - 123 = 222
increase(http_requests_total{url="/bar"}[5m]) = 789 - 456 = 333

将计算出的增加量除以 5 minutes，以秒 (5*60s = 300s) 表示：

rate(http_requests_total{url="/foo"}[5m]) = 222 / 300 = 0.74
rate(http_requests_total{url="/bar"}[5m]) = 333 / 300 = 1.11

因此 rate(http_requests_total[5m]) 的最终结果是最后 5 分钟的 per-second 平均 rps，这是根据每个具有 http_requests_total 名称的时间序列单独计算的。

一些注意事项：

两者 rate() and increase() properly handle e.g. counter resets, when the counter 都重置为零。
有时，由于所选的数据模型，Prometheus 可能 return 来自 rate() 和 increase() 的意外结果。参见 this issue. This issue is addressed in VictoriaMetrics - see this comment and this article。
一些 PromQL-compatible 查询引擎，例如 MetricsQL allow skipping the lookbehind window in square brackets when using rate() function, so rate(http_requests_total) is a valid MetricsQL query. In this case it automatically adds [$__interval] lookbehind window before query execution. See these docs 以获得更多详细信息。

PromQL：rate() 函数的用途是什么？

PromQL: What is rate() function meant for?

prometheus

promql

micrometer