Grafana Prometheus - 查询处理会在查询执行时将太多样本加载到内存中

Question

我试图让我的查询对 grafana 中的间隔求和，但我收到此错误：

"query processing would load too many samples into memory in query execution"

如果我每天查看过去 30 天。

我有一个名为 intrvl 的变量，它具有特定的时间间隔，例如 1m, 1h, 12h, 24h, and 30d，我的查询如下所示：

sort_desc(
sum by (backend)(sum_over_time(haproxy_backend_http_responses_total{code=~"[1,2,3,4][x][x]",tags=~".*external.*"}[$intrvl]))
/
sum by (backend)(sum_over_time(haproxy_backend_http_responses_total{code!~"\b(\w*other\w*)\b",tags=~".*external.*"}[$intrvl]))
)

我使用的是折线图，我也将图表的 Min step 设置为 $intrvl。这是根据时间范围计算百分比的正确方法吗？

Answer 1

由于您使用大量数据来计算您的公式，我会考虑创建一个 prometheus recording rule，它将预先计算所需的值并 sum_over_interval使用创建的规则。

Answer 2

样本过多 错误消息来自 Prometheus (promql/engine.go)，而不是 Grafana。 issue #4513

您可以尝试使用 Prometheus v2.5.0 中引入的 Prometheus 标志 --query.max-samples 提高限制。（在 prometheus -h 输出中查看您的版本的默认值）。

Answer 3

问题： 在 prometheus 中执行大量查询会在 Grafana 仪表板中引发错误并导致结果失败：

query processing would load too many samples into memory in query execution

解决方法：在prometheus的配置文件中使用--query.max-samples，增加内存加载的次数。默认值为 50000000 ，增加这个值取决于你的机器能力。来自 documentation:

--query.max-samples=50000000
     Maximum number of samples a single query can load into memory. 
     Note that queries will fail if they try to load more samples than this into memory,
     so this also limits the number of samples a query can return.

示例： 假设您运行您的 prometheus 服务在 docker-compose 执行在 docker-compose.yml:

version: '3.2'

services:   
prometheus:
    image: prom/prometheus:latest
    expose:
      - 9090
    ports:
      - 9090:9090
    volumes:
      - ./prometheus/:/etc/prometheus/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--query.max-samples=100000000'
      - '--web.external-url=http://prom.some-company-url.com:9090'

Answer 4

haproxy_backend_http_responses_total 指标是 counter, so it is likely increase() function must be used instead of sum_over_time():

sort_desc(
sum by (backend)(increase(haproxy_backend_http_responses_total{code=~"[1234]..",tags=~".*external.*"}[$intrvl]))
/
sum by (backend)(increase(haproxy_backend_http_responses_total{tags=~".*external.*"}[$intrvl]))
)

sum_over_time() 函数计算方括号中给定 lookbehind window 上所有原始样本的总和。此函数适用于 gauges.
increase() 函数计算 counter 相对于给定 lookbehind window 的增加。

发生 too many samples 错误是因为 Prometheus 将所有时间序列的所有原始样本加载到内存中，这些时间序列与方括号中指定的给定后视 window 上的给定 series selector 相匹配。 haproxy_backend_http_responses_total{tags=~".*external.*"} 选择器很可能匹配大量的时间序列。以下查询可用于估计查询需要加载到内存中的时间序列数：

count(
  last_over_time(
    haproxy_backend_http_responses_total{tags=~".*external.*"}[$intrvl]
  )
)

以下查询可用于估计查询需要加载到内存中的原始样本数：

sum(
  count_over_time(
    haproxy_backend_http_responses_total{tags=~".*external.*"}[$intrvl]
  )
)

可以看到，Prometheus需要加载到内存中的匹配时间序列数量和原始样本数量随着方括号中的lookbehind window - [$intrvl]中的增长而增长以上查询。

This article 可能有助于了解如何确定大量 PromQL 查询的根本原因以及如何对其进行优化。

too many samples 错误可以通过将更大的值传递给 --query.max-samples command-line 标志来修复，如中所述。请注意，当 Prometheus 处理繁重的查询时，这可能会增加内存使用量。

修复 too many samples 错误的另一种解决方案是使用其他 Prometheus-like 系统，这些系统在处理繁重的查询时可能需要较少的内存。例如，尝试 VictoriaMetrics.

Grafana Prometheus - 查询处理会在查询执行时将太多样本加载到内存中

Grafana Prometheus - query processing would load too many samples into memory in query execution

grafana

prometheus