PostgreSQL：heroku 数据库服务器上的 array_agg() 出现内存不足问题

Question

我遇到了一个 (Postgres 9.4.6) 查询 a) 使用了太多内存（很可能是由于 array_agg()）而且 return 也不是我需要的, 使得 post-processing 成为必要。非常感谢任何输入（特别是关于内存消耗）。

解释： table token_groups 包含我在 hstore 中解析的推文中使用的所有单词及其各自的出现频率，每 10 分钟一行（过去 7 天，所以 7*24*6 行总共）。这些行按 tweeted_at 的顺序插入，所以我可以简单地按 id 排序。我正在使用 row_number 来识别单词出现的时间。

# \d token_groups
                                     Table "public.token_groups"
   Column   |            Type             |                         Modifiers
------------+-----------------------------+-----------------------------------------------------------
 id         | integer                     | not null default nextval('token_groups_id_seq'::regclass)
 tweeted_at | timestamp without time zone | not null
 tokens     | hstore                      | not null default ''::hstore
Indexes:
    "token_groups_pkey" PRIMARY KEY, btree (id)
    "index_token_groups_on_tweeted_at" btree (tweeted_at)

我理想中想要的是一个单词列表，每个单词的行号的相对距离。所以如果例如'hello' 这个词在第 5 行出现一次，在第 8 行出现两次，在第 20 行出现一次，我想要一个包含这个词的列，以及一个数组列 returning {5,3,0, 12}。（意思是：第一次出现在第五行，下一次出现在 3 行之后，下一次出现在 0 行之后，接下来的 12 行之后）。如果有人想知道为什么：'relevant' 个词成簇出现，所以（简化）时间距离的标准差越高，一个词越有可能是关键词。在此处查看更多信息：http://bioinfo2.ugr.es/Publicaciones/PRE09.pdf

现在，我 return 一个包含位置的数组和一个包含频率的数组，并使用此信息计算 ruby 中的距离。

目前的主要问题是高内存峰值，这似乎是由 array_agg() 引起的。正如（非常有帮助的）heroku 员工告诉我的那样，我的一些连接使用 500-700MB，共享内存很少，导致内存不足错误（我是运行 Standard-0，这给了我所有连接总共 1GB），我需要找到一个优化。

hstore条目总数为~100k，然后聚合（跳过频率很低的单词后）：

SELECT COUNT(*)
FROM (SELECT row_number() over(ORDER BY id ASC) AS position,
            (each(tokens)).key, (each(tokens)).value::integer
      FROM   token_groups) subquery;

 count
--------
 106632

这是导致内存负载的查询：

SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
  SELECT row_number() over(ORDER BY id ASC) AS pos, 
         (each(tokens)).key, 
         (each(tokens)).value::integer 
  FROM token_groups
  ) subquery 
GROUP BY key
HAVING SUM(value) > 10;

输出为：

     key     |                        positions                        |           frequencies
-------------+---------------------------------------------------------+-------------------------------
 hello       | {172,185,188,210,349,427,434,467,479}                   | {1,2,1,1,2,1,2,1,4}
 world       | {166,218,265,343,415,431,436,493}                       | {1,1,2,1,2,1,2,1}
 some        | {35,65,101,180,193,198,223,227,420,424,427,428,439,444} | {1,1,1,1,1,1,1,2,1,1,1,1,1,1}
 other       | {77,111,233,416,421,494}                                | {1,1,4,1,2,2}
 word        | {170,179,182,184,185,186,187,188,189,190,196}           | {3,1,1,2,1,1,1,2,5,3,1}
(...)

解释如下：

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=12789.00..12792.50 rows=200 width=44) (actual time=309.692..343.064 rows=2341 loops=1)
   Output: ((each(token_groups.tokens)).key), array_agg((row_number() OVER (?))), array_agg((((each(token_groups.tokens)).value)::integer))
   Group Key: (each(token_groups.tokens)).key
   Filter: (sum((((each(token_groups.tokens)).value)::integer)) > 10)
   Rows Removed by Filter: 33986
   Buffers: shared hit=2176
   ->  WindowAgg  (cost=177.66..2709.00 rows=504000 width=384) (actual time=0.947..108.157 rows=106632 loops=1)
         Output: row_number() OVER (?), (each(token_groups.tokens)).key, ((each(token_groups.tokens)).value)::integer, token_groups.id
         Buffers: shared hit=2176
         ->  Sort  (cost=177.66..178.92 rows=504 width=384) (actual time=0.910..1.119 rows=504 loops=1)
               Output: token_groups.id, token_groups.tokens
               Sort Key: token_groups.id
               Sort Method: quicksort  Memory: 305kB
               Buffers: shared hit=150
               ->  Seq Scan on public.token_groups  (cost=0.00..155.04 rows=504 width=384) (actual time=0.013..0.505 rows=504 loops=1)
                     Output: token_groups.id, token_groups.tokens
                     Buffers: shared hit=150
 Planning time: 0.229 ms
 Execution time: 570.534 ms

PS：如果有人想知道：我每 10 分钟将新数据添加到 token_groupstable 并删除过时的数据。这在每 10 分钟存储一行数据时很容易，我仍然需要想出一个更好的数据结构，例如每个单词使用一行。但这似乎不是主要问题，我认为是数组聚合。

Answer 1

您提出的查询可以更简单，每行仅评估 each() 一次：

SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM  (
   SELECT t.key, pos, t.value::int
   FROM  (SELECT row_number() OVER (ORDER BY id) AS pos, * FROM token_groups) tg
        , each(g.tokens) t  -- implicit LATERAL join
   ORDER  BY t.key, pos
   ) sub
GROUP  BY key
HAVING sum(value) > 10;

同时保留元素的正确顺序。

What I'd ideally want is a list of words with each the relative distances of their row numbers.

这样做就可以了：

SELECT key, array_agg(step) AS occurrences
FROM  (
   SELECT key, CASE WHEN g = 1 THEN pos - last_pos ELSE 0 END AS step
   FROM  (
      SELECT key, value::int, pos
           , lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos) AS last_pos
      FROM  (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) tg
           , each(g.tokens) t
      ) t1
      , generate_series(1, t1.value) g
   ORDER  BY key, pos, g
   ) sub
GROUP  BY key;
HAVING count(*) > 10;

SQL Fiddle.

将每个 hstore key 解释为一个单词并将相应的 value 解释为 出现次数行（= 最后 10 分钟），我使用两个级联 LATERAL 连接：第一步分解 hstore 值，第二步根据 value 乘以行。（如果你的value（频率）大部分只是1，你可以简化。）关于LATERAL：

然后我 ORDER BY key, pos, g 在子查询中聚合之前 SELECT。这个条款似乎是多余的，事实上，我在测试中看到没有它的相同结果。这是内部查询中 lag() 的 window 定义的附带好处，除非任何其他步骤触发重新排序，否则它会被带到下一步。但是，现在我们依赖于保证工作的实现细节。

对整个查询排序一次应该比按聚合排序快得多（并且在所需的排序内存上更容易）。这也不严格按照 SQL 标准，但简单的情况是 documented for Postgres:

Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
But this syntax is not allowed in the SQL standard, and is not portable to other database systems.

严格来说，我们只需要:

ORDER BY pos, g

你可以试验一下。相关：

PostgreSQL unnest() with element number

可能的选择：

SELECT key
     , ('{' || string_agg(step || repeat(',0', value - 1), ',') || '}')::int[] AS occurrences
FROM (
   SELECT key, pos, value::int
        ,(pos - lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos))::text AS step
   FROM  (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) g
        , each(g.tokens) t
   ORDER  BY key, pos
   ) t1
GROUP  BY key;
-- HAVING sum(value) > 10;

使用文本连接而不是 generate_series() 可能更便宜。

PostgreSQL：heroku 数据库服务器上的 array_agg() 出现内存不足问题

PostgreSQL: out of memory issues with array_agg() on a heroku db server

memory

postgresql

heroku

aggregate-functions

hstore