增量 DISTINCT / GROUP BY 操作

Question

我有一个简单的两阶段 SQL 查询，在两个 table 上的运算符 A 和 B，我使用子 select 来检索 table A 的一些 ID，这些 ID 作为外键存储在 B 中，使用对 table B 的（可能是复杂的）查询（以及可能的其他连接 table）。然后，我想简单地 return A 的前 x 个 ID。我尝试使用这样的查询：

SELECT sq.id
FROM  (
    SELECT a_id AS id, created_at
    FROM   B
    WHERE  ...
    ORDER  BY created_at DESC
    ) sq 
GROUP BY sq.id
ORDER BY max(sq.created_at) DESC
LIMIT 10;

这是相当慢的，因为 Postgres 似乎在限制它之前对整个结果集执行 GROUP BY / DISTINCT 操作。如果我 LIMIT 子查询（例如到 100），性能就很好（如我所料），但当然不再保证至少有 10 个不同的 a_id sq.

结果行中的值

同样，查询

SELECT a_id AS id
FROM   B
WHERE  ...
GROUP  BY id
ORDER  BY max(created_at) DESC
LIMIT  10

相当慢，因为 Postgres 似乎在 B 上执行顺序扫描而不是使用（现有的）索引。如果我删除 GROUP BY 子句，它就可以很好地使用索引。

table B 中的数据使得大多数行包含不同的 a_id，因此即使没有 GROUP BY 大多数 returned ID 会有所不同。我对分组追求的目标是确保结果集始终包含来自 A.

的给定数量的条目

有没有办法执行"incremental DISTINCT / GROUP BY"？在我天真的想法中，Postgres 足以生成结果行并逐步对它们进行分组，直到它达到 LIMIT 指定的数字，在大多数情况下应该几乎是瞬时的，因为大多数 a_id 值是不同的。我尝试了各种方法来查询数据，但到目前为止我没有找到任何可靠的方法。

Postgres版本为9.6，数据结构如下：

                              Table "public.a"
 Column |       Type        |                   Modifiers                    
--------+-------------------+------------------------------------------------
 id     | bigint            | not null default nextval('a_id_seq'::regclass)
 bar    | character varying | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
    "ix_a_bar" btree (bar)
Referenced by:
    TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

                                      Table "public.b"
   Column   |            Type             |                    Modifiers                     
------------+-----------------------------+--------------------------------------------------
 id         | bigint                      | not null default nextval('b_id_seq'::regclass)
 foo        | character varying           | 
 a_id       | bigint                      | not null
 created_at | timestamp without time zone | 
Indexes:
    "b_pkey" PRIMARY KEY, btree (id)
    "ix_b_created_at" btree (created_at)
    "ix_b_foo" btree (foo)
Foreign-key constraints:
    "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

Answer 1

计划者有机会避免对整个 table 进行排序的唯一方法是如果您在完整的 ORDER BY 子句上有索引。

然后可以选择索引扫描以获得正确的排序，并且可以快速找到前十个结果行。

Answer 2

这个问题比乍一看要复杂得多。

如果...

您的标准不是很有选择性（超过 10 个不同的 a_id 符合条件）
您在 table B 中没有很多重复 a_id（如您所述）

那么还有一个非常快的方法

为了简化一点，我假设 created_at 也已定义 NOT NULL，否则您需要做更多。

WITH RECURSIVE top10 AS (
   ( -- extra parentheses required
   SELECT a_id, ARRAY[a_id] AS id_arr, created_at
   FROM   b
   WHERE  ...  -- your other filter conditions here
   ORDER  BY created_at DESC, a_id DESC  -- both NOT NULL
   LIMIT  1
   )
   UNION ALL -- UNION ALL, not UNION, since we exclude dupes a priori
   (
   SELECT b.a_id, id_arr || b.a_id, b.created_at
   FROM   top10 t
   JOIN   b ON (b.created_at, b.a_id)
             < (t.created_at, t.a_id)  -- comparing ROW values
           AND  b.a_id <> ALL (t.id_arr)
   WHERE  ... -- repeat conditions
   ORDER  BY created_at DESC, a_id DESC
   LIMIT  1
   )
   )
SELECT a_id
FROM   top10
LIMIT  10;

最好由 (created_at DESC, a_id DESC)（或 (created_at, a_id)）上的索引支持。

根据您的其他 WHERE 条件，其他（部分？）索引可能会更好。

这对于小型结果集特别有效。否则，根据其他各种细节，其他解决方案可能会更快。

增量 DISTINCT / GROUP BY 操作

Incremental DISTINCT / GROUP BY operation

sql

postgresql

greatest-n-per-group

postgresql-performance