快速获取 table 中至少出现 N 次的值

Get values appearing at least N times in a table quickly

我有一个 Postgres 10.10 数据库 table 超过 600 万行,定义如下:

create table users (
    id              bigserial primary key,
    user_id         text      unique,
    username        text,
    first_name      text,
    last_name       text,
    language_code   text,
    gender          text,
    first_seen      timestamp with time zone,
    last_seen       timestamp with time zone,
    search_language text,
    age             text
);

create index users_language_code_idx on users (language_code);
create index users_last_seen_idx     on users (last_seen);
create index users_first_seen_idx1   on users (first_seen);
create index users_age_idx           on users (age);
create index users_last_seen_age_idx on users (last_seen, age);

我有一个查询要获取超过 100 个用户的流行语言代码:

SELECT language_code FROM users
GROUP BY language_code
HAVING count(*) > 100;

在某些时候,此查询开始需要很长时间才能完成(约 10 分钟)。 language_code 上的 Btree 索引没有帮助。我还能做些什么来提高性能?

这是 explain analyze 输出:

https://explain.depesz.com/s/j2ga

Finalize GroupAggregate  (cost=7539479.67..7539480.34 rows=27 width=3) (actual time=620744.389..620744.458 rows=24 loops=1)
  Group Key: language_code
  Filter: (count(*) > 100)
  Rows Removed by Filter: 60
  ->  Sort  (cost=7539479.67..7539479.80 rows=54 width=11) (actual time=620744.359..620744.372 rows=84 loops=1)
        Sort Key: language_code
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=7539472.44..7539478.11 rows=54 width=11) (actual time=620744.038..620744.727 rows=84 loops=1)
              Workers Planned: 2
              Workers Launched: 0
              ->  Partial HashAggregate  (cost=7538472.44..7538472.71 rows=27 width=11) (actual time=620743.596..620743.633 rows=84 loops=1)
                    Group Key: language_code
                    ->  Parallel Seq Scan on users  (cost=0.00..7525174.96 rows=2659496 width=3) (actual time=0.377..616632.155 rows=6334894 loops=1)
Planning time: 0.194 ms
Execution time: 620745.276 ms

您可以通过 模拟索引跳过扫描:

充分利用 (language_code) 上的索引
WITH RECURSIVE cte AS (
   SELECT min(language_code) AS language_code
   FROM   users
   
   UNION ALL
   SELECT (SELECT language_code
           FROM   users
           WHERE  language_code > c.language_code
           ORDER  BY language_code
           LIMIT  1)
   FROM   cte c
   WHERE  c.language_code IS NOT NULL
   )
SELECT language_code
FROM   cte c
JOIN   LATERAL (
   SELECT count(*) AS ct
   FROM  (
      SELECT -- can stay empty
      FROM   users
      WHERE  language_code = c.language_code 
      LIMIT  101
      ) sub
   ) u ON ct > 100  -- "more than 100"
WHERE  language_code IS NOT NULL;

db<>fiddle here

鉴于您的数字(600 万行,但只有一手写满了不同的语言代码),这应该会执行得更快几个数量级。

第一部分 - 名为 cte 的递归 CTE (rCTE) - 在 table 中生成不同的 language_code 集(NULL 除外)。具有不同语言代码的 table 可以替换该部分以使其更快。 (维护这样的 table 并使用 FK 约束强制执行参照完整性可能是个好主意...)

第二部分只查看每个语言代码最多 101 行(您的阈值)。这样我们就避免了对整个大 table.

进行昂贵的顺序扫描

如果您的 table 足够“真空”,您应该会看到 仅索引扫描

升级到当前版本 Postgres 13 应该会有更多帮助,因为新引入的 index deduplication 应该使所述索引显着变小(因为它是高度重复的)。

遗憾的是,自动索引跳过扫描没有进入版本 13。也许是 Postgres 14。但上面的仿真应该几乎一样好。

进一步阅读(对上述查询技术有详细解释):

  • Optimize GROUP BY query to retrieve latest row per user
  • Select first row in each GROUP BY group?