快速获取 table 中至少出现 N 次的值
Get values appearing at least N times in a table quickly
我有一个 Postgres 10.10 数据库 table 超过 600 万行,定义如下:
create table users (
id bigserial primary key,
user_id text unique,
username text,
first_name text,
last_name text,
language_code text,
gender text,
first_seen timestamp with time zone,
last_seen timestamp with time zone,
search_language text,
age text
);
create index users_language_code_idx on users (language_code);
create index users_last_seen_idx on users (last_seen);
create index users_first_seen_idx1 on users (first_seen);
create index users_age_idx on users (age);
create index users_last_seen_age_idx on users (last_seen, age);
我有一个查询要获取超过 100 个用户的流行语言代码:
SELECT language_code FROM users
GROUP BY language_code
HAVING count(*) > 100;
在某些时候,此查询开始需要很长时间才能完成(约 10 分钟)。 language_code
上的 Btree 索引没有帮助。我还能做些什么来提高性能?
这是 explain analyze
输出:
https://explain.depesz.com/s/j2ga
Finalize GroupAggregate (cost=7539479.67..7539480.34 rows=27 width=3) (actual time=620744.389..620744.458 rows=24 loops=1)
Group Key: language_code
Filter: (count(*) > 100)
Rows Removed by Filter: 60
-> Sort (cost=7539479.67..7539479.80 rows=54 width=11) (actual time=620744.359..620744.372 rows=84 loops=1)
Sort Key: language_code
Sort Method: quicksort Memory: 28kB
-> Gather (cost=7539472.44..7539478.11 rows=54 width=11) (actual time=620744.038..620744.727 rows=84 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Partial HashAggregate (cost=7538472.44..7538472.71 rows=27 width=11) (actual time=620743.596..620743.633 rows=84 loops=1)
Group Key: language_code
-> Parallel Seq Scan on users (cost=0.00..7525174.96 rows=2659496 width=3) (actual time=0.377..616632.155 rows=6334894 loops=1)
Planning time: 0.194 ms
Execution time: 620745.276 ms
您可以通过 模拟索引跳过扫描:
充分利用 (language_code)
上的索引
WITH RECURSIVE cte AS (
SELECT min(language_code) AS language_code
FROM users
UNION ALL
SELECT (SELECT language_code
FROM users
WHERE language_code > c.language_code
ORDER BY language_code
LIMIT 1)
FROM cte c
WHERE c.language_code IS NOT NULL
)
SELECT language_code
FROM cte c
JOIN LATERAL (
SELECT count(*) AS ct
FROM (
SELECT -- can stay empty
FROM users
WHERE language_code = c.language_code
LIMIT 101
) sub
) u ON ct > 100 -- "more than 100"
WHERE language_code IS NOT NULL;
db<>fiddle here
鉴于您的数字(600 万行,但只有一手写满了不同的语言代码),这应该会执行得更快几个数量级。
第一部分 - 名为 cte
的递归 CTE (rCTE) - 在 table 中生成不同的 language_code
集(NULL
除外)。具有不同语言代码的 table 可以替换该部分以使其更快。 (维护这样的 table 并使用 FK 约束强制执行参照完整性可能是个好主意...)
第二部分只查看每个语言代码最多 101 行(您的阈值)。这样我们就避免了对整个大 table.
进行昂贵的顺序扫描
如果您的 table 足够“真空”,您应该会看到 仅索引扫描。
升级到当前版本 Postgres 13 应该会有更多帮助,因为新引入的 index deduplication 应该使所述索引显着变小(因为它是高度重复的)。
遗憾的是,自动索引跳过扫描没有进入版本 13。也许是 Postgres 14。但上面的仿真应该几乎一样好。
进一步阅读(对上述查询技术有详细解释):
- Optimize GROUP BY query to retrieve latest row per user
- Select first row in each GROUP BY group?
我有一个 Postgres 10.10 数据库 table 超过 600 万行,定义如下:
create table users (
id bigserial primary key,
user_id text unique,
username text,
first_name text,
last_name text,
language_code text,
gender text,
first_seen timestamp with time zone,
last_seen timestamp with time zone,
search_language text,
age text
);
create index users_language_code_idx on users (language_code);
create index users_last_seen_idx on users (last_seen);
create index users_first_seen_idx1 on users (first_seen);
create index users_age_idx on users (age);
create index users_last_seen_age_idx on users (last_seen, age);
我有一个查询要获取超过 100 个用户的流行语言代码:
SELECT language_code FROM users
GROUP BY language_code
HAVING count(*) > 100;
在某些时候,此查询开始需要很长时间才能完成(约 10 分钟)。 language_code
上的 Btree 索引没有帮助。我还能做些什么来提高性能?
这是 explain analyze
输出:
https://explain.depesz.com/s/j2ga
Finalize GroupAggregate (cost=7539479.67..7539480.34 rows=27 width=3) (actual time=620744.389..620744.458 rows=24 loops=1)
Group Key: language_code
Filter: (count(*) > 100)
Rows Removed by Filter: 60
-> Sort (cost=7539479.67..7539479.80 rows=54 width=11) (actual time=620744.359..620744.372 rows=84 loops=1)
Sort Key: language_code
Sort Method: quicksort Memory: 28kB
-> Gather (cost=7539472.44..7539478.11 rows=54 width=11) (actual time=620744.038..620744.727 rows=84 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Partial HashAggregate (cost=7538472.44..7538472.71 rows=27 width=11) (actual time=620743.596..620743.633 rows=84 loops=1)
Group Key: language_code
-> Parallel Seq Scan on users (cost=0.00..7525174.96 rows=2659496 width=3) (actual time=0.377..616632.155 rows=6334894 loops=1)
Planning time: 0.194 ms
Execution time: 620745.276 ms
您可以通过 模拟索引跳过扫描:
充分利用(language_code)
上的索引
WITH RECURSIVE cte AS (
SELECT min(language_code) AS language_code
FROM users
UNION ALL
SELECT (SELECT language_code
FROM users
WHERE language_code > c.language_code
ORDER BY language_code
LIMIT 1)
FROM cte c
WHERE c.language_code IS NOT NULL
)
SELECT language_code
FROM cte c
JOIN LATERAL (
SELECT count(*) AS ct
FROM (
SELECT -- can stay empty
FROM users
WHERE language_code = c.language_code
LIMIT 101
) sub
) u ON ct > 100 -- "more than 100"
WHERE language_code IS NOT NULL;
db<>fiddle here
鉴于您的数字(600 万行,但只有一手写满了不同的语言代码),这应该会执行得更快几个数量级。
第一部分 - 名为 cte
的递归 CTE (rCTE) - 在 table 中生成不同的 language_code
集(NULL
除外)。具有不同语言代码的 table 可以替换该部分以使其更快。 (维护这样的 table 并使用 FK 约束强制执行参照完整性可能是个好主意...)
第二部分只查看每个语言代码最多 101 行(您的阈值)。这样我们就避免了对整个大 table.
进行昂贵的顺序扫描如果您的 table 足够“真空”,您应该会看到 仅索引扫描。
升级到当前版本 Postgres 13 应该会有更多帮助,因为新引入的 index deduplication 应该使所述索引显着变小(因为它是高度重复的)。
遗憾的是,自动索引跳过扫描没有进入版本 13。也许是 Postgres 14。但上面的仿真应该几乎一样好。
进一步阅读(对上述查询技术有详细解释):
- Optimize GROUP BY query to retrieve latest row per user
- Select first row in each GROUP BY group?