快速计算不同列值的方法(使用索引?)
Fast way to count distinct column values (using an index?)
问题:查询时间过长
我有一个新的 table,看起来像这样,有 3e6 行:
CREATE TABLE everything_crowberry (
id SERIAL PRIMARY KEY,
group_id INTEGER,
group_type group_type_name,
epub_id TEXT,
reg_user_id INTEGER,
device_id TEXT,
campaign_id INTEGER,
category_name TEXT,
instance_name TEXT,
protobuf TEXT,
UNIQUE (group_id, group_type, reg_user_id, category_name, instance_name)
);
这对我的上下文来说通常是有意义的,而且大多数查询的速度都可以接受。
但是像这样的查询速度不快:
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=392177.29..392177.30 rows=1 width=4) (actual time=8909.698..8909.699 rows=1 loops=1)
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.461..6347.272 rows=3198583 loops=1)
Planning time: 0.063 ms
Execution time: 8909.730 ms
(4 rows)
Time: 8910.110 ms
analytics_staging=> select count(distinct group_id) from everything_crowberry;
count
-------
481
Time: 8736.364 ms
我确实在 group_id
上创建了一个索引,但是虽然该索引用于 WHERE 子句,但它并没有在上面使用。所以我得出结论,我误解了 postgres 如何使用索引。注意(查询结果)有不到 500 个不同的 group_id。
CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);
任何我误解的指示或如何使这个特定查询运行得更快?
更新
为了帮助解决评论中提出的问题,我在此处添加了建议的更改。对于未来的读者,我提供了详细信息以更好地理解如何对其进行调试。
我在玩的时候注意到,大部分时间都花在了初始聚合上。
序列扫描
关闭 seqscan 会使情况变得更糟:
analytics_staging=> set enable_seqscan = false;
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=444062.28..444062.29 rows=1 width=4) (actual time=38927.323..38927.323 rows=1 loops=1)
-> Bitmap Heap Scan on everything_crowberry (cost=51884.99..436065.82 rows=3198583 width=4) (actual time=458.252..36167.789 rows=3198583 loops=1)
Heap Blocks: exact=35734 lossy=316446
-> Bitmap Index Scan on everything_crowberry_group (cost=0.00..51085.35 rows=3198583 width=0) (actual time=448.537..448.537 rows=3198583 loops=1)
Planning time: 0.064 ms
Execution time: 38927.971 ms
Time: 38930.328 ms
哪里可以使情况变得更糟
限制在一组非常小的组 ID 中会使情况变得更糟,而我可能认为统计一小部分事物会更容易。
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry WHERE group_id > 380;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385954.43..385954.44 rows=1 width=4) (actual time=13438.422..13438.422 rows=1 loops=1)
-> Bitmap Heap Scan on everything_crowberry (cost=18742.95..383451.68 rows=1001099 width=4) (actual time=132.571..12673.233 rows=986572 loops=1)
Recheck Cond: (group_id > 380)
Rows Removed by Index Recheck: 70816
Heap Blocks: exact=49632 lossy=79167
-> Bitmap Index Scan on everything_crowberry_group (cost=0.00..18492.67 rows=1001099 width=0) (actual time=120.816..120.816 rows=986572 loops=1)
Index Cond: (group_id > 380)
Planning time: 1.294 ms
Execution time: 13439.017 ms
(9 rows)
Time: 13442.603 ms
解释(分析、缓冲)
analytics_staging=> explain(analyze, buffers) select count(distinct group_id) from everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=392177.29..392177.30 rows=1 width=4) (actual time=7329.775..7329.775 rows=1 loops=1)
Buffers: shared hit=16283 read=335912, temp read=4693 written=4693
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.224..4615.015 rows=3198583 loops=1)
Buffers: shared hit=16283 read=335912
Planning time: 0.089 ms
Execution time: 7329.818 ms
Time: 7331.084 ms
work_mem 太小(参见上面的 explain(analyze, buffers))
将其从默认的 4 MB 增加到 10 MB 会有所改善,从 7300 毫秒到 5500 毫秒左右。
更改 SQL 也有一点帮助。
analytics_staging=> EXPLAIN(analyze, buffers) SELECT group_id FROM everything_crowberry GROUP BY group_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=392177.29..392181.56 rows=427 width=4) (actual time=4686.525..4686.612 rows=481 loops=1)
Group Key: group_id
Buffers: shared hit=96 read=352099
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.034..4017.122 rows=3198583 loops=1)
Buffers: shared hit=96 read=352099
Planning time: 0.094 ms
Execution time: 4686.686 ms
Time: 4687.461 ms
analytics_staging=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=392177.29..392181.56 rows=427 width=4) (actual time=5536.151..5536.262 rows=481 loops=1)
Group Key: group_id
Buffers: shared hit=128 read=352067
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.030..4946.024 rows=3198583 loops=1)
Buffers: shared hit=128 read=352067
Planning time: 0.074 ms
Execution time: 5536.321 ms
Time: 5537.380 ms
analytics_staging=> SELECT count(*) FROM (SELECT 1 FROM everything_crowberry GROUP BY group_id) ec;
count
-------
481
Time: 4927.671 ms
创建视图是一个重大胜利,但可能会在其他地方产生性能问题。
analytics_production=> CREATE VIEW everything_crowberry_group_view AS select distinct group_id, group_type FROM everything_crowberry;
CREATE VIEW
analytics_production=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry_group_view;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.56..357898.89 rows=200 width=4) (actual time=0.046..1976.882 rows=447 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Subquery Scan on everything_crowberry_group_view (cost=0.56..357897.19 rows=680 width=4) (actual time=0.046..1976.616 rows=475 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Unique (cost=0.56..357890.39 rows=680 width=8) (actual time=0.044..1976.378 rows=475 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Index Only Scan using everything_crowberry_group_id_group_type_reg_user_id_catego_key on everything_crowberry (cost=0.56..343330.63 rows=2911953 width=8) (actual time=0.043..1656.409 rows=2912005 loops=1)
Heap Fetches: 290488
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
Planning time: 1.842 ms
Execution time: 1977.086 ms
我有时在 Postgres 中看到 count(distinct)
的问题。
这是如何工作的?
select count(*)
from (select distinct group_id
from everything_crowberry
) ec;
或者:
select count(*)
from (select distinct on (group_id) ec.*
from everything_crowberry
) ec;
请注意,NULL
处理略有不同,但可以轻松调整查询。
对于 group_id
中相对 个不同的值(每组多行)- 似乎是你的情况:
3e6 rows / under 500 distinct group_id's
要使此快速,您需要索引跳过扫描(a.k.a.松散索引扫描)。这在 Postgres 12 之前没有实现。但是你可以通过递归查询来解决这个限制:
替换:
select count(distinct group_id) from everything_crowberry;
有:
WITH RECURSIVE cte AS (
(SELECT group_id FROM everything_crowberry ORDER BY group_id LIMIT 1)
UNION ALL
SELECT (SELECT group_id FROM everything_crowberry
WHERE group_id > t.group_id ORDER BY group_id LIMIT 1)
FROM cte t
WHERE t.group_id IS NOT NULL
)
SELECT count(group_id) FROM cte;
我使用 count(group_id)
而不是稍快的 count(*)
来方便地从最终递归中删除 NULL
值 - 因为 count(<expression>)
只计算非空值。
此外,group_id
是否可以是 NULL
并不重要,因为无论如何您的查询都不会计算在内。
可以使用已有的索引:
CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);
相关:
- Optimize GROUP BY query to retrieve latest row per user
对于许多 group_id
中的不同值(每组几行) - 或者对于小表 - 普通 DISTINCT
会更快。通常在子查询中完成时最快,而不是在 count()
:
中添加子句
SELECT count(group_id) -- or just count(*) to include possible NULL value
FROM (SELECT DISTINCT group_id FROM everything_crowberry) sub;
问题:查询时间过长
我有一个新的 table,看起来像这样,有 3e6 行:
CREATE TABLE everything_crowberry (
id SERIAL PRIMARY KEY,
group_id INTEGER,
group_type group_type_name,
epub_id TEXT,
reg_user_id INTEGER,
device_id TEXT,
campaign_id INTEGER,
category_name TEXT,
instance_name TEXT,
protobuf TEXT,
UNIQUE (group_id, group_type, reg_user_id, category_name, instance_name)
);
这对我的上下文来说通常是有意义的,而且大多数查询的速度都可以接受。
但是像这样的查询速度不快:
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=392177.29..392177.30 rows=1 width=4) (actual time=8909.698..8909.699 rows=1 loops=1)
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.461..6347.272 rows=3198583 loops=1)
Planning time: 0.063 ms
Execution time: 8909.730 ms
(4 rows)
Time: 8910.110 ms
analytics_staging=> select count(distinct group_id) from everything_crowberry;
count
-------
481
Time: 8736.364 ms
我确实在 group_id
上创建了一个索引,但是虽然该索引用于 WHERE 子句,但它并没有在上面使用。所以我得出结论,我误解了 postgres 如何使用索引。注意(查询结果)有不到 500 个不同的 group_id。
CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);
任何我误解的指示或如何使这个特定查询运行得更快?
更新
为了帮助解决评论中提出的问题,我在此处添加了建议的更改。对于未来的读者,我提供了详细信息以更好地理解如何对其进行调试。
我在玩的时候注意到,大部分时间都花在了初始聚合上。
序列扫描
关闭 seqscan 会使情况变得更糟:
analytics_staging=> set enable_seqscan = false;
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=444062.28..444062.29 rows=1 width=4) (actual time=38927.323..38927.323 rows=1 loops=1)
-> Bitmap Heap Scan on everything_crowberry (cost=51884.99..436065.82 rows=3198583 width=4) (actual time=458.252..36167.789 rows=3198583 loops=1)
Heap Blocks: exact=35734 lossy=316446
-> Bitmap Index Scan on everything_crowberry_group (cost=0.00..51085.35 rows=3198583 width=0) (actual time=448.537..448.537 rows=3198583 loops=1)
Planning time: 0.064 ms
Execution time: 38927.971 ms
Time: 38930.328 ms
哪里可以使情况变得更糟
限制在一组非常小的组 ID 中会使情况变得更糟,而我可能认为统计一小部分事物会更容易。
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry WHERE group_id > 380;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385954.43..385954.44 rows=1 width=4) (actual time=13438.422..13438.422 rows=1 loops=1)
-> Bitmap Heap Scan on everything_crowberry (cost=18742.95..383451.68 rows=1001099 width=4) (actual time=132.571..12673.233 rows=986572 loops=1)
Recheck Cond: (group_id > 380)
Rows Removed by Index Recheck: 70816
Heap Blocks: exact=49632 lossy=79167
-> Bitmap Index Scan on everything_crowberry_group (cost=0.00..18492.67 rows=1001099 width=0) (actual time=120.816..120.816 rows=986572 loops=1)
Index Cond: (group_id > 380)
Planning time: 1.294 ms
Execution time: 13439.017 ms
(9 rows)
Time: 13442.603 ms
解释(分析、缓冲)
analytics_staging=> explain(analyze, buffers) select count(distinct group_id) from everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=392177.29..392177.30 rows=1 width=4) (actual time=7329.775..7329.775 rows=1 loops=1)
Buffers: shared hit=16283 read=335912, temp read=4693 written=4693
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.224..4615.015 rows=3198583 loops=1)
Buffers: shared hit=16283 read=335912
Planning time: 0.089 ms
Execution time: 7329.818 ms
Time: 7331.084 ms
work_mem 太小(参见上面的 explain(analyze, buffers))
将其从默认的 4 MB 增加到 10 MB 会有所改善,从 7300 毫秒到 5500 毫秒左右。
更改 SQL 也有一点帮助。
analytics_staging=> EXPLAIN(analyze, buffers) SELECT group_id FROM everything_crowberry GROUP BY group_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=392177.29..392181.56 rows=427 width=4) (actual time=4686.525..4686.612 rows=481 loops=1)
Group Key: group_id
Buffers: shared hit=96 read=352099
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.034..4017.122 rows=3198583 loops=1)
Buffers: shared hit=96 read=352099
Planning time: 0.094 ms
Execution time: 4686.686 ms
Time: 4687.461 ms
analytics_staging=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=392177.29..392181.56 rows=427 width=4) (actual time=5536.151..5536.262 rows=481 loops=1)
Group Key: group_id
Buffers: shared hit=128 read=352067
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.030..4946.024 rows=3198583 loops=1)
Buffers: shared hit=128 read=352067
Planning time: 0.074 ms
Execution time: 5536.321 ms
Time: 5537.380 ms
analytics_staging=> SELECT count(*) FROM (SELECT 1 FROM everything_crowberry GROUP BY group_id) ec;
count
-------
481
Time: 4927.671 ms
创建视图是一个重大胜利,但可能会在其他地方产生性能问题。
analytics_production=> CREATE VIEW everything_crowberry_group_view AS select distinct group_id, group_type FROM everything_crowberry;
CREATE VIEW
analytics_production=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry_group_view;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.56..357898.89 rows=200 width=4) (actual time=0.046..1976.882 rows=447 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Subquery Scan on everything_crowberry_group_view (cost=0.56..357897.19 rows=680 width=4) (actual time=0.046..1976.616 rows=475 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Unique (cost=0.56..357890.39 rows=680 width=8) (actual time=0.044..1976.378 rows=475 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Index Only Scan using everything_crowberry_group_id_group_type_reg_user_id_catego_key on everything_crowberry (cost=0.56..343330.63 rows=2911953 width=8) (actual time=0.043..1656.409 rows=2912005 loops=1)
Heap Fetches: 290488
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
Planning time: 1.842 ms
Execution time: 1977.086 ms
我有时在 Postgres 中看到 count(distinct)
的问题。
这是如何工作的?
select count(*)
from (select distinct group_id
from everything_crowberry
) ec;
或者:
select count(*)
from (select distinct on (group_id) ec.*
from everything_crowberry
) ec;
请注意,NULL
处理略有不同,但可以轻松调整查询。
对于 group_id
中相对 个不同的值(每组多行)- 似乎是你的情况:
3e6 rows / under 500 distinct group_id's
要使此快速,您需要索引跳过扫描(a.k.a.松散索引扫描)。这在 Postgres 12 之前没有实现。但是你可以通过递归查询来解决这个限制:
替换:
select count(distinct group_id) from everything_crowberry;
有:
WITH RECURSIVE cte AS (
(SELECT group_id FROM everything_crowberry ORDER BY group_id LIMIT 1)
UNION ALL
SELECT (SELECT group_id FROM everything_crowberry
WHERE group_id > t.group_id ORDER BY group_id LIMIT 1)
FROM cte t
WHERE t.group_id IS NOT NULL
)
SELECT count(group_id) FROM cte;
我使用 count(group_id)
而不是稍快的 count(*)
来方便地从最终递归中删除 NULL
值 - 因为 count(<expression>)
只计算非空值。
此外,group_id
是否可以是 NULL
并不重要,因为无论如何您的查询都不会计算在内。
可以使用已有的索引:
CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);
相关:
- Optimize GROUP BY query to retrieve latest row per user
对于许多 group_id
中的不同值(每组几行) - 或者对于小表 - 普通 DISTINCT
会更快。通常在子查询中完成时最快,而不是在 count()
:
SELECT count(group_id) -- or just count(*) to include possible NULL value
FROM (SELECT DISTINCT group_id FROM everything_crowberry) sub;