如果 IN 子句中的 table 为空,则 Postgresql 查询速度变慢
Postgresql Query slow if empty table in IN clause
我有以下SQL
WITH filtered_users_pre as (
SELECT value as username,row_number() OVER (partition by value) AS rk
FROM "user-stats".tag_table
WHERE _at_timestamp = 1626955200
AND tag in ('commercial','marketing')
),
filtered_users as (
SELECT username
FROM filtered_users_pre
WHERE rk = 2
),
valid_users as (
SELECT aa.username, aa.rank, aa.points, aa.version
FROM "users-results".ai_algo aa
WHERE aa._at_timestamp = 1626955200
AND aa.rank_timeframe = '7d'
AND aa.username IN (SELECT * FROM filtered_users)
ORDER BY aa.rank ASC
LIMIT 15
OFFSET 0
)
select * from valid_users;
"user-stats".tag_table
是一个 table,有大约 6000 万行,有适当的索引。
"users-results".ai_algo
是一个 table,大约有 1000 万行,具有适当的索引。
使用适当的索引 我的意思是上面的 WHERE 子句中出现的所有字段。
如果filtered_users
为空,查询需要4秒到运行。如果filtered_users
至少有一行,则需要400ms。
谁能解释一下为什么?有什么办法可以使查询 运行 具有相同的性能(400 毫秒)并且 filtered_users
为空?我期望通过减少 filtered_users
中的行数来获得更好的性能。这就是最多 1 行发生的情况。当行数为0时,需要10倍以上
如果我在 ai_algo
和 filtered_users
之间放置一个 INNER JOIN
而不是 WHERE 中的 IN
子句,当然会发生同样的情况
更新
这是当 filtered_users 有 0 行(执行 4 秒)
时的 EXPLAIN (ANALYZE, BUFFERS)
输出查询
Limit (cost=14592.13..15870.39 rows=15 width=35) (actual time=3953.945..3953.949 rows=0 loops=1)
Buffers: shared hit=7456641
-> Nested Loop Semi Join (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=3953.944..3953.947 rows=0 loops=1)
Join Filter: (aa.username = filtered_users_pre.username)
Buffers: shared hit=7456641
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.085..3885.547 rows=313611 loops=1)
" Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 7793096
Buffers: shared hit=7456533
-> Materialize (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
Buffers: shared hit=108
-> Subquery Scan on filtered_users_pre (cost=14591.56..14672.44 rows=13 width=21) (actual time=3.543..3.545 rows=0 loops=1)
Filter: (filtered_users_pre.rk = 2)
Rows Removed by Filter: 2415
Buffers: shared hit=108
-> WindowAgg (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.996..3.356 rows=2415 loops=1)
Buffers: shared hit=108
-> Sort (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.990..2.189 rows=2415 loops=1)
Sort Key: tag_table_20210722.value
Sort Method: quicksort Memory: 285kB
Buffers: shared hit=108
-> Bitmap Heap Scan on tag_table_20210722 (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.612..1.080 rows=2415 loops=1)
" Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 2415
Heap Blocks: exact=72
Buffers: shared hit=105
-> Bitmap Index Scan on tag_table_20210722_tag_idx (cost=0.00..145.57 rows=5428 width=0) (actual time=0.292..0.292 rows=4830 loops=1)
" Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Buffers: shared hit=33
Planning Time: 0.914 ms
Execution Time: 3954.035 ms
这是 filtered_users 至少有 1 行(300 毫秒)
Limit (cost=14592.13..15870.39 rows=15 width=35) (actual time=15.958..300.759 rows=15 loops=1)
Buffers: shared hit=11042
-> Nested Loop Semi Join (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=15.957..300.752 rows=15 loops=1)
Join Filter: (aa.username = filtered_users_pre.username)
Rows Removed by Join Filter: 1544611
Buffers: shared hit=11042
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.075..10.455 rows=645 loops=1)
" Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 16124
Buffers: shared hit=10937
-> Materialize (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.003..0.174 rows=2395 loops=645)
Buffers: shared hit=105
-> Subquery Scan on filtered_users_pre (cost=14591.56..14672.44 rows=13 width=21) (actual time=1.895..3.680 rows=2415 loops=1)
Filter: (filtered_users_pre.rk = 1)
Buffers: shared hit=105
-> WindowAgg (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.894..3.334 rows=2415 loops=1)
Buffers: shared hit=105
-> Sort (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.889..2.102 rows=2415 loops=1)
Sort Key: tag_table_20210722.value
Sort Method: quicksort Memory: 285kB
Buffers: shared hit=105
-> Bitmap Heap Scan on tag_table_20210722 (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.604..1.046 rows=2415 loops=1)
" Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 2415
Heap Blocks: exact=72
Buffers: shared hit=105
-> Bitmap Index Scan on tag_table_20210722_tag_idx (cost=0.00..145.57 rows=5428 width=0) (actual time=0.287..0.287 rows=4830 loops=1)
" Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Buffers: shared hit=33
Planning Time: 0.310 ms
Execution Time: 300.954 ms
问题是,如果没有匹配的 filtered_users
,PostgreSQL 必须遍历 all "users-results".ai_algo
而找不到 15 个结果行。如果子查询包含元素,它会快速找到 15 个匹配的 "users-results".ai_algo
行并可以终止处理。
对此你无能为力,但你可以加快 "users-results".ai_algo
的扫描速度。目前,您有
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa
... (actual time=0.085..3885.547 rows=313611 loops=1)
Index Cond: (rank_timeframe = '7d'::"valid-users-timeframe")
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 7793096
Buffers: shared hit=7456533
您看到索引扫描没有达到应有的效果:它从 table 中读取了 313611 + 7793096 = 8106707 行,并丢弃了除 313611 之外与过滤条件匹配的所有行。
您可以通过创建一个只能直接查找结果行的索引来做得更好:
CREATE INDEX ON "users-results".ai_algo (rank_timeframe, _at_timestamp);
那么您可以删除索引 ai_algo_rank_timeframe_rank_idx
,因为新索引可以做旧索引可以做的所有事情(甚至更多)。
我有以下SQL
WITH filtered_users_pre as (
SELECT value as username,row_number() OVER (partition by value) AS rk
FROM "user-stats".tag_table
WHERE _at_timestamp = 1626955200
AND tag in ('commercial','marketing')
),
filtered_users as (
SELECT username
FROM filtered_users_pre
WHERE rk = 2
),
valid_users as (
SELECT aa.username, aa.rank, aa.points, aa.version
FROM "users-results".ai_algo aa
WHERE aa._at_timestamp = 1626955200
AND aa.rank_timeframe = '7d'
AND aa.username IN (SELECT * FROM filtered_users)
ORDER BY aa.rank ASC
LIMIT 15
OFFSET 0
)
select * from valid_users;
"user-stats".tag_table
是一个 table,有大约 6000 万行,有适当的索引。
"users-results".ai_algo
是一个 table,大约有 1000 万行,具有适当的索引。
使用适当的索引 我的意思是上面的 WHERE 子句中出现的所有字段。
如果filtered_users
为空,查询需要4秒到运行。如果filtered_users
至少有一行,则需要400ms。
谁能解释一下为什么?有什么办法可以使查询 运行 具有相同的性能(400 毫秒)并且 filtered_users
为空?我期望通过减少 filtered_users
中的行数来获得更好的性能。这就是最多 1 行发生的情况。当行数为0时,需要10倍以上
如果我在 ai_algo
和 filtered_users
INNER JOIN
而不是 WHERE 中的 IN
子句,当然会发生同样的情况
更新 这是当 filtered_users 有 0 行(执行 4 秒)
时的EXPLAIN (ANALYZE, BUFFERS)
输出查询
Limit (cost=14592.13..15870.39 rows=15 width=35) (actual time=3953.945..3953.949 rows=0 loops=1)
Buffers: shared hit=7456641
-> Nested Loop Semi Join (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=3953.944..3953.947 rows=0 loops=1)
Join Filter: (aa.username = filtered_users_pre.username)
Buffers: shared hit=7456641
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.085..3885.547 rows=313611 loops=1)
" Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 7793096
Buffers: shared hit=7456533
-> Materialize (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
Buffers: shared hit=108
-> Subquery Scan on filtered_users_pre (cost=14591.56..14672.44 rows=13 width=21) (actual time=3.543..3.545 rows=0 loops=1)
Filter: (filtered_users_pre.rk = 2)
Rows Removed by Filter: 2415
Buffers: shared hit=108
-> WindowAgg (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.996..3.356 rows=2415 loops=1)
Buffers: shared hit=108
-> Sort (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.990..2.189 rows=2415 loops=1)
Sort Key: tag_table_20210722.value
Sort Method: quicksort Memory: 285kB
Buffers: shared hit=108
-> Bitmap Heap Scan on tag_table_20210722 (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.612..1.080 rows=2415 loops=1)
" Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 2415
Heap Blocks: exact=72
Buffers: shared hit=105
-> Bitmap Index Scan on tag_table_20210722_tag_idx (cost=0.00..145.57 rows=5428 width=0) (actual time=0.292..0.292 rows=4830 loops=1)
" Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Buffers: shared hit=33
Planning Time: 0.914 ms
Execution Time: 3954.035 ms
这是 filtered_users 至少有 1 行(300 毫秒)
Limit (cost=14592.13..15870.39 rows=15 width=35) (actual time=15.958..300.759 rows=15 loops=1)
Buffers: shared hit=11042
-> Nested Loop Semi Join (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=15.957..300.752 rows=15 loops=1)
Join Filter: (aa.username = filtered_users_pre.username)
Rows Removed by Join Filter: 1544611
Buffers: shared hit=11042
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.075..10.455 rows=645 loops=1)
" Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 16124
Buffers: shared hit=10937
-> Materialize (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.003..0.174 rows=2395 loops=645)
Buffers: shared hit=105
-> Subquery Scan on filtered_users_pre (cost=14591.56..14672.44 rows=13 width=21) (actual time=1.895..3.680 rows=2415 loops=1)
Filter: (filtered_users_pre.rk = 1)
Buffers: shared hit=105
-> WindowAgg (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.894..3.334 rows=2415 loops=1)
Buffers: shared hit=105
-> Sort (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.889..2.102 rows=2415 loops=1)
Sort Key: tag_table_20210722.value
Sort Method: quicksort Memory: 285kB
Buffers: shared hit=105
-> Bitmap Heap Scan on tag_table_20210722 (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.604..1.046 rows=2415 loops=1)
" Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 2415
Heap Blocks: exact=72
Buffers: shared hit=105
-> Bitmap Index Scan on tag_table_20210722_tag_idx (cost=0.00..145.57 rows=5428 width=0) (actual time=0.287..0.287 rows=4830 loops=1)
" Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
Buffers: shared hit=33
Planning Time: 0.310 ms
Execution Time: 300.954 ms
问题是,如果没有匹配的 filtered_users
,PostgreSQL 必须遍历 all "users-results".ai_algo
而找不到 15 个结果行。如果子查询包含元素,它会快速找到 15 个匹配的 "users-results".ai_algo
行并可以终止处理。
对此你无能为力,但你可以加快 "users-results".ai_algo
的扫描速度。目前,您有
-> Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa
... (actual time=0.085..3885.547 rows=313611 loops=1)
Index Cond: (rank_timeframe = '7d'::"valid-users-timeframe")
Filter: (_at_timestamp = 1626955200)
Rows Removed by Filter: 7793096
Buffers: shared hit=7456533
您看到索引扫描没有达到应有的效果:它从 table 中读取了 313611 + 7793096 = 8106707 行,并丢弃了除 313611 之外与过滤条件匹配的所有行。
您可以通过创建一个只能直接查找结果行的索引来做得更好:
CREATE INDEX ON "users-results".ai_algo (rank_timeframe, _at_timestamp);
那么您可以删除索引 ai_algo_rank_timeframe_rank_idx
,因为新索引可以做旧索引可以做的所有事情(甚至更多)。