如果我使用另一个 ORDER BY,为什么 postgres 不再使用索引?
Why is postgres no longer using the index if I use another ORDER BY?
我在这里没有找到问题的答案,所以我想问一下:
我有以下 table 大约 18kk 行:
# SELECT COUNT(1) from report;
count
----------
18090892
(1 row)
# \d report
Table "public.report"
Column | Type | Collation | Nullable | Default
-------------+--------------------------+-----------+----------+---------
reporter_id | uuid | | not null |
parsed | boolean | | not null |
id | text | | not null |
request_id | uuid | | |
created | timestamp with time zone | | not null | now()
customer | text | | not null |
subject | text | | |
Indexes:
"PK_99e4d0bea58cba73c57f935a546" PRIMARY KEY, btree (id)
"idx_report_created_desc" btree (created DESC)
"idx_report_reporter_id_asc_created_desc" btree (reporter_id, created DESC)
"idx_report_request_id_asc_created_desc" btree (request_id, created DESC)
Foreign-key constraints:
"FK_5b809608bb38d119333b69f65f9" FOREIGN KEY (request_id) REFERENCES request(id)
"FK_d41df66b60944992386ed47cf2e" FOREIGN KEY (reporter_id) REFERENCES reporter(id)
如果我使用 ORDER BY created DESC LIMIT 25
则使用索引:
# EXPLAIN ANALYZE SELECT * FROM report ORDER BY created DESC LIMIT 25;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
--------------------------------------------
Limit (cost=0.44..2.49 rows=25 width=169) (actual time=0.035..0.063 rows=25 loops=1)
-> Index Scan using idx_report_created_desc on report (cost=0.44..1482912.16 rows=18090892 width=169)
(actual time=0.033..0.051 rows=25 loops=1)
Planning Time: 0.239 ms
Execution Time: 0.105 ms
(4 rows)
但是,如果我使用ORDER BY created DESC, id ASC LIMIT 25
,则不再使用该索引:
# EXPLAIN ANALYZE SELECT * FROM "report" ORDER BY "created" DESC, "id" ASC LIMIT 25;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=587891.07..587893.99 rows=25 width=169) (actual time=2719.606..2726.355 rows=25 loops=1)
-> Gather Merge (cost=587891.07..2346850.67 rows=15075744 width=169) (actual time=2711.873..2718.618 rows=25 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=586891.04..605735.72 rows=7537872 width=169) (actual time=2643.445..2643.448 rows=21 loops=3)
Sort Key: created DESC, id
Sort Method: top-N heapsort Memory: 35kB
Worker 0: Sort Method: top-N heapsort Memory: 32kB
Worker 1: Sort Method: top-N heapsort Memory: 31kB
-> Parallel Seq Scan on report (cost=0.00..374177.72 rows=7537872 width=169) (actual time=0.018..1910.204 rows=6030297 loops=3)
Planning Time: 0.396 ms
JIT:
Functions: 1
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 4.757 ms, Inlining 0.172 ms, Optimization 5.053 ms, Emission 2.003 ms, Total 11.985 ms
Execution Time: 2731.226 ms
(16 rows)
如果我理解正确,应该仍然使用索引,因为应该返回相同的结果集,只是可能按照 ORDER BY id ASC
确定的不同顺序。
所以我想知道为什么 postgres 决定进行并行 seq 扫描而不是使用索引然后按 id 对返回的 25 行进行排序?这绝对应该比并行 seq 扫描更快,不是吗?
或者我哪里错了?
这里最可能的解释是您目前只有 created DESC
上的索引。结果,当您执行以下查询时:
SELECT * FROM report ORDER BY created DESC, id
单列created
索引的叶节点没有可用的id
值。如果 Postgres 要使用这个索引,它必须为每个叶节点寻找回原始的 table(聚集索引)。这种来回查找的成本可能很高,有时可能会超过首先使用索引的好处。
如果你需要这样的两层排序,那么添加一个覆盖它的索引:
CREATE INDEX idx_new ON report (created DESC, id);
请注意,假设 id
是主键,某些数据库(例如 MySQL)会自动将 id
列标记为您当前的 created DESC
索引其中 table。但是在Postgres中好像不是这样。
PostgreSQL 并非无限聪明。有些事情它没有弄清楚,尽管理论上可以弄清楚。
但它一直在变得越来越聪明。升级到版本 13,看看会发生什么。它应该使用索引扫描加一个非常快的'incremental sort'。增量排序只需要打破“已创建”之间的联系,我认为这种情况很少见。
If I understand correctly, the index should still be used because the same set of results should be returned, only possibly in a different order determined by ORDER BY id ASC
但是在存在LIMIT的情况下,以不同的顺序返回结果意味着返回不同结果的可能性。因此,它必须采取特殊措施来解决这个问题。在 v13 中是这样。
我在这里没有找到问题的答案,所以我想问一下:
我有以下 table 大约 18kk 行:
# SELECT COUNT(1) from report;
count
----------
18090892
(1 row)
# \d report
Table "public.report"
Column | Type | Collation | Nullable | Default
-------------+--------------------------+-----------+----------+---------
reporter_id | uuid | | not null |
parsed | boolean | | not null |
id | text | | not null |
request_id | uuid | | |
created | timestamp with time zone | | not null | now()
customer | text | | not null |
subject | text | | |
Indexes:
"PK_99e4d0bea58cba73c57f935a546" PRIMARY KEY, btree (id)
"idx_report_created_desc" btree (created DESC)
"idx_report_reporter_id_asc_created_desc" btree (reporter_id, created DESC)
"idx_report_request_id_asc_created_desc" btree (request_id, created DESC)
Foreign-key constraints:
"FK_5b809608bb38d119333b69f65f9" FOREIGN KEY (request_id) REFERENCES request(id)
"FK_d41df66b60944992386ed47cf2e" FOREIGN KEY (reporter_id) REFERENCES reporter(id)
如果我使用 ORDER BY created DESC LIMIT 25
则使用索引:
# EXPLAIN ANALYZE SELECT * FROM report ORDER BY created DESC LIMIT 25;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
--------------------------------------------
Limit (cost=0.44..2.49 rows=25 width=169) (actual time=0.035..0.063 rows=25 loops=1)
-> Index Scan using idx_report_created_desc on report (cost=0.44..1482912.16 rows=18090892 width=169)
(actual time=0.033..0.051 rows=25 loops=1)
Planning Time: 0.239 ms
Execution Time: 0.105 ms
(4 rows)
但是,如果我使用ORDER BY created DESC, id ASC LIMIT 25
,则不再使用该索引:
# EXPLAIN ANALYZE SELECT * FROM "report" ORDER BY "created" DESC, "id" ASC LIMIT 25;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=587891.07..587893.99 rows=25 width=169) (actual time=2719.606..2726.355 rows=25 loops=1)
-> Gather Merge (cost=587891.07..2346850.67 rows=15075744 width=169) (actual time=2711.873..2718.618 rows=25 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=586891.04..605735.72 rows=7537872 width=169) (actual time=2643.445..2643.448 rows=21 loops=3)
Sort Key: created DESC, id
Sort Method: top-N heapsort Memory: 35kB
Worker 0: Sort Method: top-N heapsort Memory: 32kB
Worker 1: Sort Method: top-N heapsort Memory: 31kB
-> Parallel Seq Scan on report (cost=0.00..374177.72 rows=7537872 width=169) (actual time=0.018..1910.204 rows=6030297 loops=3)
Planning Time: 0.396 ms
JIT:
Functions: 1
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 4.757 ms, Inlining 0.172 ms, Optimization 5.053 ms, Emission 2.003 ms, Total 11.985 ms
Execution Time: 2731.226 ms
(16 rows)
如果我理解正确,应该仍然使用索引,因为应该返回相同的结果集,只是可能按照 ORDER BY id ASC
确定的不同顺序。
所以我想知道为什么 postgres 决定进行并行 seq 扫描而不是使用索引然后按 id 对返回的 25 行进行排序?这绝对应该比并行 seq 扫描更快,不是吗?
或者我哪里错了?
这里最可能的解释是您目前只有 created DESC
上的索引。结果,当您执行以下查询时:
SELECT * FROM report ORDER BY created DESC, id
单列created
索引的叶节点没有可用的id
值。如果 Postgres 要使用这个索引,它必须为每个叶节点寻找回原始的 table(聚集索引)。这种来回查找的成本可能很高,有时可能会超过首先使用索引的好处。
如果你需要这样的两层排序,那么添加一个覆盖它的索引:
CREATE INDEX idx_new ON report (created DESC, id);
请注意,假设 id
是主键,某些数据库(例如 MySQL)会自动将 id
列标记为您当前的 created DESC
索引其中 table。但是在Postgres中好像不是这样。
PostgreSQL 并非无限聪明。有些事情它没有弄清楚,尽管理论上可以弄清楚。
但它一直在变得越来越聪明。升级到版本 13,看看会发生什么。它应该使用索引扫描加一个非常快的'incremental sort'。增量排序只需要打破“已创建”之间的联系,我认为这种情况很少见。
If I understand correctly, the index should still be used because the same set of results should be returned, only possibly in a different order determined by ORDER BY id ASC
但是在存在LIMIT的情况下,以不同的顺序返回结果意味着返回不同结果的可能性。因此,它必须采取特殊措施来解决这个问题。在 v13 中是这样。