为什么 Postgres 更喜欢 seq 扫描而不是具有显式 where 条件的部分索引？

Question

我有一个像 select * from xxx where col is not null limit 10 这样的简单查询。我不知道为什么 Postgres 更喜欢 seq 扫描，它比部分索引慢得多（我已经分析了 table）。如何调试这样的问题？

table 有超过 400 万行。大约 350,000 行满足 pid is not null.

我认为成本估算可能有问题。 seq 扫描的成本低于索引扫描。但是如何深入研究呢？

我有一个猜测，但不确定。非空行约占总行数的 10%。这意味着当 seq 扫描 100 行时，它可能会得到 10 行不为空的行。并且它认为 seq 扫描 100 行的成本低于索引扫描 10 行然后随机获取 10 个完整行的成本。是吗？

> \d data_import
+--------------------+--------------------------+----------------------------------------------------------------------------+
| Column             | Type                     | Modifiers                                                                  |
|--------------------+--------------------------+----------------------------------------------------------------------------|
| id                 | integer                  |  not null default nextval('data_import_id_seq'::regclass) |
| name               | character varying(64)    |                                                                            |
| market_activity_id | integer                  |  not null                                                                  |
| hmsr_id            | integer                  |  not null default (-1)                                                     |
| site_id            | integer                  |  not null default (-1)                                                     |
| hmpl_id            | integer                  |  not null default (-1)                                                     |
| hmmd_id            | integer                  |  not null default (-1)                                                     |
| hmci_id            | integer                  |  not null default (-1)                                                     |
| hmkw_id            | integer                  |  not null default (-1)                                                     |
| creator_id         | integer                  |                                                                            |
| created_at         | timestamp with time zone |                                                                            |
| updated_at         | timestamp with time zone |                                                                            |
| bias               | integer                  |                                                                            |
| pid                | character varying(128)   |  default NULL::character varying                                           |
+--------------------+--------------------------+----------------------------------------------------------------------------+
Indexes:
    "data_import_pkey" PRIMARY KEY, btree (id)
    "unique_hmxx" UNIQUE, btree (site_id, hmsr_id, hmpl_id, hmmd_id, hmci_id, hmkw_id) WHERE pid IS NULL
    "data_import_pid_idx" UNIQUE, btree (pid) WHERE pid IS NOT NULL
    "data_import_created_at_idx" btree (created_at)
    "data_import_hmsr_id" btree (hmsr_id)
    "data_import_updated_at_idx" btree (updated_at)



> set enable_seqscan to false;
apollon> explain (analyse, verbose)  select * from data_import where pid is not null limit 10
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
| QUERY PLAN
|-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Limit  (cost=0.42..5.68 rows=10 width=84) (actual time=0.059..0.142 rows=10 loops=1)
|   Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
|   ->  Index Scan using data_import_pid_idx on public.data_import  (cost=0.42..184158.08 rows=350584 width=84) (actual time
|         Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
|         Index Cond: (data_import.pid IS NOT NULL)
| Planning time: 0.126 ms
| Execution time: 0.177 ms
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
EXPLAIN
Time: 0.054s

> set enable_seqscan to true;
> explain (analyse, verbose)  select * from data_import where pid is not null limit 10
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                        |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=0.00..2.37 rows=10 width=84) (actual time=407.042..407.046 rows=10 loops=1)                                                          |
|   Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid       |
|   ->  Seq Scan on public.data_import  (cost=0.00..83016.60 rows=350584 width=84) (actual time=407.041..407.045 rows=10 loops=1)  |
|         Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid |
|         Filter: (data_import.pid IS NOT NULL)                                                                                    |
|         Rows Removed by Filter: 3672502                                                                                                           |
| Planning time: 0.116 ms                                                                                                                           |
| Execution time: 407.078 ms                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.426s

Answer 1

你的问题是

Rows Removed by Filter: 3672502

PostgreSQL 知道值的分布以及它们如何与物理 table 布局相关，但它不知道 table 开头的行对于 pid.

如果 NULL 均匀分布，顺序扫描会很快找到 10 个匹配项并停止，但实际上，它必须读取 3672512 行才能找到 10 个匹配项。

如果您在 LIMIT 之前添加 ORDER BY pid（即使您不需要它），优化器会做正确的事情。

为什么 Postgres 更喜欢 seq 扫描而不是具有显式 where 条件的部分索引？

why Postgres prefer seq scan to partial index with explicit where condition?

database

postgresql

indexing

database-performance

postgresql-9.4