为什么 PostgreSql 不使用 PK 索引？

Question

如果我想通过 PK select 0.5% 的行，甚至来自以下 table 的 5% 行，查询规划器会正确选择使用 PK 索引。这是 table:

create table weather as
with numbers as(
select generate_series as id from generate_series(0,1048575))
select id, 
50 + 50*sin(id) as temperature_in_f, 
50 + 50*sin(id) as humidity_in_percent
from numbers;

alter table weather
add constraint pk_weather primary key(id);

vacuum analyze weather;

统计数据是最新的，以下查询确实使用了 PK 索引：

explain analyze select sum(w.id), sum(humidity_in_percent), count(*) 
from weather as w
where w.id between 1 and 66720;

然而，假设我们需要将此 table 与另一个更小的连接：

create table lightnings 
as
select id as weather_id
from weather
where humidity_in_percent between 99.99 and 100;

alter table lightnings
add constraint pk_lightnings
primary key(weather_id);

analyze lightnings;

这是我的连接，有四种逻辑上等价的形式：

explain analyze select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
  where l.weather_id=w.id);

explain analyze select sum(w.id), count(*) 
from weather as w
join lightnings as l
  on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;

explain analyze select sum(w.id), count(*) 
from lightnings as l
join weather as w
  on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;

-- replaced explicit join with where clause
explain analyze select sum(w.id), count(*) 
from lightnings as l, weather as w
where w.humidity_in_percent between 99.99 and 100
and l.weather_id=w.id;

不幸的是，查询规划器求助于扫描整个天气 table:

"Aggregate  (cost=22645.68..22645.69 rows=1 width=4) (actual time=167.427..167.427 rows=1 loops=1)"
"  ->  Hash Join  (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
"        Hash Cond: (w.id = l.weather_id)"
"        ->  Seq Scan on weather w  (cost=0.00..22407.64 rows=5106 width=4) (actual time=0.013..158.593 rows=6672 loops=1)"
"              Filter: ((humidity_in_percent >= 99.99::double precision) AND (humidity_in_percent <= 100::double precision))"
"              Rows Removed by Filter: 1041904"
"        ->  Hash  (cost=96.72..96.72 rows=6672 width=4) (actual time=2.479..2.479 rows=6672 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 235kB"
"              ->  Seq Scan on lightnings l  (cost=0.00..96.72 rows=6672 width=4) (actual time=0.009..0.908 rows=6672 loops=1)"
"Planning time: 0.326 ms"
"Execution time: 167.581 ms"

查询规划器对天气 table 中的行数的估计 selected 是行数=5106。这或多或少接近于 6672 的精确值。如果我 select weather table 中的这一小行数通过 id，则使用 PK 索引。如果我 select 通过与另一个 table 的连接获得相同的数量，查询计划程序将扫描 table.

我错过了什么？

select version()
"PostgreSQL 9.4.0"

编辑：如果我删除湿度条件，查询计划器会正确识别 weather.id 条件非常 selective，并选择在 PK 上使用索引：

explain analyze select sum(w.id), count(*) from weather as w
where exists(select * from lightnings as l
  where l.weather_id=w.id);
"Aggregate  (cost=14677.84..14677.85 rows=1 width=4) (actual time=37.200..37.200 rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.42..14644.48 rows=6672 width=4) (actual time=0.022..36.189 rows=6672 loops=1)"
"        ->  Seq Scan on lightnings l  (cost=0.00..96.72 rows=6672 width=4) (actual time=0.011..0.868 rows=6672 loops=1)"
"        ->  Index Only Scan using pk_weather on weather w  (cost=0.42..2.17 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=6672)"
"              Index Cond: (id = l.weather_id)"
"              Heap Fetches: 0"
"Planning time: 0.321 ms"
"Execution time: 37.254 ms"

然而添加一个条件完全混淆了查询规划器。

Answer 1

我相信您在第一个查询（使用索引）和其他 3 个不使用索引的查询之间看到的差异在 where 子句中。

在第一个查询中，您的 where 子句位于已编入索引的 w.id 上。

在另外3个中，有效的where子句在w.humidity_in_percent上。我测试了以下...

create index wh_idx on weather(humidity_in_percent);

explain analyse select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
  where l.weather_id=w.id);

并获得更好的计划。我试图 post 返回实际计划，但我无法格式化它以正确显示，抱歉。

Answer 2

期望优化器在较大 table 的 PK 上使用索引意味着您希望查询从较小的 table 驱动。当然，您知道较小的 table 将连接到较大的行中的行与其上的谓词选择的行相同，但优化器不会。

看图上的线：

Hash Join  (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"

它期望从连接中产生 32 行，但实际上产生了 6672 行。

无论如何，它几乎可以选择：

对较小的 table 进行全面扫描，对较大的进行索引查找，谓词用于过滤掉连接后的行（并期望随后过滤掉大部分行).
对两个 table 进行全面扫描，较大的 table 上的谓词删除行，并对结果进行散列连接。
扫描较大的 table，谓词删除了行，并对较小的 table 进行索引查找，可能找不到值。

第二个被认为是成本最低的，我认为根据它所拥有的证据这样做是正确的，因为散列连接对于连接许多行非常有效。

当然，在这种特殊情况下，在 weather(humidity_in_percent,id) 上放置索引可能会更有效，但我怀疑这是您实际情况的修改版本（总和id 列？）所以具体建议可能不适用。

为什么 PostgreSql 不使用 PK 索引？

Why PostgreSql does not use PK index?

postgresql

query-optimization