为什么 PostgreSql 不使用 PK 索引?
Why PostgreSql does not use PK index?
如果我想通过 PK select 0.5% 的行,甚至来自以下 table 的 5% 行,查询规划器会正确选择使用 PK 索引。这是 table:
create table weather as
with numbers as(
select generate_series as id from generate_series(0,1048575))
select id,
50 + 50*sin(id) as temperature_in_f,
50 + 50*sin(id) as humidity_in_percent
from numbers;
alter table weather
add constraint pk_weather primary key(id);
vacuum analyze weather;
统计数据是最新的,以下查询确实使用了 PK 索引:
explain analyze select sum(w.id), sum(humidity_in_percent), count(*)
from weather as w
where w.id between 1 and 66720;
然而,假设我们需要将此 table 与另一个更小的连接:
create table lightnings
as
select id as weather_id
from weather
where humidity_in_percent between 99.99 and 100;
alter table lightnings
add constraint pk_lightnings
primary key(weather_id);
analyze lightnings;
这是我的连接,有四种逻辑上等价的形式:
explain analyze select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
explain analyze select sum(w.id), count(*)
from weather as w
join lightnings as l
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
explain analyze select sum(w.id), count(*)
from lightnings as l
join weather as w
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
-- replaced explicit join with where clause
explain analyze select sum(w.id), count(*)
from lightnings as l, weather as w
where w.humidity_in_percent between 99.99 and 100
and l.weather_id=w.id;
不幸的是,查询规划器求助于扫描整个天气 table:
"Aggregate (cost=22645.68..22645.69 rows=1 width=4) (actual time=167.427..167.427 rows=1 loops=1)"
" -> Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
" Hash Cond: (w.id = l.weather_id)"
" -> Seq Scan on weather w (cost=0.00..22407.64 rows=5106 width=4) (actual time=0.013..158.593 rows=6672 loops=1)"
" Filter: ((humidity_in_percent >= 99.99::double precision) AND (humidity_in_percent <= 100::double precision))"
" Rows Removed by Filter: 1041904"
" -> Hash (cost=96.72..96.72 rows=6672 width=4) (actual time=2.479..2.479 rows=6672 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 235kB"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.009..0.908 rows=6672 loops=1)"
"Planning time: 0.326 ms"
"Execution time: 167.581 ms"
查询规划器对天气 table 中的行数的估计 selected 是行数=5106。这或多或少接近于 6672 的精确值。如果我 select weather table 中的这一小行数通过 id,则使用 PK 索引。如果我 select 通过与另一个 table 的连接获得相同的数量,查询计划程序将扫描 table.
我错过了什么?
select version()
"PostgreSQL 9.4.0"
编辑:如果我删除湿度条件,查询计划器会正确识别 weather.id 条件非常 selective,并选择在 PK 上使用索引:
explain analyze select sum(w.id), count(*) from weather as w
where exists(select * from lightnings as l
where l.weather_id=w.id);
"Aggregate (cost=14677.84..14677.85 rows=1 width=4) (actual time=37.200..37.200 rows=1 loops=1)"
" -> Nested Loop (cost=0.42..14644.48 rows=6672 width=4) (actual time=0.022..36.189 rows=6672 loops=1)"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.011..0.868 rows=6672 loops=1)"
" -> Index Only Scan using pk_weather on weather w (cost=0.42..2.17 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=6672)"
" Index Cond: (id = l.weather_id)"
" Heap Fetches: 0"
"Planning time: 0.321 ms"
"Execution time: 37.254 ms"
然而添加一个条件完全混淆了查询规划器。
我相信您在第一个查询(使用索引)和其他 3 个不使用索引的查询之间看到的差异在 where 子句中。
在第一个查询中,您的 where 子句位于已编入索引的 w.id 上。
在另外3个中,有效的where子句在w.humidity_in_percent上。我测试了以下...
create index wh_idx on weather(humidity_in_percent);
explain analyse select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
并获得更好的计划。我试图 post 返回实际计划,但我无法格式化它以正确显示,抱歉。
期望优化器在较大 table 的 PK 上使用索引意味着您希望查询从较小的 table 驱动。当然,您知道较小的 table 将连接到较大的行中的行与其上的谓词选择的行相同,但优化器不会。
看图上的线:
Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
它期望从连接中产生 32 行,但实际上产生了 6672 行。
无论如何,它几乎可以选择:
- 对较小的 table 进行全面扫描,对较大的进行索引查找,谓词用于过滤掉连接后的行(并期望随后过滤掉大部分行).
- 对两个 table 进行全面扫描,较大的 table 上的谓词删除行,并对结果进行散列连接。
- 扫描较大的 table,谓词删除了行,并对较小的 table 进行索引查找,可能找不到值。
第二个被认为是成本最低的,我认为根据它所拥有的证据这样做是正确的,因为散列连接对于连接许多行非常有效。
当然,在这种特殊情况下,在 weather(humidity_in_percent,id) 上放置索引可能会更有效,但我怀疑这是您实际情况的修改版本(总和id 列?)所以具体建议可能不适用。
如果我想通过 PK select 0.5% 的行,甚至来自以下 table 的 5% 行,查询规划器会正确选择使用 PK 索引。这是 table:
create table weather as
with numbers as(
select generate_series as id from generate_series(0,1048575))
select id,
50 + 50*sin(id) as temperature_in_f,
50 + 50*sin(id) as humidity_in_percent
from numbers;
alter table weather
add constraint pk_weather primary key(id);
vacuum analyze weather;
统计数据是最新的,以下查询确实使用了 PK 索引:
explain analyze select sum(w.id), sum(humidity_in_percent), count(*)
from weather as w
where w.id between 1 and 66720;
然而,假设我们需要将此 table 与另一个更小的连接:
create table lightnings
as
select id as weather_id
from weather
where humidity_in_percent between 99.99 and 100;
alter table lightnings
add constraint pk_lightnings
primary key(weather_id);
analyze lightnings;
这是我的连接,有四种逻辑上等价的形式:
explain analyze select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
explain analyze select sum(w.id), count(*)
from weather as w
join lightnings as l
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
explain analyze select sum(w.id), count(*)
from lightnings as l
join weather as w
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
-- replaced explicit join with where clause
explain analyze select sum(w.id), count(*)
from lightnings as l, weather as w
where w.humidity_in_percent between 99.99 and 100
and l.weather_id=w.id;
不幸的是,查询规划器求助于扫描整个天气 table:
"Aggregate (cost=22645.68..22645.69 rows=1 width=4) (actual time=167.427..167.427 rows=1 loops=1)"
" -> Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
" Hash Cond: (w.id = l.weather_id)"
" -> Seq Scan on weather w (cost=0.00..22407.64 rows=5106 width=4) (actual time=0.013..158.593 rows=6672 loops=1)"
" Filter: ((humidity_in_percent >= 99.99::double precision) AND (humidity_in_percent <= 100::double precision))"
" Rows Removed by Filter: 1041904"
" -> Hash (cost=96.72..96.72 rows=6672 width=4) (actual time=2.479..2.479 rows=6672 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 235kB"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.009..0.908 rows=6672 loops=1)"
"Planning time: 0.326 ms"
"Execution time: 167.581 ms"
查询规划器对天气 table 中的行数的估计 selected 是行数=5106。这或多或少接近于 6672 的精确值。如果我 select weather table 中的这一小行数通过 id,则使用 PK 索引。如果我 select 通过与另一个 table 的连接获得相同的数量,查询计划程序将扫描 table.
我错过了什么?
select version()
"PostgreSQL 9.4.0"
编辑:如果我删除湿度条件,查询计划器会正确识别 weather.id 条件非常 selective,并选择在 PK 上使用索引:
explain analyze select sum(w.id), count(*) from weather as w
where exists(select * from lightnings as l
where l.weather_id=w.id);
"Aggregate (cost=14677.84..14677.85 rows=1 width=4) (actual time=37.200..37.200 rows=1 loops=1)"
" -> Nested Loop (cost=0.42..14644.48 rows=6672 width=4) (actual time=0.022..36.189 rows=6672 loops=1)"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.011..0.868 rows=6672 loops=1)"
" -> Index Only Scan using pk_weather on weather w (cost=0.42..2.17 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=6672)"
" Index Cond: (id = l.weather_id)"
" Heap Fetches: 0"
"Planning time: 0.321 ms"
"Execution time: 37.254 ms"
然而添加一个条件完全混淆了查询规划器。
我相信您在第一个查询(使用索引)和其他 3 个不使用索引的查询之间看到的差异在 where 子句中。
在第一个查询中,您的 where 子句位于已编入索引的 w.id 上。
在另外3个中,有效的where子句在w.humidity_in_percent上。我测试了以下...
create index wh_idx on weather(humidity_in_percent);
explain analyse select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
并获得更好的计划。我试图 post 返回实际计划,但我无法格式化它以正确显示,抱歉。
期望优化器在较大 table 的 PK 上使用索引意味着您希望查询从较小的 table 驱动。当然,您知道较小的 table 将连接到较大的行中的行与其上的谓词选择的行相同,但优化器不会。
看图上的线:
Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
它期望从连接中产生 32 行,但实际上产生了 6672 行。
无论如何,它几乎可以选择:
- 对较小的 table 进行全面扫描,对较大的进行索引查找,谓词用于过滤掉连接后的行(并期望随后过滤掉大部分行).
- 对两个 table 进行全面扫描,较大的 table 上的谓词删除行,并对结果进行散列连接。
- 扫描较大的 table,谓词删除了行,并对较小的 table 进行索引查找,可能找不到值。
第二个被认为是成本最低的,我认为根据它所拥有的证据这样做是正确的,因为散列连接对于连接许多行非常有效。
当然,在这种特殊情况下,在 weather(humidity_in_percent,id) 上放置索引可能会更有效,但我怀疑这是您实际情况的修改版本(总和id 列?)所以具体建议可能不适用。