PostgreSQL 没有使用直接索引

Question

我在 Amazon RDS 上有一个 PostgreSQL 10.6 数据库。我的table是这样的：

CREATE TABLE dfo_by_quarter (
    release_key int4 NOT NULL,
    country varchar(100) NOT NULL,
    product_group varchar(100) NOT NULL,
    distribution_type varchar(100) NOT NULL,
    "year" int2 NOT NULL,
    "date" date NULL,
    quarter int2 NOT NULL,
    category varchar(100) NOT NULL,
    units numeric(38,6) NOT NULL,
    sales_value_eur numeric(38,6) NOT NULL,
    sales_value_usd numeric(38,6) NOT NULL,
    sales_value_local numeric(38,6) NOT NULL,
    data_status bpchar(1) NOT NULL,
    panel_market_units numeric(38,6) NOT NULL,
    panel_market_sales_value_eur numeric(38,6) NOT NULL,
    panel_market_sales_value_usd numeric(38,6) NOT NULL,
    panel_market_sales_value_local numeric(38,6) NOT NULL,
    CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
    CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);

我理解主键意味着唯一索引

如果我简单地询问在过滤不存在的数据时我有多少行（release_key = 1 returns 什么都没有），我可以看到它使用索引

EXPLAIN
SELECT COUNT(*)
  FROM dpretailer.dfo_by_quarter
  WHERE release_key = 1

Aggregate  (cost=6.32..6.33 rows=1 width=8)
  ->  Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter  (cost=0.55..6.32 rows=1 width=0)
        Index Cond: (release_key = 1)

但是如果我运行对returns数据的值进行相同的查询，它会扫描table，这肯定会更昂贵...

EXPLAIN
SELECT COUNT(*)
  FROM dpretailer.dfo_by_quarter
  WHERE release_key = 2

Finalize Aggregate  (cost=47611.07..47611.08 rows=1 width=8)
  ->  Gather  (cost=47610.86..47611.07 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=46610.86..46610.87 rows=1 width=8)
              ->  Parallel Seq Scan on dfo_by_quarter  (cost=0.00..46307.29 rows=121428 width=0)
                    Filter: (release_key = 2)

我知道在没有数据时使用索引是有意义的，并且由 table 上的统计数据驱动（我运行在测试前分析）

但是有数据为什么不用我的索引呢？

当然，扫描索引的一部分（因为 release_key 是第一列）肯定比扫描整个索引更快 table???

我一定是漏了什么...?

更新2019-03-07

感谢您的意见，非常有用。

这个简单的查询只是我想了解为什么没有使用索引...

但我应该知道得更多（我是 postgresql 的新手，但有多年使用 SQL 服务器的经验），正如您评论的那样，事实并非如此。

选择性不好，因为我的标准只过滤了大约 20% 的行
糟糕的 table 设计（太胖了，我们知道并且正在解决）
索引不是 "covering" 查询等...

所以让我改变"slightly"我的问题，如果我可以...

我们的 table 将在 facts/dimensions 中规范化（错误位置不再有 varchars）。

我们只做插入，从不更新，删除很少，我们可以忽略它。

table大小不会很大（千万行顺序）。

我们的查询将始终指定一个准确的 release_key 值。

我们的新版本 table 看起来像这样

CREATE TABLE dfo_by_quarter (
    release_key int4 NOT NULL,
    country_key int2 NOT NULL,
    product_group_key int2 NOT NULL,
    distribution_type_key int2 NOT NULL,
    category_key int2 NOT NULL,
    "year" int2 NOT NULL,
    "date" date NULL,
    quarter int2 NOT NULL,
    units numeric(38,6) NOT NULL,
    sales_value_eur numeric(38,6) NOT NULL,
    sales_value_usd numeric(38,6) NOT NULL,
    sales_value_local numeric(38,6) NOT NULL,
    CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
    CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
    CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
    CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
    CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
    CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);

考虑到这一点，在 SQL 服务器环境中，我可以通过使用 "Clustered" 主键（对整个 table 进行排序）或使用索引来解决这个问题在主键上使用 INCLUDE 选项覆盖查询所需的其他列（单位、值等）。

问题 1)

在 postgresql 中，是否有与 SQL 服务器聚集索引等效的东西？一种对整个 table 进行实际排序的方法？我想这可能很困难，因为 postgresql 不进行更新 "in place"，因此它可能会使排序变得昂贵...

或者，有没有办法创建类似 SQL 服务器索引 WITH INCLUDE(units, values) 的方法？

更新：我遇到了 SQL CLUSTER 命令，这是我认为最接近的命令。对我们来说就是suitable

问题 2

使用下面的查询

EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
  SELECT
    ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
    "year",
    "main"."product_group_key" AS "productgroupkey",
    SUM("main"."units") AS "salesunits",
    SUM("main"."sales_value_eur") AS "salesvalue",
    SUM("sales_value_eur")/SUM("units") AS "asp"
  FROM "milly"."dfo_by_quarter" AS "main"

  WHERE
    "release_key" = 17 AND
    "main"."year" >= 2010
  GROUP BY
    "year",
    "main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
  "year" AS date,
  SUM("salesunits") AS "salesunits",
  SUM("salesvalue") AS "salesvalue",
  SUM("salesvalue")/SUM("salesunits") AS "asp",
  CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
  "rank_query"
GROUP BY
  "year",
  CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
  FROM BeforeLookup AS BL
  INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;

我明白了

Hash Join  (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
  Hash Cond: (bl.productgroupkey = pg.product_group_key)
  Buffers: shared hit=483 read=22719
  CTE rank_query
    ->  WindowAgg  (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
          Buffers: shared hit=480 read=22719
          ->  Sort  (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
                Sort Key: main.year, (sum(main.units)) DESC
                Sort Method: quicksort  Memory: 616kB
                Buffers: shared hit=480 read=22719
                ->  Finalize HashAggregate  (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
                      Group Key: main.year, main.product_group_key
                      Buffers: shared hit=480 read=22719
                      ->  Gather  (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
                            Workers Planned: 2
                            Workers Launched: 2
                            Buffers: shared hit=480 read=22719
                            ->  Partial HashAggregate  (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
                                  Group Key: main.year, main.product_group_key
                                  Buffers: shared hit=480 read=22719
                                  ->  Parallel Seq Scan on dfo_by_quarter main  (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
                                        Filter: ((year >= 2010) AND (release_key = 17))
                                        Rows Removed by Filter: 546602
                                        Buffers: shared hit=480 read=22719
  CTE beforelookup
    ->  HashAggregate  (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
          Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
          Buffers: shared hit=480 read=22719
          ->  CTE Scan on rank_query  (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
                Buffers: shared hit=480 read=22719
  ->  CTE Scan on beforelookup bl  (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
        Buffers: shared hit=480 read=22719
  ->  Hash  (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 24kB
        Buffers: shared hit=3
        ->  Seq Scan on dim_dfo_product_group pg  (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
              Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms

有什么想spring的吗？

如果我没看错的话，这意味着我最大的成本是 table 的初始扫描...但我没能让它使用索引...

我创建了一个索引，希望它能有所帮助，但它被忽略了...

CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;

我也试过聚类 table 但也没有明显效果

CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;

非常感谢

埃里克

Answer 1

一般来说，虽然可能，但 PK 跨越 7 列，其中有几列 varchar(100) 至少可以说没有针对性能进行优化。

如果您对涉及的列进行了更新，这样的索引一开始就很大，并且会很快膨胀。

我会使用代理 PK，serial（或者 bigserial，如果你有那么多行）。或者 IDENTITY。参见：

Auto increment table column

并对所有 7 个进行 UNIQUE 约束以强制执行唯一性（无论如何都是 NOT NULL）。

如果您有很多计数查询，其中唯一的谓词在 release_key 上，请考虑仅在该列上添加一个普通的 btree 索引。

这么多列的数据类型 varchar(100) 可能不是最佳的。一些规范化可能会有所帮助。

更多建议取决于缺少的信息...

Answer 2

因为 release_key 实际上不是唯一列，所以无法根据您提供的信息知道是否应该使用索引。如果很大比例的行具有 release_key = 2 或更小比例的行匹配较大的 table，则使用索引可能效率不高。

部分原因是因为 Postgres 索引是间接的——即索引实际上包含一个指针，指向堆中磁盘上真正元组所在的位置。所以遍历索引需要从索引中读取一个条目，从堆中读取元组，然后重复。对于大量元组，直接扫描堆并避免间接磁盘访问惩罚通常更有价值。

编辑：您通常不想在 PostgreSQL 中使用 CLUSTER；这不是索引的维护方式，因此很少在野外看到它。

您没有数据的更新查询给出了这个计划：

                                                                                  QUERY PLAN                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl  (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
   Buffers: shared hit=4
   CTE rank_query
     ->  WindowAgg  (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
           Buffers: shared hit=4
           ->  Sort  (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
                 Sort Key: main.year, (sum(main.units)) DESC
                 Sort Method: quicksort  Memory: 25kB
                 Buffers: shared hit=4
                 ->  GroupAggregate  (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
                       Group Key: main.year, main.product_group_key
                       Buffers: shared hit=1
                       ->  Sort  (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
                             Sort Key: main.year, main.product_group_key
                             Sort Method: quicksort  Memory: 25kB
                             Buffers: shared hit=1
                             ->  Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main  (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
                                   Index Cond: ((release_key = 17) AND (year >= 2010))
                                   Buffers: shared hit=1
   CTE beforelookup
     ->  HashAggregate  (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
           Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
           Buffers: shared hit=4
           ->  CTE Scan on rank_query  (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
                 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

因此 PostgreSQL 完全有能力为您的查询使用索引，但规划器认为这不值得（即，直接使用索引的成本高于使用并行序列扫描的成本） .

如果您 set enable_indexscan = off; 没有数据，您将进行位图索引扫描（如我所料）。如果您 set enable_bitmapscan = off; 没有数据，您将获得（非并行）序列扫描。

如果您 set max_parallel_workers = 0;.

，您应该会看到计划变回（包含大量数据）

但是查看您的查询的解释结果，我非常希望使用索引比使用并行序列扫描更昂贵并且需要更长的时间。在您更新的查询中，您仍在扫描很大比例的 table 和大量行，并且您还通过访问不在索引中的字段来强制访问堆。 Postgres 11（我相信）添加了覆盖索引，这在理论上允许您使这个查询仅由索引驱动，但我完全不相信在这个例子中它实际上是值得的。

Answer 3

我最初的问题的答案：为什么 postgresql 不在 SELECT (*)...之类的东西上使用我的索引...可以在文档中找到...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

特别是：这意味着每次从索引中读取一行时，引擎还必须读取 table 中的实际行以确保该行没有已删除。

这很好地解释了为什么我无法让 postgresql 使用我的索引，从 SQL 服务器的角度来看，显然 "should"。

PostgreSQL 没有使用直接索引

PostgreSQL is not using a straight forward index

postgresql

indexing

amazon-rds

postgresql-performance