PostgreSQL 如何在字段上执行带有 b 树索引的 ORDER BY?

How does PostgreSQL perform ORDER BY with a b-tree index on the field?

我有一个 table bsort:

CREATE TABLE bsort(a int, data text);

此处data可能不完整。换句话说,一些元组可能没有 data 值。

然后我在 table 上构建了一个 b-tree 索引:

CREATE INDEX ON bsort USING BTREE(a);

现在如果我执行这个查询:

SELECT * FROM bsort ORDER BY a;

PostgreSQL 是对具有 nlogn 复杂度的元组进行排序,还是直接从 b-tree 索引中获取顺序?

您需要检查执行计划。但是,Postgres 完全有能力使用索引来提高 order by 的效率。它会直接从索引中读取记录。因为您只有一列,所以不需要访问数据页。

对于像这样的简单查询,Postgres 将使用 index scan and retrieve readily sorted tuples from the index in order. Due to its MVCC model Postgres had to always visit the "heap" (data pages) additionally to verify entries are actually visible to the current transaction. Quoting the Postgres Wiki on index-only scans:

PostgreSQL indexes do not contain visibility information. That is, it is not directly possible to ascertain if any given tuple is visible to the current transaction, which is why it has taken so long for index-only scans to be implemented.

最终发生在版本 9.2 中:index-only scans. The manual:

If the index stores the original indexed data values (and not some lossy representation of them), it is useful to support index-only scans, in which the index returns the actual data not just the TID of the heap tuple. This will only avoid I/O if the visibility map shows that the TID is on an all-visible page; else the heap tuple must be visited anyway to check MVCC visibility. But that is no concern of the access method's.

visibility map决定是否可以进行仅索引扫描。如果所有涉及的列值都包含在索引中,则只有一个选项。否则,无论如何都必须(另外)访问堆。 仍然不需要排序步骤。

这就是为什么我们现在有时会将无用的列附加到索引中。就像您示例中的 data 列:

CREATE INDEX ON bsort (a, data);  -- btree is the default index type

它使索引更大(取决于)并且维护和用于其他目的的成本更高。因此,如果您从中获得仅索引扫描,则仅附加 data 列。索引中列的顺序很重要:

从 Postgres 11 开始,也有带有 INCLUDE 关键字的“覆盖索引”。喜欢:

CREATE INDEX ON bsort (a) INCLUDE (data);

参见:

仅索引扫描的好处,per documentation:

If it's known that all tuples on the page are visible, the heap fetch can be skipped. This is most noticeable on large data sets where the visibility map can prevent disk accesses. The visibility map is vastly smaller than the heap, so it can easily be cached even when the heap is very large.

可见性地图由 VACUUM which happens automatically if you have autovacuum 运行ning 维护(现代 Postgres 中的默认设置)。详情:

但是 table 和下一个 VACUUM 运行 的写入操作之间存在一些延迟。要点:

  • 只读 tables 在清理后随时准备进行仅索引扫描。
  • 被修改的数据页在可见性映射中丢失它们的“全部可见”标志,直到下一个 VACUUM(并且所有旧的交易都已完成),所以这取决于写操作和VACUUM 频率。

如果 一些 相关页面被标记为全部可见,则仍然可以进行部分仅索引扫描。但是如果无论如何都要访问堆,访问方法“索引扫描”会更便宜一些。所以如果当前有太多页面是脏的,Postgres 将完全切换到更便宜的索引扫描。 The Postgres Wiki again:

As the number of heap fetches (or "visits") that are projected to be needed by the planner goes up, the planner will eventually conclude that an index-only scan isn't desirable, as it isn't the cheapest possible plan according to its cost model. The value of index-only scans lies wholly in their potential to allow us to elide heap access (if only partially) and minimise I/O.