对大型空间选择进行排序不使用 GiST 索引 (Postgres 11.5)

Sorting a large spatial selection is not using GiST index (Postgres 11.5)

我有一个 table (demo),其中包含一个序列作为其主键 (seqno) 和一个 geometry 属性 JSONB 列 (doc)。我已经为序列列配置了主键约束,为几何配置了 GiST 索引。我已经通过 运行 VACUUM ANALYZE 收集了统计数据。这是一个相当大的 table(4200 万行)。

CREATE TABLE demo
(
    seqno bigint NOT NULL DEFAULT nextval('seqno'::regclass),
    doc jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT demo_pkey PRIMARY KEY (seqno)
)

CREATE INDEX demo_doc_geometry_gist
ON demo USING gist (st_geometryfromtext(doc ->> 'geometry'::text))

我想对相当大的区域和 return 前 10 行执行空间过滤,按其主键排序。因此,我尝试了以下查询:

SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10

这导致以下查询计划:

Limit  (cost=1000.59..15169.06 rows=10 width=633) (actual time=2479.372..2496.737 rows=10 loops=1)
  ->  Gather Merge  (cost=1000.59..19780184.81 rows=13960 width=633) (actual time=2479.370..2496.732 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Parallel Index Scan using demo_pkey on demo  (cost=0.56..19777573.45 rows=5817 width=633) (actual time=2440.310..2450.101 rows=5 loops=3)
              Filter: (('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text))) AND _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text))))
              Rows Removed by Filter: 221313
Planning Time: 0.375 ms
Execution Time: 2496.786 ms

这说明使用主键约束索引扫描所有行,对每一行进行空间过滤,显然效率很低。给定空间谓词的匹配项超过 5M。根本没有使用 GiST 索引。

但是,当省略 ORDER BY 子句时,正确使用几何 属性 的 GiST 索引,效率更高。

Limit  (cost=0.42..128.90 rows=10 width=633) (actual time=0.381..0.745 rows=10 loops=1)
  ->  Index Scan using demo_doc_geometry_gist on demo  (cost=0.42..179352.99 rows=13960 width=633) (actual time=0.380..0.742 rows=10 loops=1)
        Index Cond: ('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text)))
        Filter: _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text)))
Planning Time: 0.245 ms
Execution Time: 0.780 ms

有没有办法使这个查询更快?我们能否让查询规划器将 GiST 索引与 PK 索引结合起来得到一个排序的结果?还有其他建议吗?

您可以尝试在查询中包含边界框重叠运算符 ~,如文档所述

This operand will make use of any indexes that may be available on the geometries.

SELECT seqno, doc
FROM demo
WHERE ST_GeometryFromText((doc->>'geometry')) ~ ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
  AND ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10

否则,您可以 运行 不带 limit 子句且偏移量为 0 的查询以防止内联子查询,然后应用限制。

SELECT * FROM (
  SELECT seqno, doc
  FROM demo
  WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), 
                  ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
  OFFSET 0
) sub
ORDER BY seqno
LIMIT 10

This shows that the primary key constraint index is used to scan all rows

它不会扫描所有行,它会在找到 10 个匹配的行后停止。这似乎是大约 221313 * 3 + 10 行,或大约总行数的 1.6%。这是不是很明显这是错误的做法。您可以通过更改为 ORDER BY seqno+0 来抑制主键索引的使用。这应该使用 GiST 索引,但我不会指望它会更快。

However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.

但它回答了一个简单得多的问题。考虑一下“从芝加哥给我找 5 个随机人”和“给我找 5 个芝加哥最高的人”之间的区别。

至于加快查询速度,我会尝试 ORDER BY seqno+0 技巧。我不认为它会更快,但我可能是错的。

我也会尝试在 (seqno, doc) 上使用 btree 索引,这样您就可以获得仅索引扫描,尽管如果您的几何图形在其自己的列中而不是嵌入在 JSONB 中,这会好得多,所以你可以只索引 seqno 和几何而不是整个 JSONB。理论上 PostgreSQL 可以给你一个索引只扫描 (seqno, ST_GeometryFromText(doc->>'geometry')) 上的索引,但它还不够聪明,无法实现这一点。

您还可以尝试在 (seqno, ST_GeometryFromText(doc->>'geometry')) 上使用多列 GiST 索引,使用 btree_gist 扩展来启用 seqno 的包含。

最后,您可以尝试在 seqno 上对 table 进行范围分区。这需要重新组织您的数据集,因此并不像构建索引那么简单。