对大型空间选择进行排序不使用 GiST 索引 (Postgres 11.5)
Sorting a large spatial selection is not using GiST index (Postgres 11.5)
我有一个 table (demo
),其中包含一个序列作为其主键 (seqno
) 和一个 geometry
属性 JSONB 列 (doc
)。我已经为序列列配置了主键约束,为几何配置了 GiST 索引。我已经通过 运行 VACUUM ANALYZE
收集了统计数据。这是一个相当大的 table(4200 万行)。
CREATE TABLE demo
(
seqno bigint NOT NULL DEFAULT nextval('seqno'::regclass),
doc jsonb NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT demo_pkey PRIMARY KEY (seqno)
)
CREATE INDEX demo_doc_geometry_gist
ON demo USING gist (st_geometryfromtext(doc ->> 'geometry'::text))
我想对相当大的区域和 return 前 10 行执行空间过滤,按其主键排序。因此,我尝试了以下查询:
SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10
这导致以下查询计划:
Limit (cost=1000.59..15169.06 rows=10 width=633) (actual time=2479.372..2496.737 rows=10 loops=1)
-> Gather Merge (cost=1000.59..19780184.81 rows=13960 width=633) (actual time=2479.370..2496.732 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using demo_pkey on demo (cost=0.56..19777573.45 rows=5817 width=633) (actual time=2440.310..2450.101 rows=5 loops=3)
Filter: (('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text))) AND _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text))))
Rows Removed by Filter: 221313
Planning Time: 0.375 ms
Execution Time: 2496.786 ms
这说明使用主键约束索引扫描所有行,对每一行进行空间过滤,显然效率很低。给定空间谓词的匹配项超过 5M。根本没有使用 GiST 索引。
但是,当省略 ORDER BY 子句时,正确使用几何 属性 的 GiST 索引,效率更高。
Limit (cost=0.42..128.90 rows=10 width=633) (actual time=0.381..0.745 rows=10 loops=1)
-> Index Scan using demo_doc_geometry_gist on demo (cost=0.42..179352.99 rows=13960 width=633) (actual time=0.380..0.742 rows=10 loops=1)
Index Cond: ('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text)))
Filter: _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text)))
Planning Time: 0.245 ms
Execution Time: 0.780 ms
有没有办法使这个查询更快?我们能否让查询规划器将 GiST 索引与 PK 索引结合起来得到一个排序的结果?还有其他建议吗?
您可以尝试在查询中包含边界框重叠运算符 ~
,如文档所述
This operand will make use of any indexes that may be available on the
geometries.
SELECT seqno, doc
FROM demo
WHERE ST_GeometryFromText((doc->>'geometry')) ~ ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
AND ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10
否则,您可以 运行 不带 limit
子句且偏移量为 0 的查询以防止内联子查询,然后应用限制。
SELECT * FROM (
SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')),
ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
OFFSET 0
) sub
ORDER BY seqno
LIMIT 10
This shows that the primary key constraint index is used to scan all rows
它不会扫描所有行,它会在找到 10 个匹配的行后停止。这似乎是大约 221313 * 3 + 10 行,或大约总行数的 1.6%。这是不是很明显这是错误的做法。您可以通过更改为 ORDER BY seqno+0
来抑制主键索引的使用。这应该使用 GiST 索引,但我不会指望它会更快。
However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.
但它回答了一个简单得多的问题。考虑一下“从芝加哥给我找 5 个随机人”和“给我找 5 个芝加哥最高的人”之间的区别。
至于加快查询速度,我会尝试 ORDER BY seqno+0
技巧。我不认为它会更快,但我可能是错的。
我也会尝试在 (seqno, doc)
上使用 btree 索引,这样您就可以获得仅索引扫描,尽管如果您的几何图形在其自己的列中而不是嵌入在 JSONB 中,这会好得多,所以你可以只索引 seqno 和几何而不是整个 JSONB。理论上 PostgreSQL 可以给你一个索引只扫描 (seqno, ST_GeometryFromText(doc->>'geometry'))
上的索引,但它还不够聪明,无法实现这一点。
您还可以尝试在 (seqno, ST_GeometryFromText(doc->>'geometry'))
上使用多列 GiST 索引,使用 btree_gist 扩展来启用 seqno 的包含。
最后,您可以尝试在 seqno 上对 table 进行范围分区。这需要重新组织您的数据集,因此并不像构建索引那么简单。
我有一个 table (demo
),其中包含一个序列作为其主键 (seqno
) 和一个 geometry
属性 JSONB 列 (doc
)。我已经为序列列配置了主键约束,为几何配置了 GiST 索引。我已经通过 运行 VACUUM ANALYZE
收集了统计数据。这是一个相当大的 table(4200 万行)。
CREATE TABLE demo
(
seqno bigint NOT NULL DEFAULT nextval('seqno'::regclass),
doc jsonb NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT demo_pkey PRIMARY KEY (seqno)
)
CREATE INDEX demo_doc_geometry_gist
ON demo USING gist (st_geometryfromtext(doc ->> 'geometry'::text))
我想对相当大的区域和 return 前 10 行执行空间过滤,按其主键排序。因此,我尝试了以下查询:
SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10
这导致以下查询计划:
Limit (cost=1000.59..15169.06 rows=10 width=633) (actual time=2479.372..2496.737 rows=10 loops=1)
-> Gather Merge (cost=1000.59..19780184.81 rows=13960 width=633) (actual time=2479.370..2496.732 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using demo_pkey on demo (cost=0.56..19777573.45 rows=5817 width=633) (actual time=2440.310..2450.101 rows=5 loops=3)
Filter: (('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text))) AND _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text))))
Rows Removed by Filter: 221313
Planning Time: 0.375 ms
Execution Time: 2496.786 ms
这说明使用主键约束索引扫描所有行,对每一行进行空间过滤,显然效率很低。给定空间谓词的匹配项超过 5M。根本没有使用 GiST 索引。
但是,当省略 ORDER BY 子句时,正确使用几何 属性 的 GiST 索引,效率更高。
Limit (cost=0.42..128.90 rows=10 width=633) (actual time=0.381..0.745 rows=10 loops=1)
-> Index Scan using demo_doc_geometry_gist on demo (cost=0.42..179352.99 rows=13960 width=633) (actual time=0.380..0.742 rows=10 loops=1)
Index Cond: ('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text)))
Filter: _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text)))
Planning Time: 0.245 ms
Execution Time: 0.780 ms
有没有办法使这个查询更快?我们能否让查询规划器将 GiST 索引与 PK 索引结合起来得到一个排序的结果?还有其他建议吗?
您可以尝试在查询中包含边界框重叠运算符 ~
,如文档所述
This operand will make use of any indexes that may be available on the geometries.
SELECT seqno, doc
FROM demo
WHERE ST_GeometryFromText((doc->>'geometry')) ~ ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
AND ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10
否则,您可以 运行 不带 limit
子句且偏移量为 0 的查询以防止内联子查询,然后应用限制。
SELECT * FROM (
SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')),
ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
OFFSET 0
) sub
ORDER BY seqno
LIMIT 10
This shows that the primary key constraint index is used to scan all rows
它不会扫描所有行,它会在找到 10 个匹配的行后停止。这似乎是大约 221313 * 3 + 10 行,或大约总行数的 1.6%。这是不是很明显这是错误的做法。您可以通过更改为 ORDER BY seqno+0
来抑制主键索引的使用。这应该使用 GiST 索引,但我不会指望它会更快。
However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.
但它回答了一个简单得多的问题。考虑一下“从芝加哥给我找 5 个随机人”和“给我找 5 个芝加哥最高的人”之间的区别。
至于加快查询速度,我会尝试 ORDER BY seqno+0
技巧。我不认为它会更快,但我可能是错的。
我也会尝试在 (seqno, doc)
上使用 btree 索引,这样您就可以获得仅索引扫描,尽管如果您的几何图形在其自己的列中而不是嵌入在 JSONB 中,这会好得多,所以你可以只索引 seqno 和几何而不是整个 JSONB。理论上 PostgreSQL 可以给你一个索引只扫描 (seqno, ST_GeometryFromText(doc->>'geometry'))
上的索引,但它还不够聪明,无法实现这一点。
您还可以尝试在 (seqno, ST_GeometryFromText(doc->>'geometry'))
上使用多列 GiST 索引,使用 btree_gist 扩展来启用 seqno 的包含。
最后,您可以尝试在 seqno 上对 table 进行范围分区。这需要重新组织您的数据集,因此并不像构建索引那么简单。