LATERAL JOIN 不使用三元组索引

Question

我想使用 Postgres 对地址进行一些基本的地理编码。我有一个地址 table，其中包含大约 100 万个原始地址字符串：

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

我还有一个 table 位置数据：

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

大多数地址字符串都包含邮政编码，因此我的第一次尝试是进行点赞和横向连接：

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

这给出了预期的结果，但是速度很慢。这是查询计划：

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

我尝试在address列中添加一个gist trigram索引，就像在中提到的那样，但是上面查询的查询计划没有使用它，并且查询计划没有改变。

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

我必须删除横向连接查询中的顺序和限制才能使用索引，这不会给我想要的结果。这是没有 ORDER 或 LIMIT 的查询的查询计划：

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

我可以做些什么来让查询使用索引，还是有更好的方法来重写这个查询？

Answer 1

这是一个远景，但以下替代方案的表现如何？

SELECT DISTINCT ON ((x.a).address) (x.a).*, l.*
FROM (
  SELECT a, l.id AS lid, LENGTH(l.postalcode) AS pclen
  FROM addresses a
  LEFT JOIN locations l ON (a.address ilike '%' || l.postalcode || '%') -- this should be fast, but produce many rows
  ) x
LEFT JOIN locations l ON (l.id = x.lid)
ORDER BY (x.a).address, pclen DESC -- this is where it will be slow, as it'll have to sort the entire results, to filter them by DISTINCT ON

Answer 2

如果你把横向连接从里面翻出来就可以了。但即使那样它可能仍然很慢

SELECT DISTINCT ON (address) *
FROM (
    SELECT * 
    FROM locations
       ,LATERAL(
           SELECT * FROM addresses
           WHERE address ilike '%' || postalcode || '%'
           OFFSET 0 -- force fencing, might be redundant
        ) a
) q
ORDER BY address, LENGTH(postalcode) DESC

缺点是您只能根据邮政编码实现分页，而不是地址。

Answer 3

为什么？

查询不能使用主体上的索引。您需要 table locations 上的索引，但您拥有的索引位于 table addresses.

上

您可以通过设置来验证我的声明：

SET enable_seqscan = off;

（仅在您的会话中，仅用于调试。切勿在生产中使用它。）索引不会比顺序扫描更昂贵，Postgres 无法将它用于您的查询完全.

旁白：[INNER] JOIN ... ON true 只是 CROSS JOIN ...

的一种尴尬说法

为什么去掉`ORDER`和`LIMIT`后还要用索引？

因为 Postgres 可以将这个简单的形式重写为：

SELECT *
FROM   addresses a
JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';

您将看到完全相同的查询计划。（至少我在 Postgres 9.5 的测试中这样做了。）

解决方案

您需要 locations.postalcode 上的索引。在使用 LIKE 或 ILIKE 时，您还需要将索引表达式 (postalcode) 带到运算符的 left 一侧。 ILIKE 是用运算符 ~~* 实现的，而这个运算符没有 COMMUTATOR （逻辑上的必要性），所以不可能翻转操作数。这些相关答案中的详细解释：

Can PostgreSQL index array columns?
Is there a way to usefully index a text column containing regex patterns?

一个解决方案是在最近邻查询中使用trigram similarity operator % or its inverse, the distance operator <->（每个都是自身的交换器，因此操作数可以自由切换位置）：

SELECT *
FROM   addresses a
JOIN   LATERAL (
   SELECT *
   FROM   locations
   ORDER  BY postalcode <b><-></b> a.address
   LIMIT  1
   ) l ON address ILIKE '%' || postalcode || '%';

为每个 address 找到最相似的 postalcode，然后检查 postalcode 是否完全匹配。

这样，较长的 postalcode 将自动成为首选，因为它比同样匹配的较短的 postalcode 更相似（距离更小）。

仍有一些不确定性。根据可能的邮政编码，由于字符串其他部分中的三元组匹配，可能会出现误报。题中信息不足，多说

在这里，[INNER] JOIN而不是CROSS JOIN是有道理的，因为我们添加了一个实际的连接条件。

The manual:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

所以：

CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);

LATERAL JOIN 不使用三元组索引

LATERAL JOIN not using trigram index

postgresql

indexing

query-optimization

nearest-neighbor

postgresql-9.4

为什么？

为什么去掉`ORDER`和`LIMIT`后还要用索引？

解决方案

LATERAL JOIN 不使用三元组索引

LATERAL JOIN not using trigram index

postgresql

indexing

query-optimization

nearest-neighbor

postgresql-9.4

为什么？

为什么去掉ORDER和LIMIT后还要用索引？

解决方案

为什么去掉`ORDER`和`LIMIT`后还要用索引？