unaccent 和 pg_trgm 的多列索引（匹配脏数据）

Question

我有一个带有客户数据的 table，它有超过 1200 万条记录。我想根据几个字段查询它，例如：first_name、last_name、birth_place。但是数据真的很脏，所以我什至想要不完全匹配的记录。为此，我正在使用模块 unaccent 和 pg_trgm。

我遵循这个 question 以便能够在索引中使用无重音符号，因此 f_unaccent() 而不是查询中的 unaccent()。

索引：

CREATE INDEX first_name_idx ON customer USING gist(f_unaccent(coalesce(first_name, '')) gist_trgm_ops);
CREATE INDEX last_name_idx ON customer USING gist(f_unaccent(coalesce(last_name, '')) gist_trgm_ops);
CREATE INDEX birthplace_idx ON customer USING gist(f_unaccent(coalesce(birthplace, '')) gist_trgm_ops);

SELECT:

WITH t AS (
SELECT id, first_name, f_unaccent(coalesce(first_name, '')) <-> unaccent('Oliver') as first_name_distance, 
    last_name, f_unaccent(coalesce(last_name, '')) <-> unaccent('Twist') as last_name_distance,
    birthplace, f_unaccent(coalesce(birthplace, '')) <-> unaccent('London') as birthplace_distance, 
    FROM customer
),
s AS (
SELECT t.id, t.first_name_distance + t.last_name_distance + t.birthplace_distance as total FROM t
)

select * from t join s on (t.id = s.id);

当我运行分析它时，它进行顺序扫描。它不使用索引。我知道第一个 select 运行整个 table，所以也许它很好。我使用的是 <->，而不是 similarity(text, text) 函数，因为我什至想要一些相似度为 0 的字段的记录，相似度的总和是我关心的。

在真实数据上这个查询（有 6 个字段，而不是 3 个）大约需要 12 分钟（没有索引，我没有创建它们，因为我在测试数据上看到它们甚至没有被使用...... )

如何使此查询运行更快？谢谢

Answer 1

由于查询从 customer 中获取所有行，使用顺序扫描是最快的选择。

unaccent 和 pg_trgm 的多列索引（匹配脏数据）

Multi-column index with unaccent and pg_trgm (matching dirty data)

postgresql

indexing

unaccent

trigram