使用 levenshtein() 函数优化 PostgreSQL 查询
Optimize PostgreSQL query with levenshtein() function
我有一个 table 大约有 700 万条记录。 table 有一个 first_name 和 last_name 列,我想使用 levenshtein() 距离函数搜索它们。
select levenshtein('JOHN', first_name) as fn_distance,
levenshtein('DOE', last_name) as ln_distance,
id,
first_name as "firstName",
last_name as "lastName"
from person
where first_name is not null
and last_name is not null
and levenshtein('JOHN', first_name) <= 2
and levenshtein('DOE', last_name) <= 2
order by 1, 2
limit 50;
上面的搜索很慢(4 - 5 秒),我可以做些什么来提高性能?应该在两列上创建索引,还是其他?
我在下面添加索引后:
create index first_name_idx on person using gin (first_name gin_trgm_ops);
create index last_name_idx on person using gin(last_name gin_trgm_ops);
查询现在大约需要 11 秒。 :(
新查询:
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and similarity('JOHN', first_name) >= 0.2
and similarity('DOW', last_name) >= 0.2
order by 1 desc, 2 desc, npi
limit 50;
没有支持编辑距离的内置索引类型。我也不知道有任何第 3 方索引实现可以这样做。
另一个字符串相似性度量,三元组相似性,确实有一个索引方法to support it。也许您可以改用该措施。
您需要使用 % 运算符而不是相似度函数来编写查询。所以它看起来像这样:
set pg_trgm.similarity_threshold TO 0.2;
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and 'JOHN' % first_name
and 'DOW' % last_name
order by 1, 2, npi
limit 50;
但请注意,0.2 是非常低的截止值,截止值越低,索引的效率就越低。
我有一个 table 大约有 700 万条记录。 table 有一个 first_name 和 last_name 列,我想使用 levenshtein() 距离函数搜索它们。
select levenshtein('JOHN', first_name) as fn_distance,
levenshtein('DOE', last_name) as ln_distance,
id,
first_name as "firstName",
last_name as "lastName"
from person
where first_name is not null
and last_name is not null
and levenshtein('JOHN', first_name) <= 2
and levenshtein('DOE', last_name) <= 2
order by 1, 2
limit 50;
上面的搜索很慢(4 - 5 秒),我可以做些什么来提高性能?应该在两列上创建索引,还是其他?
我在下面添加索引后:
create index first_name_idx on person using gin (first_name gin_trgm_ops);
create index last_name_idx on person using gin(last_name gin_trgm_ops);
查询现在大约需要 11 秒。 :(
新查询:
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and similarity('JOHN', first_name) >= 0.2
and similarity('DOW', last_name) >= 0.2
order by 1 desc, 2 desc, npi
limit 50;
没有支持编辑距离的内置索引类型。我也不知道有任何第 3 方索引实现可以这样做。
另一个字符串相似性度量,三元组相似性,确实有一个索引方法to support it。也许您可以改用该措施。
您需要使用 % 运算符而不是相似度函数来编写查询。所以它看起来像这样:
set pg_trgm.similarity_threshold TO 0.2;
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and 'JOHN' % first_name
and 'DOW' % last_name
order by 1, 2, npi
limit 50;
但请注意,0.2 是非常低的截止值,截止值越低,索引的效率就越低。