PostgreSQL、三元组和相似性
PostgreSQL, trigrams and similarity
刚刚在我的 Mac 上测试 PostgreSQL 9.6.2 并使用 Ngrams。
假设在winery字段上有一个GIN八卦索引。
相似度限制(我知道这已被弃用):
SELECT set_limit(0.5);
我正在对 2,3M 行构建三元组搜索 table。
我的select代码:
SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity
FROM usr_wines
WHERE status=1 AND winery % 'chateau chevla blanc'
ORDER BY similarity DESC;
我的结果(我的 mac 上有 329 毫秒):
Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64
嗯,在这种情况下,我不明白 "Chateau blanc" 与 "Chateau Cheval Blanc" 有什么相似之处?我知道这两个词完全相同 "chateau" 和 "blanc",但没有其他词 "cheval".
另外为什么 "Chateau ChevL Blanc" 是第一个?缺少字母 "a" !
好吧,我的目标是匹配所有可能重复的酒厂名称,即使拼写错误。我错过了什么?
三元组相似度的概念依赖于将任何句子分成 "trigrams"(三个连续字母的序列),并将结果视为一个 SET(即:顺序无关紧要,你不'有重复值)。考虑句子之前,开头加两个空格,结尾加一个,单空格换成双空格。
八卦是N-grams.
的特例
"Chateau blanc"对应的三卦集合是通过找到其上出现的三个字母的所有序列得到的:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
对它们进行排序并去除重复项可以让您:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
这可以由 PostgreSQL 通过函数 show_trgm
:
计算得出
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
... 有 14 个八卦。 (检查 pg_trgm)。
而"Chateau Cheval Blanc"对应的卦集合为:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
... 有 19 个卦
如果你数一数有多少八卦具有这两个集合,你会发现它们有以下几个:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
他们总共有:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
即两个句子共有14个八卦,一共19个
相似度计算如下:
similarity = 14 / 19
您可以通过以下方式查看:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg
你会看到你得到:0.736842
...这解释了如何计算相似度,以及为什么 你得到你得到的值。
注意:您可以通过以下方式计算交集和并集:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
这是一种探索不同句子对相似度的方法:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
您可以在Rextester
查看
三元算法应该越精确,比较字符串的长度差异越小。您可以修改算法以补偿长度差异的影响。
以下示例函数将字符串长度中 1 个字符的差异降低 1% 的相似度。这意味着它偏爱相同(相似)长度的字符串。
create or replace function corrected_similarity(str1 text, str2 text)
returns float4 language sql as $$
select similarity(str1, str2)* (1- abs(length(str1)-length(str2))/100.0)::float4
$$;
select
winery,
similarity(winery, 'chateau chevla blanc') as similarity,
corrected_similarity(winery, 'chateau chevla blanc') as corrected_similarity
from usr_wines
where winery % 'chateau chevla blanc'
order by corrected_similarity desc;
winery | similarity | corrected_similarity
--------------------------+------------+----------------------
Chateau ChevL Blanc | 0.85 | 0.8415
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Blanc, | 0.736842 | 0.692632
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Cheval Blanc (7) | 0.666667 | 0.64
Chateau Du Cheval Blanc | 0.64 | 0.6208
Chateau Du Cheval Blanc | 0.64 | 0.6208
Chateau Cheval Blanc Cbo | 0.64 | 0.6144
(14 rows)
以类似的方式,您可以通过例如有多少个初始字符相同来校正标准相似度(认为函数会稍微复杂一些)。
有时候你想要klin答案的反面。在某些应用程序中,字符串长度的巨大差异不应导致如此显着的分数损失。
例如,想象一个自动完成的结果表单,其中包含在您键入时改进的三元组匹配建议。
这是另一种对匹配进行评分的方法,它仍然使用三字母组,但比平时更倾向于子字符串匹配。
相似度公式以
开头
the number of common trigrams
-------------------------------------------
the number of trigrams in the shortest word <-- key difference
并且可以根据良好的标准相似性得分从那里上升。
CREATE OR REPLACE FUNCTION substring_similarity(string_a TEXT, string_b TEXT) RETURNS FLOAT4 AS $$
DECLARE
a_trigrams TEXT[];
b_trigrams TEXT[];
a_tri_len INTEGER;
b_tri_len INTEGER;
common_trigrams TEXT[];
max_common INTEGER;
BEGIN
a_trigrams = SHOW_TRGM(string_a);
b_trigrams = SHOW_TRGM(string_b);
a_tri_len = ARRAY_LENGTH(a_trigrams, 1);
b_tri_len = ARRAY_LENGTH(b_trigrams, 1);
IF (NOT (a_tri_len > 0) OR NOT (b_tri_len > 0)) THEN
IF (string_a = string_b) THEN
RETURN 1;
ELSE
RETURN 0;
END IF;
END IF;
common_trigrams := ARRAY(SELECT UNNEST(a_trigrams) INTERSECT SELECT UNNEST(b_trigrams));
max_common = LEAST(a_tri_len, b_tri_len);
RETURN COALESCE(ARRAY_LENGTH(common_trigrams, 1), 0)::FLOAT4 / max_common::FLOAT4;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION corrected_similarity(string_a TEXT, string_b TEXT)
RETURNS FLOAT4 AS $$
DECLARE
base_score FLOAT4;
BEGIN
base_score := substring_similarity(string_a, string_b);
-- a good standard similarity score can raise the base_score
RETURN base_score + ((1.0 - base_score) * SIMILARITY(string_a, string_b));
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION is_minimally_substring_similar(string_a TEXT, string_b TEXT) RETURNS BOOLEAN AS $$
BEGIN
RETURN corrected_similarity(string_a, string_b) >= 0.5;
END;
$$ LANGUAGE plpgsql;
CREATE OPERATOR %%% (
leftarg = TEXT,
rightarg = TEXT,
procedure = is_minimally_substring_similar,
commutator = %%%
);
现在您可以像使用标准相似性查询一样使用它:
SELECT * FROM table WHERE name %%% 'chateau'
ORDER BY corrected_similarity(name, 'chateau') DESC;
性能
搜索 space 10 万条记录的性能可以接受,但搜索 space 数百万条记录的性能可能不太好。为此,您可能需要使用 modified build of the pg_trgm module, code on github.
刚刚在我的 Mac 上测试 PostgreSQL 9.6.2 并使用 Ngrams。 假设在winery字段上有一个GIN八卦索引。
相似度限制(我知道这已被弃用):
SELECT set_limit(0.5);
我正在对 2,3M 行构建三元组搜索 table。
我的select代码:
SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity
FROM usr_wines
WHERE status=1 AND winery % 'chateau chevla blanc'
ORDER BY similarity DESC;
我的结果(我的 mac 上有 329 毫秒):
Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64
嗯,在这种情况下,我不明白 "Chateau blanc" 与 "Chateau Cheval Blanc" 有什么相似之处?我知道这两个词完全相同 "chateau" 和 "blanc",但没有其他词 "cheval".
另外为什么 "Chateau ChevL Blanc" 是第一个?缺少字母 "a" !
好吧,我的目标是匹配所有可能重复的酒厂名称,即使拼写错误。我错过了什么?
三元组相似度的概念依赖于将任何句子分成 "trigrams"(三个连续字母的序列),并将结果视为一个 SET(即:顺序无关紧要,你不'有重复值)。考虑句子之前,开头加两个空格,结尾加一个,单空格换成双空格。
八卦是N-grams.
的特例"Chateau blanc"对应的三卦集合是通过找到其上出现的三个字母的所有序列得到的:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
对它们进行排序并去除重复项可以让您:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
这可以由 PostgreSQL 通过函数 show_trgm
:
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
... 有 14 个八卦。 (检查 pg_trgm)。
而"Chateau Cheval Blanc"对应的卦集合为:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
... 有 19 个卦
如果你数一数有多少八卦具有这两个集合,你会发现它们有以下几个:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
他们总共有:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
即两个句子共有14个八卦,一共19个
相似度计算如下:
similarity = 14 / 19
您可以通过以下方式查看:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg
你会看到你得到:0.736842
...这解释了如何计算相似度,以及为什么 你得到你得到的值。
注意:您可以通过以下方式计算交集和并集:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
这是一种探索不同句子对相似度的方法:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
您可以在Rextester
查看三元算法应该越精确,比较字符串的长度差异越小。您可以修改算法以补偿长度差异的影响。
以下示例函数将字符串长度中 1 个字符的差异降低 1% 的相似度。这意味着它偏爱相同(相似)长度的字符串。
create or replace function corrected_similarity(str1 text, str2 text)
returns float4 language sql as $$
select similarity(str1, str2)* (1- abs(length(str1)-length(str2))/100.0)::float4
$$;
select
winery,
similarity(winery, 'chateau chevla blanc') as similarity,
corrected_similarity(winery, 'chateau chevla blanc') as corrected_similarity
from usr_wines
where winery % 'chateau chevla blanc'
order by corrected_similarity desc;
winery | similarity | corrected_similarity
--------------------------+------------+----------------------
Chateau ChevL Blanc | 0.85 | 0.8415
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Blanc, | 0.736842 | 0.692632
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Cheval Blanc (7) | 0.666667 | 0.64
Chateau Du Cheval Blanc | 0.64 | 0.6208
Chateau Du Cheval Blanc | 0.64 | 0.6208
Chateau Cheval Blanc Cbo | 0.64 | 0.6144
(14 rows)
以类似的方式,您可以通过例如有多少个初始字符相同来校正标准相似度(认为函数会稍微复杂一些)。
有时候你想要klin答案的反面。在某些应用程序中,字符串长度的巨大差异不应导致如此显着的分数损失。
例如,想象一个自动完成的结果表单,其中包含在您键入时改进的三元组匹配建议。
这是另一种对匹配进行评分的方法,它仍然使用三字母组,但比平时更倾向于子字符串匹配。
相似度公式以
开头the number of common trigrams
-------------------------------------------
the number of trigrams in the shortest word <-- key difference
并且可以根据良好的标准相似性得分从那里上升。
CREATE OR REPLACE FUNCTION substring_similarity(string_a TEXT, string_b TEXT) RETURNS FLOAT4 AS $$
DECLARE
a_trigrams TEXT[];
b_trigrams TEXT[];
a_tri_len INTEGER;
b_tri_len INTEGER;
common_trigrams TEXT[];
max_common INTEGER;
BEGIN
a_trigrams = SHOW_TRGM(string_a);
b_trigrams = SHOW_TRGM(string_b);
a_tri_len = ARRAY_LENGTH(a_trigrams, 1);
b_tri_len = ARRAY_LENGTH(b_trigrams, 1);
IF (NOT (a_tri_len > 0) OR NOT (b_tri_len > 0)) THEN
IF (string_a = string_b) THEN
RETURN 1;
ELSE
RETURN 0;
END IF;
END IF;
common_trigrams := ARRAY(SELECT UNNEST(a_trigrams) INTERSECT SELECT UNNEST(b_trigrams));
max_common = LEAST(a_tri_len, b_tri_len);
RETURN COALESCE(ARRAY_LENGTH(common_trigrams, 1), 0)::FLOAT4 / max_common::FLOAT4;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION corrected_similarity(string_a TEXT, string_b TEXT)
RETURNS FLOAT4 AS $$
DECLARE
base_score FLOAT4;
BEGIN
base_score := substring_similarity(string_a, string_b);
-- a good standard similarity score can raise the base_score
RETURN base_score + ((1.0 - base_score) * SIMILARITY(string_a, string_b));
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION is_minimally_substring_similar(string_a TEXT, string_b TEXT) RETURNS BOOLEAN AS $$
BEGIN
RETURN corrected_similarity(string_a, string_b) >= 0.5;
END;
$$ LANGUAGE plpgsql;
CREATE OPERATOR %%% (
leftarg = TEXT,
rightarg = TEXT,
procedure = is_minimally_substring_similar,
commutator = %%%
);
现在您可以像使用标准相似性查询一样使用它:
SELECT * FROM table WHERE name %%% 'chateau'
ORDER BY corrected_similarity(name, 'chateau') DESC;
性能
搜索 space 10 万条记录的性能可以接受,但搜索 space 数百万条记录的性能可能不太好。为此,您可能需要使用 modified build of the pg_trgm module, code on github.