Postgres word_similarity 不比较单词

Postgres word_similarity not comparing words

"Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the words of the second string)."

这就是word_similarity(a,b)的定义,按我的理解,它会在文本b中寻找WORD a,将b按词拆分,得到最高匹配词的分数.

但是,我看到了一些不一致的地方,其中单词匹配并不是真正的单词匹配,看起来所有的三元组都是经过打乱和比较的?

示例:

select word_similarity('sage', 'message sag')

Returns 1,显然'message'和'sag'都不应该匹配'sage',但是如果我们组合'message sag'中可能的八卦,我们会然后发现 'sage' 中的所有三元组都会匹配,但这并不是真正应该发生的事情,因为函数描述是逐字逐句的......是因为两个词彼此相邻吗?

以下将 return 0.6 分:

select word_similarity('sage', 'message test sag') 

编辑:Fiddle 玩转 http://sqlfiddle.com/#!17/b4bab/1

功能与描述不符

相关主题 pgsql-bugs mailing list.

子串相似度算法described by the author比较查询字符串和文本的三元组数组。问题在于优化了一个三元组(消除了重复的三元组)并丢失了有关文本中单个单词的信息。

查询说明了问题:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select 
    t as "text", 
    show_trgm(t) as "text trigrams", 
    show_trgm('sage') as "string trigrams", 
    cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;

       text       |                       text trigrams                       |       string trigrams       | common trgms 
------------------+-----------------------------------------------------------+-----------------------------+--------------
 message          | {"  m"," me",age,ess,"ge ",mes,sag,ssa}                   | {"  s"," sa",age,"ge ",sag} |            3
 message s        | {"  m","  s"," me"," s ",age,ess,"ge ",mes,sag,ssa}       | {"  s"," sa",age,"ge ",sag} |            4
 message sag      | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sag  | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sage | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
(5 rows)    

最后三行的八卦数组相同,包含查询字符串的所有八卦。

很明显,实现与功能描述不一致(描述在文档的后续版本中有所更改):

Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring.


我在上述查询中使用的函数:

create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
    select case 
        when  is null then 
        else
            array(
                select unnest()
                intersect
                select unnest()
            )
        end;
$$;

解决方法

您可以轻松编写自己的函数以获得更多预期结果:

create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
    select max(similarity(, word))
    from regexp_split_to_table(, '[^[:alnum:]]') word
$$;

比较:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;

        t         | word_similarity | my_word_similarity
------------------+-----------------+--------------------
 message          |             0.6 |                0.3
 message s        |             0.8 |                0.3
 message sag      |               1 |                0.5
 message sag sag  |               1 |                0.5
 message sag sage |               1 |                  1
(5 rows)

Postgres 11+中的新功能

Postgres 11+ strict_word_similarity() 中有一个新函数可以给出问题作者所期望的结果:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;

        t         | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
 message          |             0.6 |                    0.3
 message s        |             0.8 |             0.36363637
 message sag      |               1 |                    0.5
 message sag sag  |               1 |                    0.5
 message sag sage |               1 |                      1
(5 rows)