全文搜索返回过多无关结果导致性能不佳

Question

我正在使用 Postgres 的全文搜索功能，并且在大多数情况下它工作正常。

我的数据库 table 中有一个名为 documentFts 的列，它基本上是 body 字段的 ts_vector 版本，它是一个文本列，它是用 GIN 索引索引。

这是我的查询：

select
      count(*) OVER() AS full_count,
      id,
      url, 
      (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery(, ))) as "finalScore",
      ts_headline(\'english_unaccent\', title, websearch_to_tsquery(, )) as title,
      ts_headline(\'english_unaccent\', body, websearch_to_tsquery(, )) as body,
      "possibleEncoding",
      "responseYear"
    from "Entries"
    where 
      "language" =  and 
      "documentFts" @@ websearch_to_tsquery(, )
    order by (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery(, ))) desc limit 20 offset ;

字典是 english_unaccent 因为我创建了一个基于 english 的字典，它使用 unaccent 扩展名：

CREATE TEXT SEARCH CONFIGURATION english_unaccent (
  COPY = english
);

ALTER TEXT SEARCH CONFIGURATION english_unaccent
  ALTER MAPPING FOR hword, hword_part, word WITH unaccent,
  english_stem;

我对其他语言也做了同样的事情。

然后我对我的条目数据库执行了此操作：

ALTER TABLE "Entries"
  ADD COLUMN "documentFts" tsvector;

UPDATE
  "Entries"
SET
  "documentFts" = (setweight(to_tsvector('english_unaccent', coalesce(title)), 'A') || setweight(to_tsvector('english_unaccent', coalesce(body)), 'C'))
WHERE
  "language" = 'english';

我的 table 中有一列包含条目的语言，因此 "language" = 'english'。

所以，我遇到的问题是，对于像 animal、anime 或 animation 这样的词，它们都以 anim 的形式进入向量，这意味着如果我搜索这些词中的任何一个，我会得到包含所有这些变体的结果。

return 是一个巨大的数据集，与 return 较少项目的搜索相比，它导致查询非常慢。而且，如果我搜索 Anime，我的第一个结果包含 Animal、Animated，第一个包含单词 Anime 的结果是第 12 个。

不应将 animation 转换为向量中的 animat 而 animal 只是 animal 因为它的其他变体是 animals 或animalia?

我一直在寻找这个问题的解决方案，但运气不佳，有什么方法可以改进这个，我很乐意安装扩展、重新索引列或其他任何东西。

Answer 1

这里面有很多小细节。最佳解决方案取决于具体情况和具体要求。

两个简单的选项：

简单调整 1

如果要对 title 或 body 中包含以 'Anime' 开头的单词（完全）的行进行排序，不区分大小写，请添加 ORDER BY 表达式如：

ORDER  BY unaccent(concat_ws(' ', title, body) !~* ('\m' || f_regexp_escape())
        , (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery(, ))) DESC

其中辅助函数 f_regexp_escape() 转义特殊正则表达式字符并在此处定义：

Escape function for regular expression or LIKE patterns

该表达式比较昂贵，但由于它仅应用于过滤结果，因此效果有限。您可能需要微调，因为其他搜索词会带来其他困难。认为 'body' / 'bodies' 源于 'bodi' ...

简单调整 2

要完全删除英语词干提取，请基于 'simple' TEXT SEARCH CONFIGURATION:

CREATE TEXT SEARCH CONFIGURATION simple_unaccent (
  COPY = simple
);

等等

那么文本的实际语言是 irrelevant.The 索引变得更大，并且搜索是在文字拼写上完成的。您现在可以使用 前缀匹配 扩大搜索范围，例如：

WHERE  "documentFts" @@ to_tsquery('simple_unaccent', ( || ':*')

同样，您必须进行微调。这个简单的例子只适用于单字模式。而且我怀疑你想完全摆脱词干。可能太激进了。

参见：

Get partial match from GIN indexed TSVECTOR column

正确解法：同义词词典

为此，您需要访问 Postgres 服务器的安装驱动器。因此，大多数托管服务通常是不可能的。

要否决某些词干分析器的决定，请使用您自己的同义词（规则）集来否决。在 $SHAREDIR/tsearch_data/my_synonyms.syn 中创建一个映射文件。这是我的 Linux 安装中的 /usr/share/postgresql/13/tsearch_data/my_synonyms.syn：

让它包含（默认不区分大小写）：

anime anime

然后：

CREATE TEXT SEARCH DICTIONARY my_synonym (
    TEMPLATE = synonym,
    SYNONYMS = my_synonyms
);

有一章 instructions in the manual。引用一则：

A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary.

然后：

CREATE TEXT SEARCH CONFIGURATION my_english_unaccent (
  COPY = english
);

ALTER TEXT SEARCH CONFIGURATION my_english_unaccent
  ALTER MAPPING FOR hword, hword_part, word
  WITH unaccent, my_synonym, english_stem;   -- added my_synonym!

您必须将列 "documentFts" 更新为 my_english_unaccent。在使用它时，请使用适当的小写列名称，如 document_fts，并考虑使用 GENERATED 列。参见：

Computed / calculated / virtual / derived columns in PostgreSQL
Are PostgreSQL column names case-sensitive?

现在，搜索 Anime（或 ánime，就此而言）将不会再找到 animal。并且搜索 animal 不会找到 Anime.

全文搜索返回过多无关结果导致性能不佳

Full text search returning too many irrelevant results and causing poor performance

sql

postgresql

full-text-search

tsvector

简单调整 1

简单调整 2

正确解法：同义词词典