即使第一个字母丢失,Elasticsearch 拼写检查建议

Elasticsearch spell check suggestions even if first letter missed

我这样创建索引:

curl --location --request PUT 'http://127.0.0.1:9200/test/' \
--header 'Content-Type: application/json' \
--data-raw '{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "properties" : {
            "word" : { "type" : "text" }
        }
    }
}'

当我创建文档时:

curl --location --request POST 'http://127.0.0.1:9200/test/_doc/' \
--header 'Content-Type: application/json' \
--data-raw '{ "word":"organic" }'

最后,使用故意拼错的单词进行搜索:

curl --location --request POST 'http://127.0.0.1:9200/test/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
  "suggest": {
    "001" : {
      "text" : "rganic",
      "term" : {
        "field" : "word"
      }
    }
  }
}'

单词 'organic' 丢失了第一个字母 - ES 从不为此类拼写错误提供建议选项(对于任何其他拼写错误都非常有效 - 'orgnic'、'oragnc' 和 'organi').我错过了什么?

发生这种情况是因为 prefix_length 参数:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html。它默认为 1,即从术语开头起至少有 1 个字母必须匹配。您可以将 prefix_length 设置为 0,但这会对性能产生影响。只有您的硬件、您的设置和您的数据集才能准确地向您展示在您的案例中实际情况,即尝试一下 :)。但是,请注意 - Elasticsearch 和 Lucene 开发人员将默认值设置为 1 是有原因的。

这是一个查询,对我来说 returns 在我执行您的设置步骤后,您在 Elasticsearch 7.4.0 上获得的建议结果。

curl --location --request POST 'http://127.0.0.1:9200/test/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
  "suggest": {
    "001" : {
      "text" : "rganic",
      "term" : {
        "field" : "word",
        "prefix_length": 0
      }
    }
  }
}'

您需要将候选生成器短语建议器一起使用,请从Elasticsearch in Action book page 444

中查看

Having multiple generators and filters lets you do some neat tricks. For instance, if typos are likely to happen both at the beginning and end of words, you can use multi- ple generators to avoid expensive suggestions with low prefix lengths by using the reverse token filter, as shown in figure F.4. You’ll implement what’s shown in figure F.4 in listing F.4: ■ First, you’ll need an analyzer that includes the reverse token filter.

■ Then you’ll index the correct product description in two fields: one analyzed with the standard analyzer and one with the reverse analyzer.

来自 Elasticsearch 文档

The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.

因此,您可以通过将 reverse 分析器与 post-filterpre-filter

一起使用来实现此目的

如您所见,他们说:

This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions.

查看 Elasticsearch In Action 一书中的这张图,相信会让思路更清晰。

A screenshot from the book explains how elastic search will give us the correct phrase

有关更多信息,请参阅文档 https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html#:~:text=The%20phrase%20suggester%20uses%20candidate,individual%20term%20in%20the%20text.

如果解释了完整的想法,那么这将是一个很长的答案,但我给了你关键,你可以去研究一下将短语建议器与多个生成器一起使用。