Elasticsearch：edgeNGram 标记过滤器是否适用于非英语标记？

Question

我正在尝试为索引设置新映射。它将支持由 ES 提供支持的部分关键字搜索和自动完成请求。

edgeNGram 带有空白标记器的标记过滤器似乎是一种可行的方法。到目前为止，我的设置看起来像这样：

curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "customNgram": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "customNgram"]
                }
            },
            "filter": {
                "customNgram": {
                    "type": "edgeNGram",
                    "min_gram": "3",
                    "max_gram": "18",
                    "side": "front"
                }
            }
        }
    }
}
}'

日语单词有问题！ NGrams 是否适用于日文字母？例如：【11月13日13时まで、フォロー&RTで応募！】

这里没有空格 - 无法使用部分关键字搜索该文档，这是预期的吗？

Answer 1

您可能需要查看 icu_tokenizer，它增加了对外语的支持 https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

放置icu_sample

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

请注意，要在您的索引中使用它，您需要安装适当的插件：

bin/elasticsearch-plugin install analysis-icu

将此添加到您的代码中：

curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "customNgram": {
                    "type": "custom",
                    "tokenizer": "icu_tokenizer",
                    "filter": ["lowercase", "customNgram"]
                }
            },
            "filter": {
                "customNgram": {
                    "type": "edgeNGram",
                    "min_gram": "3",
                    "max_gram": "18",
                    "side": "front"
                }
            }
        }
    }
}
}'

通常您会使用 standard 分析器搜索这样的自动完成，而不是使用 icu_tokenizer（但不是 edgeNGram 过滤器）将分析器添加到您的映射并应用这在搜索时添加到您的查询中，或者将其明确设置为您应用 customNgram 的字段的 search_analyzer。

Elasticsearch：edgeNGram 标记过滤器是否适用于非英语标记？

Elasticsearch: Does edgeNGram token filter work on non english tokens?

lucene

cjk

elasticsearch