Elasticsearch:edgeNGram 标记过滤器是否适用于非英语标记?
Elasticsearch: Does edgeNGram token filter work on non english tokens?
我正在尝试为索引设置新映射。它将支持由 ES 提供支持的部分关键字搜索和自动完成请求。
edgeNGram 带有空白标记器的标记过滤器似乎是一种可行的方法。到目前为止,我的设置看起来像这样:
curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customNgram": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "customNgram"]
}
},
"filter": {
"customNgram": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "18",
"side": "front"
}
}
}
}
}
}'
日语单词有问题! NGrams 是否适用于日文字母?
例如:
【11月13日13时まで、フォロー&RTで応募!】
这里没有空格 - 无法使用部分关键字搜索该文档,这是预期的吗?
您可能需要查看 icu_tokenizer,它增加了对外语的支持 https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html
Tokenizes text into words on word boundaries, as defined in UAX #29:
Unicode Text Segmentation. It behaves much like the standard
tokenizer, but adds better support for some Asian languages by using a
dictionary-based approach to identify words in Thai, Lao, Chinese,
Japanese, and Korean, and using custom rules to break Myanmar and
Khmer text into syllables.
放置icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
请注意,要在您的索引中使用它,您需要安装适当的插件:
bin/elasticsearch-plugin install analysis-icu
将此添加到您的代码中:
curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customNgram": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["lowercase", "customNgram"]
}
},
"filter": {
"customNgram": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "18",
"side": "front"
}
}
}
}
}
}'
通常您会使用 standard
分析器搜索这样的自动完成,而不是使用 icu_tokenizer
(但不是 edgeNGram
过滤器)将分析器添加到您的映射并应用这在搜索时添加到您的查询中,或者将其明确设置为您应用 customNgram
的字段的 search_analyzer
。
我正在尝试为索引设置新映射。它将支持由 ES 提供支持的部分关键字搜索和自动完成请求。
edgeNGram 带有空白标记器的标记过滤器似乎是一种可行的方法。到目前为止,我的设置看起来像这样:
curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customNgram": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "customNgram"]
}
},
"filter": {
"customNgram": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "18",
"side": "front"
}
}
}
}
}
}'
日语单词有问题! NGrams 是否适用于日文字母? 例如: 【11月13日13时まで、フォロー&RTで応募!】
这里没有空格 - 无法使用部分关键字搜索该文档,这是预期的吗?
您可能需要查看 icu_tokenizer,它增加了对外语的支持 https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html
Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.
放置icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
请注意,要在您的索引中使用它,您需要安装适当的插件:
bin/elasticsearch-plugin install analysis-icu
将此添加到您的代码中:
curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customNgram": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["lowercase", "customNgram"]
}
},
"filter": {
"customNgram": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "18",
"side": "front"
}
}
}
}
}
}'
通常您会使用 standard
分析器搜索这样的自动完成,而不是使用 icu_tokenizer
(但不是 edgeNGram
过滤器)将分析器添加到您的映射并应用这在搜索时添加到您的查询中,或者将其明确设置为您应用 customNgram
的字段的 search_analyzer
。