Elastic Search - 如何使用带 UTF-8 过滤器的语言分析器?
Elastic Search - how to use language analyzer with UTF-8 filter?
我对 ElasticSearch 语言分析器有疑问。我正在研究立陶宛语,所以我正在使用立陶宛语分析器。分析器工作正常,我得到了我需要的所有单词大小写。比如我索引立陶宛城市"Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
问题是,当我仅在拉丁字母 ("Klaipeda") 和所有立陶宛语案例中搜索 "Klaipėda" 时,我还需要得到一个结果:
- 不定格:"Klaipeda"
- 属格:"Klaipedos"
- ...
- 本地大小写:"Klaipedoje"
"Klaipėda"、"Klaipėdos"、"Klaipėdoje" - 有效,但 "Klaipeda"、"Klaipedos"、"Klaipedoje" - 无效。
我的指数:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
和搜索查询:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
我做错了什么?感谢您的帮助。
您在这里使用的技术是所谓的multi-fields。基础 name.folded
字段的限制是您无法对其执行搜索 - 您只能按 name.folded
和聚合执行排序。
为了解决这个问题,我想出了以下设置:
单独的字段设置(以消除重复 - 只需指定 copy_to
):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
将分析器的类型更改为 custom
,因为它描述了 here,否则 asciifolding
不会进入配置。更重要的是 - asciifolding
应该在立陶宛语中的所有词干/停用词之后,因为折叠后的词可能会错过所需的意义。
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
抱歉,我已经删除了 lithuanian_keywords
- 它需要额外的设置,我在这里错过了。但我希望你已经明白了。
我对 ElasticSearch 语言分析器有疑问。我正在研究立陶宛语,所以我正在使用立陶宛语分析器。分析器工作正常,我得到了我需要的所有单词大小写。比如我索引立陶宛城市"Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
问题是,当我仅在拉丁字母 ("Klaipeda") 和所有立陶宛语案例中搜索 "Klaipėda" 时,我还需要得到一个结果:
- 不定格:"Klaipeda"
- 属格:"Klaipedos"
- ...
- 本地大小写:"Klaipedoje"
"Klaipėda"、"Klaipėdos"、"Klaipėdoje" - 有效,但 "Klaipeda"、"Klaipedos"、"Klaipedoje" - 无效。
我的指数:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
和搜索查询:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
我做错了什么?感谢您的帮助。
您在这里使用的技术是所谓的multi-fields。基础 name.folded
字段的限制是您无法对其执行搜索 - 您只能按 name.folded
和聚合执行排序。
为了解决这个问题,我想出了以下设置:
单独的字段设置(以消除重复 - 只需指定
copy_to
):curl -XPUT http://localhost:9200/cities -d ' { "mappings": { "city": { "properties": { "name": { "type": "string", "analyzer": "lithuanian", "copy_to": "folded", }, "folded": { "type": "string", "analyzer": "md_folded_analyzer" } } } } }'
将分析器的类型更改为
custom
,因为它描述了 here,否则asciifolding
不会进入配置。更重要的是 -asciifolding
应该在立陶宛语中的所有词干/停用词之后,因为折叠后的词可能会错过所需的意义。curl -XPUT http://localhost:9200/my_cities -d ' { "settings": { "analysis": { "filter": { "lithuanian_stop": { "type": "stop", "stopwords": "_lithuanian_" }, "lithuanian_stemmer": { "type": "stemmer", "language": "lithuanian" } }, "analyzer": { "md_folded_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "lithuanian_stop", "lithuanian_stemmer", "asciifolding" ] } } } } }
抱歉,我已经删除了
lithuanian_keywords
- 它需要额外的设置,我在这里错过了。但我希望你已经明白了。