如何在 elasticsearch 中匹配包含连字符或尾随 space 的查询词
How to match query terms containing hyphens or trailing space in elasticsearch
在 elasticsearch 映射的映射 char_filter 部分,它有点模糊,我很难理解是否以及如何使用 charfilter 分析器:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
基本上,我们存储在索引中的数据是 String
类型的 ID,如下所示:"008392342000"
。我希望能够在查询字词实际包含连字符或尾随 space 时搜索此类 ID,如下所示:"008392342-000 "
.
你建议我如何设置分析仪?
目前这是字段的定义:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
这里是包含分析器等索引的设置
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
您没有提供您的实际分析仪、输入的数据以及您的期望是什么,但根据您提供的信息,我将从以下开始:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
char_filter 将用空值替换 -
:-=>
。我还会使用 trim
过滤器去除任何尾随或前导空格。不知道你的 autocomplete_index
分析仪是什么,我只用了 keyword
一个。
测试分析器 GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000
结果:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
这意味着它确实消除了 -
和空格。
典型的查询是:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}
在 elasticsearch 映射的映射 char_filter 部分,它有点模糊,我很难理解是否以及如何使用 charfilter 分析器:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
基本上,我们存储在索引中的数据是 String
类型的 ID,如下所示:"008392342000"
。我希望能够在查询字词实际包含连字符或尾随 space 时搜索此类 ID,如下所示:"008392342-000 "
.
你建议我如何设置分析仪? 目前这是字段的定义:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
这里是包含分析器等索引的设置
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
您没有提供您的实际分析仪、输入的数据以及您的期望是什么,但根据您提供的信息,我将从以下开始:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
char_filter 将用空值替换 -
:-=>
。我还会使用 trim
过滤器去除任何尾随或前导空格。不知道你的 autocomplete_index
分析仪是什么,我只用了 keyword
一个。
测试分析器 GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000
结果:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
这意味着它确实消除了 -
和空格。
典型的查询是:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}