如何在 elasticsearch 中匹配包含连字符或尾随 space 的查询词

How to match query terms containing hyphens or trailing space in elasticsearch

在 elasticsearch 映射的映射 char_filter 部分,它有点模糊,我很难理解是否以及如何使用 charfilter 分析器:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

基本上,我们存储在索引中的数据是 String 类型的 ID,如下所示:"008392342000"。我希望能够在查询字词实际包含连字符或尾随 space 时搜索此类 ID,如下所示:"008392342-000 ".

你建议我如何设置分析仪? 目前这是字段的定义:

"mappings": {
    "client": {
        "properties": {
            "ucn": {
                "type": "multi_field",
                "fields": {
                    "ucn_autoc": {
                        "type": "string",
                        "index": "analyzed",
                        "index_analyzer": "autocomplete_index",
                        "search_analyzer": "autocomplete_search"
                    },
                    "ucn": {
                        "type": "string",
                        "index": "not_analyzed"
                    }
                }
            }
        }
    }
}

这里是包含分析器等索引的设置

 "settings": {
        "analysis": {
            "filter": {
                "autocomplete_ngram": {
                    "max_gram": 15,
                    "min_gram": 1,
                    "type": "edge_ngram"
                },
                "ngram_filter": {
                    "type": "nGram",
                    "min_gram": 2,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "lowercase_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "autocomplete_index": {
                    "filter": [
                        "lowercase",
                        "autocomplete_ngram"
                    ],
                    "tokenizer": "keyword"
                },
                "ngram_index": {
                    "filter": [
                        "ngram_filter",
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "autocomplete_search": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "ngram_search": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                }
            },
            "index": {
                "number_of_shards": 6,
                "number_of_replicas": 1
            }
        }
    }

您没有提供您的实际分析仪、输入的数据以及您的期望是什么,但根据您提供的信息,我将从以下开始:

{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": [
            "-=>"
          ]
        }
      },
      "analyzer": {
        "autocomplete_search": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_mapping"
          ],
          "filter": [
            "trim"
          ]
        },
        "autocomplete_index": {
          "tokenizer": "keyword",
          "filter": [
            "trim"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "ucn": {
          "type": "multi_field",
          "fields": {
            "ucn_autoc": {
              "type": "string",
              "index": "analyzed",
              "index_analyzer": "autocomplete_index",
              "search_analyzer": "autocomplete_search"
            },
            "ucn": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

char_filter 将用空值替换 --=>。我还会使用 trim 过滤器去除任何尾随或前导空格。不知道你的 autocomplete_index 分析仪是什么,我只用了 keyword 一个。

测试分析器 GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 结果:

"tokens": [
      {
         "token": "012334742000",
         "start_offset": 0,
         "end_offset": 17,
         "type": "word",
         "position": 1
      }
   ]

这意味着它确实消除了 - 和空格。 典型的查询是:

{
  "query": {
    "match": {
      "ucn.ucn_autoc": " 0123-34742-000  "
    }
  }
}