Elasticsearch 相似文本查询

Elasticsearch Similar Text Query

给定索引中的以下文档(我们称之为 addresses):

{
    ADDRESS: {
        ID: 1,
        LINE1: "steet 1",
        CITY: "kuala lumpur",
        COUNTRY: "MALAYSIA",
        ...
    } 
}
{
    ADDRESS: {
        ID: 2,
        LINE1: "steet 1",
        CITY: "kualalumpur city",
        COUNTRY: "MALAYSIA",
        ...
    }
}
{
    ADDRESS: {
        ID: 3,
        LINE1: "steet 1",
        CITY: "kualalumpur",        
        COUNTRY: "MALAYSIA",
        ...
    }
}
{
    ADDRESS: {
        ID: 4,
        LINE1: "steet 1",
        CITY: "kuala lumpur city",      
        COUNTRY: "MALAYSIA",
        ...
    }
}

此时,我找到了使用搜索文本“kualalumpur”抓取“kualalumpur”、“kuala lumpur”、“kualalumpur city”的查询。
但是,尽管与“kualalumpur city”几乎相似,但结果中却缺少“kuala lumpur city”。

到目前为止,这是我的查询:

{
  "query": {
    "bool": {
      "should": [
          {"match": {"ADDRESS.STREET": {"query": "street 1", "fuzziness": 1, "operator": "AND"}}},
          {
            "bool": {
              "should": [
                {"match": {"ADDRESS.CITY": {"query": "kualalumpur", "fuzziness": 1, "operator": "OR"}}},
                {"match": {"ADDRESS.CITY.keyword": {"query": "kualalumpur", "fuzziness": 1, "operator": "OR"}}}
              ]
            }
          }
        ],
      "filter": {
        "bool": {
          "must": [
            {"term": {"ADDRESS.COUNTRY.keyword": "MALAYSIA"}}
          ]
        }
      },
      "minimum_should_match": 2
    }
  }
}

给定条件,Elasticsearch 是否有可能 return 所有四个文档都带有搜索文本“kualalumpur”?

您可以在 country 字段上使用 edge-n gram tokenizer 来获取所有四个文档,在我的本地尝试过并添加下面的工作示例。

创建自定义分析器并将其应用到您的领域

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "ngram_analyzer": {
                        "type": "custom",
                        "filter": [
                            "lowercase"
                        ],
                        "tokenizer": "edgeNGramTokenizer"
                    }
                },
                "tokenizer": {
                    "edgeNGramTokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "1",
                        "type": "edgeNGram",
                        "max_gram": "40"
                    }
                }
            },
            "max_ngram_diff": "50"
        }
    },
    "mappings": {
        "properties": {
            "country": {
                "type": "text",
                "analyzer" : "ngram_analyzer"
            }
        }
    }
}

为所有四个示例文档编制索引,如下所示

{
  "country" : "kuala lumpur"
}

包含字词 kualalumpur 的搜索查询匹配所有四个文档

{
    "query": {
        "match" : {
            "country" : "kualalumpur"
        }
    }
}

 "hits": [
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "3",
        "_score": 5.0003963,
        "_source": {
          "country": "kualalumpur"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.4082437,
        "_source": {
          "country": "kualalumpur city"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.5621849,
        "_source": {
          "country": "kuala lumpur"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.4956103,
        "_source": {
          "country": "kuala lumpur city"
        }
      }
    ]