如何构建考虑单词之间距离的 Elasticsearch 查询?

How to build an Elasticsearch query that will take into account the distance between words?

我是 运行 elasticsearch:7.6.2

我有一个包含 4 个简单文档的索引:

    PUT demo_idx/_doc/1
    {
      "content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
    }

    PUT demo_idx/_doc/2
    {
      "content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
    }

    PUT demo_idx/_doc/3
    {
      "content": "Distributed nature, simple REST APIs, speed, and scalability"
    }

    PUT demo_idx/_doc/4
    {
      "content": "Distributed tmp tmp nature"
    }

我要搜索文本:distributed nature 得到的结果顺序如下:

Doc id: 3 
Doc id: 1
Doc id: 2
Doc id: 4

即完全匹配的文档(doc 3 和 doc 1)将在具有小斜率的文档(doc 2)之前显示,而具有大斜率匹配的文档将最后显示(doc 4)

我读了这个post: How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word 但它对我没有帮助

我试过以下搜索查询:

"query": {
            "bool": {
                "must":
                    [{
                        "match_phrase": {
                            "content": {
                                "query": query,
                                "slop": 2
                            }
                        }
                    }]
            }
        }

但它没有给我所需的结果。

我得到了以下结果:

Doc id: 3  ,Score: 0.22949813
Doc id: 4  ,Score: 0.15556586
Doc id: 1  ,Score: 0.15401536 
Doc id: 2  ,Score: 0.14397088

如何编写查询以获得我想要的结果?

您可以使用bool should 子句显示与“分布式性质”完全匹配的文档。第一个条款将提高那些与“分布式性质”完全匹配的文档的分数,没有任何懈怠。

POST demo_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "content": {
              "query": "Distributed nature"
            }
          }
        },
        {
          "match_phrase": {
            "content": {
              "query": "Distributed nature",
              "slop": 2
            }
          }
        }
      ]
    }
  }
}

搜索响应将是:

"hits" : [
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.45899627,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.30803072,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.15556586,
        "_source" : {
          "content" : "Distributed tmp tmp nature"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.14397088,
        "_source" : {
          "content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
        }
      }
    ]

更新 1:

为了避免“字段长度”参数对搜索查询评分的影响,您需要禁用“内容”字段的“规范”参数,使用更新映射API

PUT demo_idx/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "norms": "false"
    }
  }
}

在此之后,再次重新索引文档,这样norms不会立即被删除

现在点击搜索查询,搜索响应将按照您期望的顺序排列。