aws opensearch:为什么相似的数据集排名不同

aws opensearch: Why are similar sets of data ranked differently

我已经设置了一个 AWS Opensearch 实例,几乎所有内容都设置为默认值。然后我插入了一些关于酒店的数据。当用户搜索 Good Morning B 时,我的结果查询 POST 请求如下所示:

{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "good morning b*",
                        "fields": ["name"],
                        "default_operator": "and"
                    }
                },
                {
                    "match": {
                        "provider": "SomeProvider"
                    }
                }
            ]
        }
    }
    "sort": {
        "_score": {
            "order": "desc"
        },
        "name.keyword": {
            "order": "asc"
        }
    }
}

结果包含 4 个条目和 2 个不同的酒店。除了 ID 之外,名称和索引中的所有其他数据都是相同的。以下是回复的摘录:

{
  "took": 442,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "1",
        "_score": 11.143229,
        "_source": {
          "id": "1",
          "name": "Good Morning + Berlin City East",
          "provider": "SomeProvider"
        },
        "sort": [
          11.143229,
          "Good Morning + Berlin City East"
        ]
      },
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "2",
        "_score": 10.455675,
        "_source": {
          "id": "2",
          "name": "Good Morning Bad Oldesloe",
          "provider": "SomeProvider"
        },
        "sort": [
          10.455675,
          "Good Morning Bad Oldesloe"
        ]
      },
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "3",
        "_score": 10.455675,
        "_source": {
          "id": "3",
          "name": "Good Morning Bad Oldesloe",
          "provider": "SomeProvider"
        },
        "sort": [
          10.455675,
          "Good Morning Bad Oldesloe"
        ]
      },
      {
        "_index": "hotels",
        "_type": "_doc",
        "_id": "4",
        "_score": 9.6945305,
        "_source": {
          "id": "4",
          "name": "Good Morning + Berlin City East",
          "provider": "SomeProvider"
        },
        "sort": [
          9.6945305,
          "Good Morning + Berlin City East"
        ]
      }
    ]
  }
}

您可以看到“早安 + 柏林城东”的条目有两个不同的 运行ks。就像我说的,包含的数据是完全一样的。由于名称相同,我原以为它会像“早安巴特奥尔德斯洛”酒店一样 运行 一样被命名。

我 运行 使用 explain=true 参数进行相同的查询,并为柏林条目获得了这个(我只 post 这里的相关部分使其有点紧凑):

// ID = 1
{
  "sort": [
    11.143229,
    "Good Morning + Berlin City East"
  ],
  "_explanation": {
    "value": 11.143229,
    "description": "sum of:",
    "details": [
      {
        "value": 9.302926,
        "description": "sum of:",
        "details": [
          {
            "value": 4.151463,
            "description": "weight(name:good in 1) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 4.151463,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                  {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                  },
                  {
                    "value": 4.811831,
                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details": [
                      {
                        "value": 11,
                        "description": "n, number of documents containing term",
                        "details": []
                      },
                      {
                        "value": 1413,
                        "description": "N, total number of documents with field",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 0.3921644,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                      {
                        "value": 1.0,
                        "description": "freq, occurrences of term within document",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                      },
                      {
                        "value": 5.0,
                        "description": "dl, length of field",
                        "details": []
                      },
                      {
                        "value": 3.6001415,
                        "description": "avgdl, average length of field",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 4.151463,
            "description": "weight(name:morning in 1) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 4.151463,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                  {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                  },
                  {
                    "value": 4.811831,
                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details": [
                      {
                        "value": 11,
                        "description": "n, number of documents containing term",
                        "details": []
                      },
                      {
                        "value": 1413,
                        "description": "N, total number of documents with field",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 0.3921644,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                      {
                        "value": 1.0,
                        "description": "freq, occurrences of term within document",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                      },
                      {
                        "value": 5.0,
                        "description": "dl, length of field",
                        "details": []
                      },
                      {
                        "value": 3.6001415,
                        "description": "avgdl, average length of field",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 1.0,
            "description": "name:b*",
            "details": []
          }
        ]
      },
      {
        "value": 1.840302,
        "description": "weight(provider:hob in 1) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 1.840302,
            "description": "score(freq=1.0), computed as boost * idf * tf from:",
            "details": [
              {
                "value": 2.2,
                "description": "boost",
                "details": []
              },
              {
                "value": 1.8403021,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 224,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 1413,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.45454544,
                "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details": [
                  {
                    "value": 1.0,
                    "description": "freq, occurrences of term within document",
                    "details": []
                  },
                  {
                    "value": 1.2,
                    "description": "k1, term saturation parameter",
                    "details": []
                  },
                  {
                    "value": 0.75,
                    "description": "b, length normalization parameter",
                    "details": []
                  },
                  {
                    "value": 1.0,
                    "description": "dl, length of field",
                    "details": []
                  },
                  {
                    "value": 1.0,
                    "description": "avgdl, average length of field",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

// ID = 2{
  "sort": [
      9.6945305,
      "Good Morning + Berlin City East"
  ],
  "_explanation": {
      "value": 9.6945305,
      "description": "sum of:",
      "details": [
          {
              "value": 7.975009,
              "description": "sum of:",
              "details": [
                  {
                      "value": 3.4875045,
                      "description": "weight(name:good in 380) [PerFieldSimilarity], result of:",
                      "details": [
                          {
                              "value": 3.4875045,
                              "description": "score(freq=1.0), computed as boost * idf * tf from:",
                              "details": [
                                  {
                                      "value": 2.2,
                                      "description": "boost",
                                      "details": []
                                  },
                                  {
                                      "value": 4.0562115,
                                      "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                      "details": [
                                          {
                                              "value": 24,
                                              "description": "n, number of documents containing term",
                                              "details": []
                                          },
                                          {
                                              "value": 1414,
                                              "description": "N, total number of documents with field",
                                              "details": []
                                          }
                                      ]
                                  },
                                  {
                                      "value": 0.39081526,
                                      "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                      "details": [
                                          {
                                              "value": 1.0,
                                              "description": "freq, occurrences of term within document",
                                              "details": []
                                          },
                                          {
                                              "value": 1.2,
                                              "description": "k1, term saturation parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 0.75,
                                              "description": "b, length normalization parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 5.0,
                                              "description": "dl, length of field",
                                              "details": []
                                          },
                                          {
                                              "value": 3.5749645,
                                              "description": "avgdl, average length of field",
                                              "details": []
                                          }
                                      ]
                                  }
                              ]
                          }
                      ]
                  },
                  {
                      "value": 3.4875045,
                      "description": "weight(name:morning in 380) [PerFieldSimilarity], result of:",
                      "details": [
                          {
                              "value": 3.4875045,
                              "description": "score(freq=1.0), computed as boost * idf * tf from:",
                              "details": [
                                  {
                                      "value": 2.2,
                                      "description": "boost",
                                      "details": []
                                  },
                                  {
                                      "value": 4.0562115,
                                      "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                      "details": [
                                          {
                                              "value": 24,
                                              "description": "n, number of documents containing term",
                                              "details": []
                                          },
                                          {
                                              "value": 1414,
                                              "description": "N, total number of documents with field",
                                              "details": []
                                          }
                                      ]
                                  },
                                  {
                                      "value": 0.39081526,
                                      "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                      "details": [
                                          {
                                              "value": 1.0,
                                              "description": "freq, occurrences of term within document",
                                              "details": []
                                          },
                                          {
                                              "value": 1.2,
                                              "description": "k1, term saturation parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 0.75,
                                              "description": "b, length normalization parameter",
                                              "details": []
                                          },
                                          {
                                              "value": 5.0,
                                              "description": "dl, length of field",
                                              "details": []
                                          },
                                          {
                                              "value": 3.5749645,
                                              "description": "avgdl, average length of field",
                                              "details": []
                                          }
                                      ]
                                  }
                              ]
                          }
                      ]
                  },
                  {
                      "value": 1.0,
                      "description": "name:b*",
                      "details": []
                  }
              ]
          },
          {
              "value": 1.719521,
              "description": "weight(provider:hob in 380) [PerFieldSimilarity], result of:",
              "details": [
                  {
                      "value": 1.719521,
                      "description": "score(freq=1.0), computed as boost * idf * tf from:",
                      "details": [
                          {
                              "value": 2.2,
                              "description": "boost",
                              "details": []
                          },
                          {
                              "value": 1.719521,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                  {
                                      "value": 253,
                                      "description": "n, number of documents containing term",
                                      "details": []
                                  },
                                  {
                                      "value": 1414,
                                      "description": "N, total number of documents with field",
                                      "details": []
                                  }
                              ]
                          },
                          {
                              "value": 0.45454544,
                              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                              "details": [
                                  {
                                      "value": 1.0,
                                      "description": "freq, occurrences of term within document",
                                      "details": []
                                  },
                                  {
                                      "value": 1.2,
                                      "description": "k1, term saturation parameter",
                                      "details": []
                                  },
                                  {
                                      "value": 0.75,
                                      "description": "b, length normalization parameter",
                                      "details": []
                                  },
                                  {
                                      "value": 1.0,
                                      "description": "dl, length of field",
                                      "details": []
                                  },
                                  {
                                      "value": 1.0,
                                      "description": "avgdl, average length of field",
                                      "details": []
                                  }
                              ]
                          }
                      ]
                  }
              ]
          }
      ]
  }
}

运行k 的主要差异和差异的原因似乎是 n, number of documents containing term 在较高的 运行ked id = 1 和 24 的情况下为 11较低的 运行ked id = 2 的情况。 但是由于每个数据字段都是相同的(除了 id),它不应该是相同的数字吗?两个条目的搜索词相同。

有人能给我解释一下吗(请用没有多少数学的简单语言)为什么这家酒店有区别而巴特奥尔德斯洛的那家却没有(在这里,正如人们所期望的那样,解释中的数字是一样)?

提前致谢

文档的数量不是由 Elasticsearch 计算整个索引的,而是由底层的 Lucene 引擎计算的,并且是按分片计算的(每个分片都是一个完整的 Lucene 索引)。由于您的文档(可能)位于不同的分片中,因此它们的分数略有不同。