aws opensearch:为什么相似的数据集排名不同
aws opensearch: Why are similar sets of data ranked differently
我已经设置了一个 AWS Opensearch 实例,几乎所有内容都设置为默认值。然后我插入了一些关于酒店的数据。当用户搜索 Good Morning B
时,我的结果查询 POST
请求如下所示:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "good morning b*",
"fields": ["name"],
"default_operator": "and"
}
},
{
"match": {
"provider": "SomeProvider"
}
}
]
}
}
"sort": {
"_score": {
"order": "desc"
},
"name.keyword": {
"order": "asc"
}
}
}
结果包含 4 个条目和 2 个不同的酒店。除了 ID 之外,名称和索引中的所有其他数据都是相同的。以下是回复的摘录:
{
"took": 442,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "hotels",
"_type": "_doc",
"_id": "1",
"_score": 11.143229,
"_source": {
"id": "1",
"name": "Good Morning + Berlin City East",
"provider": "SomeProvider"
},
"sort": [
11.143229,
"Good Morning + Berlin City East"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "2",
"_score": 10.455675,
"_source": {
"id": "2",
"name": "Good Morning Bad Oldesloe",
"provider": "SomeProvider"
},
"sort": [
10.455675,
"Good Morning Bad Oldesloe"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "3",
"_score": 10.455675,
"_source": {
"id": "3",
"name": "Good Morning Bad Oldesloe",
"provider": "SomeProvider"
},
"sort": [
10.455675,
"Good Morning Bad Oldesloe"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "4",
"_score": 9.6945305,
"_source": {
"id": "4",
"name": "Good Morning + Berlin City East",
"provider": "SomeProvider"
},
"sort": [
9.6945305,
"Good Morning + Berlin City East"
]
}
]
}
}
您可以看到“早安 + 柏林城东”的条目有两个不同的 运行ks。就像我说的,包含的数据是完全一样的。由于名称相同,我原以为它会像“早安巴特奥尔德斯洛”酒店一样 运行 一样被命名。
我 运行 使用 explain=true
参数进行相同的查询,并为柏林条目获得了这个(我只 post 这里的相关部分使其有点紧凑):
// ID = 1
{
"sort": [
11.143229,
"Good Morning + Berlin City East"
],
"_explanation": {
"value": 11.143229,
"description": "sum of:",
"details": [
{
"value": 9.302926,
"description": "sum of:",
"details": [
{
"value": 4.151463,
"description": "weight(name:good in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.151463,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.811831,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3921644,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.6001415,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 4.151463,
"description": "weight(name:morning in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.151463,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.811831,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3921644,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.6001415,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0,
"description": "name:b*",
"details": []
}
]
},
{
"value": 1.840302,
"description": "weight(provider:hob in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.840302,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.8403021,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 224,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 1.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
// ID = 2{
"sort": [
9.6945305,
"Good Morning + Berlin City East"
],
"_explanation": {
"value": 9.6945305,
"description": "sum of:",
"details": [
{
"value": 7.975009,
"description": "sum of:",
"details": [
{
"value": 3.4875045,
"description": "weight(name:good in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.4875045,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.0562115,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 24,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.39081526,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.5749645,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 3.4875045,
"description": "weight(name:morning in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.4875045,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.0562115,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 24,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.39081526,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.5749645,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0,
"description": "name:b*",
"details": []
}
]
},
{
"value": 1.719521,
"description": "weight(provider:hob in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.719521,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.719521,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 253,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 1.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
运行k 的主要差异和差异的原因似乎是 n, number of documents containing term
在较高的 运行ked id = 1 和 24 的情况下为 11较低的 运行ked id = 2 的情况。
但是由于每个数据字段都是相同的(除了 id),它不应该是相同的数字吗?两个条目的搜索词相同。
有人能给我解释一下吗(请用没有多少数学的简单语言)为什么这家酒店有区别而巴特奥尔德斯洛的那家却没有(在这里,正如人们所期望的那样,解释中的数字是一样)?
提前致谢
文档的数量不是由 Elasticsearch 计算整个索引的,而是由底层的 Lucene 引擎计算的,并且是按分片计算的(每个分片都是一个完整的 Lucene 索引)。由于您的文档(可能)位于不同的分片中,因此它们的分数略有不同。
我已经设置了一个 AWS Opensearch 实例,几乎所有内容都设置为默认值。然后我插入了一些关于酒店的数据。当用户搜索 Good Morning B
时,我的结果查询 POST
请求如下所示:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "good morning b*",
"fields": ["name"],
"default_operator": "and"
}
},
{
"match": {
"provider": "SomeProvider"
}
}
]
}
}
"sort": {
"_score": {
"order": "desc"
},
"name.keyword": {
"order": "asc"
}
}
}
结果包含 4 个条目和 2 个不同的酒店。除了 ID 之外,名称和索引中的所有其他数据都是相同的。以下是回复的摘录:
{
"took": 442,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "hotels",
"_type": "_doc",
"_id": "1",
"_score": 11.143229,
"_source": {
"id": "1",
"name": "Good Morning + Berlin City East",
"provider": "SomeProvider"
},
"sort": [
11.143229,
"Good Morning + Berlin City East"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "2",
"_score": 10.455675,
"_source": {
"id": "2",
"name": "Good Morning Bad Oldesloe",
"provider": "SomeProvider"
},
"sort": [
10.455675,
"Good Morning Bad Oldesloe"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "3",
"_score": 10.455675,
"_source": {
"id": "3",
"name": "Good Morning Bad Oldesloe",
"provider": "SomeProvider"
},
"sort": [
10.455675,
"Good Morning Bad Oldesloe"
]
},
{
"_index": "hotels",
"_type": "_doc",
"_id": "4",
"_score": 9.6945305,
"_source": {
"id": "4",
"name": "Good Morning + Berlin City East",
"provider": "SomeProvider"
},
"sort": [
9.6945305,
"Good Morning + Berlin City East"
]
}
]
}
}
您可以看到“早安 + 柏林城东”的条目有两个不同的 运行ks。就像我说的,包含的数据是完全一样的。由于名称相同,我原以为它会像“早安巴特奥尔德斯洛”酒店一样 运行 一样被命名。
我 运行 使用 explain=true
参数进行相同的查询,并为柏林条目获得了这个(我只 post 这里的相关部分使其有点紧凑):
// ID = 1
{
"sort": [
11.143229,
"Good Morning + Berlin City East"
],
"_explanation": {
"value": 11.143229,
"description": "sum of:",
"details": [
{
"value": 9.302926,
"description": "sum of:",
"details": [
{
"value": 4.151463,
"description": "weight(name:good in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.151463,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.811831,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3921644,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.6001415,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 4.151463,
"description": "weight(name:morning in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.151463,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.811831,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3921644,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.6001415,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0,
"description": "name:b*",
"details": []
}
]
},
{
"value": 1.840302,
"description": "weight(provider:hob in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.840302,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.8403021,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 224,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1413,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 1.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
// ID = 2{
"sort": [
9.6945305,
"Good Morning + Berlin City East"
],
"_explanation": {
"value": 9.6945305,
"description": "sum of:",
"details": [
{
"value": 7.975009,
"description": "sum of:",
"details": [
{
"value": 3.4875045,
"description": "weight(name:good in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.4875045,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.0562115,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 24,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.39081526,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.5749645,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 3.4875045,
"description": "weight(name:morning in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.4875045,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 4.0562115,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 24,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.39081526,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 5.0,
"description": "dl, length of field",
"details": []
},
{
"value": 3.5749645,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.0,
"description": "name:b*",
"details": []
}
]
},
{
"value": 1.719521,
"description": "weight(provider:hob in 380) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.719521,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.719521,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 253,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1414,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 1.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.0,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
运行k 的主要差异和差异的原因似乎是 n, number of documents containing term
在较高的 运行ked id = 1 和 24 的情况下为 11较低的 运行ked id = 2 的情况。
但是由于每个数据字段都是相同的(除了 id),它不应该是相同的数字吗?两个条目的搜索词相同。
有人能给我解释一下吗(请用没有多少数学的简单语言)为什么这家酒店有区别而巴特奥尔德斯洛的那家却没有(在这里,正如人们所期望的那样,解释中的数字是一样)?
提前致谢
文档的数量不是由 Elasticsearch 计算整个索引的,而是由底层的 Lucene 引擎计算的,并且是按分片计算的(每个分片都是一个完整的 Lucene 索引)。由于您的文档(可能)位于不同的分片中,因此它们的分数略有不同。