Elasticsearch 模糊查询忽略提升因子?
Elasticsearch fuzzy queries ignores boost factor?
当我运行这个查询时:
GET /index_for_test/_search
{
"query": {
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
}
}
}
它显示了这个结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.04012554,
"hits": [
{
"_index": "index_for_test",
"_type": "business",
"_id": "1269493995",
"_score": 0.04012554,
"_source": {
"name": "Bono Italian Restaurant",
"categories": [
"Pizza"
]
}
},
{
"_index": "index_for_test",
"_type": "business",
"_id": "2017788160",
"_score": 0.014542127,
"_source": {
"name": "Pizza Perperook",
"categories": [
"Italian Food"
]
}
}
]
}
}
但是当我为这个查询添加模糊性时:
GET /index_for_test/_search
{
"query": {
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
"fuzziness":2
}
}
}
它将忽略提升因子并显示此结果:
{
"took": 28,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.095891505,
"hits": [
{
"_index": "index_for_test",
"_type": "business",
"_id": "2017788160",
"_score": 0.095891505,
"_source": {
"name": "Pizza Perperook",
"categories": [
"Italian Food"
]
}
},
{
"_index": "index_for_test",
"_type": "business",
"_id": "1269493995",
"_score": 0.076713204,
"_source": {
"name": "Bono Italian Restaurant",
"categories": [
"Pizza"
]
}
}
]
}
}
当我两次提升 name 字段(通过使用 name^2 作为字段)时,它应该显示与第一个查询相同的结果,但它似乎忽略了提升因子。
我使用其他类型的查询(query_string、fuzzy_like_this)并遇到了同样的问题。
已编辑:
GET /index_for_test/_search?explain=true
{
"query": {
"multi_match": {
"query": "پیتزا",
"type": "most_fields",
"fields": [ "name^2", "categories" ]
}
}
}
使用 ?explain=true 进行模糊搜索的结果:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.05015693,
"hits": [
{
"_shard": 1,
"_node": "ZTZ37EpAR1W9e4Qqwk0O5Q",
"_index": "index_for_test",
"_type": "business",
"_id": "2017788160",
"_score": 0.05015693,
"_source": {
"name": "پیتزا پرپروک",
"categories": [
"غذای ایتالیایی"
]
},
"_explanation": {
"value": 0.05015693,
"description": "product of:",
"details": [
{
"value": 0.10031386,
"description": "sum of:",
"details": [
{
"value": 0.10031386,
"description": "weight(name:پیتزا^2.0 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.10031386,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.5230591,
"description": "queryWeight, product of:",
"details": [
{
"value": 2,
"description": "boost"
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.8522964,
"description": "queryNorm"
}
]
},
{
"value": 0.19178301,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
},
{
"_shard": 2,
"_node": "ZTZ37EpAR1W9e4Qqwk0O5Q",
"_index": "index_for_test",
"_type": "business",
"_id": "1269493995",
"_score": 0.023267403,
"_source": {
"name": "رستوران ایتالیایی بونو",
"categories": [
"پیتزا"
]
},
"_explanation": {
"value": 0.023267403,
"description": "product of:",
"details": [
{
"value": 0.046534806,
"description": "sum of:",
"details": [
{
"value": 0.046534806,
"description": "weight(categories:پیتزا in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.046534806,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.15165187,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.49421698,
"description": "queryNorm"
}
]
},
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
},
{
"_shard": 3,
"_node": "ZTZ37EpAR1W9e4Qqwk0O5Q",
"_index": "index_for_test",
"_type": "business",
"_id": "1203656733",
"_score": 0.023267403,
"_source": {
"name": "چمن",
"categories": [
"پیتزا"
]
},
"_explanation": {
"value": 0.023267403,
"description": "product of:",
"details": [
{
"value": 0.046534806,
"description": "sum of:",
"details": [
{
"value": 0.046534806,
"description": "weight(categories:پیتزا in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.046534806,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.15165187,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.49421698,
"description": "queryNorm"
}
]
},
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
}
]
}
}
Boost 并未被忽略...您只是在分数中添加了一个模糊成分,这正在改变整体排序。如果您 运行 使用 ?explain=true
进行查询,您将获得有关分数构建方式的调试转储。
对于您的第一个查询,需要完全匹配。结合most_fields
,打分就比较简单了:找字段数最多的准确匹配的文档。
您的第二个查询通过两次编辑引入了模糊性。这意味着两个字符编辑中的任何单词都将匹配。这可以大大改变匹配标记的数量。
如果你 post explain
调试输出,我可以帮助分析它给你一个更清晰的解释,但基本上答案是:boosting 仍然有效,你的分数只是因为模糊匹配。
根据 Zach 的建议,我将查询更改为此以实现我的结果:
GET /index_for_test/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
"boost":10
}
},
{
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
"fuzziness":2
}
}
]
}
}
}
当我运行这个查询时:
GET /index_for_test/_search
{
"query": {
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
}
}
}
它显示了这个结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.04012554,
"hits": [
{
"_index": "index_for_test",
"_type": "business",
"_id": "1269493995",
"_score": 0.04012554,
"_source": {
"name": "Bono Italian Restaurant",
"categories": [
"Pizza"
]
}
},
{
"_index": "index_for_test",
"_type": "business",
"_id": "2017788160",
"_score": 0.014542127,
"_source": {
"name": "Pizza Perperook",
"categories": [
"Italian Food"
]
}
}
]
}
}
但是当我为这个查询添加模糊性时:
GET /index_for_test/_search
{
"query": {
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
"fuzziness":2
}
}
}
它将忽略提升因子并显示此结果:
{
"took": 28,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.095891505,
"hits": [
{
"_index": "index_for_test",
"_type": "business",
"_id": "2017788160",
"_score": 0.095891505,
"_source": {
"name": "Pizza Perperook",
"categories": [
"Italian Food"
]
}
},
{
"_index": "index_for_test",
"_type": "business",
"_id": "1269493995",
"_score": 0.076713204,
"_source": {
"name": "Bono Italian Restaurant",
"categories": [
"Pizza"
]
}
}
]
}
}
当我两次提升 name 字段(通过使用 name^2 作为字段)时,它应该显示与第一个查询相同的结果,但它似乎忽略了提升因子。
我使用其他类型的查询(query_string、fuzzy_like_this)并遇到了同样的问题。
已编辑:
GET /index_for_test/_search?explain=true
{
"query": {
"multi_match": {
"query": "پیتزا",
"type": "most_fields",
"fields": [ "name^2", "categories" ]
}
}
}
使用 ?explain=true 进行模糊搜索的结果:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.05015693,
"hits": [
{
"_shard": 1,
"_node": "ZTZ37EpAR1W9e4Qqwk0O5Q",
"_index": "index_for_test",
"_type": "business",
"_id": "2017788160",
"_score": 0.05015693,
"_source": {
"name": "پیتزا پرپروک",
"categories": [
"غذای ایتالیایی"
]
},
"_explanation": {
"value": 0.05015693,
"description": "product of:",
"details": [
{
"value": 0.10031386,
"description": "sum of:",
"details": [
{
"value": 0.10031386,
"description": "weight(name:پیتزا^2.0 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.10031386,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.5230591,
"description": "queryWeight, product of:",
"details": [
{
"value": 2,
"description": "boost"
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.8522964,
"description": "queryNorm"
}
]
},
{
"value": 0.19178301,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
},
{
"_shard": 2,
"_node": "ZTZ37EpAR1W9e4Qqwk0O5Q",
"_index": "index_for_test",
"_type": "business",
"_id": "1269493995",
"_score": 0.023267403,
"_source": {
"name": "رستوران ایتالیایی بونو",
"categories": [
"پیتزا"
]
},
"_explanation": {
"value": 0.023267403,
"description": "product of:",
"details": [
{
"value": 0.046534806,
"description": "sum of:",
"details": [
{
"value": 0.046534806,
"description": "weight(categories:پیتزا in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.046534806,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.15165187,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.49421698,
"description": "queryNorm"
}
]
},
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
},
{
"_shard": 3,
"_node": "ZTZ37EpAR1W9e4Qqwk0O5Q",
"_index": "index_for_test",
"_type": "business",
"_id": "1203656733",
"_score": 0.023267403,
"_source": {
"name": "چمن",
"categories": [
"پیتزا"
]
},
"_explanation": {
"value": 0.023267403,
"description": "product of:",
"details": [
{
"value": 0.046534806,
"description": "sum of:",
"details": [
{
"value": 0.046534806,
"description": "weight(categories:پیتزا in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.046534806,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.15165187,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 0.49421698,
"description": "queryNorm"
}
]
},
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
}
]
}
}
Boost 并未被忽略...您只是在分数中添加了一个模糊成分,这正在改变整体排序。如果您 运行 使用 ?explain=true
进行查询,您将获得有关分数构建方式的调试转储。
对于您的第一个查询,需要完全匹配。结合most_fields
,打分就比较简单了:找字段数最多的准确匹配的文档。
您的第二个查询通过两次编辑引入了模糊性。这意味着两个字符编辑中的任何单词都将匹配。这可以大大改变匹配标记的数量。
如果你 post explain
调试输出,我可以帮助分析它给你一个更清晰的解释,但基本上答案是:boosting 仍然有效,你的分数只是因为模糊匹配。
根据 Zach 的建议,我将查询更改为此以实现我的结果:
GET /index_for_test/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
"boost":10
}
},
{
"multi_match": {
"query": "Italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ],
"fuzziness":2
}
}
]
}
}
}