如何使弹性搜索评分考虑字段长度
How to make elasticsearch scoring take field-length into account
我创建了一个非常简单的测试索引,包含以下 5 个条目:
{ "tags": [ { "topics": "music festival dance techno germany"} ]}
{ "tags": [ { "topics": "music festival dance techno"} ]}
{ "tags": [ { "topics": "music festival dance"} ]}
{ "tags": [ { "topics": "music festival"} ]}
{ "tags": [ { "topics": "music"} ]}
然后我执行了以下查询:
{
"query": {
"bool": {
"should": [
{ "match": { "tags.topics": "music festival"}}
]
}
}
}
期望在结果中获得以下顺序:
1) "music festival"
2) "music festival dance"
3) "music festival dance techno"
4) "music festival dance techno germany"
5) "music"
考虑字段长度规范化。
但是我得到了以下信息:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5753642,
"hits": [
{
"_index": "testindex",
"_type": "entry",
"_id": "1",
"_score": 0.5753642,
"_source": {
"tags": [
{
"topics": "music festival dance techno germany"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "3",
"_score": 0.5753642,
"_source": {
"tags": [
{
"topics": "music festival dance"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "4",
"_score": 0.42221835,
"_source": {
"tags": [
{
"topics": "music festival"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "2",
"_score": 0.32088596,
"_source": {
"tags": [
{
"topics": "music festival dance techno"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "5",
"_score": 0.2876821,
"_source": {
"tags": [
{
"topics": "music"
}
]
}
}
]
}
}
他的顺序看起来完全随机,除了最低分数只匹配一个词。
是什么原因导致的,我可以更改什么(在映射、索引或搜索期间)以获得预期的顺序?
注意:非完美匹配查询也是如此。搜索 "music dance" 应该仍会产生 3 个词的条目作为第一个结果,因此使用或提升术语查询似乎是不可能的。
正如我在 中所描述的那样 scoring/relevance 并不是 Elasticsearch 中最简单的主题。
我正在尝试为您找出解决方案,目前我有类似的解决方案。
文件:
{ "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }
{ "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }
{ "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }
{ "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }
{ "tags": [ { "topics": ["music"]} ], "topics_count": 1 }
并查询:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"terms_set": {
"tags.topics" : {
"terms" : ["music", "festival"],
"minimum_should_match_script": {
"source": "params.num_terms"
}
}
}
},
"script_score" : {
"script" : {
"source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
}
}
}
},
{
"function_score": {
"query": {
"terms_set": {
"tags.topics" : {
"terms" : ["music", "festival"],
"minimum_should_match_script": {
"source": "doc['topics_count'].value"
}
}
}
},
"script_score" : {
"script" : {
"source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
}
}
}
}
]
}
}
}
它并不完美。仍然需要一些改进。在这个例子中,它对于 ["music", "festival"]
和 ["music", "dance"]
运行良好(在 ES 6.2 上测试),但我猜测在其他结果上它不会像您预期的那样 100% 运行。主要是因为 relevance/scoring 的复杂性。但是你现在可以阅读更多关于我使用的东西并尝试改进它。
我创建了一个非常简单的测试索引,包含以下 5 个条目:
{ "tags": [ { "topics": "music festival dance techno germany"} ]}
{ "tags": [ { "topics": "music festival dance techno"} ]}
{ "tags": [ { "topics": "music festival dance"} ]}
{ "tags": [ { "topics": "music festival"} ]}
{ "tags": [ { "topics": "music"} ]}
然后我执行了以下查询:
{
"query": {
"bool": {
"should": [
{ "match": { "tags.topics": "music festival"}}
]
}
}
}
期望在结果中获得以下顺序:
1) "music festival"
2) "music festival dance"
3) "music festival dance techno"
4) "music festival dance techno germany"
5) "music"
考虑字段长度规范化。
但是我得到了以下信息:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5753642,
"hits": [
{
"_index": "testindex",
"_type": "entry",
"_id": "1",
"_score": 0.5753642,
"_source": {
"tags": [
{
"topics": "music festival dance techno germany"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "3",
"_score": 0.5753642,
"_source": {
"tags": [
{
"topics": "music festival dance"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "4",
"_score": 0.42221835,
"_source": {
"tags": [
{
"topics": "music festival"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "2",
"_score": 0.32088596,
"_source": {
"tags": [
{
"topics": "music festival dance techno"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "5",
"_score": 0.2876821,
"_source": {
"tags": [
{
"topics": "music"
}
]
}
}
]
}
}
他的顺序看起来完全随机,除了最低分数只匹配一个词。
是什么原因导致的,我可以更改什么(在映射、索引或搜索期间)以获得预期的顺序?
注意:非完美匹配查询也是如此。搜索 "music dance" 应该仍会产生 3 个词的条目作为第一个结果,因此使用或提升术语查询似乎是不可能的。
正如我在
我正在尝试为您找出解决方案,目前我有类似的解决方案。
文件:
{ "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }
{ "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }
{ "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }
{ "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }
{ "tags": [ { "topics": ["music"]} ], "topics_count": 1 }
并查询:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"terms_set": {
"tags.topics" : {
"terms" : ["music", "festival"],
"minimum_should_match_script": {
"source": "params.num_terms"
}
}
}
},
"script_score" : {
"script" : {
"source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
}
}
}
},
{
"function_score": {
"query": {
"terms_set": {
"tags.topics" : {
"terms" : ["music", "festival"],
"minimum_should_match_script": {
"source": "doc['topics_count'].value"
}
}
}
},
"script_score" : {
"script" : {
"source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
}
}
}
}
]
}
}
}
它并不完美。仍然需要一些改进。在这个例子中,它对于 ["music", "festival"]
和 ["music", "dance"]
运行良好(在 ES 6.2 上测试),但我猜测在其他结果上它不会像您预期的那样 100% 运行。主要是因为 relevance/scoring 的复杂性。但是你现在可以阅读更多关于我使用的东西并尝试改进它。