布尔相似性——有没有办法去除重复项
Boolean similarity - is there a way to remove duplicates
给定以下索引
PUT /test_index
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
},
"field2": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
}
}
}
}
及以下数据
POST /test_index/_bulk?refresh=true
{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}
对于给定的布尔相似性查询
POST /test_index/_search
{
"size": 10,
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy":{
"field1":{
"value":"foo",
"fuzziness":"AUTO",
"boost": 1
}
}
},
{
"fuzzy":{
"field2":{
"value":"bar",
"fuzziness":"AUTO",
"boost": 1
}
}
}
]
}
}
}
}
}
我总是收到 ["foo1 foo2 foo3", "bar1 bar2 bar3"] 尽管索引中有一个确切的结果(第一个):
{
"took": 114,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 3.9999998,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "bXw8eXUBCTtfNv84bNPr",
"_score": 3.9999998,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "bHw8eXUBCTtfNv84bNPr",
"_score": 2.6666665,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "a3w8eXUBCTtfNv84bNPr",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
}
]
}
}
我知道布尔值以这种方式匹配尽可能多的结果,我知道我可以在这里进行重新评分,但这不是一个选项,因为我不知道有多少前 N 个结果获取。
这里还有其他选择吗?也许要基于布尔相似性创建我自己的相似性插件以删除重复项并留下最匹配的标记,但我不知道从哪里开始,我只看到脚本和重新评分的示例。
更新:- 根据我之前回答的评论部分中提供的清晰度,更新答案。
下面查询returns预期结果
{
"min_score": 0.4,
"size":10,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": { --> used for boosting the exact terms
"field1": {
"value": "foo",
"boost": 1.5 --> further boosting the exact match.
}
}
}
]
}
}
}
}
}
和搜索结果
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]
没有明确提升确切词条的另一个查询也returns预期结果
{
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": {
"field1": {
"value": "foo" --> notice there is no boost
}
}
}
]
}
}
}
}
}
和搜索结果
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 1.5,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]
给定以下索引
PUT /test_index
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
},
"field2": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
}
}
}
}
及以下数据
POST /test_index/_bulk?refresh=true
{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}
对于给定的布尔相似性查询
POST /test_index/_search
{
"size": 10,
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy":{
"field1":{
"value":"foo",
"fuzziness":"AUTO",
"boost": 1
}
}
},
{
"fuzzy":{
"field2":{
"value":"bar",
"fuzziness":"AUTO",
"boost": 1
}
}
}
]
}
}
}
}
}
我总是收到 ["foo1 foo2 foo3", "bar1 bar2 bar3"] 尽管索引中有一个确切的结果(第一个):
{
"took": 114,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 3.9999998,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "bXw8eXUBCTtfNv84bNPr",
"_score": 3.9999998,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "bHw8eXUBCTtfNv84bNPr",
"_score": 2.6666665,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "a3w8eXUBCTtfNv84bNPr",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
}
]
}
}
我知道布尔值以这种方式匹配尽可能多的结果,我知道我可以在这里进行重新评分,但这不是一个选项,因为我不知道有多少前 N 个结果获取。
这里还有其他选择吗?也许要基于布尔相似性创建我自己的相似性插件以删除重复项并留下最匹配的标记,但我不知道从哪里开始,我只看到脚本和重新评分的示例。
更新:- 根据我之前回答的评论部分中提供的清晰度,更新答案。
下面查询returns预期结果
{
"min_score": 0.4,
"size":10,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": { --> used for boosting the exact terms
"field1": {
"value": "foo",
"boost": 1.5 --> further boosting the exact match.
}
}
}
]
}
}
}
}
}
和搜索结果
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]
没有明确提升确切词条的另一个查询也returns预期结果
{
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": {
"field1": {
"value": "foo" --> notice there is no boost
}
}
}
]
}
}
}
}
}
和搜索结果
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 1.5,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]