布尔相似性——有没有办法去除重复项

Boolean similarity - is there a way to remove duplicates

给定以下索引

PUT /test_index
{
    "mappings": {
        "properties": {
        "field1": { 
            "type": "text",
            "analyzer": "whitespace",
            "similarity": "boolean"
        },
        "field2": { 
            "type": "text",
            "analyzer": "whitespace",
            "similarity": "boolean"
        }
        }
    }
}

及以下数据

POST /test_index/_bulk?refresh=true
{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}

对于给定的布尔相似性查询

POST /test_index/_search
{
    "size": 10,
    "min_score": 0.4,
    "query": {
        "function_score": {
        "query": {
            "bool": {
            "should": [
                {
                "fuzzy":{
                    "field1":{
                        "value":"foo",
                        "fuzziness":"AUTO",
                        "boost": 1
                    }
                }
            },
            {
                "fuzzy":{
                    "field2":{
                        "value":"bar",
                        "fuzziness":"AUTO",
                        "boost": 1
                    }
                }
            }
            ]
            }
        }
        }
    }
}

我总是收到 ["foo1 foo2 foo3", "bar1 bar2 bar3"] 尽管索引中有一个确切的结果(第一个):

{
    "took": 114,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 3.9999998,
        "hits": [
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "bXw8eXUBCTtfNv84bNPr",
                "_score": 3.9999998,
                "_source": {
                    "field1": "foo1 foo2 foo3",
                    "field2": "bar1 bar2 bar3"
                }
            },
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "bHw8eXUBCTtfNv84bNPr",
                "_score": 2.6666665,
                "_source": {
                    "field1": "foo1 foo2",
                    "field2": "bar1 bar2"
                }
            },
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "a3w8eXUBCTtfNv84bNPr",
                "_score": 2.0,
                "_source": {
                    "field1": "foo",
                    "field2": "bar"
                }
            }
        ]
    }
}

我知道布尔值以这种方式匹配尽可能多的结果,我知道我可以在这里进行重新评分,但这不是一个选项,因为我不知道有多少前 N 个结果获取。

这里还有其他选择吗?也许要基于布尔相似性创建我自己的相似性插件以删除重复项并留下最匹配的标记,但我不知道从哪里开始,我只看到脚本和重新评分的示例。

更新:- 根据我之前回答的评论部分中提供的清晰度,更新答案。

下面查询returns预期结果

{
    "min_score": 0.4,
    "size":10,
    "query": {
        "function_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "fuzzy": {
                                "field1": {
                                    "value": "foo",
                                    "fuzziness": "AUTO",
                                    "boost": 0.5
                                }
                            }
                        },
                        {
                            "term": { --> used for boosting the exact terms
                                "field1": {
                                    "value": "foo",
                                     "boost": 1.5 --> further boosting the exact match.
                                }
                            }
                        }
                    ]
                }
            }
        }
    }
}

和搜索结果

"hits": [
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "zdMEvHUBlo4-1mHbtvNH",
                "_score": 2.0,
                "_source": {
                    "field1": "foo",
                    "field2": "bar"
                }
            },
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "z9MEvHUBlo4-1mHbtvNH",
                "_score": 0.99999994,
                "_source": {
                    "field1": "foo1 foo2 foo3",
                    "field2": "bar1 bar2 bar3"
                }
            },
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "ztMEvHUBlo4-1mHbtvNH",
                "_score": 0.6666666,
                "_source": {
                    "field1": "foo1 foo2",
                    "field2": "bar1 bar2"
                }
            }
        ]

没有明确提升确切词条的另一个查询也returns预期结果

{
    "min_score": 0.4,
    "query": {
        "function_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "fuzzy": {
                                "field1": {
                                    "value": "foo",
                                    "fuzziness": "AUTO",
                                    "boost": 0.5
                                }
                            }
                        },
                        {
                            "term": {
                                "field1": {
                                    "value": "foo" --> notice there is no boost
                                }
                            }
                        }
                    ]
                }
            }
        }
    }
}

和搜索结果

"hits": [
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "zdMEvHUBlo4-1mHbtvNH",
                "_score": 1.5,
                "_source": {
                    "field1": "foo",
                    "field2": "bar"
                }
            },
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "z9MEvHUBlo4-1mHbtvNH",
                "_score": 0.99999994,
                "_source": {
                    "field1": "foo1 foo2 foo3",
                    "field2": "bar1 bar2 bar3"
                }
            },
            {
                "_index": "test_index",
                "_type": "_doc",
                "_id": "ztMEvHUBlo4-1mHbtvNH",
                "_score": 0.6666666,
                "_source": {
                    "field1": "foo1 foo2",
                    "field2": "bar1 bar2"
                }
            }
        ]