totalEstimatedMatches 行为与 Microsoft (Bing) 认知搜索 API (v5)

Question

最近将一些 Bing 搜索 API v2 代码转换为 v5，它可以工作，但我对 "totalEstimatedMatches" 的行为感到好奇。这是一个例子来说明我的问题：

我们网站上的一位用户搜索了一个特定的词。 API 查询 returns 10 个结果（我们的页面大小设置）并且 totalEstimatedMatches 设置为 21。因此我们指示 3 页结果并让用户翻页。

当他们到达第 3 页时，totalEstimatedMatches returns 为 22 而不是 21。看起来奇怪的是，对于这么小的结果集，它不应该已经知道它是 22，但好吧，我可以接受。所有结果都正确显示。

现在如果用户再次从第 3 页返回到第 2 页，则 totalEstimatedMatches 的值再次为 21。这让我有点惊讶，因为一旦结果集被分页，API 可能应该知道有 22 个而不是 21 个结果。

我自 80 年代以来一直是一名专业软件开发人员，所以我知道这是与 API 设计相关的细节问题之一。显然它没有缓存准确数量的结果，或其他任何内容。我只是不记得 V2 搜索 API 中的那种行为（我意识到这是第 3 方代码）。结果的数量非常可靠。

这是否让除我之外的任何人都觉得有点意外？

Answer 1

事实证明，这就是响应 JSON 字段 totalEstimatedMatches 包含单词 ...Estimated... 并且不只是称为 totalMatches 的原因：

"...search engine index does not support an accurate estimation of total match."

取自：News Search API V5 paging results with offset and count

正如人们所料，返回的结果越少，您在 totalEstimatedMatches 值中看到的错误百分比可能就越大。同样，您的查询越复杂（例如运行复合查询 ../search?q=(foo OR bar OR foobar)&... 实际上是将 3 个搜索打包到 1 个中）这个值似乎表现出更多的变化。

就是说，我已经设法（至少初步）通过设置 offset == totalEstimatedMatches 并创建一个简单的等效检查函数来弥补这一点。

这是 python 中的一个简单示例：

while True:
    if original_totalEstimatedMatches < new_totalEstimatedMatches:
       original_totalEstimatedMatches = new_totalEstimatedMatches.copy()

       #set_new_offset_and_call_api() is a func that does what it says.
       new_totalEstimatedMatches = set_new_offset_and_call_api()
    else:
        break

Answer 2

重温 API & 我想出了一种无需使用 "totalEstimatedMatches" return 值即可高效分页的方法：

class ApiWorker(object):
    def __init__(self, q):
        self.q = q
        self.offset = 0
        self.result_hashes = set()
        self.finished = False

    def calc_next_offset(self, resp_urls):
       before_adding = len(self.result_hashes)
       self.result_hashes.update((hash(i) for i in resp_urls)) #<==abuse of set operations.
       after_adding = len(self.result_hashes)
       if after_adding == before_adding: #<==then we either got a bunch of duplicates or we're getting very few results back.
           self.complete = True
       else:
           self.offset += len(new_results)

    def page_through_results(self, *args, **kwargs):
        while not self.finished:
            new_resp_urls = ...<call_logic>...
            self.calc_next_offset(new_resp_urls)
            ...<save logic>...
        print(f'All unique results for q={self.q} have been obtained.')

一旦获得完整的重复响应，这^将停止分页。

totalEstimatedMatches 行为与 Microsoft (Bing) 认知搜索 API (v5)

totalEstimatedMatches behavior with Microsoft (Bing) Cognitive search API (v5)

bing-api

microsoft-cognitive