非常大的查询或优化 elasticsearch 更新

Question

我从事文档可视化工作，对大量文档（大约 150 000）进行二进制分类。挑战在于如何向最终用户呈现一般视觉信息，以便他们可以了解每个类别 (positive/negative) 的主要 "concepts"。由于每个文档都有一组相关联的主题，我考虑通过聚合询问 Elasticsearch 以获取关于正面分类文档的前 20 个主题，然后同样询问 Elasticsearch。

我创建了一个 python 脚本，它从 Elastic 下载数据并对文档进行分类，但问题是数据集的预测没有在 Elasticsearch 上注册，所以我不能要求 top-特定类别的 20 个主题。首先，我考虑在弹性中创建一个查询来请求聚合并传递一个匹配

因为我有 positive/negative 文档的 ID，我可以编写一个查询来检索主题的聚合，但是在查询中我应该提供大量的文档 IDS 来指示，例如，只是积极的文件。这是不可能的，因为端点有限制，我不能传递 50 000 个 ID，例如：

"query": {
    "bool": {
      "should": [
           {"match": {"id_str": "939490553510748161"}},
           {"match": {"id_str": "939496983510742348"}}
           ...
        ],
      "minimum_should_match" : 1
    }
},
"aggs" : { ... }

所以我尝试在弹性索引中注册分类的预测类别，但由于文档量非常大，需要半个小时（相比之下运行不到一分钟分类）...这需要大量时间来存储预测...然后我还需要查询索引以为可视化设置正确的数据。要更新文档，我使用：

for id in docs_ids:
    es.update(
        index=kwargs["index"],
        doc_type=kwargs["doc_type"],
        id=id,
        body={"doc": {
            "prediction": kwargs["category"]
        }}
    )

您知道更快更新预测的替代方法吗？

Answer 1

您可以使用 bulk query，它允许您序列化您的请求并只查询一次，而不是执行大量搜索的 elasticsearch。尝试：

from elasticsearch import helpers

query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
    query_dict ={
    '_op_type': 'update',
    '_index': kwargs["index"],
    '_type': kwargs["doc_type"],
    '_id': id,
    'doc': {"prediction": kwargs["category"]}
    }
    query_list.append(query_dict)

helpers.bulk(client=es,actions=query_list)

请阅读here 关于查询列表 ID，为了获得更快的响应，您不应像在问题中那样携带 match_string 值，而应携带 _id 字段。这允许您在 python 库中使用 multiget query, a bulk query for the get operation. Here。尝试：

my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
                doc_type = kwargs["index"],
                body = {'ids': my_ids_list})

非常大的查询或优化 elasticsearch 更新

Really huge query or optimizing an elasticsearch update

elasticsearch

elasticsearch-py