具有高基数字段的 ElasticSearch 术语和基数性能

Question

TL;DR

与 SQL 服务器上的相同查询相比，我的 ElasticSearch 查询需要很长时间。
难道我做错了什么？有什么方法可以提高我的查询性能吗？
这只是 RDBMS 比 NoSQL 做得更好的事情之一吗？

前提

假设我有一家公司接受订单并交付所要求的物品。

我想知道每个订单的独特商品的平均数量。
我的订单数据按订购的商品排列 - 每个订单都有一个或多个包含订单 ID、商品 ID 等的记录。
我有一个用于开发目的的单节点设置
无论我有 4 GB 堆 space（在 12 GB 机器上）还是 16 GB 堆 space（在 32 GB 机器上），结果（性能方面）都是相同的
索引有数十亿条记录，但查询将其过滤到大约 300,000 条记录
订单和商品 ID 的类型为 关键字（本质上是文本），我无法更改它。
在这种特殊情况下，平均唯一商品数为 1.65 - 许多订单仅包含一个唯一商品，其他订单包含 2 个，少数订单最多包含 25 个唯一商品。

问题

使用 ElasticSearch，我必须使用 Terms Aggregation 按订单 ID 对文档进行分组，Cardinality Aggregation 以获得唯一的项目计数，和 Average Bucket 聚合以获得每个订单的平均项目数。

我的两个设置都需要大约 23 秒。在 SQL 服务器上使用相同的数据集进行相同的查询不到 2 秒。

附加信息

弹性搜索查询

{
   "size":0,
   "query":{
      "bool":{
         "filter":[
            {
               ...
            }
         ]
      }
   },
   "aggs":{
      "OrdersBucket":{
         "terms":{
            "field":"orderID",
            "execution_hint":"global_ordinals_hash",
            "size":10000000
         },
         "aggs":{
            "UniqueItems":{
               "cardinality":{
                  "field":"itemID"
               }
            }
         }
      },
      "AverageItemCount":{
         "avg_bucket":{
            "buckets_path":"OrdersBucket>UniqueItems"
         }
      }
   }
}

起初我的查询生成了 OutOfMemoryException，这导致我的服务器宕机。
在我的更高 ram 设置上发出相同的请求产生了以下断路器：

[request] Data too large, data for [<reused_arrays>] would be
[14383258184/13.3gb], which is larger than the limit of
[10287002419/9.5gb]

ElasticSearch github 在这个问题上有几个（当前）未解决的问题：

Cardinality aggregation should not reserve a fixed amount of memory per bucket #15892

global_ordinals execution mode for the terms aggregation has an adversarially impact on children aggregations that expect dense buckets #24788

Heap Explosion on even small cardinality queries in ES 5.3.1 / Kibana 5.3.1 #24359

所有这些让我使用了执行提示 "global_ordinals_hash"，它允许查询成功完成（尽管需要时间..）

类比SQL查询

SELECT AVG(CAST(uniqueCount.amount AS FLOAT)) FROM 
(   SELECT o.OrderID, COUNT(DISTINCT o.ItemID) AS amount 
    FROM Orders o
    WHERE ...
    GROUP BY o.OrderID 
) uniqueCount

正如我所说，这非常非常快。

orderID 字段映射

{
   "orderID":{
      "full_name":"orderID",
      "mapping":{
         "orderID":{
            "type":"keyword",
            "boost":1,
            "index":true,
            "store":false,
            "doc_values":true,
            "term_vector":"no",
            "norms":false,
            "index_options":"docs",
            "eager_global_ordinals":true,
            "similarity":"BM25",
            "fields":{
               "autocomplete":{
                  "type":"text",
                  "boost":1,
                  "index":true,
                  "store":false,
                  "doc_values":false,
                  "term_vector":"no",
                  "norms":true,
                  "index_options":"positions",
                  "eager_global_ordinals":false,
                  "similarity":"BM25",
                  "analyzer":"autocomplete",
                  "search_analyzer":"standard",
                  "search_quote_analyzer":"standard",
                  "include_in_all":true,
                  "position_increment_gap":-1,
                  "fielddata":false
               }
            },
            "null_value":null,
            "include_in_all":true,
            "ignore_above":2147483647,
            "normalizer":null
         }
      }
   }
}

我设置了eager_global_ordinals试图提高性能，但无济于事。

示例文档

{
            "_index": "81cec0acbca6423aa3c2feed5dbccd98",
            "_type": "order",
            "_id": "AVwpLZ7GK9DJVcpvrzss",
            "_score": 0,
            "_source": {
        ...
               "orderID": "904044A",
               "itemID": "23KN",
        ...
            }
}

为了简洁和不可公开的内容删除了不相关的字段

示例输出

{
   "OrdersBucket":{
      "doc_count_error_upper_bound":0,
      "sum_other_doc_count":0,
      "buckets":[
         {
            "key":"910117A",
            "doc_count":16,
            "UniqueItems":{
               "value":16
            }
         },
         {
            "key":"910966A",
            "doc_count":16,
            "UniqueItems":{
               "value":16
            }
         },
        ...
         {
            "key":"912815A",
            "doc_count":1,
            "UniqueItems":{
               "value":1
            }
         },
         {
            "key":"912816A",
            "doc_count":1,
            "UniqueItems":{
               "value":1
            }
         }
      ]
   },
   "AverageItemCount":{
      "value":1.3975020363833832
   }
}

任何帮助将不胜感激:)

Answer 1

显然 SQL 服务器在缓存这些结果方面做得很好。
进一步调查显示初始查询所用时间与 ElasticSearch 所用时间相同。

我会调查为什么这些结果没有通过 ElasticSearch 正确缓存。

我还设法将订单 ID 转换为整数，这显着提高了性能（尽管与 SQL 服务器的性能提升相同）。

此外，as advised by Mark Harwood on the Elastic Forum，在基数聚合上指定 precision_threshold 大大降低了内存消耗！

所以答案是，对于这种特定类型的查询，ES 的性能至少与 SQL 服务器一样好。

具有高基数字段的 ElasticSearch 术语和基数性能

ElasticSearch terms and cardinality performance with high cardinality fields

performance

nosql

query-performance

elasticsearch