控制聚合中创建的桶数

Question

在 Elasticsearch 中，您可以在聚合中创建的存储桶数量存在限制。如果它创建的桶数超过指定的限制，您将在 ES 6.x 中收到一条警告消息，并且在以后的版本中会抛出错误。

警告信息如下：

This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.

由于 ES 7.x，该限制设置为 10000，但可以调整。

问题是，我实际上无法计算（或估计）聚合将创建多少个桶。

考虑以下请求：

GET /zone_stats_hourly/_search
{
   "aggs":{
      "apps":{
         "terms":{
            "field":"appId",
            "size":<NUM_TERM_BUCKETS>,
            "min_doc_count":1,
            "shard_min_doc_count":0,
            "show_term_doc_count_error":false,
            "order":[
               {
                  "_count":"desc"
               },
               {
                  "_key":"asc"
               }
            ]
         },
         "aggregations":{
            "histogram":{
               "days":{
                  "field":"processTime",
                  "time_zone":"UTC",
                  "interval":"1d",
                  "offset":0,
                  "order":{
                     "_key":"asc"
                  },
                  "keyed":false,
                  "min_doc_count":0
               },
               "aggregations":{
                  "requests":{
                     "sum":{
                        "field":"requests"
                     }
                  },
                  "filled":{
                     "sum":{
                        "field":"filledRequests"
                     }
                  },
                  "matched":{
                     "sum":{
                        "field":"matchedRequests"
                     }
                  },
                  "imp":{
                     "sum":{
                        "field":"impressions"
                     }
                  },
                  "cv":{
                     "sum":{
                        "field":"completeViews"
                     }
                  },
                  "clicks":{
                     "sum":{
                        "field":"clicks"
                     }
                  },
                  "installs":{
                     "sum":{
                        "field":"installs"
                     }
                  },
                  "actions":{
                     "sum":{
                        "field":"actions"
                     }
                  },
                  "earningsIRT":{
                     "sum":{
                        "field":"earnings.inIRT"
                     }
                  },
                  "earningsUSD":{
                     "sum":{
                        "field":"earnings.inUSD"
                     }
                  },
                  "earningsEUR":{
                     "sum":{
                        "field":"earnings.inEUR"
                     }
                  },
                  "dealBasedEarnings":{
                     "nested":{
                        "path":"dealBasedEarnings"
                     },
                     "aggregations":{
                        "types":{
                           "terms":{
                              "field":"dealBasedEarnings.type",
                              "size":4,
                              "min_doc_count":1,
                              "shard_min_doc_count":0,
                              "show_term_doc_count_error":false,
                              "order":[
                                 {
                                    "_count":"desc"
                                 },
                                 {
                                    "_key":"asc"
                                 }
                              ]
                           },
                           "aggregations":{
                              "dealBasedEarningsIRT":{
                                 "sum":{
                                    "field":"dealBasedEarnings.amount.inIRT"
                                 }
                              },
                              "dealBasedEarningsUSD":{
                                 "sum":{
                                    "field":"dealBasedEarnings.amount.inUSD"
                                 }
                              },
                              "dealBasedEarningsEUR":{
                                 "sum":{
                                    "field":"dealBasedEarnings.amount.inEUR"
                                 }
                              }
                           }
                        }
                     }
                  }
               }
            }
         }
      }
   },
   "size":0,
   "_source":{
      "excludes":[]
   },
   "stored_fields":["*"],
   "docvalue_fields":[
      {
         "field":"eventTime",
         "format":"date_time"
      },
      {
         "field":"processTime",
         "format":"date_time"
      },
      {
         "field":"postBack.time",
         "format":"date_time"
      }
   ],
   "query":{
      "bool":{
         "must":[
            {
               "range":{
                  "processTime":{
                     "from":1565049600000,
                     "to":1565136000000,
                     "include_lower":true,
                     "include_upper":false,
                     "boost":1.0
                  }
               }
            }
         ],
         "adjust_pure_negative":true,
         "boost":1.0
      }
   }
}

如果我将 <NUM_TERM_BUCKETS> 设置为 2200 并执行请求，我会收到一条警告消息，提示我正在创建超过 10000 个存储桶（怎么办？！）。 =30=]

来自 ES 的示例响应：

#! Deprecation: 299 Elasticsearch-6.7.1-2f32220 "This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests."
{
  "took": 6533,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 103456,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "apps": {
      "doc_count_error_upper_bound": 9,
      "sum_other_doc_count": 37395,
      "buckets":[...]
    }
  }
}

更有趣的是，在将 <NUM_TERM_BUCKETS> 减少到 2100 之后，我没有收到任何警告消息，这意味着创建的存储桶数量低于 10000。

我很难找到这背后的原因，结果一无所获。

在实际执行请求之前，是否有任何公式或东西可以计算或估计聚合将要创建的桶数？

我想知道聚合是否在 ES 7.x 或以后针对指定的 search.max_buckets 抛出错误，以便我可以决定是否是否使用 composite 聚合。

更新

我尝试了一个更简单的聚合，其中不包含对具有大约 80000 个文档的索引的嵌套或子聚合。

请求如下：

GET /my_index/_search
{
   "size":0,
   "query":{
      "match_all":{}
   },
   "aggregations":{
      "unique":{
         "terms":{
            "field":"_id",
            "size":<NUM_TERM_BUCKETS>
         }
      }
   }
}

如果我将 <NUM_TERM_BUCKETS> 设置为 7000，我会在 ES 7.3:

中收到此错误响应

{
   "error":{
      "root_cause":[
         {
            "type":"too_many_buckets_exception",
            "reason":"Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
            "max_buckets":10000
         }
      ],
      "type":"search_phase_execution_exception",
      "reason":"all shards failed",
      "phase":"query",
      "grouped":true,
      "failed_shards":[
         {
            "shard":0,
            "index":"my_index",
            "node":"XYZ",
            "reason":{
               "type":"too_many_buckets_exception",
               "reason":"Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
               "max_buckets":10000
            }
         }
      ]
   },
   "status":503
}

如果我将 <NUM_TERM_BUCKETS> 减少到 6000，它会成功运行。

说真的，我很困惑。这种聚合到底是如何产生超过 10000 个桶的？谁能回答这个问题？

Answer 1

根据 Terms Aggregation 的文档：

The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client.

The default shard_size is (size * 1.5 + 10).

为了解决分布式系统中的准确性问题，Elasticsearch 要求每个分片的数字大于 size。

因此，可以使用以下公式计算简单项聚合的最大值 NUM_TERM_BUCKETS：

maxNumTermBuckets = (search.maxBuckets - 10) / 1.5

6660 search.maxBuckets = 10000。

控制聚合中创建的桶数

Control number of buckets created in an aggregation

elasticsearch

elasticsearch-aggregation