计算桶大小的基数聚合与术语聚合

Cardinality aggregation vs Terms aggregation with calculating bucket size

我正在使用 elasticsearch 2.4 并希望获得数据中各种实体的不同计数。我玩过很多查询,其中包括两种计算不同计数的方法。一种是通过基数聚合,另一种是进行项聚合,然后可以通过计算桶大小来获得不同的计数。通过前一种方法,我发现计数是错误且不准确的,但速度更快且相对简单。我的数据很大并且会随着时间的推移而增加,所以我不知道基数聚合将如何执行,它会变得更准确还是更不准确 accurate.Wanted 向以前有过这个问题的人以及他们采用哪种方法提出一些建议选择了。

基数聚合需要一个附加参数 precision_threshold

The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default values is 3000.

  • 可配置的精度,它决定了如何用内存换取精度,
  • 在低基数集上具有出色的准确性,
  • 固定内存使用:无论是否有数百个或数十亿个唯一值,内存使用仅取决于配置的精度。

简而言之,基数可以为您提供最多 40000 个基数的准确计数,之后它会给出一个近似计数。 precision_threshold越高,内存成本越高,准确度越高。对于非常高的值,它只能给你一个大概的计数。

To add to what Rahul said in the below answer. Cardinality will give you an approximate count yes, but if you set the precision threshold to its maximum value which is 40000 it will give you accurate results till 40000. Above which the error rate increases but more importantly it never goes above 1%, even upto 10 million documents.
See screen-shot below
Source: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-metrics-cardinality-aggregation.html

Also if we look at it from the user's perspective. If the user gets the count of 10 million documnets or for a matter of fact even a million documets off by 1% it will not make much of a difference and will go unnoticed. And when the user wants to look at the actual data he will do a search anyways which will return accurate results.