hive 上的不同计数与 elasticsearch 上的基数计数不匹配

distinct count on hive does not match cardinality count on elasticsearch

我已经使用 elasticelasticsearch-hadoop 插件将数据从 hive 加载到我的 elasticsearch 集群中。

我需要获取唯一帐号的计数。我在 hqlqueryDSL 中编写了以下查询, 但是 它们返回不同的计数。

配置单元查询:

select count(distinct account) from <tableName> where capacity="550";

// Returns --> 71132

同样,在 Elasticsearch 中,查询如下所示:

{
    "query": {
        "bool": {
            "must": [
              {"match": { "capacity": "550"}}
            ]
        }
    },
    "aggs": {
    "unique_account": {
      "cardinality": {
        "field": "account"
      }
    }
  }
}

// Returns --> 71607

我是不是做错了什么?我该怎么做才能匹配这两个查询?

Note:hive和elasticsearch的记录数完全一样

"the first approximate aggregation provided by Elasticsearch is the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

对于 OP

precision_threshold

"precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.
...
Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

您可能还想看看 "Support for precise cardinality aggregation #15876"

对于 OP,2

"I have tried several numbers..."

您有 71,132 个不同的值,而精度阈值限制为 40,000,因此基数为 over 阈值,这意味着以准确性换取内存节省。
这就是所选实现(基于 HyperLogLog++ 算法)的工作原理。

即使使用 40000 precision_threshold,基数也不能确保准确计数​​。还有另一种方法可以获取字段的准确非重复计数。

这篇关于“Accurate Distinct Count and Values from Elasticsearch”的文章详细解释了解决方案及其相对于 Cardinality 的准确性。