hive 上的不同计数与 elasticsearch 上的基数计数不匹配
distinct count on hive does not match cardinality count on elasticsearch
我已经使用 elastic
的 elasticsearch-hadoop
插件将数据从 hive 加载到我的 elasticsearch
集群中。
我需要获取唯一帐号的计数。我在 hql
和 queryDSL
中编写了以下查询, 但是 它们返回不同的计数。
配置单元查询:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
同样,在 Elasticsearch 中,查询如下所示:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
我是不是做错了什么?我该怎么做才能匹配这两个查询?
Note:
hive和elasticsearch的记录数完全一样
"the first approximate aggregation provided by Elasticsearch is
the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an
approximate algorithm. It is based on the HyperLogLog++ (HLL)
algorithm."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
对于 OP
precision_threshold
"precision_threshold accepts a number from 0–40,000. Larger values are
treated as equivalent to 40,000.
...
Although not guaranteed by the
algorithm, if a cardinality is under the threshold, it is almost
always 100% accurate. Cardinalities above this will begin to trade
accuracy for memory savings, and a little error will creep into the
metric."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
您可能还想看看 "Support for precise cardinality aggregation #15876"
对于 OP,2
"I have tried several numbers..."
您有 71,132 个不同的值,而精度阈值限制为 40,000,因此基数为 over 阈值,这意味着以准确性换取内存节省。
这就是所选实现(基于 HyperLogLog++ 算法)的工作原理。
即使使用 40000 precision_threshold,基数也不能确保准确计数。还有另一种方法可以获取字段的准确非重复计数。
这篇关于“Accurate Distinct Count and Values from Elasticsearch”的文章详细解释了解决方案及其相对于 Cardinality 的准确性。
我已经使用 elastic
的 elasticsearch-hadoop
插件将数据从 hive 加载到我的 elasticsearch
集群中。
我需要获取唯一帐号的计数。我在 hql
和 queryDSL
中编写了以下查询, 但是 它们返回不同的计数。
配置单元查询:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
同样,在 Elasticsearch 中,查询如下所示:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
我是不是做错了什么?我该怎么做才能匹配这两个查询?
Note:
hive和elasticsearch的记录数完全一样
"the first approximate aggregation provided by Elasticsearch is the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm."https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
对于 OP
precision_threshold
"precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.
...
Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric."https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
您可能还想看看 "Support for precise cardinality aggregation #15876"
对于 OP,2
"I have tried several numbers..."
您有 71,132 个不同的值,而精度阈值限制为 40,000,因此基数为 over 阈值,这意味着以准确性换取内存节省。
这就是所选实现(基于 HyperLogLog++ 算法)的工作原理。
即使使用 40000 precision_threshold,基数也不能确保准确计数。还有另一种方法可以获取字段的准确非重复计数。
这篇关于“Accurate Distinct Count and Values from Elasticsearch”的文章详细解释了解决方案及其相对于 Cardinality 的准确性。