ElasticSearch 基数问题
ElasticSearch Cardinality issue
基数聚合计算不同值的近似计数。但为什么即使对于存储在单个分片中的索引,它也显示不正确的值?
GET /jobs/_settings
{
"jobs": {
"settings": {
"index": {
"number_of_shards": "1",
...
position_id is long
GET /jobs/_search
{
"size": 0,
"aggs": {
"count_position_id": {
"value_count": {
"field": "position_id"
}
},
"unique_position_id": {
"cardinality": {
"field": "position_id",
"precision_threshold": 40000
}
}
}
}
{
"took": 44,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 52836,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_position_id": {
"value": 52930
},
"count_position_id": {
"value": 52836
}
}
}
与图片中的单个分片相比,它更多地与用于计算基数的算法有关。
ES cardinality agg 使用 HLL (hyperloglog) 工作,这是一种近似计数算法(它依赖于对哈希的二进制表示的观察来近似唯一值计数)
您可以通过增加 precision_threshold 来控制精度。因此根据定义,这是 "approximate count" - 并不是真的不正确。
基数聚合计算不同值的近似计数。但为什么即使对于存储在单个分片中的索引,它也显示不正确的值?
GET /jobs/_settings { "jobs": { "settings": { "index": { "number_of_shards": "1", ... position_id is long GET /jobs/_search { "size": 0, "aggs": { "count_position_id": { "value_count": { "field": "position_id" } }, "unique_position_id": { "cardinality": { "field": "position_id", "precision_threshold": 40000 } } } } { "took": 44, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 52836, "max_score": 0, "hits": [] }, "aggregations": { "unique_position_id": { "value": 52930 }, "count_position_id": { "value": 52836 } } }
与图片中的单个分片相比,它更多地与用于计算基数的算法有关。
ES cardinality agg 使用 HLL (hyperloglog) 工作,这是一种近似计数算法(它依赖于对哈希的二进制表示的观察来近似唯一值计数)
您可以通过增加 precision_threshold 来控制精度。因此根据定义,这是 "approximate count" - 并不是真的不正确。