如何通过 Elasticsearch 中的嵌套字段计算唯一文档的数量?
How to count a number of unique documents by a nested field in Elasticsearch?
我正在尝试计算具有唯一嵌套字段值的文档(接下来,还有文档本身)。看起来获得独特的文件有效。
但是当我尝试执行 count
的请求时,出现如下错误:
Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/package/_count?ignore_throttled=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"parsing_exception","reason":"request does not support [collapse]","line":1,"col":216}],"type":"parsing_exception","reason":"request does not support [collapse]","line":1,"col":216},"status":400}
代码:
BoolQueryBuilder innerTemplNestedBuilder = QueryBuilders.boolQuery();
NestedQueryBuilder templatesNestedQuery = QueryBuilders.nestedQuery("attachment", innerTemplNestedBuilder, ScoreMode.None);
BoolQueryBuilder mainQueryBuilder = QueryBuilders.boolQuery().must(templatesNestedQuery);
if (!isEmpty(templateName)) {
innerTemplNestedBuilder.filter(QueryBuilders.termQuery("attachment.name", templateName));
}
SearchSourceBuilder searchSourceBuilder = SearchSourceBuilder.searchSource()
.collapse(new CollapseBuilder("attachment.uuid"))
.query(mainQueryBuilder);
// NEXT LINE CAUSE ERROR
long count = client.count(new CountRequest("package").source(searchSourceBuilder), RequestOptions.DEFAULT).getCount(); <<<<<<<<<< ERROR HERE
// THIS WORKS
SearchResponse searchResponse = client.search(
new SearchRequest(
new String[] {"package"},
searchSourceBuilder.timeout(new TimeValue(20, TimeUnit.SECONDS)).from(offset).size(limit)
).indices("package").searchType(SearchType.DFS_QUERY_THEN_FETCH),
RequestOptions.DEFAULT
);
return ....;
approach的总体意图是得到一部分文档,以及所有这些文档的个数。可能已经存在满足这种需求的另一种方法。如果我尝试使用 aggregations
和 cardinality
来获得 count
- 我得到的结果为零,并且看起来它在嵌套字段上不起作用。
计数请求:
{
"query": {
"bool": {
"must": [
{
"nested": {
"query": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
},
"path": "attachment",
"ignore_unmapped": false,
"score_mode": "none",
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"collapse": {
"field": "attachment.uuid"
}
}
映射的创建方式:
curl -X DELETE "localhost:9200/package?pretty"
curl -X PUT "localhost:9200/package?include_type_name=true&pretty" -H 'Content-Type: application/json' -d '{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
}}'
curl -X PUT "localhost:9200/package/_mappings?pretty" -H 'Content-Type: application/json' -d'
{
"dynamic": false,
"properties" : {
"attachment": {
"type": "nested",
"properties": {
"uuid" : { "type" : "keyword" },
"name" : { "type" : "text" }
}
},
"uuid" : {
"type" : "keyword"
}
}
}
'
代码生成的结果查询应该是这样的:
curl -X POST "localhost:9200/package/_count?&pretty" -H 'Content-Type: application/json' -d' { "query" :
{
"bool": {
"must": [
{
"nested": {
"query": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
},
"path": "attachment",
"ignore_unmapped": false,
"score_mode": "none",
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"collapse": {
"field": "attachment.uuid"
}
}'
折叠可以 only be used 在 _search
上下文中,而不是 _count
。
其次,您的查询甚至做了什么?您那里有很多冗余参数,例如 boost:1
等。您不妨说:
POST /package/_count?&pretty
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "attachment",
"query": {
"match_all": {}
}
}
}
]
}
}
}
这并没有真正做任何事情:)
回答您 "counting documents with unique nested field value" 的原始问题 ,
假设有 3 个文档,其中 2 个具有相同的 attachment.uuid
值:
[
{
"attachment":{
"uuid":"04144e14-62c3-11ea-bc55-0242ac130003"
}
},
{
"attachment":{
"uuid":"04144e14-62c3-11ea-bc55-0242ac130003"
}
},
{
"attachment":{
"uuid":"100b9632-62c3-11ea-bc55-0242ac130003"
}
}
]
要获得 uuid
的 terms
细分,运行
GET package/_search
{
"size": 0,
"aggs": {
"nested_uniques": {
"nested": {
"path": "attachment"
},
"aggs": {
"subagg": {
"terms": {
"field": "attachment.uuid"
}
}
}
}
}
}
产生
...
{
"aggregations":{
"nested_uniques":{
"doc_count":3,
"subagg":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"04144e14-62c3-11ea-bc55-0242ac130003",
"doc_count":2
},
{
"key":"100b9632-62c3-11ea-bc55-0242ac130003",
"doc_count":1
}
]
}
}
}
}
要获得唯一嵌套字段的父文档计数,我们必须稍微聪明一点:
GET package/_search
{
"size": 0,
"aggs": {
"nested_uniques": {
"nested": {
"path": "attachment"
},
"aggs": {
"scripted_uniques": {
"scripted_metric": {
"init_script": "state.my_map = [:];",
"map_script": """
if (doc.containsKey('attachment.uuid')) {
state.my_map[doc['attachment.uuid'].value.toString()] = 1;
}
""",
"combine_script": """
def sum = 0;
for (c in state.my_map.entrySet()) {
sum += 1
}
return sum
""",
"reduce_script": """
def sum = 0;
for (agg in states) {
sum += agg;
}
return sum;
"""
}
}
}
}
}
}
哪个returns
...
{
"aggregations":{
"nested_uniques":{
"doc_count":3,
"scripted_uniques":{
"value":2
}
}
}
}
而这个 scripted_uniques: 2
正是您所追求的。
注意:我使用嵌套脚本化指标聚合解决了这个用例,但如果你们知道更简洁的方法,我非常乐意学习它!
我正在尝试计算具有唯一嵌套字段值的文档(接下来,还有文档本身)。看起来获得独特的文件有效。
但是当我尝试执行 count
的请求时,出现如下错误:
Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/package/_count?ignore_throttled=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"parsing_exception","reason":"request does not support [collapse]","line":1,"col":216}],"type":"parsing_exception","reason":"request does not support [collapse]","line":1,"col":216},"status":400}
代码:
BoolQueryBuilder innerTemplNestedBuilder = QueryBuilders.boolQuery();
NestedQueryBuilder templatesNestedQuery = QueryBuilders.nestedQuery("attachment", innerTemplNestedBuilder, ScoreMode.None);
BoolQueryBuilder mainQueryBuilder = QueryBuilders.boolQuery().must(templatesNestedQuery);
if (!isEmpty(templateName)) {
innerTemplNestedBuilder.filter(QueryBuilders.termQuery("attachment.name", templateName));
}
SearchSourceBuilder searchSourceBuilder = SearchSourceBuilder.searchSource()
.collapse(new CollapseBuilder("attachment.uuid"))
.query(mainQueryBuilder);
// NEXT LINE CAUSE ERROR
long count = client.count(new CountRequest("package").source(searchSourceBuilder), RequestOptions.DEFAULT).getCount(); <<<<<<<<<< ERROR HERE
// THIS WORKS
SearchResponse searchResponse = client.search(
new SearchRequest(
new String[] {"package"},
searchSourceBuilder.timeout(new TimeValue(20, TimeUnit.SECONDS)).from(offset).size(limit)
).indices("package").searchType(SearchType.DFS_QUERY_THEN_FETCH),
RequestOptions.DEFAULT
);
return ....;
approach的总体意图是得到一部分文档,以及所有这些文档的个数。可能已经存在满足这种需求的另一种方法。如果我尝试使用 aggregations
和 cardinality
来获得 count
- 我得到的结果为零,并且看起来它在嵌套字段上不起作用。
计数请求:
{
"query": {
"bool": {
"must": [
{
"nested": {
"query": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
},
"path": "attachment",
"ignore_unmapped": false,
"score_mode": "none",
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"collapse": {
"field": "attachment.uuid"
}
}
映射的创建方式:
curl -X DELETE "localhost:9200/package?pretty"
curl -X PUT "localhost:9200/package?include_type_name=true&pretty" -H 'Content-Type: application/json' -d '{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
}}'
curl -X PUT "localhost:9200/package/_mappings?pretty" -H 'Content-Type: application/json' -d'
{
"dynamic": false,
"properties" : {
"attachment": {
"type": "nested",
"properties": {
"uuid" : { "type" : "keyword" },
"name" : { "type" : "text" }
}
},
"uuid" : {
"type" : "keyword"
}
}
}
'
代码生成的结果查询应该是这样的:
curl -X POST "localhost:9200/package/_count?&pretty" -H 'Content-Type: application/json' -d' { "query" :
{
"bool": {
"must": [
{
"nested": {
"query": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
},
"path": "attachment",
"ignore_unmapped": false,
"score_mode": "none",
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"collapse": {
"field": "attachment.uuid"
}
}'
折叠可以 only be used 在 _search
上下文中,而不是 _count
。
其次,您的查询甚至做了什么?您那里有很多冗余参数,例如 boost:1
等。您不妨说:
POST /package/_count?&pretty
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "attachment",
"query": {
"match_all": {}
}
}
}
]
}
}
}
这并没有真正做任何事情:)
回答您 "counting documents with unique nested field value" 的原始问题 ,
假设有 3 个文档,其中 2 个具有相同的 attachment.uuid
值:
[
{
"attachment":{
"uuid":"04144e14-62c3-11ea-bc55-0242ac130003"
}
},
{
"attachment":{
"uuid":"04144e14-62c3-11ea-bc55-0242ac130003"
}
},
{
"attachment":{
"uuid":"100b9632-62c3-11ea-bc55-0242ac130003"
}
}
]
要获得 uuid
的 terms
细分,运行
GET package/_search
{
"size": 0,
"aggs": {
"nested_uniques": {
"nested": {
"path": "attachment"
},
"aggs": {
"subagg": {
"terms": {
"field": "attachment.uuid"
}
}
}
}
}
}
产生
...
{
"aggregations":{
"nested_uniques":{
"doc_count":3,
"subagg":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"04144e14-62c3-11ea-bc55-0242ac130003",
"doc_count":2
},
{
"key":"100b9632-62c3-11ea-bc55-0242ac130003",
"doc_count":1
}
]
}
}
}
}
要获得唯一嵌套字段的父文档计数,我们必须稍微聪明一点:
GET package/_search
{
"size": 0,
"aggs": {
"nested_uniques": {
"nested": {
"path": "attachment"
},
"aggs": {
"scripted_uniques": {
"scripted_metric": {
"init_script": "state.my_map = [:];",
"map_script": """
if (doc.containsKey('attachment.uuid')) {
state.my_map[doc['attachment.uuid'].value.toString()] = 1;
}
""",
"combine_script": """
def sum = 0;
for (c in state.my_map.entrySet()) {
sum += 1
}
return sum
""",
"reduce_script": """
def sum = 0;
for (agg in states) {
sum += agg;
}
return sum;
"""
}
}
}
}
}
}
哪个returns
...
{
"aggregations":{
"nested_uniques":{
"doc_count":3,
"scripted_uniques":{
"value":2
}
}
}
}
而这个 scripted_uniques: 2
正是您所追求的。
注意:我使用嵌套脚本化指标聚合解决了这个用例,但如果你们知道更简洁的方法,我非常乐意学习它!