两个聚合的 Elasticsearch Common
Elasticsearch Common of two Aggregations
我想找到关于顶级作者和顶级合著者的聚合的常见文档计数,这些作者是索引中来源的书目数据字段中的字段。
我目前正在做的是:
1.Calculating 前 10 位作者的汇总。(A、B、C、D.....)。
2.Calculating 前 10 名共同作者(X、Y、Z、....)的聚合。
3.Calculating 交集的文档计数,如这些对之间的公共文档计数:
[(A,X), (B,Y).....]。 <-----结果
我尝试了子桶聚合,但它给了我:
[A:(top 10对应A),B:(top 10对应B).....].
好的,所以从上面的评论继续作为答案,以使其更易于阅读并且没有字符限制。
Comment
I don't think you can use pipeline aggregation to achieve it.
It's not a lot to process on client side i guess. only 20 records (10 for authors and 10 for co-authors) and it would be simple aggregate query.
Another option would be to just get top 10 across both fields and also simple agg query.
But if you really need intersection of both top10s on ES side go with Scripted Metric Aggregation. you can lay your logic in the code
第一个选项很简单:
GET index_name/_search
{
"size": 0,
"aggs": {
"firstname_dupes": {
"terms": {
"field": "authorFullName.keyword",
"size": 10
}
},
"lastname_dupes": {
"terms": {
"field": "coauthorFullName.keyword",
"size": 10
}
}
}
}
然后在客户端对结果进行交集。
第二个 看起来像:
GET index_name/_search
{
"size": 0,
"aggs": {
"name_dupes": {
"terms": {
"script": {
"source": "return [doc['authorFullName.keyword'].value,doc['coauthorFullName.keyword'].value]"
}
, "size": 10
}
}
}
}
但这并不是 top10 作者和 top10 合著者的交集。它是所有的交集,然后得到 top10.
第三个选项就是写Scripted Metric Aggregation。没有时间花在事物的算法方面(它应该被优化)但它可能看起来像这个。 java 技能肯定会对您有所帮助。还要确保您了解脚本化指标聚合执行的所有阶段以及您在使用它时可能遇到的性能问题。
GET index_name/_search
{
"size": 0,
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "state.fnames = [:];state.lnames = [:];",
"map_script" :
"""
def key = doc['authorFullName.keyword'];
def value = '';
if (key != null && key.value != null) {
value = state.fnames[key.value];
if(value==null) value = 0;
state.fnames[key.value] = value+1
}
key = doc['coauthorFullName.keyword'];
if (key != null && key.value != null) {
value = state.lnames[key.value];
if(value==null) value = 0;
state.lnames[key.value] = value+1
}
""",
"combine_script" : "return state",
"reduce_script" :
"""
def intersection = [];
def f10_global = new HashSet();
def l10_global = new HashSet();
for (state in states) {
def f10_local = state.fnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
def l10_local = state.lnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
for(name in f10_local){f10_global.add(name);}
for(name in l10_local){l10_global.add(name);}
}
for(name in f10_global){
if(l10_global.contains(name)) intersection.add(name);
}
return intersection;
"""
}
}
}
}
请注意,此处的查询假定您对这些属性有 keyword
。如果不只是根据您的情况调整它们。
更新
PS,刚刚注意到您提到您需要常用计数,而不是常用名称。不确定是什么情况,但使用 map(e->e.getValue().toString())
而不是 map(e->e.getKey())
。请参阅 the other answer 类似问题
我想找到关于顶级作者和顶级合著者的聚合的常见文档计数,这些作者是索引中来源的书目数据字段中的字段。
我目前正在做的是:
1.Calculating 前 10 位作者的汇总。(A、B、C、D.....)。
2.Calculating 前 10 名共同作者(X、Y、Z、....)的聚合。
3.Calculating 交集的文档计数,如这些对之间的公共文档计数:
[(A,X), (B,Y).....]。 <-----结果
我尝试了子桶聚合,但它给了我: [A:(top 10对应A),B:(top 10对应B).....].
好的,所以从上面的评论继续作为答案,以使其更易于阅读并且没有字符限制。
Comment
I don't think you can use pipeline aggregation to achieve it.
It's not a lot to process on client side i guess. only 20 records (10 for authors and 10 for co-authors) and it would be simple aggregate query.
Another option would be to just get top 10 across both fields and also simple agg query.
But if you really need intersection of both top10s on ES side go with Scripted Metric Aggregation. you can lay your logic in the code
第一个选项很简单:
GET index_name/_search
{
"size": 0,
"aggs": {
"firstname_dupes": {
"terms": {
"field": "authorFullName.keyword",
"size": 10
}
},
"lastname_dupes": {
"terms": {
"field": "coauthorFullName.keyword",
"size": 10
}
}
}
}
然后在客户端对结果进行交集。
第二个 看起来像:
GET index_name/_search
{
"size": 0,
"aggs": {
"name_dupes": {
"terms": {
"script": {
"source": "return [doc['authorFullName.keyword'].value,doc['coauthorFullName.keyword'].value]"
}
, "size": 10
}
}
}
}
但这并不是 top10 作者和 top10 合著者的交集。它是所有的交集,然后得到 top10.
第三个选项就是写Scripted Metric Aggregation。没有时间花在事物的算法方面(它应该被优化)但它可能看起来像这个。 java 技能肯定会对您有所帮助。还要确保您了解脚本化指标聚合执行的所有阶段以及您在使用它时可能遇到的性能问题。
GET index_name/_search
{
"size": 0,
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "state.fnames = [:];state.lnames = [:];",
"map_script" :
"""
def key = doc['authorFullName.keyword'];
def value = '';
if (key != null && key.value != null) {
value = state.fnames[key.value];
if(value==null) value = 0;
state.fnames[key.value] = value+1
}
key = doc['coauthorFullName.keyword'];
if (key != null && key.value != null) {
value = state.lnames[key.value];
if(value==null) value = 0;
state.lnames[key.value] = value+1
}
""",
"combine_script" : "return state",
"reduce_script" :
"""
def intersection = [];
def f10_global = new HashSet();
def l10_global = new HashSet();
for (state in states) {
def f10_local = state.fnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
def l10_local = state.lnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
for(name in f10_local){f10_global.add(name);}
for(name in l10_local){l10_global.add(name);}
}
for(name in f10_global){
if(l10_global.contains(name)) intersection.add(name);
}
return intersection;
"""
}
}
}
}
请注意,此处的查询假定您对这些属性有 keyword
。如果不只是根据您的情况调整它们。
更新
PS,刚刚注意到您提到您需要常用计数,而不是常用名称。不确定是什么情况,但使用 map(e->e.getValue().toString())
而不是 map(e->e.getKey())
。请参阅 the other answer 类似问题