Elasticsearch 按字段分组

Elasticsearch group by field

我想按字段对搜索结果进行分组。 示例:我有 userId 对应多个用户名的数据。 所以在搜索结果中,我想将所有 userId 及其相应的用户名分组。

目前正在使用聚合,我能够对 userId 进行分组,但无法检索其对应的用户名列表。 我得到如下信息。

"aggregations" : {
"by_user_id" : {
  "after_key" : {
    "group_by_search" : 2335
  },
  "buckets" : [
    {
      "key" : {
        "group_by_search" : 2
      },
      "doc_count" : 2
    },
    {
      "key" : {
        "group_by_search" : 1000
      },
      "doc_count" : 4
    },
    {
      "key" : {
        "group_by_search" : 2335
      },
      "doc_count" : 2
    }
  ]
}

我想要的是:

"aggregations" : {
"by_corp_id" : {
  "after_key" : {
    "group_by_search" : 2335
  },
  "buckets" : [
    {
      "key" : {
        "group_by_search" : 2
        "usernames":[1111,222] ***//this is list of usernames having same userId***
      },
      "doc_count" : 2
    },
    {
      "key" : {
        "group_by_search" : 1000
        "usernames":[11 ,0101,1199,222] ***//this is list of usernames having same userId***
      },
      "doc_count" : 4
    },
    {
      "key" : {
        "group_by_search" : 2335
        "usernames":[1111,222] ***//this is list of usernames having same userId***
      },
      "doc_count" : 2
    }
  ]
}

有没有办法在 Elasticsearch 中使用聚合直接实现这一点?

更新:我正在使用以下聚合

"aggregations": {
    "by_user_id": {
        "composite": {
            "size": 1000,
            "sources": [
                {
                    "group_by_search": {
                        "terms": {
                            "field": "user_id",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                }
            ]
        }
    }
}

谢谢。

您可以使用 top hits aggregation 获取具有相同 ID 的所有用户名的列表。

添加一个工作示例

索引数据:

{
  "usernames": 3,
  "user_id": 2
}
{
  "usernames": 1,
  "user_id": 1
}
{
  "usernames": 2,
  "user_id": 1
}

搜索查询:

{
  "size": 0,
  "aggregations": {
    "by_user_id": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "group_by_search": {
              "terms": {
                "field": "user_id",
                "missing_bucket": false,
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "list_names": {
          "top_hits": {
            "_source": {
              "includes": [
                "usernames"
              ]
            }
          }
        }
      }
    }
  }
}

搜索结果:

"aggregations": {
    "by_user_id": {
      "after_key": {
        "group_by_search": 2      
      },
      "buckets": [
        {
          "key": {
            "group_by_search": 1        // note this
          },
          "doc_count": 2,
          "list_names": {
            "hits": {
              "total": {
                "value": 2,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "66362501",
                  "_type": "_doc",
                  "_id": "1",
                  "_score": 1.0,
                  "_source": {
                    "usernames": 1             // note this
                  }
                },
                {
                  "_index": "66362501",
                  "_type": "_doc",
                  "_id": "2",
                  "_score": 1.0,
                  "_source": {
                    "usernames": 2           // note this
                  }
                }
              ]
            }
          }
        },
        {
          "key": {
            "group_by_search": 2
          },
          "doc_count": 1,
          "list_names": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "66362501",
                  "_type": "_doc",
                  "_id": "3",
                  "_score": 1.0,
                  "_source": {
                    "usernames": 3       
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }

您需要做的只是在用户名字段上添加一个 terms 子聚合,以便每个存储桶都获得所有唯一用户名的列表:

"aggregations": {
    "by_user_id": {
        "composite": {
            "size": 1000,
            "sources": [
                {
                    "group_by_search": {
                        "terms": {
                            "field": "user_id",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                }
            ]
        },
        "aggs": {
            "username": {
                "terms": {
                    "field": "username",
                    "size": 1000
                }
            }
        }
    }
}

top_hits 也是可能的,但是您会得到很多重复项,并且您需要 return 大量匹配以确保您拥有所有可能的不同用户名。

如果您的用户名字段具有高基数 (>1000),那么最好将用户名上的术语聚合移动到复合源数组中并自行遍历所有存储桶,如下所示:

"aggregations": {
    "by_user_id": {
        "composite": {
            "size": 1000,
            "sources": [
                {
                    "group_by_search": {
                        "terms": {
                            "field": "user_id",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                },
                {
                    "group_by_username": {
                        "terms": {
                            "field": "username",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                }
            ]
        }
    }
}