Elasticsearch中如何对文档进行分组,并获取每组中的文档?

How to group documents in Elasticsearch and get the documents in each group?

我的 Elasticsearch 索引包含与类别具有非规范化 m:n 关系的产品。

我的目标是从中导出包含相同信息但关系倒置的类别索引。

索引如下所示:

PUT /products
{
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "article_id": {
                "type": "keyword"
            },
            "categories": {
                "type": "nested",
                "properties": {
                    "cat_name": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

包含这样创建的文档:

POST /products/_doc
{
  "name": "radio",
  "article_id": "1001",
  "categories": [
    { "cat_name": "audio" },
    { "cat_name": "electronics" }
  ]
}

POST /products/_doc
{
  "name": "fridge",
  "article_id": "1002",
  "categories": [
    { "cat_name": "appliances" },
    { "cat_name": "electronics" }
  ]
}

我想从 Elasticsearch 得到类似这样的信息:

{
  "name": "appliances",
  "products": [
    { 
      "name": "fridge",
      "article_id": "1002"
    }
  ]
},
{
  "name": "audio",
  "products": [
    { 
      "name": "radio",
      "article_id": "1001"
    }
  ]
},
{
  "name": "electronics",
  "products": [
    { 
      "name": "fridge",
      "article_id": "1002"
    },
    { 
      "name": "radio",
      "article_id": "1001"
    }
  ]
}

最终将被放入索引中,例如:

PUT /categories
{
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "products": {
                "type": "nested",
                "properties": {
                    "name": {
                        "type": "keyword"
                    },
                    "article_id": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

如果不以编程方式加载和分组所有产品,我不知道如何执行此操作。 这是我尝试过的:

  1. 字段上的桶聚合 categories.cat_name

    这为我提供了每个类别的文档计数,但没有提供产品文档。使用 top_hits 子聚合似乎限制为 100 个文档。

  2. 组使用带扩展的折叠字段

    只能在单值字段上折叠。

我正在使用 Elasticsearch 8.1。

您需要查询的是这个:

POST products/_search
{
  "size": 0,
  "aggs": {
    "cats": {
      "nested": {
        "path": "categories"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "categories.cat_name",
            "size": 10
          },
          "aggs": {
            "root": {
              "reverse_nested": {},
              "aggs": {
                "products": {
                  "terms": {
                    "field": "name",
                    "size": 10
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

它会生成您所需要的(减去文章 ID,但这很简单):

    "buckets" : [
      {
        "key" : "electronics",
        "doc_count" : 2,
        "root" : {
          "doc_count" : 2,
          "products" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fridge",
                "doc_count" : 1
              },
              {
                "key" : "radio",
                "doc_count" : 1
              }
            ]
          }
        }
      },
      {
        "key" : "appliances",
        "doc_count" : 1,
        "root" : {
          "doc_count" : 1,
          "products" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fridge",
                "doc_count" : 1
              }
            ]
          }
        }
      },
      {
        "key" : "audio",
        "doc_count" : 1,
        "root" : {
          "doc_count" : 1,
          "products" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "radio",
                "doc_count" : 1
              }
            ]
          }
        }
      }
    ]