mongodb 按多个字段分组，return 包含所有字段的最新文档

Question

我有一个 table 数据如下所示

project  | environment | timestamp
----------------------------------------
project1 | dev         | 1644515845
project1 | dev         | 1644513211
project1 | qa          | 1644515542
project2 | dev         | 1644513692
project2 | qa          | 1644514822

有多个项目，每个项目有多个环境。每个（项目、环境）对都有多个时间戳，对应于上次对项目进行更改的时间。

是否有查询按（项目、环境）分组，为每个（项目、环境）组合获取具有最新时间戳的条目，然后return整个文档？

类似

db.collection.aggregate([
  {
    "$group": {
      "_id": {
        "env": "$env",
        "project": "$project"
      },
      "timestamp": {
        "$max": "$timestamp"
      }
    }
  }
])

但是，它应该 return 整个文档。

我的尝试可以找到here and here

第一次尝试，没有 return 整个文档。第二次尝试 return 文档的时间戳有误。

  {
    "_id": {
      "env": "dev",
      "project": "project1"
    },
    "doc": {
      "_id": 1,
      "env": "dev",
      "project": "project1",
      "timestamp": 1.644515845e+09
    },
    "timestamp": 1.644519211e+09
  },

此处可能的工作解决方案 here，尽管我想知道是否有更好的方法。

Answer 1

{ "_id" : 1, "project" : "project1", "env" : "dev", "timestamp" : 1644515845 }

Is there a query to group by (project, environment), get the the entry with the newest timestampfor each combination of (project, environment), and then return the entire document?

这是获得所需结果的聚合查询。查询在较新的 mongosh 或 mongo shell 客户端中运行。

// Define the aggregation pipeline with various stages.
var pipeline = [

// Sorting by project+env+timestamp gives the last document for each group (project+env)
// as the latest (highest) timestamp.
{ 
    $sort: { 
        project: 1, 
        env: 1, 
        timestamp: 1 
    } 
},

// Grouping on project+env, and get the last document for the group -
// this is the latest of the group - use the "$last" operator. 
// The aggregation system variable "$$ROOT" references 
// the current top level document (with all fields).
{ 
    $group: { 
        _id: { project: "$project", env: "$env" },
        // newest_timestamp: { "$last": "$timestamp" },
        newest_document: { "$last": "$$ROOT" },
    }
},

// Make the "newest_document" as the root (top leve) document.
{ 
    $replaceWith: "$newest_document" 
},

// Optionally, sort the documents by project+ env
{ 
    $sort: { 
        project: 1, 
        env: 1 
    } 
}
]

// Run the query using the pipeline
db.collection.aggregate(pipeline)

Answer 2

少有更正，

首先需要按timestamp降序排列
第二次使用 $first 运算符 while group 到 select 最新文档

db.collection.aggregate([
  { "$sort": { "timestamp": -1 } },
  {
    "$group": {
      "_id": {
        "env": "$env",
        "project": "$project"
      },
      "doc": { "$first": "$$ROOT" }
    }
  }
])

Playground

另一个可选阶段，使用$project，$unset阶段从结果_id字段中删除，因为不需要，

{ "$project": { "_id": 0 } }

Playground

或

{ "$unset": "_id" }

Playground

For better performance you can create an index in timestamp field in descending order!

Answer 3

查询

我们可以像下面一样在文档上使用 max 而无需排序和创建索引（文档也可以根据字段的顺序进行比较）。

Test code here

aggregate(
[{"$group":
  {"_id":{"env":"$env", "project":"$project"},
   "latestDoc":{"$max":{"timestamp":"$timestamp", "doc":"$$ROOT"}}}},
 {"$set":{"latestDoc":"$latestDoc.doc"}}])

描述了使用索引执行此操作的快速方法here，但在示例中 "_id" 只有 1 个字段，累加器具有复合索引的另一个字段（这里有2 个字段，在累加器中我们有 $$ROOT)，所以组没有使用索引。

我尝试了所有答案，但都很慢，100 万次需要 9-10 秒 documents.Test 你自己确定，如果你找到在组中使用索引的方法，请发送一些如果可以的话反馈一下。

Mongodb 5.3 也将有 $top 用于获得 top 或 topN。

mongodb 按多个字段分组，return 包含所有字段的最新文档

mongodb group by multiple fields, return newest document with all fields

mongodb

mongodb-query