MongoDB

Question

我在 MongoDB 中有一个 collection（4.4，但版本对我来说并不重要），其中一个文档值是 URL 的数组。将有多个文档，每个文档（在数组内）有多个 URLs，其中一些 URLs 已经存在。我想 select 每个文件中最早出现的每个 URL （目的是将其标记为 'origin'.

MongoPlayground link 示例 collection - https://mongoplayground.net/p/ZAgCqr517-8

  {
    "title": "story1_first",
    "isoDate": "2022-01-01T00:00:00.000Z",
    "links": [
      "www.first.com/article1",
      "www.anotherdomain.com"
    ]
  },
  {
    "title": "story1_mention",
    "isoDate": "2022-01-10T00:00:00.000Z",
    "links": [
      "www.first.com/article1",
      "www.somesite.com"
    ]
  },
  {
    "title": "story2_first",
    "isoDate": "2022-01-20T00:00:00.000Z",
    "links": [
      "www.newstory.com/article2",
      "www.anothercompany.com"
    ]
  },
  {
    "title": "story2_mention",
    "isoDate": "2022-01-20T00:00:00.000Z",
    "links": [
      "www.newstory.com/article2",
      "www.anothercompany.com"
    ]
  }
]

在这个例子中，我想查询/聚合到 return 标题中带有“first”的两个文档，因为它们是在 [=20] 中共享一个公共 URL 的文档=] 并且是具有最早日期的文档。类似于搜索引擎如何根据其他网站的数量 link 对网站进行排名。

Answer 1

您可以在聚合管道中执行以下操作：

$unwind links 因此文档处于 link 级别
$sort 在 isoDate 上获取第一个文档
$group by links 获取组之间的计数和第一个文档的 ID。在您的示例中，标题被视为唯一标识符。
$match 计数 > 1 得到 title 共享相同的 link
$group 对我们在步骤 3 中找到的唯一标识符进行重复数据删除
$lookup 返回原始文档并通过 $replaceRoot

db.collection.aggregate([
  {
    "$unwind": "$links"
  },
  {
    $sort: {
      isoDate: 1
    }
  },
  {
    $group: {
      _id: "$links",
      first: {
        $first: "$title"
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        $gt: 1
      }
    }
  },
  {
    $group: {
      _id: "$first"
    }
  },
  {
    "$lookup": {
      "from": "collection",
      "localField": "_id",
      "foreignField": "title",
      "as": "rawDocument"
    }
  },
  {
    "$unwind": "$rawDocument"
  },
  {
    "$replaceRoot": {
      "newRoot": "$rawDocument"
    }
  }
])

这里是Mongo playground供您参考。

MongoDB - 查找最早出现重复值的文档

MongoDB - Find documents with earliest occurrence of duplicate value

arrays

mongodb-query

aggregation-framework