使用索引键更新 mongo 文档中大型嵌入式数组的优化方法

Question

我有一个包含 5200 万条记录的用户集合。每个用户文档都有一个评论列表，comment_id 上有一个唯一索引。

{
  _id:123, 
  user_name:"xyz",
  comments:[
    {
      comment_id:123,
      text:"sd"
    },
    {
      comment_id:234,
      text:"sdf"
    }
    ......,
    (63000 elements)
  ]
}

comment_id索引的totalIndexSize是104GB。我有 52M 中的大约 100 个文档，其中 comments 数组中有 63000 个元素。

我的目标是删除旧评论并将评论数组的大小减少 80% 以上。早些时候，当我尝试使用此查询

更新文档时

db.user.updateOne({_id:_id},{$set: {"comments":newCommentsArray}},upsert=True)

此处 newCommentsArray 的大小约为 400。执行此操作大约需要 130 秒。

我的问题是：

1) 上述更新查询耗时 130 秒的原因可能是什么。是因为 comment_id 字段上的唯一索引大小巨大吗？（我相信用新的评论数组更新评论数组会尝试重新排列所有已删除的 63000 个元素的索引，并在索引中插入新元素。）

2) 我有另一种使用 $pull 的方法，它基本上是从评论数组中提取 100 条评论并等待 5 秒，然后执行下一批 100 条评论。您如何看待这个解决方案。

3) 如果上述解决方案不好，您能否提出一个将评论数组减少80%以上的好方法。

Answer 1

你有一个巨大的 comment_id 索引是因为你有 Multikey Index

MongoDB creates an index key for each element in the array.

在你的情况下，_id 索引有 ~1GB 大小，comment_id 是 avg ~100/per document（得到 ~104GB）

1) What could be the reason update query above took 130sec

Mongodb 用 B-tree structure 存储索引。 B 树属性：

Algorithm   Average     Worst case
Space       O(n)        O(n)
Search      O(log n)    O(log n)
Insert      O(log n)    O(log n)
Delete      O(log n)    O(log n)

这意味着，要为评论插入索引，在最坏的情况下，MongoDB 需要迭代 O(log n)（每个项目~25 次迭代）。

2) I had an other approach use $pull which is basically pulling 100 comments from the comments array and waiting for 5 sec and then execute for next batch of 100 comments.

由于评论被编入索引，它将快速（记住O (log n) 属性）。没有必要等待 5 秒，因为从 MongoDB 3.0 开始，它使用 multi-granularity locking，这意味着只锁定受影响的文档。

此外，您可以像这样使用 $push 运算符进行缩减：

db.user.update({ },{$push: {comments: {$each: [ ], $slice: -400}}})

这将插入 [ ]（在本例中为 0 项）项并从末尾切出 400 项

3) If the above solution is no good can you suggest a good way to reduce the comments array by over 80%.

即使您减少注释数组，WiredTiger 也不会释放不需要的磁盘space 给操作系统。

运行 dropIndex

db.user.dropIndex({ "comment_id" : 1 })

警告：由于 v4.2 在操作期间获得了对指定集合的独占锁。对集合的所有后续操作必须等到 db.collection.dropIndex() 释放锁。

在 v4.2 之前，此命令会在受影响的数据库上获得写锁，并将阻止其他操作，直到它完成。

或运行compact

警告： compact 阻止当前正在操作的数据库的操作。仅在计划维护期间使用 compact。此外，您必须使用目标集合

上的紧凑权限操作作为 user 进行身份验证

使用索引键更新 mongo 文档中大型嵌入式数组的优化方法

Optimized way to update large embedded array in mongo document with an indexed key

mongodb

embedded-documents