Mongodb 服务器端与客户端处理

Question

我有一个 shell 脚本，它在一个集合上创建一个游标，然后用另一个集合中的数据更新每个文档。
当我在本地数据库上运行它在大约 15 秒内完成，但在托管数据库上，它运行超过 45 分钟。

db.col1.find().forEach(function(doc) {
    db.col2.findAndModify(
        {
            query: {attr1: doc.attr1},
            update: { $push: {attr2: doc.attr2},
            upsert: true
        });
    });

因此，为了处理此脚本，客户端和服务器之间显然存在网络开销。有没有办法保持所有服务器端的处理？我查看了服务器端 javascript，但根据我的阅读 here，这不是推荐的做法。

Answer 1

在本地，您几乎没有网络开销。没有干扰，没有路由器，没有交换机，没有带宽限制。另外，在大多数情况下，您的大容量存储，无论是 SSD 还是 HDD，或多或少都会闲置（除非您倾向于在开发时玩游戏。）因此，当需要大量 IO 功能的操作开始时，它是可用的。

当您从本地 shell 针对服务器运行脚本时，会发生以下情况。

db.col1.find().forEach 整个集合将从未知介质（很可能是 HDD，其可用 IO 可以在 many 实例之间共享）读取。然后文件将被传送到您本地shell。与到本地主机的连接相比，每个文档检索都经过几十个跃点，每个跃点都增加了少量的延迟。大概有相当多的文件，这加起来。不要忘记 complete 文档是通过网络发送的，因为您没有使用投影将返回的字段限制为 attr1 和 attr2。外部带宽当然比 localhost.
db.col2.findAndModify 对于每个文档，进行一次查询。同样，共享 IO 可能会降低性能。
{ query: {attr1: doc.attr1}, update: { $push: {attr2: doc.attr2}, upsert: true} 顺便问一下，您确定 attr1 已编入索引吗？即使是这样，也不确定索引当前是否在 RAM 中。我们正在谈论共享实例，对吗？很可能你的写操作必须等到它们甚至被 mongod 处理，根据默认的写关注，数据必须 成功应用 到 in内存数据在确认之前设置，但是如果将大量操作发送到共享实例，则很可能您的操作是第一个 bazillion 并且在队列中。并且第二次增加了网络延迟，因为传输到本地 shell 的值需要发回。

你能做什么

首先，请确保您

使用 projection:
将返回值限制为您需要的值
```
db.col1.find({},{ "_id":0, "attr1":1, "attr2":1 })
```
确保你有 attr1 索引
```
db.col2.ensureIndex( { "attr1":1 } )
```

使用bulk operations。它们的执行速度要快得多，但代价是出现问题时反馈会减少。

// We can use unordered here, because the operations
// each apply to only a single document
var bulk = db.col2.initializeUnorderedBulkOp()

// A counter we will use for intermediate commits
// We do the intermediate commits in order to keep RAM usage low
var counter = 0

// We limit the result to the values we need
db.col1.find({}.{"_id":0, "attr1":1, "attr2":1 }).forEach(
  function(doc){

    // Find the matching document
    // Update exactly that
    // and if it does not exist, create it
    bulk
      .find({"attr1": doc.attr1})
      .updateOne({ $push: {"attr2": doc.attr2})
      .upsert()

    counter++

    // We have queued 1k operations and can commit them
    // MongoDB would split the bulk ops in batches of 1k operations anyway
    if( counter%1000 == 0 ){
      bulk.execute()
      print("Operations committed: "+counter)
      // Initialize a new batch of operations
      bulk = db.col2.initializeUnorderedBulkOp()
    }

  }
)
// Execute the remaining operations not committed yet.
bulk.execute()
print("Operations committed: "+counter)

Mongodb 服务器端与客户端处理

Mongodb server side vs client side processing

mongodb

database-performance

你能做什么