如何删除重复的嵌入文档

Question

我有包含许多子文档列表的用户集合。架构是这样的：

   {
    _id: ObjectId(),
    name: aaa,
    age: 20,
    transactions:[
        {
         trans_id: 1,
         product: mobile,
         price: 30,
        },
        {
         trans_id: 2,
         product: tv,
         price: 10
        },
        ...]
    ...
   }

所以我有一个疑问。 transactions 列表中的 trans_id 在所有产品中都是唯一的，但我可能已经用相同的 trans_id 再次复制了相同的交易（由于错误的 ETL 编程）。现在我想删除那些重复的子文档。我索引了 trans_id 以为没有 unique。我读到了 dropDups 选项。但是它会删除数据库中存在的特定重复项还是会删除整个文档（我绝对不想要）。如果不行怎么办？

PS: 我正在使用 MongoDB 2.6.6 版本。

Answer 1

我们在这里看到的最接近的情况是，现在您需要一种方法来定义数组中的 "distinct" 项，其中某些项实际上是数组中其他项的 "exact copy"数组。

最好的情况是使用 $addToSet along with the $each modifier within a looping operation for the collection. Ideally you use the Bulk Operations API 以利用减少的流量：

var bulk = db.collection.initializeOrderedBulkOperation();
var count = 0;

// Read the docs
db.collection.find({}).forEach(function(doc) {
    // Blank the array
    bulk.find({ "_id": doc.id })
        .updateOne({ "$set": { "transactions": [] } });
    // Resend as a "set"
    bulk.find({ "_id": doc.id })
        .updateOne({ 
            "$addToSet": { 
                "trasactions": { "$each": doc.transactions }
            }
        });
    count++;

    // Execute once every 500 statements ( actually 1000 )
    if ( count % 500 == 0 ) {
        bulk.execute()
        bulk = db.collection.initializeOrderedBulkOperation();
    }
});

// If a remainder then execute the remaining stack
if ( count % 500 != 0 )
    bulk.execute();

所以只要 "duplicate" 的内容是 "entirely the same" 那么这个方法就有效。如果实际上 "duplicated" 的唯一内容是 "trans_id" 字段，那么您需要一种完全不同的方法，因为 "whole documents" 的 none 是 "duplicated"，这意味着你需要更多的逻辑来做到这一点。

如何删除重复的嵌入文档

how to drop duplicate embedded document

indexing

duplicates

mongodb

mongodb-query