按可能不存在的字段查询和排序 mongo 中的大量数据

Question

我对 mongo 比较陌生，我有一个看起来像这样的集合：

[
    {
        "stored": {
            "box": [
                {
                    "parcelId": "uwb1",
                    "status": "ACTIVE"
                }
            ]
        },
        "checked": {
            "box": [
                {
                    "parcelId": "uwb1",
                    "status": "ACTIVE"
                }
            ]
        }
    },
    {
        "stored": {
            "box": [
                {
                    "parcelId": "aqrf123",
                    "status": "PENDING"
                }
            ]
        },
        "checked": {
            "box": [
                {
                    "parcelId": "aqrf123",
                    "status": "PENDING"
                }
            ]
        }
    },
    {
        "checked": {
            "box": [
                {
                    "parcelId": "zuz873",
                    "status": "ACTIVE"
                }
            ]
        }
    }
]

关于数据的一些观察：

文档将始终具有 checked 字段但可能没有 stored 字段
checked 和 stored 字段具有相同的架构
两者都将始终具有 box 字段，我们可以假设 box 字段在数组中始终具有 1 个元素（仅 1，不多不少）
此集合中的文档数量相对较高（~1 亿）

我想要实现的 是让文档按 status 字段排序，这就像一个枚举，它可以有 3 个值 - ACTIVE、PENDING 和 REJECTED。

如果对于文档，stored 字段存在，我将从那里获取并忽略 checked 字段。
否则我将不得不从 checked 字段中取出它，如前所述，它保证存在。
一个重要的要求是将整个文档返回给消费者/客户，所以我不能使用 projection 来减少文档中的数据量（这可能会使整个操作更快）。

我如何尝试实现这是通过使用如下所示的聚合：

db.getCollection('entries')
    .aggregate([{
            $addFields: {
                sortStatus: {
                    $ifNull: [{
                        $let: {
                            vars: {
                                box: {
                                    $arrayElemAt: [
                                        "$stored.box", 0
                                    ]
                                }
                            },
                            in: "$$box.status"
                        }
                    }, {
                        $let: {
                            vars: {
                                box: {
                                    $arrayElemAt: [
                                        "$checked.box", 0
                                    ]
                                }
                            },
                            in: "$$box.status"
                        }
                    }]
                }
            }
        },
        {
            $sort: {
                sortStatus: 1
            }
        }
    ], {
        allowDiskUse: true
    })

这似乎可以完成工作，但感觉很慢。还有 allowDiskUse 这让我有点不舒服。如果我遗漏它，我会收到 Sort exceeded memory limit of x bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in 错误消息。

所以我的问题是：

是否有更快的替代方案，无论是否有聚合？
在进行聚合时使用 allowDiskUse 选项是否有任何风险？
稍微改变文档结构并将该可排序字段添加到文档的根目录，为其添加索引并仅使用 .sort({"statusField": 1})?这将是最后的选择，因为我必须迁移现有数据。

Answer 1

您的 sortStatus 字段值可以通过以下方式获得：

{ $addFields: { sortStatus: { $ifNull: [ "$stored.box.status", "$checked.box.status" ] } } },

这会使查询更快吗？没有，但是代码更简单。

(1) Are there faster alternatives, be it with or without aggregation?

我不知道，目前。

(2) Are there any risks in using the allowDiskUse option when doing an aggregation?

使用allowDiskUse:true选项意味着当排序操作的内存（RAM）超过其限制时，排序操作使用磁盘作为额外资源。与内存相比，磁盘 IO 非常慢，因此 "risk" 是一个慢得多的排序操作。当排序操作需要的内存超过 100MB 的限制时，此选项变为必需选项（请参阅 Sort and Memory Restrictions in Aggregation 上的文档）。

(3) Would it be better (or is it the "mongo" way) to alter a bit the document structure and add that sortable field to the root of the document, add an index for it and just use .sort({"statusField": 1})? This would be the last resort option, as I'd have to migrate the existing data.

创建新的状态字段和该字段的索引意味着新的考虑：

创建新字段 "status" 需要在编写文档的时间（也可能在更新期间）。
在这个新字段上创建索引，也是写入期间的额外开销。请注意，索引大小会随着文档数量的增加而变大。

这些会影响应用程序的写入性能。

但是，查询将变成一个简单的排序。由于集合中有大量文档，用于排序的索引在运行期间可能适合也可能不适合内存。如果不进行实际试验，您无法确定此选项有何帮助。

这是关于 Indexing Strategies 的一些文档。

按可能不存在的字段查询和排序 mongo 中的大量数据

Querying and sorting a large amount of data in mongo, by fields which might not exist

projection

mongodb

mongodb-query