在 MongoDB 中查询子文档数组的最佳方法是什么？

Question

假设我有一个 collection 像这样：

    {
    "id": "2902-48239-42389-83294",
    "data": {
        "location": [
            {
                "country": "Italy",
                "city": "Rome"
            }
        ],
        "time": [
            {
                "timestamp": "1626298659",
                "data":"2020-12-24 09:42:30"
            }
        ],
        "details": [
            {
                "timestamp": "1626298659",
                "data": {
                    "url": "https://example.com",
                    "name": "John Doe",
                    "email": "john@doe.com"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "https://www.myexample.com",
                    "name": "John Doe",
                    "email": "doe@john.com"    
                }
            },
            {
                "timestamp": "1626298652",
                "data": {
                    "url": "http://example.com/sub/directory",
                    "name": "John Doe",
                    "email": "doe@johnson.com"    
                }
            }
        ]
    }
}

现在主要关注子文档数组（“data.details”）：我只想获得相关匹配项的输出，例如：

db.info.find({"data.details.data.url": "example.com"})

如何匹配所有包含“example.com”但不匹配“myexample.com”。 当我使用 $regex 时，我得到的结果太多，所以如果我查询“example.com”，它也会 return “myexample.com”

即使我确实得到了部分结果（使用 $match），它也非常慢。我试过这个聚合阶段:

   { $unwind: "$data.details" },

   {
     $match: {
       "data.details.data.url": /.*example.com.*/,
     },
   },
   {
     $project: {
       id: 1,
       "data.details.data.url": 1,
       "data.details.data.email": 1,
     },
   },

我真的不明白这种模式，使用 $match，有时 Mongo 确实能识别像“https://”或“https://www”这样的前缀。有时它不会。

更多信息： 我的collection有几十GB，我创建了两个索引：

像这样复合： "data.details.data.url": 1, "data.details.data.email": 1
正文索引： "data.details.data.url": "文本", "data.details.data.email": "文本"

它确实提高了查询性能，但还不够，我仍然对 $match 与 $regex 有这个问题。感谢帮助！

Answer 1

你的错误在正则表达式中。它匹配所有 URL，因为子字符串 example.com 在所有 URL 中。例如：https://www.myexample.com匹配加粗部分。

为避免这种情况，您必须使用另一个正则表达式，例如以该域开头的正则表达式。

例如：

(http[s]?:\/\/|www\.)YOUR_SEARCH

将检查您要搜索的内容是否位于 http:// 或 www.分数。 https://regex101.com/r/M4OLw1/1

我把完整的查询留给你。

[
  {
    '$unwind': {
      'path': '$data.details'
    }
  }, {
    '$match': {
      'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
    }
  }
]

注意：您必须从正则表达式中转义特殊字符。点匹配任何字符，斜杠将关闭您的正则表达式并导致错误。

在 MongoDB 中查询子文档数组的最佳方法是什么？

What is the best way to query an array of subdocument in MongoDB?

mongoose

mongodb

mongodb-query

aggregation-framework

mongodb-indexes