复合索引的顺序在 MongoDB 性能方面有何影响？

Question

我们需要按照查询参数的顺序创建复合索引。这个顺序对性能有影响吗？

假设我们有一个包含地球上所有人的集合，索引为 sex（99.9% 的时间为 "male" 或 "female"，但字符串仍然存在（不是二进制））和 name.

上的索引

如果我们希望能够 select 某个 sex 的所有人与某个 name，例如所有 "male" 名为 "John" 的复合索引是先 sex 还是先 name 更好？为什么（不）？

Answer 1

雷桑德罗，

你必须考虑Index Cardinality and Selectivity.

1。索引基数

索引基数是指一个字段有多少个可能的值。 sex 字段只有两个可能的值。它的基数很低。其他字段如 names, usernames, phone numbers, emails 等，对于集合中的每个文档都会有一个更独特的值，这被认为是 high cardinality。

更大的基数

字段的基数越大，索引就越有用，因为索引缩小了搜索范围 space，使其成为一个更小的集合。

如果您在 sex 上有索引，并且您正在寻找名叫约翰的男人。如果您先按 sex 编制索引，您只会将结果 space 缩小大约 %50。相反，如果您按 name 进行索引，您会立即将结果集缩小到名为 John 的一小部分用户，然后您将参考这些文档来检查性别。
经验法则

尝试在 high-cardinality 键上创建索引或将 high-cardinality 键放在复合索引的第一位。您可以在书中关于复合索引的部分阅读更多相关信息：

MongoDB The Definitive Guide

2。选择性

此外，您还想使用索引 selectively 并编写查询来限制具有索引字段的可能文档的数量。为简单起见，请考虑以下集合。如果你的索引是{name:1}，如果你运行则查询{ name: "John", sex: "male"}。您将必须扫描 1 文档。因为你允许 MongoDB 有选择性。

{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}

考虑以下集合。如果你的索引是{sex:1}，如果你运行则查询{sex: "male", name: "John"}。您将需要扫描 4 个文档。

{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}

想象一下在更大的数据集上可能存在的差异。

复合索引的一点解释

很容易对复合索引做出错误的假设。根据MongoDB docs on Compound Indexes。

MongoDB supports compound indexes, where a single index structure holds references to multiple fields within a collection’s documents. The following diagram illustrates an example of a compound index on two fields:

创建复合索引时，1 索引 将包含多个字段。因此，如果我们通过 {"sex" : 1, "name" : 1} 索引一个集合，索引将大致如下所示：

["male","Rick"] -> 0x0c965148
["male","John"] -> 0x0c965149
["male","Sean"] -> 0x0cdf7859
["male","Bro"] ->> 0x0cdf7859
...
["female","Kate"] -> 0x0c965134
["female","Katy"] -> 0x0c965126
["female","Naji"] -> 0x0c965183
["female","Joan"] -> 0x0c965191
["female","Sara"] -> 0x0c965103

如果我们通过 {"name" : 1, "sex" : 1} 索引集合，索引将大致如下所示：

["John","male"] -> 0x0c965148
["John","female"] -> 0x0c965149
["John","male"] -> 0x0cdf7859
["Rick","male"] -> 0x0cdf7859
...
["Kate","female"] -> 0x0c965134
["Katy","female"] -> 0x0c965126
["Naji","female"] -> 0x0c965183
["Joan","female"] -> 0x0c965191
["Sara","female"] -> 0x0c965103

将 {name:1} 作为 Prefix 可以更好地使用复合索引。关于这个主题还有很多可以阅读的内容，我希望这可以提供一些清晰度。

Answer 2

我要说我自己做了一个实验，发现首先使用区分度低的索引键似乎没有性能损失。（我将 mongodb 3.4 与 wiredtiger 一起使用，它可能与 mmap 不同）。我将 2.5 亿个文档插入到一个名为 items 的新集合中。每个文档看起来像这样：

{
    field1:"bob",
    field2:i + "",
    field3:i + ""

"field1" 始终等于 "bob"。 "field2" 等于 i，所以它是完全唯一的。首先，我在 field2 上进行了搜索，扫描了 2.5 亿份文档花了一分多钟。然后我像这样创建了一个索引：

`db.items.createIndex({field1:1,field2:1})`

当然 field1 在每个文档上都是 "bob"，因此索引在找到所需文档之前必须搜索多个项目。然而，这不是我得到的结果。

索引创建完成后，我对集合进行了另一次搜索。这次我得到了下面列出的结果。您会看到 "totalKeysExamined" 每次都是 1。因此，也许通过有线老虎或其他东西，他们已经想出了如何更好地做到这一点。我看过wiredtiger实际上压缩了索引前缀，所以这可能与它有关。

db.items.find({field1:"bob",field2:"250888000"}).explain("executionStats")

{
    "executionSuccess" : true,
    "nReturned" : 1,
    "executionTimeMillis" : 4,
    "totalKeysExamined" : 1,
    "totalDocsExamined" : 1,
    "executionStages" : {
        "stage" : "FETCH",
        "nReturned" : 1,
        "executionTimeMillisEstimate" : 0,
        "works" : 2,
        "advanced" : 1,
        ...
        "docsExamined" : 1,
        "inputStage" : {
            "stage" : "IXSCAN",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 0,
            ...
            "indexName" : "field1_1_field2_1",
            "isMultiKey" : false,
            ...
            "indexBounds" : {
                "field1" : [
                    "[\"bob\", \"bob\"]"
                ],
                "field2" : [
                    "[\"250888000\", \"250888000\"]"
                ]
            },
            "keysExamined" : 1,
            "seeks" : 1
        }
    }

然后我在 field3 上创建了一个索引（与字段 2 具有相同的值）。然后我搜索：

db.items.find({field3:"250888000"});

与复合索引的耗时相同，为 4 毫秒。我用 field2 和 field3 的不同值重复了多次，每次都得到了微不足道的差异。这表明使用 wiredtiger 时，索引的第一个字段的差分不会造成性能损失。

Answer 3

Note that multiple equality predicates do not have to be ordered from most selective to least selective. This guidance has been provided in the past however it is erroneous due to the nature of B-Tree indexes and how in leaf pages, a B-Tree will store combinations of all field’s values. As such, there is exactly the same number of combinations regardless of key order.

https://www.alexbevi.com/blog/2020/05/16/optimizing-mongodb-compound-indexes-the-equality-sort-range-esr-rule/

这篇博文不同意接受的答案。另一个答案中的基准也表明这无关紧要。那篇文章的作者是“MongoDB 的高级技术服务工程师”，在我看来，在这个话题上他是一个值得信赖的人，所以我猜这个顺序真的不会影响平等领域的表现。我将改用 ESR 规则。

还要考虑前缀。 { a: 1234 } 的过滤不适用于 { b: 1, a: 1 } 的索引：https://docs.mongodb.com/manual/core/index-compound/#prefixes

复合索引的顺序在 MongoDB 性能方面有何影响？

How does the order of compound indexes matter in MongoDB performance-wise?

indexing

mongodb

compound-index

1。索引基数

更大的基数

经验法则

2。选择性

复合索引的一点解释