MongoDb - 用于分片目的的重复索引字段？

MongoDb - Duplicate index fields for sharding purposes?

我想建立一个集群来存储日志数据。每个文档都有几个字段，但这些是关键字段：

user_id（中等基数）
标识符（这具有极高的基数，但不能保证在用户之间是唯一的，例如它可以是 UPC）
通道（低基数）
时间戳

该集合预计将包含超过 10 亿份文档，因此分片和性能在这里很重要。

现在几乎所有对集合的高频查询将包含 user_id，因为日志在 UI 中对每个用户都是唯一显示的.大多数查询将在 user_id + identifier 上进行。有些查询是有时间限制的。一些查询也使用 channel 但不是全部。 user_id 是一个单调递增的字段。

我想在 hashed(user_id) 上进行分片。一个理想的索引是 {"user_id": 1, "identifier": 1, "timestamp": 1} 所以我做到了。我尝试在 hashed(user_id) 上进行分片，但在这种情况下它不起作用，我意识到 user_id 必须是同一类型。但是，创建 {"user_id": "hashed", "identifier": 1, "timestamp": 1} 的索引也是不可能的，因为不允许使用带有散列的复合键。

我最好的选择是什么？

只用 hashed(user_id) 创建一个索引，这样我就可以对其进行分片，然后用 {"user_id": 1, "identifier": 1, "timestamp": 1} 创建另一个索引？我会在这里招致存储惩罚。
不要散列 user_id，即使它是单调递增的，而是在 {"user_id": 1, "identifier": 1} 上分片？与 hashed(user_id)
还有其他选择吗？

请注意 MongoDB 4.4 允许使用具有单个散列字段的复合索引：https://docs.mongodb.com/manual/core/hashed-sharding/

如果您不能轻松升级到 4.4，考虑到这里的存储压力很高，文档数量众多，而且大多数查询将同时包含 user_id 和 identifier，请在 {"user_id": 1, "identifier": 1} 听起来是您在这里的最佳选择。它将允许这些查询更快，但会牺牲您的其他查询，这些查询需要搜索每个用户的所有标识符或基于时间的查询。

我不确定在低于 MongoDB 4.4 的版本上有更好的解决方案。

create one index with just hashed(user_id) so I can shard on it and then another index with { "user_id": 1, "identifier": 1, "timestamp": 1 }? I would incur a storage penalty here.

你只能有一个分片键（这需要是一个索引的、单一的或复合的字段）。对于分片键的散列索引字段，从 MongoDB v4.2 开始，它只能是单个字段索引。

使用分片键（或复合分片键的前缀）使用条件查询分片集合将是有针对性的查询。 mongos 将仅访问所需的分片。因此，这将是一个高效的查询。

在没有分片键作为查询条件的情况下进行查询只会导致 scatter-gather 操作 - 集群中的所有分片都将被访问。即使查询的字段上有索引，也还是会打散操作。

因此，选择分片键可能是分片集群设置中最重要的部分。

见Targeted Operations vs. Broadcast Operations。

don't hash the user_id even if it's monotonically increasing and instead shard on {"user_id": 1, "identifier": 1}? I'm not sure if there are disadvantages here compared to simply sharding on hashed(user_id)

您的查询需求应该会影响您的分片键选择（我已经在上面提到了分片键）。

MongoDB v4.4（最新）允许 Hashed Sharding on a Compound Hashed Index.

MongoDb - 用于分片目的的重复索引字段？

MongoDb - Duplicate index fields for sharding purposes?

database

indexing

performance

sharding

mongodb