Cosmos DB 值唯一性的要点仅针对每个分片键（分区键）

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft 的 Managing indexing in Azure Cosmos DB's API for MongoDB 文档指出：

Azure Cosmos DB's API for MongoDB server version 3.6 automatically indexes the _id field, which can't be dropped. It automatically enforces the uniqueness of the _id field per shard key.

我对“per shard key”部分背后的推理感到困惑。我将其视为“你是唯一的字段根本不会在全球范围内唯一”，因为如果我理解正确，如果我将 Guid 字段 _id 作为唯一字段并将 userId 字段作为分区键然后我可以有 2 个具有相同 ID 的元素，前提是它们恰好属于 2 个不同的用户。

是不是我没选对分区键？因为在我的理解中分区键应该是最常用于过滤数据的字段。但是，如果我只需要 select 来自数据库的数据怎么办？或者查询所有用户的数据？

我是否需要接受分布式系统的固有限制，并因此改造我设计数据库和编程访问它的过程？在这种情况下，这将是：始终不仅通过 _id 字段而且首先通过 userId 字段从该集合中查询您的数据？不要将我的 _id 字段单独视为标识符，而是将标识符视为 userId 和 _id?

的组合

TL;DR

Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?

是的。主要是。

更长的版本

虽然这个 id not 字段不是唯一的乍一看并不直观，但它实际上是有道理的，考虑到 CosmosDB 寻求无限规模的精确 GET/PUT 操作。这需要分区独立行动，这就是很多魔法的来源。如果 id 或其他唯一约束的唯一性将在全球范围内实施，那么 每个文档更改都必须与所有其他分区协调，并且这将不再是最佳的或无限规模的可预测。

我还认为这种数据分离的设计决策与 CosmosDB 的无模式分布式思维方式一致。如果您使用 CosmosDB，那么接受它并 避免尝试对其施加 cross-document 关系约束 。改为在 data/api 设计和客户端逻辑层中管理它们。例如，通过使用 guid 作为 id.

关于分区键..

Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.

这取决于；）。您还必须考虑最差的查询性能，而不仅仅是“最频繁”使用的查询性能。确保大多数查询可以直接转到正确的分区，这意味着 您必须在进行这些查询之前知道确切的目标分区键，即使对于那些“通过 id 获取”查询也是如此。在实际数据集上测量左 cross-partition 查询的成本。

很难说userId是不是一把好钥匙。它很可能是事先知道的，并且可以包含在 get-by-id 查询中，所以从这个意义上说它很好。但你也应该考虑：

热分区 - 所有单个用户查询都将转到单个分区，那里没有扩展。
分区大小 - 单用户数据最有可能 grows-and-grows-and-grows。分区有最大大小限制，随着时间的推移，在这些目标分区内工作将变得更加昂贵。

所以，如果可能的话，我会定义更小的分区来进一步分配负载。也许考虑 using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key，这对写入和 get-by-id 有好处，但对其他一切都不太理想。

.. 始终确保手头有所选的分区键。

Cosmos DB 值唯一性的要点仅针对每个分片键（分区键）

The point of Cosmos DB value uniqueness only per shard key (partition key)

sharding

uniqueidentifier

mongodb

azure-cosmosdb

azure-cosmosdb-mongoapi

TL;DR

更长的版本

关于分区键..