为处理夜间大型查询和大数据的单个容器设计 Cosmos 分区键

Designing a Cosmos partition key for a single container which handles nightly large queries and large data

我目前有五个 Cosmos DB 容器,每个容器包含约 800K 个文档(并且还在增加)。文档大小变化很大,从 1-40KB 不等。该数据库的用户使用率非常低,每天可能使用 id + 主键进行 200-500 次查找。每个容器的partition key设计本质上就是id对100个逻辑分区取模

我不使用 id 作为分区键的原因是因为有一些跨分区查询。具体来说,一个容器每天使用分区键未知的字段进行大约 100 次查找。此外,Azure 搜索每小时为一个容器编制索引,Azure 搜索使用 _ts 查找所有已更改的文档。此外,三个容器每晚(在不同的时间)参与其中,每个(部分)文档都被下载到一个完全独立的系统的摄取过程中,可以发现该系统的更改并将其更新回容器中。

当前容器布局总结:

Container 1
- About 800K documents and partition key is 100 modulus of id
- About 200-500 lookups a day using id + partition key
- About 100 lookups a day using a field for which partition key is unknown
- Indexed hourly by Azure Search
- Nightly every partial document is downloaded and potentially upserted
Container 2
- About 800K documents and partition key is 100 modulus of id
- About 200-500 lookups a day using id + partition key
- Nightly every partial document is downloaded and potentially upserted
Container 3
- About 800K documents and partition key is 100 modulus of id
- About 200-500 lookups a day using id + partition key
- Nightly every partial document is downloaded and potentially upserted
Container 4
- About 800K documents and partition key is 100 modulus of id
- About 200-500 lookups a day using id + partition key
Container 5
- About 100K documents and partition key is 100 modulus of id
- About 200-500 lookups a day using id + partition key

我现在的多容器设计效果很好。但是考虑到用户使用率低,成本太高,所以我想将五个容器合并为一个。如果我将五个容器合并为一个,问题是我如何设计一个新的分区方案,它继续允许快速查找,而且查询不会花费大量时间和 RU。

我主要关心的是我想确保我的大型查询只关注包含相关文档的分区。每个现有容器都已分发到 100 个逻辑分区,并且由于现有查询是容器范围的(它获取所有文档),所以我不需要担心扇出问题。但是现在如果所有容器都合并了,我希望查询只针对我关心的分区,这样扫描就不会触及我不感兴趣的分区。到目前为止我想到的唯一选择是:

1) Keep the existing 100 logical partition design per "container" (namespace
   of documents) and have the queries use "IN" to target all 100 partitions.
- Unfortunately range like STARTSWITH on partition key will not prevent fan-out.
- Having so many partition keys in an "IN" clause may make the query very
  long and I don't know of the consequences of that. In my test it seems to work
  fine -- the query length just adds about 10 to 20 RUs onto the query.
- If there are no problems with large queries, this probably would just work
  fine and keep good performance.
2) Have one logical partition per "container" (namespace of documents).
- Because of low usage performance is probably still acceptable.
- May exceed permitted document size per-container.
3) Have two-ten logical partitions and have the queries use "IN"
- This makes the "IN" usage of #1 more palettable.
- Won't have the look-up performance of #1, but better than #2.
- Logical containers are still very large.
4) Just deal with the fan out and having high-RU queries.
- Database may be unusable at some points during the night.
- The Azure Search _ts-based queries don't seem to have much impact on the
  performance.

我倾向于做#1,但我希望在我继续设计模式之前有人能给我反馈。

在我提出这个问题几个小时后,我发现了一个新的 Cosmos DB 功能 allows all containers within a database to share throughput。这解决了我所有的顾虑。