DynamoDB 流如何将记录分发到分片?

How do DynamoDB streams distribute records to shards?

我的目标是确保 DynamoDB 流发布的记录按 "correct" 顺序处理。我的 table 包含客户活动。哈希键是事件 ID,范围键是时间戳。 "Correct" order 表示同一客户 ID 的事件按顺序处理。可以并行处理不同的客户 ID。

我正在通过 Lambda 函数使用流。每个分片自动生成消费者。因此,如果 运行 时间决定对流进行分片,消费会并行发生(如果我做对了)并且我 运行 有在 CustomerCreated 之前处理 CustomerAddressChanged 事件的风险(例如)。

docs 意味着没有办法影响分片。但他们并没有这么明确地说。有没有办法,例如,将客户 ID 和时间戳组合用于范围键?

dynamodb 流由分组到碎片中的流记录组成。分片可以生成子分片以响应 dynamodb table 上的大量写入。因此,您可以拥有父分片和可能的多个子分片。 为确保您的应用程序以正确的顺序处理记录,父分片必须始终在子分片之前处理。这在 docs 中有详细描述。

遗憾的是,发送到 AWS Lambda 函数的 DynamoDB Streams 记录按分片严格序列化,不保证跨不同分片的记录顺序

来自 AWS Lamda 常见问题解答:

Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?

The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.

如果您使用 DynamoDB Streams Kinesis Adapter,您的应用程序将根据 DynamoDB 文档 here. For more information on DynamoDB Streams Kinesis Adapter, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.

以正确的顺序处理分片和流记录

因此,使用 dynamodb lambda 触发器不能保证顺序。您的其他选择包括使用 DynamoDB Streams Kinesis Adapter 或 DynamoDB Streams Low-Level API,这需要更多工作。

分片由 table 键决定的假设似乎是正确的。我的解决方案是使用客户 ID 作为哈希键和时间戳(或事件 ID)作为范围键。

This AWS blog 说:

The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.

This slide 确认。我仍然希望 DynamoDB 文档能明确说明...

我刚收到 AWS 支持部门的回复。它似乎证实了 @EagleBeak 关于将分区映射到分片的假设。或者按照我的理解,一个partition映射到一个shard tree。

我的问题是关于由于 TTL 过期导致的 REMOVE 事件,但它也适用于所有其他类型的操作。

  1. Is a shard created per Primary Partition Key? and then if there are too many items in the same partition, the shard gets split into children?

A shard is created per partition in your DynamoDB table. If a partition split is required due to too many items in the same partition, the shard gets split into children as well. A shard might split in response to high levels of write activity on its parent table, so that applications can process records from multiple shards in parallel.

  1. 如果删除的 100 个项目都具有相同的分区键,是否会只将它们放入一个分片中?

Assuming all 100 items have the same partition key value (but different sort key values), they would have been stored on the same partition. Therefore, they would be removed from the same partition and be put in the same shard.

  1. 由于“发送到您的 AWS Lambda 函数的记录是严格序列化的”,这种序列化在 TTL 的情况下如何工作?是 在由 partition/sort 键建立的分片内排序,TTL 过期等?

DynamoDB Streams captures a time-ordered sequence of item-level modifications in your DynamoDB table. This time-ordered sequence is preserved at a per shard level. In other words, the order within a shard is established based on the order in which items were created, updated or deleted.