DynamoDB 中的索引通知 table

Indexing notifications table in DynamoDB

我要实现一个通知系统,我正在尝试找出一种在数据库中存储通知的好方法。我有一个使用 PostgreSQL 数据库的 Web 应用程序,但关系数据库似乎不适合这个用例;我想支持各种类型的通知,每种通知都包含不同的数据,尽管数据的一个子集对于所有类型的通知都是通用的。因此,我认为 NoSQL 数据库可能比尝试规范化关系数据库中的模式更好,因为这会非常棘手。

我的应用程序托管在 Amazon Web Services (AWS) 中,我一直在研究用于存储通知的 DynamoDB。这是因为它是托管的,所以我不必处理它的操作。理想情况下,我希望使用 MongoDB,但我真的不希望自己处理数据库操作。我一直在尝试想出一种在 DynamoDB 中做我想做的事情的方法,但我一直在努力,因此我有几个问题。

假设我想为每个通知存储以下数据:

现在,我希望能够查询给定用户的最新 X 通知。此外,在另一个查询中,我想获取特定用户未读通知的数量。我正在尝试找出一种方法来索引我的 table 以便能够有效地执行此操作。

我可以排除仅使用散列主键的可能性,因为我不会仅通过散列键进行查找。我不知道 "hash and range primary key" 是否对我有帮助,因为我不知道将哪个属性作为范围键。我可以将唯一的通知 ID 作为散列键,将用户 ID 作为范围键吗?这是否允许我仅通过范围键进行查找,即不提供哈希键?那么也许二级索引可以帮助我按时间戳排序,如果这可能的话。

我还查看了全局二级索引,但这些索引的问题是,在查询索引时,DynamoDB 只能 return 投影到索引中的属性 - 因为我希望所有属性都是returned,那么我实际上必须复制我的所有数据,这看起来很荒谬。

如何索引我的通知 table 以支持我的用例?有没有可能,或者你有什么其他的建议吗?

我是 DynamoDB 的活跃用户,这就是我要做的……首先,我假设您需要单独访问通知(例如将它们标记为 read/seen),在除了通过 user_id 获取最新通知。

Table设计:

NotificationsTable
id - Hash key
user_id
timestamp
...

UserNotificationsIndex (Global Secondary Index)
user_id - Hash key
timestamp - Range key
id

当您 query UserNotificationsIndex 时,您将想要通知的用户的 user_idScanIndexForward 设置为 false,DynamoDB 将return 该用户的通知 ID 按时间倒序排列。您可以选择设置 limit 想要 return 的结果数量,或者获得最大 1 MB。

关于投影属性,您必须将需要的属性投影到索引中,或者您可以简单地投影 id 然后在代码中编写 "hydrate" 功能查找每个 ID 并 returns 您需要的特定字段。

如果您真的不喜欢那样,这里有一个替代解决方案...将您的 id 设置为您的 timestamp。例如,我会使用自自定义纪元(例如 2015 年 1 月 1 日)以来的毫秒数。这是另一种 table 设计:

NotificationsTable
user_id - Hash key
id/timestamp - Range key

现在您可以直接查询 NotificationsTable,适当设置 user_id 并在 Range 键的类型上将 ScanIndexForward 设置为 false。当然,这假设您不会发生用户在同一毫秒内收到 2 条通知的冲突。这应该不太可能,但我不知道你的系统规模。

Motivation Note: When using a Cloud Storage like DynamoDB we have to be aware of the Storage Model because that will directly impact your performance, scalability, and financial costs. It is different than working with a local database because you pay not only for the data that you store but also the operations that you perform against the data. Deleting a record is a WRITE operation for example, so if you don't have an efficient plan for clean up (and your case being Time Series Data specially needs one), you will pay the price. Your Data Model will not show problems when dealing with small data volume but can definitely ruin your plans when you need to scale. That being said, decisions like creating (or not) an index, defining proper attributes for your keys, creating table segmentation, and etc will make the entire difference down the road. Choosing DynamoDB (or more generically speaking, a Key-Value store) as any other architectural decision comes with a trade-off, you need to clearly understand certain concepts about the Storage Model to be able to use the tool efficiently, choosing the right keys is indeed important but only the tip of the iceberg. For example, if you overlook the fact that you are dealing with Time Series Data, no matter what primary keys or index you define, your provisioned throughput will not be optimized because it is spread throughout your entire table (and its partitions) and NOT ONLY THE DATA THAT IS FREQUENTLY ACCESSED, meaning that unused data is directly impacting your throughput just because it is part of the same table. This leads to cases where the ProvisionedThroughputExceededException is thrown "unexpectedly" when you know for sure that your provisioned throughput should be enough for your demand, however, the TABLE PARTITION that is being unevenly accessed has reached its limits (more details here).

下面的 post 有更多的细节,但我想给你一些动力来阅读它并理解虽然你现在肯定可以找到一个更简单的解决方案,但这可能意味着从头开始在不久的将来,当你碰壁时("wall" 可能会出现高昂的财务成本、性能和可扩展性的限制,或者所有这些的组合)。

问:我可以将唯一的通知 ID 作为哈希键,将用户 ID 作为范围键吗?这是否允许我仅通过范围键进行查找,即不提供哈希键?

A: DynamoDB 是一种键值存储,这意味着最高效的查询使用整个键(散列或散列范围)。使用 Scan 操作来实际执行查询只是因为您没有密钥绝对是您的数据模型在您的要求方面存在缺陷的标志。有一些事情需要考虑,也有很多选项可以避免这个问题(更多细节见下文)。

现在,在继续之前,我建议您快速阅读这篇文章 post 以清楚地了解 Hash Key 和 Hash+Range Key 之间的区别:

您的案例是一个典型的时间序列数据场景,您的记录会随着时间的流逝而过时。您需要注意两个主要因素:

  • 确保您的 table 具有均匀的访问模式

如果您将所有通知放在一个 table 中并且最近的通知被更频繁地访问,您预配的吞吐量将不会得到有效利用。 您应该将最常访问的项目分组在一个 table 中,以便可以针对所需的访问适当调整配置的吞吐量。此外,请确保正确定义

  • 以最有效的方式删除过时数据(努力、性能和成本方面)

文档建议将数据分段到不同的 table 中,以便您可以在记录过时后删除或备份整个 table(请参阅下面的更多详细信息)。

以下是文档中解释与时间序列数据相关的最佳实践的部分:

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Source:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns

例如,您可以按月份对 table 进行细分:

Notifications_April, Notifications_May, etc

问:我希望能够查询给定用户的最新 X 通知。

A: 我建议使用 Query 操作并仅使用具有 [=18] 的 Hash Key (UserId) 进行查询=] 按 Timestamp(日期和时间)对通知进行排序。

Hash Key: UserId
Range Key: Timestamp

注意: 更好的解决方案是 Hash Key 不仅要有 UserId,还要有另一个可以在查询之前计算的串联信息确保您的 Hash Key 甚至授予您访问数据的模式。例如,如果来自特定用户的通知比其他用户更容易访问,您可以开始拥有 热分区 ...在 Hash Key 中提供额外信息将减轻这种风险。

问:我想获取特定用户的未读通知数。

A: 创建一个 Global Secondary Index 作为 稀疏索引 UserId 作为 Hash KeyUnread 作为 Range Key.

示例:

Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread

当您通过哈希键 (UserId) 查询此索引时,您将自动获得所有未读通知,而不会通过与本案例无关的通知进行不必要的扫描。请记住,来自 table 的原始主键会自动投射到索引中,因此如果您需要获取有关通知的更多信息,您可以随时求助于这些属性来执行 GetItemBatchGetItem 原 table.

注意: 您可以探索使用 'Unread' 标志以外的不同属性的想法,重要的是要记住稀疏索引可以提供帮助你在这个用例上(下面有更多细节)。

详细说明:

我会有一个稀疏索引来确保您可以查询缩减的数据集来进行计数。在你的情况下,你可以有一个属性 "unread" 来标记通知是否被读取,并使用该属性创建稀疏索引。当用户阅读通知时,您只需从通知中删除该属性,这样它就不会再出现在索引中。以下是文档中明确适用于您的场景的一些准则:

Take Advantage of Sparse Indexes

For any item in a table, DynamoDB will only write a corresponding index entry if the index range key attribute value is present in the item. If the range key attribute does not appear in every table item, the index is said to be sparse. [...]

To track open orders, you can create an index on CustomerId (hash) and IsOpen (range). Only those orders in the table with IsOpen defined will appear in the index. Your application can then quickly and efficiently find the orders that are still open by querying the index. If you had thousands of orders, for example, but only a small number that are open, the application can query the index and return the OrderId of each open order. Your application will perform significantly fewer reads than it would take to scan the entire CustomerOrders table. [...]

Instead of writing an arbitrary value into the IsOpen attribute, you can use a different attribute that will result in a useful sort order in the index. To do this, you can create an OrderOpenDate attribute and set it to the date on which the order was placed (and still delete the attribute once the order is fulfilled), and create the OpenOrders index with the schema CustomerId (hash) and OrderOpenDate (range). This way when you query your index, the items will be returned in a more useful sort order.[...]

Such a query can be very efficient, because the number of items in the index will be significantly fewer than the number of items in the table. In addition, the fewer table attributes you project into the index, the fewer read capacity units you will consume from the index.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes

在下面找到您需要以编程方式创建和删除 tables:

的一些操作参考

创建Table http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html

删除Table http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html