用于按属性过滤用户的 DynamoDB 架构

DynamoDB schema for filtering users by attributes

我们正在寻求替换我们当前的 RDBMS 数据库,并且一直在考虑一些替代方案。我们有大量的时间序列信息,我们对如何在 Dynamo DB 中毫无问题地表示这些信息有一定的了解。

我们目前在下表中存储每个人的人员属性:

people (id, name, email, phone)
people_attributes (id, person_id, attribute_name, attribute_value)
people_location (id, person_id, location_id) (links to locations table)
people_devices (id, person_id, device_id)
people_metrics (id, person_id, metric_type, value) -- very very large table

什么是将这些表示为 Dynamo DB 模式的最佳方式,用于如下查询:

Get all people that  ( 
    live in Moscow 
    OR  
    live in Athens 
    OR  
    live in Istanbul 
    OR 
    live in San Francisco ) 
AND 
have an iPhone 
AND ( EITHER
          have at least one metric of type X
          OR
          have at least one metric of type Y )

让我们看看我们应该首先避免的事情,这样我们就可以缩小实施的选择范围:

  • 我们应该避免table扫描(尽可能),通过主键查询总是最好的方式

  • 我们应该避免不均匀的访问模式,选择包含未均匀访问的值的哈希键将无法很好地利用您预配的吞吐量

  • 我们应该通过查询 return 少量记录而不是 return 大量数据[=31] 的单个查询来避免突发读取 activity =]

这是我的建议:

我们应该从限制性最强的查询开始,尽快从我们的数据集中排除最大数量的记录。

为了实现上述一些准则,我们可能需要对您的数据模型进行一些非规范化,以在以下 table:

中包含一些属性
people (id, name, email, phone, device_id, location_name)

您可以按如下方式创建全局二级索引

Hash Key  : location_name
Range Key : device_id

*旧的散列键 (id) 将自动投射到索引中,并且应该是查询中唯一 return 的属性。

所以第一个查询将解决您需求的前两部分:

Get all people that  ( 
    live in Moscow 
    OR  
    live in Athens 
    OR  
    live in Istanbul 
    OR 
    live in San Francisco ) 
AND 
have an iPhone

通过每个位置有一个查询,您让 DynamoDB 有机会平衡查询执行(这应该比额外的 HTTP 往返成本更受青睐):

Get All People living in Moscow   with Iphone
Get All People living in Athens   with Iphone
Get All People living in Istanbul with Iphone
Get All People living in San Francisco with Iphone

现在您有了一个 ID 子集,您可以用更少的成本查询最大的 table。这是要执行的剩余查询:

EITHER
          have at least one metric of type X
          OR
          have at least one metric of type Y

因为table很大,所以尽量避免做SCAN操作,尽量通过Primary Key查询。还应避免创建和维护索引,以最大限度地减少发生的存储和额外的写入成本。

我们已经有了 person_ids,现在我们需要过滤掉具有所需指标的那些,我们可以通过计算 Hash 和 Range Key 来完成。

同样,我们需要更改您的 table 结构:

people_metrics (id, person_id [HashKey], metric_type_index [RangeKey], metric_type, value)

范围键属性 metric_type_index 可以采用以下格式:

metric_type#calculated_number

*无论您使用什么作为 Range Key,请确保它使组合的 Hash + Range Key 唯一并且可以计算(下面有更多详细信息)。

您的最后一个查询可以是 BatchGetItem,如下所示:

获取项目 1:

Table: people_metrics
Hash Key: 123 (person_id from the initial query)
Range Key: x#1

获取项目 2:

Table: people_metrics
Hash Key: 123 (person_id from the initial query)
Range Key: y#1

BatchGetItem 应该非常快,并且只有 return 记录至少具有所需指标之一。

如果您有大量 person_id 记录来自第一个查询 return 我建议您将第二个查询分成几批而不是单个巨大的 BatchGetItem 请求(无论如何,BatchGetItem 有一个 100 items limit)。

我的建议可能不是最终答案,但我相信您可以从中得到一些想法并演变成最终和最佳的解决方案。

您可以在下面找到有关所用指南的详细信息:

Design For Uniform Data Access Across Items In Your Tables

"Because you are randomizing the hash key, the writes to the table on each day are spread evenly across all of the hash key values; this will yield better parallelism and higher overall throughput. [...] To read all of the items for a given day, you would still need to Query each of the 2014-07-09.N keys (where N is 1 to 200), and your application would need to merge all of the results. However, you will avoid having a single "hot" hash key taking all of the workload."

来源:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

这里还有一个有趣的观点,建议在单个分区中适度使用读取...

Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity

"Note that it is not just the burst of capacity units the Scan uses that is a problem. It is also because the scan is likely to consume all of its capacity units from the same partition because the scan requests read items that are next to each other on the partition. This means that the request is hitting the same partition, causing all of its capacity units to be consumed, and throttling other requests to that partition. If the request to read data had been spread across multiple partitions, then the operation would not have throttled a specific partition."

最后,由于您正在处理时间序列数据,因此查看文档建议的一些最佳实践可能也会有所帮助:

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

来源:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html