DynamoDB 扫描查询和 BatchGet

DynamoDB Scan Query and BatchGet

我们有一个 Dynamo DB table 结构,其中包含 Hash 和 Range 作为主键。

Hash = date.random_number
Range = timestamp

如何获取 X 和 Y 时间戳内的项目?由于散列键附加有 random_number,因此必须触发多次查询。是否可以给出多个哈希值和单个 RangeKeyCondition.

就成本和时间而言,什么最有效?

随机数范围是1到10。

如果我没理解错的话,你有一个 table 具有以下主键定义:

Hash Key  : date.random_number 
Range Key : timestamp

您必须记住的一件事是,无论您使用 GetItem 还是 Query,您都必须能够在您的应用程序中计算 Hash Key为了从您的 table.

中成功取回一件或多件物品

使用随机数作为 Hash Key 的一部分是有意义的,因此您的记录可以均匀分布在 DynamoDB 分区中,但是,您必须以应用程序仍然可以计算的方式进行当您需要检索记录时,这些数字。

考虑到这一点,让我们创建指定要求所需的查询。您可用于从 table 获取多个项目的本机 AWS DynamoDB 操作是:

Query, BatchGetItem and Scan
  • 为了使用BatchGetItem,您需要事先知道整个主键(哈希键和范围键),但事实并非如此。

  • Scan 操作实际上会遍历您 table 的每条记录,我认为这对您的要求来说是不必要的。

  • 最后,Query 操作允许您从 table 中检索一项或多项,将 EQ(相等)运算符应用于 Hash Key 和许多其他运算符,当您没有完整的 Range Key 或想匹配多个运算符时可以使用这些运算符。

Range Key 条件的运算符选项为:EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN

在我看来,最符合您要求的 table 是 BETWEEN 运算符,也就是说,让我们看看如何使用所选的 SDK 构建查询:

Table table = dynamoDB.getTable(tableName);

String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";

RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);

        ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
            rangeKeyCondition,
            null, //FilterExpression - not used in this example
            null,  //ProjectionExpression - not used in this example
            null, //ExpressionAttributeNames - not used in this example
            null); //ExpressionAttributeValues - not used in this example

您可能需要查看以下内容 post 以获取有关 DynamoDB 主键的更多信息:

问题:我担心的是多次查询,因为附加了 random_number。有没有办法组合这些查询并点击一次 dynamoDB?

您的担心完全可以理解,但是,通过 BatchGetItem 获取所有记录的唯一方法是了解您要获取的所有记录的完整主键 (HASH + RANGE)。虽然最小化到服务器的 HTTP 往返次数乍一看似乎是最好的解决方案,但文档实际上建议按照您正在做的事情来避免热分区和不均衡地使用您配置的吞吐量:

Design For Uniform Data Access Across Items In Your Tables

"Because you are randomizing the hash key, the writes to the table on each day are spread evenly across all of the hash key values; this will yield better parallelism and higher overall throughput. [...] To read all of the items for a given day, you would still need to Query each of the 2014-07-09.N keys (where N is 1 to 200), and your application would need to merge all of the results. However, you will avoid having a single "hot" hash key taking all of the workload."

来源:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

这里还有一个有趣的点建议在单个分区中适度使用读取......如果你从哈希键中删除随机数以便能够一次获得所有记录,你很可能会堕落在这个问题上,无论您使用的是 ScanQuery 还是 BatchGetItem:

Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity

"Note that it is not just the burst of capacity units the Scan uses that is a problem. It is also because the scan is likely to consume all of its capacity units from the same partition because the scan requests read items that are next to each other on the partition. This means that the request is hitting the same partition, causing all of its capacity units to be consumed, and throttling other requests to that partition. If the request to read data had been spread across multiple partitions, then the operation would not have throttled a specific partition."

最后,由于您正在处理时间序列数据,因此查看文档中建议的一些最佳实践可能也会有所帮助:

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

来源:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html