通过 R 中的 paws 包查询 AWS Athena 时如何获得完整结果？

Question

我一直在尝试通过 paws 包从 Athena 获取数据到 R。到目前为止，我已经成功地获得了对运行和 return 的查询的一些结果。但我得到的默认最大值为 1000。我在 Python 中看到了 Boto3 库的一些解决方案，但即使语法相似，我也可以调用 Boto3 具有的分页函数。

有人知道如何进行分页，或者如何使用函数的下一个标记参数吗？

我的代码如下所示：

用于连接 AWS 服务的 SDK

install.packages('paws')
install.packages('tidyverse')

正在创建一个 Athena 对象：

athena <- paws::athena()

正在编写查询并指定输出位置：

query <- athena$start_query_execution(QueryString = '
                                                    SELECT *
                                                    FROM db.table
                                                    LIMIT 100000
                                                    ',
                                       ResultConfiguration = list(OutputLocation =
                                                                  "s3://aws-athena..."
                                                                  )
                                      )

正在执行查询

result <- athena$get_query_execution(QueryExecutionId = query$QueryExecutionId)

获取查询输出：

output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId)

解析为 table 对象

data <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),nrow = length(unlist(output$ResultSet$Rows[1])))))

colnames(data) <- as.character(unlist(data[1, ]))
data <- data[-1, ]

Answer 1

您可能需要考虑 noctua 包。该包使用 paws SDK（DBI 接口）将 R 连接到 Athena。它解决了您遇到的 1000 行限制的问题。所以你上面的查询看起来像：

library(DBI)

con = dbConnect(noctua::athena(), s3_staging_dir = "s3://aws-athena...")

dbGetQuery(con, "SELECT * FROM db.table LIMIT 100000")

该软件包还提供与 dplyr:

的集成

library(DBI)
library(dplyr)

con = dbConnect(noctua::athena(), 
                schema_name = "db", 
                s3_staging_dir = "s3://aws-athena...")

tbl(con, "table"))

Answer 2

这晚了，但它确实回答了原来的 post。您可以使用 get_query_results() 中的 NextToken 属性来获取查询的所有结果。我没有对它进行基准测试，但在简单的例子中注意到它更快，因为它没有像使用 dbConnect() 和 RAthena::athena() 或 [= 那样建立与整个 Athena 'data base' 的连接15=] 会。以下循环会将您所有的查询结果放入您的标题中：

# starting with the OP results processing 
output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId)

data <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),
                                  nrow = length(unlist(output$ResultSet$Rows[1])))))

colnames(data) <- as.character(unlist(data[1, ]))
data <- data[-1, ]

# adding this loop will get the rest of your query results
while(length(output$NextToken) == 1) {
  output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId,
                                     NextToken = output$NextToken)
  tmp <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),
                                   nrow = length(unlist(output$ResultSet$Rows[1])))))
  colnames(tmp) <- colnames(data)
  data <- rbind(data, tmp)
}

通过 R 中的 paws 包查询 AWS Athena 时如何获得完整结果？

How can I get the full result when querying AWS Athena through the paws package in R?

r

amazon-web-services

aws-sdk

amazon-athena