使用 apache 箭头读取一个 R 数据框中的分区镶木地板目录（所有文件）

Question

如何使用箭头（没有任何火花）将分区的镶木地板文件读入 R

情况

使用 Spark 管道创建镶木地板文件并保存在 S3 上
阅读 RStudio/RShiny 以一列作为索引做进一步分析

parquet文件结构

从我的 Spark 创建的镶木地板文件由几个部分组成

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

如何将此 component_mapping.parquet 读入 R？

我试过的

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

但这失败并出现错误

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

如果我只读取目录中的一个文件就可以了

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

但我需要全部加载才能对其进行查询

我在文档中找到的内容

在 apache 箭头文档中 https://arrow.apache.org/docs/r/reference/read_parquet.html 和 https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html 我发现 read_parquet() 命令有一些属性，但我无法让它工作，也没有找到任何示例。

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

如何正确设置属性以读取完整目录？

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

非常感谢帮助

Answer 1

读取文件目录不是您可以通过为（单个）文件设置选项来实现的reader。如果内存不是问题，今天您可以 lapply/map 覆盖目录列表并将 rbind/bind_rows 合并为一个 data.frame。可能有一个 purrr 函数可以干净地执行此操作。在对文件的迭代中，如果您只需要数据的已知子集，您也可以对每个文件 select/filter。

在 Arrow 项目中，我们正在积极开发一个多文件数据集 API，它可以让您做您想做的事情，以及将行和列选择下推给个人文件等等。敬请期待。

Answer 2

解决方案：使用箭头

将本地文件系统中的分区镶木地板文件读取到 R 数据帧中

因为我想避免在 RShiny 服务器上使用任何 Spark 或 Python，所以我不能使用其他库，例如 sparklyr、SparkR 或 reticulate和 dplyr 如所描述的，例如在

我现在使用 arrow 以及 lapply 和 rbindlist

解决了我的任务

my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))

期待 apache 箭头功能可用谢谢

Answer 3

解决方案：使用箭头

将分区的镶木地板文件从 S3 读取到 R 数据帧中

因为我现在花了很长时间才找到解决方案，而且我在网上找不到任何东西，所以我想分享这个关于如何从 S3 读取分区镶木地板文件的解决方案

library(arrow)
library(aws.s3)

bucket="mybucket"
prefix="my_prefix"

# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key

# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})

# concatenate all data together into one data.frame
data <- do.call(rbind, data)

What a mess but it works.
@neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R

Answer 4

正如@neal-richardson 在他的回答中提到的那样，在这方面已经做了更多的工作，并且使用当前的 arrow 包 （我是运行 4.0.当前 0) 这是可能的。

我注意到你的文件使用了 snappy 压缩，这在安装前需要一个特殊的构建标志。（安装文档在这里：https://arrow.apache.org/docs/r/articles/install.html）

Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)

Dataset API 使用多文件数据集实现了您正在寻找的功能。虽然该文档尚未包含大量示例，但它确实提供了一个明确的起点。 https://arrow.apache.org/docs/r/reference/Dataset.html

下面的示例显示了从给定目录读取多文件数据集并将其转换为内存中 R 数据帧的最小示例。 API 还支持过滤条件和选择列的子集，尽管我仍在尝试自己找出语法。

library(arrow)

## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)

使用 apache 箭头读取一个 R 数据框中的分区镶木地板目录（所有文件）

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

r

rstudio

parquet

apache-arrow