如何在不使用 spark 包的情况下读取 R 中的镶木地板文件？

Question

我可以通过使用 sparklyr 或使用不同的 spark 包在线找到许多答案，这实际上需要启动一个 spark 集群，这是一种开销。在 python 中，我可以使用 "pandas.read_parquet" 或 python 中的 Apache 箭头找到一种方法 - 我正在寻找与此类似的东西。

Answer 1

有了 reticulate，您可以使用 python 中的 pandas 来读取 parquet 文件。这可以为您省去运行个 spark 实例的麻烦。在 apache arrow 发布他们的版本之前，可能会失去序列化的性能。如上评论所述。

library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {

  path <- path.expand(path)
  path <- normalizePath(path)

  if (!is.null(columns)) columns = as.list(columns)

  xdf <- pandas$read_parquet(path, columns = columns)

  xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)

  dplyr::tbl_df(xdf)

}

read_parquet(PATH_TO_PARQUET_FILE)

Answer 2

你可以简单地使用箭头包：

install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")

如何在不使用 spark 包的情况下读取 R 中的镶木地板文件？

How to read a parquet file in R without using spark packages?

r

parquet