使用 sparklyr 读取 Databricks 中的 Parquet 文件
Read Parquet file in Databricks using sparklyr
尝试使用以下代码将 Parquet 文件从 R 读取到 Apache Spark 2.4.3。它可以在我的本地机器上使用 Windows 10,但不能在 Databricks 5.5 LTS 上使用。
library(sparklyr)
library(arrow)
# Set up Spark connection
sc <- sparklyr::spark_connect(method = "databricks")
# Convert iris R data frame to Parquet and save to disk
arrow::write_parquet(iris, "/dbfs/user/iris.parquet")
# Read Parquet file into a Spark DataFrame: throws the error below
iris_sdf <- sparklyr::spark_read_parquet(sc, "iris_sdf", "user/iris.parquet")
Error in record_batch_stream_reader(stream) : Error in
record_batch_stream_reader(stream) : could not find function
"record_batch_stream_reader"
这里可能有什么问题?
SessionInfo()
在我的本地机器上:
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_0.16.0.2 sparklyr_1.1.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.11 magrittr_1.5 bit_1.1-15.2 tidyselect_1.0.0 R6_2.4.1 rlang_0.4.5 httr_1.4.1 dplyr_0.8.5 tools_3.6.3 DBI_1.1.0 dbplyr_1.4.2 ellipsis_0.3.0 htmltools_0.4.0
[15] bit64_0.9-7 assertthat_0.2.1 rprojroot_1.3-2 digest_0.6.25 tibble_2.1.3 forge_0.2.0 crayon_1.3.4 purrr_0.3.3 vctrs_0.2.4 base64enc_0.1-3 htmlwidgets_1.5.1 glue_1.3.1 compiler_3.6.3 pillar_1.4.3
[29] generics_0.0.2 r2d3_0.2.3 backports_1.1.5 jsonlite_1.6.1 pkgconfig_2.0.3
问题是Databricks Runtime 5.5 LTS comes with sparklyr 1.0.0 (released 2019-02-25) 但需要1.1.0或以上版本。通过 CRAN 或 GitHub 安装更新的版本,并且 spark_read_parquet()
应该可以工作。
# CRAN
install.packages("sparklyr")
# GitHub
devtools::install_github("rstudio/sparklyr")
# You also need to install Apache Arrow
install.packages("arrow")
arrow_install()
尝试使用以下代码将 Parquet 文件从 R 读取到 Apache Spark 2.4.3。它可以在我的本地机器上使用 Windows 10,但不能在 Databricks 5.5 LTS 上使用。
library(sparklyr)
library(arrow)
# Set up Spark connection
sc <- sparklyr::spark_connect(method = "databricks")
# Convert iris R data frame to Parquet and save to disk
arrow::write_parquet(iris, "/dbfs/user/iris.parquet")
# Read Parquet file into a Spark DataFrame: throws the error below
iris_sdf <- sparklyr::spark_read_parquet(sc, "iris_sdf", "user/iris.parquet")
Error in record_batch_stream_reader(stream) : Error in record_batch_stream_reader(stream) : could not find function "record_batch_stream_reader"
这里可能有什么问题?
SessionInfo()
在我的本地机器上:
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_0.16.0.2 sparklyr_1.1.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.11 magrittr_1.5 bit_1.1-15.2 tidyselect_1.0.0 R6_2.4.1 rlang_0.4.5 httr_1.4.1 dplyr_0.8.5 tools_3.6.3 DBI_1.1.0 dbplyr_1.4.2 ellipsis_0.3.0 htmltools_0.4.0
[15] bit64_0.9-7 assertthat_0.2.1 rprojroot_1.3-2 digest_0.6.25 tibble_2.1.3 forge_0.2.0 crayon_1.3.4 purrr_0.3.3 vctrs_0.2.4 base64enc_0.1-3 htmlwidgets_1.5.1 glue_1.3.1 compiler_3.6.3 pillar_1.4.3
[29] generics_0.0.2 r2d3_0.2.3 backports_1.1.5 jsonlite_1.6.1 pkgconfig_2.0.3
问题是Databricks Runtime 5.5 LTS comes with sparklyr 1.0.0 (released 2019-02-25) 但需要1.1.0或以上版本。通过 CRAN 或 GitHub 安装更新的版本,并且 spark_read_parquet()
应该可以工作。
# CRAN
install.packages("sparklyr")
# GitHub
devtools::install_github("rstudio/sparklyr")
# You also need to install Apache Arrow
install.packages("arrow")
arrow_install()