SparkR 中是否有基本路径数据选项？

Question

我在 S3 中有一个明确修剪的模式结构，当我 read.parquet():

时导致以下错误

Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
    s3a://leftout/for/security/dashboard/updateddate=20170217
    s3a://leftout/for/security/dashboard/updateddate=20170218

（冗长的）错误告诉我进一步...

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.

但是，我找不到任何关于如何使用 SparkR::read.parquet(...) 执行此操作的文档。有谁知道如何在 R 中执行此操作（使用 SparkR）？

> version

platform       x86_64-redhat-linux-gnu     
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          2.2                         
year           2015                        
month          08                          
day            14                          
svn rev        69053                       
language       R                           
version.string R version 3.2.2 (2015-08-14)
nickname       Fire Safety       

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.6.0   SparkR_2.0.2      DT_0.2            jsonlite_1.2      shinythemes_1.1.1 ggthemes_3.3.0   
 [7] dplyr_0.5.0       ggplot2_2.2.1     leaflet_1.0.1     shiny_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9       magrittr_1.5      munsell_0.4.3     colorspace_1.3-2  xtable_1.8-2      R6_2.2.0         
 [7] stringr_1.1.0     plyr_1.8.4        tools_3.2.2       grid_3.2.2        gtable_0.2.0      DBI_0.5-1        
[13] sourcetools_0.1.5 htmltools_0.3.5   yaml_2.1.14       lazyeval_0.2.0    digest_0.6.12     assertthat_0.1   
[19] tibble_1.2        htmlwidgets_0.8   mime_0.5          stringi_1.1.2     scales_0.4.1      httpuv_1.3.3

Answer 1

In Spark 2.1 or later 您可以将 basePath 作为命名参数传递：

read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")

省略号捕获的参数用 varargsToStrEnv and used as options 转换。

完整会话示例：

写入数据（Scala）：

Seq(("a", 1), ("b", 2)).toDF("k", "v")
  .write.partitionBy("k").mode("overwrite").parquet("/tmp/data")

读取数据（SparkR）：

 Welcome to
    ____              __ 
   / __/__  ___ _____/ /__ 
  _\ \/ _ \/ _ `/ __/  '_/ 
 /___/ .__/\_,_/_/ /_/\_\   version  2.1.0 
    /_/ 


 SparkSession available as 'spark'.

> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE)
> read.parquet(paths, basePath="/tmp/data")

SparkDataFrame[v:int, k:string]

相比之下，没有basePath：

> read.parquet(paths)

SparkDataFrame[v:int]

Answer 2

我已经很接近了。来自 source code：

read.parquet.default <- function(path, ...) {
  sparkSession <- getSparkSession()
  options <- varargsToStrEnv(...)
  # Allow the user to have a more flexible definiton of the Parquet file path
  paths <- as.list(suppressWarnings(normalizePath(path)))
  read <- callJMethod(sparkSession, "read")
  read <- callJMethod(read, "options", options)
  sdf <- handledCallJMethod(read, "parquet", paths)
  dataFrame(sdf)
}

此方法也可用 here 但也会抛出 unused argument 错误：

read.parquet(..., options=c(basePath="foo"))

SparkR 中是否有基本路径数据选项？

Is there a basepath data option in SparkR?

r

apache-spark

parquet

sparkr