SparkR 中是否有基本路径数据选项?

Is there a basepath data option in SparkR?

我在 S3 中有一个明确修剪的模式结构,当我 read.parquet():

时导致以下错误
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
    s3a://leftout/for/security/dashboard/updateddate=20170217
    s3a://leftout/for/security/dashboard/updateddate=20170218

(冗长的)错误告诉我进一步...

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.

但是,我找不到任何关于如何使用 SparkR::read.parquet(...) 执行此操作的文档。有谁知道如何在 R 中执行此操作(使用 SparkR)?

> version

platform       x86_64-redhat-linux-gnu     
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          2.2                         
year           2015                        
month          08                          
day            14                          
svn rev        69053                       
language       R                           
version.string R version 3.2.2 (2015-08-14)
nickname       Fire Safety       

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.6.0   SparkR_2.0.2      DT_0.2            jsonlite_1.2      shinythemes_1.1.1 ggthemes_3.3.0   
 [7] dplyr_0.5.0       ggplot2_2.2.1     leaflet_1.0.1     shiny_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9       magrittr_1.5      munsell_0.4.3     colorspace_1.3-2  xtable_1.8-2      R6_2.2.0         
 [7] stringr_1.1.0     plyr_1.8.4        tools_3.2.2       grid_3.2.2        gtable_0.2.0      DBI_0.5-1        
[13] sourcetools_0.1.5 htmltools_0.3.5   yaml_2.1.14       lazyeval_0.2.0    digest_0.6.12     assertthat_0.1   
[19] tibble_1.2        htmlwidgets_0.8   mime_0.5          stringi_1.1.2     scales_0.4.1      httpuv_1.3.3             

In Spark 2.1 or later 您可以将 basePath 作为命名参数传递:

read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")

省略号捕获的参数用 varargsToStrEnv and used as options 转换。

完整会话示例:

  • 写入数据(Scala):

    Seq(("a", 1), ("b", 2)).toDF("k", "v")
      .write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
    
  • 读取数据(SparkR):

     Welcome to
        ____              __ 
       / __/__  ___ _____/ /__ 
      _\ \/ _ \/ _ `/ __/  '_/ 
     /___/ .__/\_,_/_/ /_/\_\   version  2.1.0 
        /_/ 
    
    
     SparkSession available as 'spark'.
    
    > paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE)
    > read.parquet(paths, basePath="/tmp/data")
    
    SparkDataFrame[v:int, k:string]
    

    相比之下,没有basePath

    > read.parquet(paths)
    
    SparkDataFrame[v:int]
    

我已经很接近了。来自 source code

read.parquet.default <- function(path, ...) {
  sparkSession <- getSparkSession()
  options <- varargsToStrEnv(...)
  # Allow the user to have a more flexible definiton of the Parquet file path
  paths <- as.list(suppressWarnings(normalizePath(path)))
  read <- callJMethod(sparkSession, "read")
  read <- callJMethod(read, "options", options)
  sdf <- handledCallJMethod(read, "parquet", paths)
  dataFrame(sdf)
}

此方法也可用 here 但也会抛出 unused argument 错误:

read.parquet(..., options=c(basePath="foo"))