SparkR 中是否有基本路径数据选项?
Is there a basepath data option in SparkR?
我在 S3 中有一个明确修剪的模式结构,当我 read.parquet()
:
时导致以下错误
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
s3a://leftout/for/security/dashboard/updateddate=20170217
s3a://leftout/for/security/dashboard/updateddate=20170218
(冗长的)错误告诉我进一步...
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.
但是,我找不到任何关于如何使用 SparkR::read.parquet(...)
执行此操作的文档。有谁知道如何在 R 中执行此操作(使用 SparkR)?
> version
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14)
nickname Fire Safety
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.6.0 SparkR_2.0.2 DT_0.2 jsonlite_1.2 shinythemes_1.1.1 ggthemes_3.3.0
[7] dplyr_0.5.0 ggplot2_2.2.1 leaflet_1.0.1 shiny_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3 colorspace_1.3-2 xtable_1.8-2 R6_2.2.0
[7] stringr_1.1.0 plyr_1.8.4 tools_3.2.2 grid_3.2.2 gtable_0.2.0 DBI_0.5-1
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14 lazyeval_0.2.0 digest_0.6.12 assertthat_0.1
[19] tibble_1.2 htmlwidgets_0.8 mime_0.5 stringi_1.1.2 scales_0.4.1 httpuv_1.3.3
In Spark 2.1 or later 您可以将 basePath
作为命名参数传递:
read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")
省略号捕获的参数用 varargsToStrEnv
and used as options
转换。
完整会话示例:
写入数据(Scala):
Seq(("a", 1), ("b", 2)).toDF("k", "v")
.write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
读取数据(SparkR):
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
SparkSession available as 'spark'.
> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE)
> read.parquet(paths, basePath="/tmp/data")
SparkDataFrame[v:int, k:string]
相比之下,没有basePath
:
> read.parquet(paths)
SparkDataFrame[v:int]
我已经很接近了。来自 source code:
read.parquet.default <- function(path, ...) {
sparkSession <- getSparkSession()
options <- varargsToStrEnv(...)
# Allow the user to have a more flexible definiton of the Parquet file path
paths <- as.list(suppressWarnings(normalizePath(path)))
read <- callJMethod(sparkSession, "read")
read <- callJMethod(read, "options", options)
sdf <- handledCallJMethod(read, "parquet", paths)
dataFrame(sdf)
}
此方法也可用 here 但也会抛出 unused argument
错误:
read.parquet(..., options=c(basePath="foo"))
我在 S3 中有一个明确修剪的模式结构,当我 read.parquet()
:
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
s3a://leftout/for/security/dashboard/updateddate=20170217
s3a://leftout/for/security/dashboard/updateddate=20170218
(冗长的)错误告诉我进一步...
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.
但是,我找不到任何关于如何使用 SparkR::read.parquet(...)
执行此操作的文档。有谁知道如何在 R 中执行此操作(使用 SparkR)?
> version
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14)
nickname Fire Safety
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.6.0 SparkR_2.0.2 DT_0.2 jsonlite_1.2 shinythemes_1.1.1 ggthemes_3.3.0
[7] dplyr_0.5.0 ggplot2_2.2.1 leaflet_1.0.1 shiny_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3 colorspace_1.3-2 xtable_1.8-2 R6_2.2.0
[7] stringr_1.1.0 plyr_1.8.4 tools_3.2.2 grid_3.2.2 gtable_0.2.0 DBI_0.5-1
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14 lazyeval_0.2.0 digest_0.6.12 assertthat_0.1
[19] tibble_1.2 htmlwidgets_0.8 mime_0.5 stringi_1.1.2 scales_0.4.1 httpuv_1.3.3
In Spark 2.1 or later 您可以将 basePath
作为命名参数传递:
read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")
省略号捕获的参数用 varargsToStrEnv
and used as options
转换。
完整会话示例:
写入数据(Scala):
Seq(("a", 1), ("b", 2)).toDF("k", "v") .write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
读取数据(SparkR):
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ SparkSession available as 'spark'.
> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE) > read.parquet(paths, basePath="/tmp/data")
SparkDataFrame[v:int, k:string]
相比之下,没有
basePath
:> read.parquet(paths)
SparkDataFrame[v:int]
我已经很接近了。来自 source code:
read.parquet.default <- function(path, ...) {
sparkSession <- getSparkSession()
options <- varargsToStrEnv(...)
# Allow the user to have a more flexible definiton of the Parquet file path
paths <- as.list(suppressWarnings(normalizePath(path)))
read <- callJMethod(sparkSession, "read")
read <- callJMethod(read, "options", options)
sdf <- handledCallJMethod(read, "parquet", paths)
dataFrame(sdf)
}
此方法也可用 here 但也会抛出 unused argument
错误:
read.parquet(..., options=c(basePath="foo"))