如何使用 R 中的 sparklyr 读取 S3 folder/bucket 中的所有文件?
How to read all files in S3 folder/bucket using sparklyr in R?
我已尝试使用以下代码及其组合来读取 S3 文件夹中给定的所有文件,但似乎没有任何效果。敏感 information/code 已从以下脚本中删除。共有 6 个文件,每个 6.5 GB .
#Spark Connection
sc<-spark_connect(master = "local" , config=config)
rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|")
# This is the S3 bucket/folder for files [One of the file names Industry_Raw_Data_000]
s3://mybucket/xyzabc/Retail_Industry/Industry_Raw_Data_000
这是我得到的错误
Error: org.apache.spark.sql.AnalysisException: Path does not exist: s3a://mybucket/xyzabc/Retail_Industry/*/*;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:710)
在谷歌搜索该问题几周后,它已解决。在这里,解决方案..
Sys.setenv(AWS_ACCESS_KEY_ID="abc")
Sys.setenv(AWS_SECRET_ACCESS_KEY="xyz")
config<-spark_config()
config$sparklyr.defaultPackages <- c(
"com.databricks:spark-csv_2.10:1.5.0",
"com.amazonaws:aws-java-sdk-pom:1.10.34",
"org.apache.hadoop:hadoop-aws:2.7.3")
#Spark Connection
sc<-spark_connect(master = "local" , config=config)
# hadoop configurations
ctx <- spark_context(sc)
jsc <- invoke_static( sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx
)
hconf <- jsc %>% invoke("hadoopConfiguration")
hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
hconf %>% invoke("set", "fs.s3a.fast.upload", "true")
folder_files<-"s3a://mybucket/abc/xyz"
rd_11<-spark_read_csv(sc,name = "Retail",path=folder_files,infer_schema = TRUE,header = F,delimiter = "|")
spark_disconnect(sc)
我已尝试使用以下代码及其组合来读取 S3 文件夹中给定的所有文件,但似乎没有任何效果。敏感 information/code 已从以下脚本中删除。共有 6 个文件,每个 6.5 GB .
#Spark Connection
sc<-spark_connect(master = "local" , config=config)
rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|")
# This is the S3 bucket/folder for files [One of the file names Industry_Raw_Data_000]
s3://mybucket/xyzabc/Retail_Industry/Industry_Raw_Data_000
这是我得到的错误
Error: org.apache.spark.sql.AnalysisException: Path does not exist: s3a://mybucket/xyzabc/Retail_Industry/*/*;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:710)
在谷歌搜索该问题几周后,它已解决。在这里,解决方案..
Sys.setenv(AWS_ACCESS_KEY_ID="abc")
Sys.setenv(AWS_SECRET_ACCESS_KEY="xyz")
config<-spark_config()
config$sparklyr.defaultPackages <- c(
"com.databricks:spark-csv_2.10:1.5.0",
"com.amazonaws:aws-java-sdk-pom:1.10.34",
"org.apache.hadoop:hadoop-aws:2.7.3")
#Spark Connection
sc<-spark_connect(master = "local" , config=config)
# hadoop configurations
ctx <- spark_context(sc)
jsc <- invoke_static( sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx
)
hconf <- jsc %>% invoke("hadoopConfiguration")
hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
hconf %>% invoke("set", "fs.s3a.fast.upload", "true")
folder_files<-"s3a://mybucket/abc/xyz"
rd_11<-spark_read_csv(sc,name = "Retail",path=folder_files,infer_schema = TRUE,header = F,delimiter = "|")
spark_disconnect(sc)