从 DataFrame 写出后将 csv 数据读入 SparkR
Reading csv data into SparkR after writing it out from a DataFrame
我按照 中的示例将 DataFrame
作为 csv 写入 AWS S3 存储桶。结果不是单个文件,而是包含许多 .csv 文件的文件夹。我现在无法在 SparkR 中将此文件夹作为 DataFrame
读取。以下是我尝试过的方法,但结果与我写出的 DataFrame
不同。
write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket
df_in1 <- read.df("s3a://bucket/df", source="csv")
df_in2 <- read.df("s3a://bucket/df/*.csv", source="csv")
#Neither df_in1 or df_in2 result in DataFrames that are the same as df
# Spark 1.4 is used in this example
#
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
# Print the first few rows
head(flights)
希望这个例子对您有所帮助。
我按照 DataFrame
作为 csv 写入 AWS S3 存储桶。结果不是单个文件,而是包含许多 .csv 文件的文件夹。我现在无法在 SparkR 中将此文件夹作为 DataFrame
读取。以下是我尝试过的方法,但结果与我写出的 DataFrame
不同。
write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket
df_in1 <- read.df("s3a://bucket/df", source="csv")
df_in2 <- read.df("s3a://bucket/df/*.csv", source="csv")
#Neither df_in1 or df_in2 result in DataFrames that are the same as df
# Spark 1.4 is used in this example
#
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
# Print the first few rows
head(flights)
希望这个例子对您有所帮助。