如何将 csv 读入 sparkR ver 1.4?
How to read csv into sparkR ver 1.4?
随着 spark
(1.4) 的新版本发布,R
包 sparkR
中似乎有一个很好的 spark
前端界面。在 documentation page of R for spark 上有一个命令可以读取 json
文件作为 RDD 对象
people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json")
我正在尝试从 .csv
文件中读取数据,就像 this revolutionanalitics' blog
中描述的那样
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
注释说我需要一个 spark-csv 包来启用此操作。所以我用这个命令从github repo下载了这个包:
$ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3
但是我在尝试读取 .csv
文件时遇到了这样的错误。
> flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
15/07/03 12:52:41 ERROR RBackendHandler: load on 1 failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
... 25 more
Error: returnStatus == 0 is not TRUE
知道这个错误是什么意思以及如何解决这个问题吗?
当然,我可以尝试以标准方式阅读 .csv
,例如:
read.table("data.csv") -> flights
然后我可以将 R data.frame
转换为 spark
的 DataFrame
,如下所示:
flightsDF <- createDataFrame(sqlContext, flights)
但这不是我喜欢的方式,而且真的很费时间。
你每次都必须像这样启动 sparkR 控制台:
sparkR --packages com.databricks:spark-csv_2.10:1.0.3
如果您使用的是 Rstudio:
library(SparkR)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)
成功了。确保您为 spark-csv 指定的版本与您下载的版本匹配。
确保使用以下命令从 spark 中安装 sparkr:
install.packages("C:/spark/R/lib/sparkr.zip", repos = NULL)
而不是来回 github
帮我解决了。
随着 spark
(1.4) 的新版本发布,R
包 sparkR
中似乎有一个很好的 spark
前端界面。在 documentation page of R for spark 上有一个命令可以读取 json
文件作为 RDD 对象
people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json")
我正在尝试从 .csv
文件中读取数据,就像 this revolutionanalitics' blog
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
注释说我需要一个 spark-csv 包来启用此操作。所以我用这个命令从github repo下载了这个包:
$ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3
但是我在尝试读取 .csv
文件时遇到了这样的错误。
> flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
15/07/03 12:52:41 ERROR RBackendHandler: load on 1 failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
... 25 more
Error: returnStatus == 0 is not TRUE
知道这个错误是什么意思以及如何解决这个问题吗?
当然,我可以尝试以标准方式阅读 .csv
,例如:
read.table("data.csv") -> flights
然后我可以将 R data.frame
转换为 spark
的 DataFrame
,如下所示:
flightsDF <- createDataFrame(sqlContext, flights)
但这不是我喜欢的方式,而且真的很费时间。
你每次都必须像这样启动 sparkR 控制台:
sparkR --packages com.databricks:spark-csv_2.10:1.0.3
如果您使用的是 Rstudio:
library(SparkR)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)
成功了。确保您为 spark-csv 指定的版本与您下载的版本匹配。
确保使用以下命令从 spark 中安装 sparkr:
install.packages("C:/spark/R/lib/sparkr.zip", repos = NULL)
而不是来回 github
帮我解决了。