在 SparkR 1.4.0 中读取文本文件

Reading Text file in SparkR 1.4.0

有谁知道如何在 SparkR 1.4.0 版中读取文本文件? 是否有任何可用的 Spark 包?

请关注links http://ampcamp.berkeley.edu/5/exercises/sparkr.html

我们可以简单地使用-

 textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")

在检查 SparkR 代码时,Context.R 具有 textFile 方法,因此理想情况下 SparkContext 必须具有 textFile API 才能创建 RDD,但文档中缺少该方法。

# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# @param sc SparkContext to use
# @param path Path of file to read. A vector of multiple paths is allowed.
# @param minPartitions Minimum number of partitions to be created. If NULL, the default
#  value is chosen based on available parallelism.
# @return RDD where each item is of type \code{character}
# @export
# @examples
#\dontrun{
#  sc <- sparkR.init()
#  lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
  # Allow the user to have a more flexible definiton of the text file path
  path <- suppressWarnings(normalizePath(path))
  # Convert a string vector of paths to a string containing comma separated paths
  path <- paste(path, collapse = ",")

  jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
  # jrdd is of type JavaRDD[String]
  RDD(jrdd, "string")
}

关注link https://github.com/apache/spark/blob/master/R/pkg/R/context.R

对于测试用例 https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R

Spark 1.6+

您可以使用 text 输入格式将文本文件读取为 DataFrame:

read.df(sqlContext=sqlContext, source="text", path="README.md")

火花 <= 1.5

简短的回答是你不知道。 SparkR 1.4 几乎完全从低级别 API 剥离,只留下有限的 Data Frame 操作子集。 正如您在 old SparkR webpage:

上看到的那样

As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.

可能最接近的是使用 spark-csv:

加载文本文件
> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
|                  C0|
+--------------------+
|      # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+

由于像 mapflatMapreducefilter 这样典型的 RDD 操作也已经消失,所以它可能正是您想要的。

现在,低级别 API 仍在下面,因此您可以随时执行类似下面的操作,但 我怀疑这是个好主意 。 SparkR 开发人员很可能有充分的理由将其私有化。引用 ::: 手册页:

It is typically a design mistake to use ‘:::’ in your code since the corresponding object has probably been kept internal for a good reason. Consider contacting the package maintainer if you feel the need to access the object for anything but mere inspection.

即使您愿意忽略良好的编码习惯,我也很可能不值得花时间。 1.4 之前的低级别 API 令人尴尬地缓慢和笨拙,并且没有 Catalyst 优化器的所有优点,它很可能与内部 1.4 API.

相同
> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)

[[1]]
[1] 14

[[2]]
[1] 0

[[3]]
[1] 78

不是 spark-csv,不像 textFile,会忽略空行。

事实上,您也可以使用 databricks/spark-csv 包来处理 tsv 文件。

例如,

data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")

此处提供了大量选项 - databricks-spark-csv#features