Spark CSV 包无法在字段内处理 \n

Question

我有一个 CSV 文件，我正在尝试使用 Spark CSV package 加载它，但它没有正确加载数据，因为很少有字段包含 \n，例如下面两行

"XYZ", "Test Data", "TestNew\nline", "OtherData" 
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"

我正在使用以下代码，它很简单我正在使用 parserLib 作为 univocity 在互联网上阅读它解决了多个换行符问题，但对我来说似乎并非如此。

 SQLContext sqlContext = new SQLContext(sc);
    DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .option("parserLib","univocity")
        .load("data.csv");

如何替换以引号开头的字段中的换行符。有没有更简单的方法？

Answer 1

升级到 Spark 2.x。换行符实际上是由 ascii 13 和 10 表示的 CRLF。但是反斜杠和 'n' 是不同的 ascii，它们是以编程方式解释和编写的。 Spark 2.x 将正确读取..我试过了..s.b。
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local[2]") val sc = SparkSession.builder().master("local").getOrCreate() val df = sc.read.csv("src/main/resources/data.csv") df.foreach(row => println(row.mkString(", ")))
如果您无法升级，请使用正则表达式对 RDD 上的 \n 进行清理。这不会删除行尾，因为它是正则表达式中的 $。 S.b.

  val conf = new SparkConf().setAppName("HelloSpark").setMaster("local")
  val sc = new SparkContext(conf)
  val rdd1 = sc.textFile("src/main/resources/data.csv")
  val rdd2 = rdd1.map(row => row.replace("\n", ""))
  val sqlContext = new SQLContext(sc)

  import sqlContext.implicits._
  val df = rdd2.toDF()
  df.foreach(row => println(row.mkString(", ")))

Answer 2

根据 SPARK-14194（作为重复解决）不支持且永远不会支持带有换行符的字段。

I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.

那是 Spark 2.0，你使用 spark-csv 模块。

在引用的SPARK-19610 it was fixed with the pull request中：

hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline

换句话说，在 Spark 2.x 中使用 wholeFile 选项（如您在 CSVDataSource 中所见）。

至于 spark-csv，this comment 可能会有一些帮助（突出显示我的）：

However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).

在 spark-csv 的 Features 您可以找到以下内容：

The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:

quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.

quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes

Answer 3

Spark 2.2 的用户可以使用一个选项来解决 CSV 文件中的换行符问题。它最初被称为 wholeFile，但在发布之前被重命名为 multiLine。

以下是使用该选项将 CSV 文件加载到数据帧的示例：

var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))

Spark CSV 包无法在字段内处理 \n

Spark CSV package not able to handle \n within fields

scala

apache-spark

apache-spark-sql

spark-csv

apache-spark-1.6