在 Scala 中读取包含多行字符串的 CSV 文件

Question

我有一个 csv 文件，我想逐行阅读它。问题是某些单元格值在包含换行符的引号中。

这是一个 CSV 示例：

Product,Description,Price
Product A,This is Product A,20
Product B,"This is much better
than Product A",200

标准的 getLines() 函数不处理该问题。

Source.fromFile(inputFile).getLines()  // will split at every line break, regardless if quoted or not

getLines 得到如下内容：

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better")
Array("than Product A\"", "20")

但应该是这样的：

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better\nthan Product A\"", "20")

我试过它来完整读取文件，拆分是使用类似于此的 RegEx post

file.mkString.split("""\n(?=(?:[^"]*"[^"]*")*[^"]*$)""")

正则表达式工作正常，但我收到堆栈溢出异常，因为文件太大无法完全处理内存不足。我用较小版本的文件试了一下，效果很好。

如 post 中所述，foldLeft() 可以帮助处理更大的文件。但是我不确定它应该如何工作，当遍历字符串的每个字符时，一次传递所有...

当前迭代的 Char
你正在建造的线路
和已创建行的列表

也许可以编写自己的 getLines 尾递归版本，但我不确定是否没有更实用的解决方案而不是逐个字符地处理它。

你看到这个问题的任何其他函数式解决方案吗？

坦克和问候，菲利克斯

Answer 1

您可以使用第三方库来执行此操作，例如 opencsv

maven 仓库 -> https://mvnrepository.com/artifact/au.com.bytecode/opencsv/2.4

代码示例 -> https://www.programcreek.com/java-api-examples/au.com.bytecode.opencsv.CSVReader

Answer 2

最简单的答案是找到一个外部库来完成它！

如果这不是您的解决方案，foldLeft 解决方案是我认为最好的功能样式！这是一个简单的版本：

  val lines = Source.fromFile(inputFile).getLines()

  lines.foldLeft[(Seq[String], String)](Nil, "") {
    case ((accumulatedLines, accumulatedString), newLine) => {
      val isInAnOpenString = accumulatedString.nonEmpty
      val lineHasOddQuotes =  newLine.count(_ == '"') % 2 == 1
      (isInAnOpenString, lineHasOddQuotes) match {
        case (true, true) => (accumulatedLines :+ (accumulatedString + newLine)) -> ""
        case (true, false) => accumulatedLines -> (accumulatedString + newLine)
        case (false, true) => accumulatedLines -> newLine
        case (false, false) => (accumulatedLines :+ newLine) -> ""
      }
    }
  }._1

请注意，此版本不会处理太多特殊情况，例如在包含多行的一行上有多个值，但它应该会给您一个好的开始。

主要思想是对几乎所有你需要保留在内存中的东西进行 foldLeft，然后逐渐改变你的状态。

如您所见，在 foldLeft 中您可以根据需要拥有尽可能多的逻辑。在这种情况下，我添加了额外的布尔值和嵌套匹配案例以提高可读性。

所以我的建议是：foldLeft，不要惊慌！

Answer 3

我想知道新的 (Scala 2.13) unfold() 是否可以在这里得到很好的利用。

                        // "file" has been opened
val lines = Iterator.unfold(file.getLines()){ itr =>
              Option.when(itr.hasNext) {
                val sb = new StringBuilder(itr.next)
                while (itr.hasNext && sb.count(_ == '"') % 2 > 0)
                  sb.append("\n" + itr.next)
                (sb.toString, itr)
              }
            }

现在您可以根据需要迭代内容。

lines.foreach(println)
//Product,Description,Price
//Product A,This is Product A,20
//Product B,"This is much better\nthan Product A",200
//Product C,a "third rate" product,5

请注意，这非常简单，因为它只计算所有引号，寻找偶数。它不会将转义引号 \" 识别为不同，但使用正则表达式应该不会太困难，因此它只计算非转义引号。

由于我们使用的是迭代器，因此它应该是内存高效的并且可以处理任何大小的文件，只要没有错误的单引号触发文件的其余部分作为一行文本读入即可。

在 Scala 中读取包含多行字符串的 CSV 文件

Reading CSV file with multi line strings in Scala

csv

scala

line-breaks