在 Scala 中解析文件的惯用方法

Question

我正在用 Scala 解析一个文件，我有两种文件要读取：

一组火车句子，形式如下：

String\tString\tInt
String\tString\tInt
// ...
String\tString\tInt

还有一组测试语句，形式如下：

String\tString\tString\tInt
String\tString\tString\tInt
// ...
String\tString\tString\tInt

到目前为止，我已经使用 Either 来区分各种格式：

  def readDataSet(file: String): Option[Vector[LabeledSentence]] = {

  def getSentenceType(s: Array[String]) = s.length match {
    case 3 => Left((s(0), s(1), s(2).toInt))
    case 4 => Right((s(0), s(1), s(2), s(3).toInt))
    case _ => Right(("EOS", "EOS", "EOS", -1))
  }

    val filePath = getClass.getResource(file).getPath

    Manage(Source.fromFile(filePath)) { source =>

      val parsedTuples = source getLines() map (s => s.split("\t"))

      // ..........

      // Got throught each token in the file and construct a sentence
      for (s <- parsedTuples) {
        getSentenceType(s) match {
          // When reaching the end of the sentence, save it
          case Right(("EOS", "EOS", "EOS", -1)) =>
            sentences += new LabeledSentence(lex.result(), po.result(), dep.result())
            lex.clear()
            po.clear()
            dep.clear()
          //            if (isTrain) gold.clear()
          case Left(x) =>
            lex += x._1
            po += x._2
            dep += x._3
          case Right(x) =>
            lex += x._1
            po += x._2
            gold += x._3
            dep += x._4
        }
      }
      Some(sentences.result())
    }
  }

是否有 better/idiomatic 方法来简化此代码？

我删除了一些不重要的代码，如果你想看完整的代码，check my github page

更新：遵循Dima的建议，我使用Monoid简化了我的代码，结果如下：

val parsedTuples = source
    .getLines()
    .map(s => s.split("\t"))
    .map {
      case Array(a, b, c, d) => Tokens(a, b, c, d.toInt)
      case Array(a, b, d) => Tokens(a, b, "", d.toInt)
      case _ => Tokens() // Read end of sentence
    }.foldLeft((Tokens(), Vector.empty[LabeledSentence])) {
    // When reading an end of sentence, create a new Labeled sentence with tokens
    case ((z, l), t) if t.isEmpty => (Tokens(), l :+ LabeledSentence(z))
    // Accumulate tokens of the sentence
    case ((z, l), t) => (z append(z, t), l)
  }._2

Answer 1

您不需要 Either。总是使用 4 元组：

  source
    .getLines
    .map(_.split("\t"))
    .map {
      case Array(a, b, c, d) => Some(a, b, c, d.toInt)
      case Array(a, b, d) => Some(a, b, "", d.toInt)
      case _ => None
    }.foldLelft((List.empty[LabeledSentence], List[String].empty, List.empty[String], List.empty[String], List.empty[Int])) {
      case ((l, lex, po, gold, dep), None) =>
         (new LabeledSentence(lex.reverse, po.reverse, fold.reverse, dep.reverse)::l, List(), List(), List(), List())
      case ((l, lex, po, gold, dep), Some((a, b, c, d))) => 
         (l, a::lex, b::po, c::gold, d::dep)
   }._1._1.reverse

如果您重新考虑处理 lex, po, gold, dep 问题的方法（将其作为案例 class and/or 与 LabeledSentence 也许？）。

此外，您必须减少使用可变容器，这使得理解正在发生的事情变得更加困难。这不是 java ...

在 Scala 中解析文件的惯用方法

Idiomatic way to parse a File in Scala

scala

idioms