在 Scala 中解析文件的惯用方法
Idiomatic way to parse a File in Scala
我正在用 Scala 解析一个文件,我有两种文件要读取:
一组火车句子,形式如下:
String\tString\tInt
String\tString\tInt
// ...
String\tString\tInt
还有一组测试语句,形式如下:
String\tString\tString\tInt
String\tString\tString\tInt
// ...
String\tString\tString\tInt
到目前为止,我已经使用 Either
来区分各种格式:
def readDataSet(file: String): Option[Vector[LabeledSentence]] = {
def getSentenceType(s: Array[String]) = s.length match {
case 3 => Left((s(0), s(1), s(2).toInt))
case 4 => Right((s(0), s(1), s(2), s(3).toInt))
case _ => Right(("EOS", "EOS", "EOS", -1))
}
val filePath = getClass.getResource(file).getPath
Manage(Source.fromFile(filePath)) { source =>
val parsedTuples = source getLines() map (s => s.split("\t"))
// ..........
// Got throught each token in the file and construct a sentence
for (s <- parsedTuples) {
getSentenceType(s) match {
// When reaching the end of the sentence, save it
case Right(("EOS", "EOS", "EOS", -1)) =>
sentences += new LabeledSentence(lex.result(), po.result(), dep.result())
lex.clear()
po.clear()
dep.clear()
// if (isTrain) gold.clear()
case Left(x) =>
lex += x._1
po += x._2
dep += x._3
case Right(x) =>
lex += x._1
po += x._2
gold += x._3
dep += x._4
}
}
Some(sentences.result())
}
}
是否有 better/idiomatic 方法来简化此代码?
我删除了一些不重要的代码,如果你想看完整的代码,check my github page
更新:遵循Dima的建议,我使用Monoid简化了我的代码,结果如下:
val parsedTuples = source
.getLines()
.map(s => s.split("\t"))
.map {
case Array(a, b, c, d) => Tokens(a, b, c, d.toInt)
case Array(a, b, d) => Tokens(a, b, "", d.toInt)
case _ => Tokens() // Read end of sentence
}.foldLeft((Tokens(), Vector.empty[LabeledSentence])) {
// When reading an end of sentence, create a new Labeled sentence with tokens
case ((z, l), t) if t.isEmpty => (Tokens(), l :+ LabeledSentence(z))
// Accumulate tokens of the sentence
case ((z, l), t) => (z append(z, t), l)
}._2
您不需要 Either
。总是使用 4 元组:
source
.getLines
.map(_.split("\t"))
.map {
case Array(a, b, c, d) => Some(a, b, c, d.toInt)
case Array(a, b, d) => Some(a, b, "", d.toInt)
case _ => None
}.foldLelft((List.empty[LabeledSentence], List[String].empty, List.empty[String], List.empty[String], List.empty[Int])) {
case ((l, lex, po, gold, dep), None) =>
(new LabeledSentence(lex.reverse, po.reverse, fold.reverse, dep.reverse)::l, List(), List(), List(), List())
case ((l, lex, po, gold, dep), Some((a, b, c, d))) =>
(l, a::lex, b::po, c::gold, d::dep)
}._1._1.reverse
如果您重新考虑处理 lex, po, gold, dep
问题的方法(将其作为案例 class and/or 与 LabeledSentence
也许?)。
此外,您必须减少使用可变容器,这使得理解正在发生的事情变得更加困难。这不是 java ...
我正在用 Scala 解析一个文件,我有两种文件要读取:
一组火车句子,形式如下:
String\tString\tInt
String\tString\tInt
// ...
String\tString\tInt
还有一组测试语句,形式如下:
String\tString\tString\tInt
String\tString\tString\tInt
// ...
String\tString\tString\tInt
到目前为止,我已经使用 Either
来区分各种格式:
def readDataSet(file: String): Option[Vector[LabeledSentence]] = {
def getSentenceType(s: Array[String]) = s.length match {
case 3 => Left((s(0), s(1), s(2).toInt))
case 4 => Right((s(0), s(1), s(2), s(3).toInt))
case _ => Right(("EOS", "EOS", "EOS", -1))
}
val filePath = getClass.getResource(file).getPath
Manage(Source.fromFile(filePath)) { source =>
val parsedTuples = source getLines() map (s => s.split("\t"))
// ..........
// Got throught each token in the file and construct a sentence
for (s <- parsedTuples) {
getSentenceType(s) match {
// When reaching the end of the sentence, save it
case Right(("EOS", "EOS", "EOS", -1)) =>
sentences += new LabeledSentence(lex.result(), po.result(), dep.result())
lex.clear()
po.clear()
dep.clear()
// if (isTrain) gold.clear()
case Left(x) =>
lex += x._1
po += x._2
dep += x._3
case Right(x) =>
lex += x._1
po += x._2
gold += x._3
dep += x._4
}
}
Some(sentences.result())
}
}
是否有 better/idiomatic 方法来简化此代码?
我删除了一些不重要的代码,如果你想看完整的代码,check my github page
更新:遵循Dima的建议,我使用Monoid简化了我的代码,结果如下:
val parsedTuples = source
.getLines()
.map(s => s.split("\t"))
.map {
case Array(a, b, c, d) => Tokens(a, b, c, d.toInt)
case Array(a, b, d) => Tokens(a, b, "", d.toInt)
case _ => Tokens() // Read end of sentence
}.foldLeft((Tokens(), Vector.empty[LabeledSentence])) {
// When reading an end of sentence, create a new Labeled sentence with tokens
case ((z, l), t) if t.isEmpty => (Tokens(), l :+ LabeledSentence(z))
// Accumulate tokens of the sentence
case ((z, l), t) => (z append(z, t), l)
}._2
您不需要 Either
。总是使用 4 元组:
source
.getLines
.map(_.split("\t"))
.map {
case Array(a, b, c, d) => Some(a, b, c, d.toInt)
case Array(a, b, d) => Some(a, b, "", d.toInt)
case _ => None
}.foldLelft((List.empty[LabeledSentence], List[String].empty, List.empty[String], List.empty[String], List.empty[Int])) {
case ((l, lex, po, gold, dep), None) =>
(new LabeledSentence(lex.reverse, po.reverse, fold.reverse, dep.reverse)::l, List(), List(), List(), List())
case ((l, lex, po, gold, dep), Some((a, b, c, d))) =>
(l, a::lex, b::po, c::gold, d::dep)
}._1._1.reverse
如果您重新考虑处理 lex, po, gold, dep
问题的方法(将其作为案例 class and/or 与 LabeledSentence
也许?)。
此外,您必须减少使用可变容器,这使得理解正在发生的事情变得更加困难。这不是 java ...