嵌套 Scala 大小写 类 to/from CSV
Nested Scala case classes to/from CSV
writing/reading Scala case 类 to/from CSV 文件有很多不错的库。我正在寻找超出此范围的东西,它可以处理 嵌套 个案例 类。例如,这里一个 Match
有两个 Players
:
case class Player(name: String, ranking: Int)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane",7), Player("Fred",23)),
Match("Rome", Player("Marco",19), Player("Giulia",3)),
Match("Paris", Player("Isabelle",2), Player("Julien",5))
)
我想毫不费力地(没有样板文件!)write/read matches
to/from 这个 CSV:
place,winner.name,winner.ranking,loser.name,loser.ranking
London,Jane,7,Fred,23
Rome,Marco,19,Giulia,3
Paris,Isabelle,2,Julien,5
请注意使用点“.”的自动 header 行。形成嵌套字段的列名,例如winner.ranking
。如果有人能展示一种简单的方法来做到这一点(比如,使用反射或 Shapeless),我会很高兴。
[动力。在数据分析期间,使用平面 CSV 进行排序、过滤等很方便,即使嵌套 case 类 也是如此。如果你能从这些文件中加载嵌套案例 类 就好了。]
由于 case-class 是 Product
,因此获取各个字段的值相对容易。获取 fields/columns 的名称确实需要使用 Java 反射。
以下函数采用 case-class 个实例列表和 returns 个行列表,每个都是一个字符串列表。它使用递归来获取 child case-class 个实例的值和 headers。
def toCsv(p: List[Product]): List[List[String]] = {
def header(c: Class[_], prefix: String = ""): List[String] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
if (classOf[Product].isAssignableFrom(field.getType)) header(field.getType, name + ".")
else List(name)
}
}
def flatten(p: Product): List[String] =
p.productIterator.flatMap {
case p: Product => flatten(p)
case v: Any => List(v.toString)
}.toList
header(classOf[Match]) :: p.map(flatten)
}
但是,从 CSV 构造 case-classes 复杂得多,需要使用反射来获取各种字段的类型、从 CSV 字符串创建值以及构造 case-class 实例。
为简单起见(不是说代码简单,只是为了不会变得更复杂),我假设 CSV 中列的顺序与文件是由上面的 toCsv(...)
函数生成的一样。
以下函数首先创建 "instructions how to process a single CSV row" 列表(这些说明还用于验证 CSV 中的列 headers 是否与 case-class 属性匹配)。然后使用这些指令一次递归地生成一个 CSV 行。
def fromCsv[T <: Product](csv: List[List[String]])(implicit tag: ClassTag[T]): List[T] = {
trait Instruction {
val name: String
val header = true
}
case class BeginCaseClassField(name: String, clazz: Class[_]) extends Instruction {
override val header = false
}
case class EndCaseClassField(name: String) extends Instruction {
override val header = false
}
case class IntField(name: String) extends Instruction
case class StringField(name: String) extends Instruction
case class DoubleField(name: String) extends Instruction
def scan(c: Class[_], prefix: String = ""): List[Instruction] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
val fType = field.getType
if (fType == classOf[Int]) List(IntField(name))
else if (fType == classOf[Double]) List(DoubleField(name))
else if (fType == classOf[String]) List(StringField(name))
else if (classOf[Product].isAssignableFrom(fType)) BeginCaseClassField(name, fType) :: scan(fType, name + ".")
else throw new IllegalArgumentException(s"Unsupported field type: $fType")
} :+ EndCaseClassField(prefix)
}
def produce(instructions: List[Instruction], row: List[String], argAccumulator: List[Any]): (List[Instruction], List[String], List[Any]) = instructions match {
case IntField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toInt)
case StringField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString)
case DoubleField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toDouble)
case BeginCaseClassField(_, clazz) :: tail =>
val (instructionRemaining, rowRemaining, constructorArgs) = produce(tail, row, List.empty)
val newCaseClass = clazz.getConstructors.head.newInstance(constructorArgs.map(_.asInstanceOf[AnyRef]): _*)
produce(instructionRemaining, rowRemaining, argAccumulator :+ newCaseClass)
case EndCaseClassField(_) :: tail => (tail, row, argAccumulator)
case Nil if row.isEmpty => (Nil, Nil, argAccumulator)
case Nil => throw new IllegalArgumentException("Not all values from CSV row were used")
}
val instructions = BeginCaseClassField(".", tag.runtimeClass) :: scan(tag.runtimeClass)
assert(csv.head == instructions.filter(_.header).map(_.name), "CSV header doesn't match target case-class fields")
csv.drop(1).map(row => produce(instructions, row, List.empty)._3.head.asInstanceOf[T])
}
我已经使用以下方法对此进行了测试:
case class Player(name: String, ranking: Int, price: Double)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane", 7, 12.5), Player("Fred", 23, 11.1)),
Match("Rome", Player("Marco", 19, 13.54), Player("Giulia", 3, 41.8)),
Match("Paris", Player("Isabelle", 2, 31.7), Player("Julien", 5, 16.8))
)
val csv = toCsv(matches)
val matchesFromCsv = fromCsv[Match](csv)
assert(matches == matchesFromCsv)
显然,如果您想将其用于生产,则应对其进行优化和强化...
writing/reading Scala case 类 to/from CSV 文件有很多不错的库。我正在寻找超出此范围的东西,它可以处理 嵌套 个案例 类。例如,这里一个 Match
有两个 Players
:
case class Player(name: String, ranking: Int)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane",7), Player("Fred",23)),
Match("Rome", Player("Marco",19), Player("Giulia",3)),
Match("Paris", Player("Isabelle",2), Player("Julien",5))
)
我想毫不费力地(没有样板文件!)write/read matches
to/from 这个 CSV:
place,winner.name,winner.ranking,loser.name,loser.ranking
London,Jane,7,Fred,23
Rome,Marco,19,Giulia,3
Paris,Isabelle,2,Julien,5
请注意使用点“.”的自动 header 行。形成嵌套字段的列名,例如winner.ranking
。如果有人能展示一种简单的方法来做到这一点(比如,使用反射或 Shapeless),我会很高兴。
[动力。在数据分析期间,使用平面 CSV 进行排序、过滤等很方便,即使嵌套 case 类 也是如此。如果你能从这些文件中加载嵌套案例 类 就好了。]
由于 case-class 是 Product
,因此获取各个字段的值相对容易。获取 fields/columns 的名称确实需要使用 Java 反射。
以下函数采用 case-class 个实例列表和 returns 个行列表,每个都是一个字符串列表。它使用递归来获取 child case-class 个实例的值和 headers。
def toCsv(p: List[Product]): List[List[String]] = {
def header(c: Class[_], prefix: String = ""): List[String] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
if (classOf[Product].isAssignableFrom(field.getType)) header(field.getType, name + ".")
else List(name)
}
}
def flatten(p: Product): List[String] =
p.productIterator.flatMap {
case p: Product => flatten(p)
case v: Any => List(v.toString)
}.toList
header(classOf[Match]) :: p.map(flatten)
}
但是,从 CSV 构造 case-classes 复杂得多,需要使用反射来获取各种字段的类型、从 CSV 字符串创建值以及构造 case-class 实例。
为简单起见(不是说代码简单,只是为了不会变得更复杂),我假设 CSV 中列的顺序与文件是由上面的 toCsv(...)
函数生成的一样。
以下函数首先创建 "instructions how to process a single CSV row" 列表(这些说明还用于验证 CSV 中的列 headers 是否与 case-class 属性匹配)。然后使用这些指令一次递归地生成一个 CSV 行。
def fromCsv[T <: Product](csv: List[List[String]])(implicit tag: ClassTag[T]): List[T] = {
trait Instruction {
val name: String
val header = true
}
case class BeginCaseClassField(name: String, clazz: Class[_]) extends Instruction {
override val header = false
}
case class EndCaseClassField(name: String) extends Instruction {
override val header = false
}
case class IntField(name: String) extends Instruction
case class StringField(name: String) extends Instruction
case class DoubleField(name: String) extends Instruction
def scan(c: Class[_], prefix: String = ""): List[Instruction] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
val fType = field.getType
if (fType == classOf[Int]) List(IntField(name))
else if (fType == classOf[Double]) List(DoubleField(name))
else if (fType == classOf[String]) List(StringField(name))
else if (classOf[Product].isAssignableFrom(fType)) BeginCaseClassField(name, fType) :: scan(fType, name + ".")
else throw new IllegalArgumentException(s"Unsupported field type: $fType")
} :+ EndCaseClassField(prefix)
}
def produce(instructions: List[Instruction], row: List[String], argAccumulator: List[Any]): (List[Instruction], List[String], List[Any]) = instructions match {
case IntField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toInt)
case StringField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString)
case DoubleField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toDouble)
case BeginCaseClassField(_, clazz) :: tail =>
val (instructionRemaining, rowRemaining, constructorArgs) = produce(tail, row, List.empty)
val newCaseClass = clazz.getConstructors.head.newInstance(constructorArgs.map(_.asInstanceOf[AnyRef]): _*)
produce(instructionRemaining, rowRemaining, argAccumulator :+ newCaseClass)
case EndCaseClassField(_) :: tail => (tail, row, argAccumulator)
case Nil if row.isEmpty => (Nil, Nil, argAccumulator)
case Nil => throw new IllegalArgumentException("Not all values from CSV row were used")
}
val instructions = BeginCaseClassField(".", tag.runtimeClass) :: scan(tag.runtimeClass)
assert(csv.head == instructions.filter(_.header).map(_.name), "CSV header doesn't match target case-class fields")
csv.drop(1).map(row => produce(instructions, row, List.empty)._3.head.asInstanceOf[T])
}
我已经使用以下方法对此进行了测试:
case class Player(name: String, ranking: Int, price: Double)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane", 7, 12.5), Player("Fred", 23, 11.1)),
Match("Rome", Player("Marco", 19, 13.54), Player("Giulia", 3, 41.8)),
Match("Paris", Player("Isabelle", 2, 31.7), Player("Julien", 5, 16.8))
)
val csv = toCsv(matches)
val matchesFromCsv = fromCsv[Match](csv)
assert(matches == matchesFromCsv)
显然,如果您想将其用于生产,则应对其进行优化和强化...