如何在 scala-spark 中减少多个案例

Question

新手问题，你怎么optimize/reduce这样的表达：

when(x1._1,x1._2).when(x2._1,x2._2).when(x3._1,x3._2).when(x4._1,x4._2).when(x5._1,x5._2)....
.when(xX._1,xX._2).otherwise(z)

x1、x2、x3、xX 是映射，其中 x1._1 是条件，x._2 是“然后”。

我试图将地图保存在列表中，然后使用 map-reduce 但它生成了：

when(x1._1,x1._2).otherwise(z) && when(x2._1,x2._2).otherwise(z)...

这是错误的。我有 10 行 pure when case 并且想减少它以便我的代码更清晰。

Answer 1

您可以在地图列表中使用foldLeft：

val maplist = List(x1, x2)  // add more x if needed

val new_col = maplist.tail.foldLeft(when(maplist.head._1, maplist.head._2))((x,y) => x.when(y._1, y._2)).otherwise(z)

另一种方法是使用 coalesce。如果不满足条件，则when语句返回null，并会计算下一个when语句，直到得到一个非空的结果。

val new_col = coalesce((maplist.map(x => when(x._1, x._2)) :+ z):_*)

Answer 2

另一种方法是将 otherwise 作为初始值传递给 foldLeft:

val maplist = Seq(Map(col("c1") -> "value1"), Map(col("c2") -> "value2"))

val newCol = maplist.flatMap(_.toSeq).foldLeft(lit("z")) {
  case (acc, (cond, value)) => when(cond, value).otherwise(acc)
}

// gives:
// newCol: org.apache.spark.sql.Column = CASE WHEN c2 THEN value2 ELSE CASE WHEN c1 THEN value1 ELSE z END END

Answer 3

您可以创建一个简单的递归方法来 assemble 嵌套-when/otherwise 条件：

import org.apache.spark.sql.Column

def nestedCond(cols: Array[String], default: String): Column = {
  def loop(ls: List[String]): Column = ls match {
    case Nil => col(default)
    case c :: tail => when(col(s"$c._1"), col(s"$c._2")).otherwise(loop(tail))
  }
  loop(cols.toList).as("nested-cond")
}

测试方法：

val df = Seq(
  ((false, 1), (false, 2), (true, 3), 88),
  ((false, 4), (true, 5), (true, 6), 99)
).toDF("x1", "x2", "x3", "z")

val cols = df.columns.filter(_.startsWith("x"))
// cols: Array[String] = Array(x1, x2, x3)

df.select(nestedCond(cols, "z")).show
// +-----------+
// |nested-cond|
// +-----------+
// |          3|
// |          5|
// +-----------+

或者，使用 foldRight 到 assemble 嵌套条件：

def nestedCond(cols: Array[String], default: String): Column =
  cols.foldRight(col(default)){ (c, acc) =>
      when(col(s"$c._1"), col(s"$c._2")).otherwise(acc)
    }.as("nested-cond")

如何在 scala-spark 中减少多个案例

How to reduce multiple case when in scala-spark

scala

case-when

apache-spark