spark - 当来自列表达式数组的条件时堆栈多个
spark - stack multiple when conditions from an Array of column expressions
我有以下火花数据框:
val df = Seq(("US",10),("IND",20),("NZ",30),("CAN",40)).toDF("a","b")
df.show(false)
+---+---+
|a |b |
+---+---+
|US |10 |
|IND|20 |
|NZ |30 |
|CAN|40 |
+---+---+
我正在应用 when()
条件如下:
df.withColumn("x", when(col("a").isin(us_list:_*),"u").when(col("a").isin(i_list:_*),"i").when(col("a").isin(n_list:_*),"n").otherwise("-")).show(false)
+---+---+---+
|a |b |x |
+---+---+---+
|US |10 |u |
|IND|20 |i |
|NZ |30 |n |
|CAN|40 |- |
+---+---+---+
现在为了最小化代码,我正在尝试以下操作:
val us_list = Array("U","US")
val i_list = Array("I","IND")
val n_list = Array("N","NZ")
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
这导致
ap: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (a IN (U, US)) THEN u END, CASE WHEN (a IN (I, IND)) THEN i END, CASE WHEN (a IN (N, NZ)) THEN n END)
但是当我尝试
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) ).reduce( (x,y) => x.y )
我收到一个错误。如何解决这个问题?
您可以在 ar1
列表上使用 foldLeft :
val x = ar1.foldLeft(lit("-")) { case (acc, (list, value)) =>
when(col("a").isin(list: _*), value).otherwise(acc)
}
// x: org.apache.spark.sql.Column = CASE WHEN (a IN (N, NZ)) THEN n ELSE CASE WHEN (a IN (I, IND)) THEN i ELSE CASE WHEN (a IN (U, US)) THEN u ELSE - END END END
通常不需要使用 reduce
/fold
等组合 when
语句。 coalesce
就足够了,因为 when
语句在序列,并在条件为假时给出 null
。它还可以使您免于指定 otherwise
,因为您只需在 coalesce
.
的参数列表中再追加一列
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
val combined = coalesce(ap :+ lit("-"): _*)
df.withColumn("x", combined).show
+---+---+---+
| a| b| x|
+---+---+---+
| US| 10| u|
|IND| 20| i|
| NZ| 30| n|
|CAN| 40| -|
+---+---+---+
我有以下火花数据框:
val df = Seq(("US",10),("IND",20),("NZ",30),("CAN",40)).toDF("a","b")
df.show(false)
+---+---+
|a |b |
+---+---+
|US |10 |
|IND|20 |
|NZ |30 |
|CAN|40 |
+---+---+
我正在应用 when()
条件如下:
df.withColumn("x", when(col("a").isin(us_list:_*),"u").when(col("a").isin(i_list:_*),"i").when(col("a").isin(n_list:_*),"n").otherwise("-")).show(false)
+---+---+---+
|a |b |x |
+---+---+---+
|US |10 |u |
|IND|20 |i |
|NZ |30 |n |
|CAN|40 |- |
+---+---+---+
现在为了最小化代码,我正在尝试以下操作:
val us_list = Array("U","US")
val i_list = Array("I","IND")
val n_list = Array("N","NZ")
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
这导致
ap: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (a IN (U, US)) THEN u END, CASE WHEN (a IN (I, IND)) THEN i END, CASE WHEN (a IN (N, NZ)) THEN n END)
但是当我尝试
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) ).reduce( (x,y) => x.y )
我收到一个错误。如何解决这个问题?
您可以在 ar1
列表上使用 foldLeft :
val x = ar1.foldLeft(lit("-")) { case (acc, (list, value)) =>
when(col("a").isin(list: _*), value).otherwise(acc)
}
// x: org.apache.spark.sql.Column = CASE WHEN (a IN (N, NZ)) THEN n ELSE CASE WHEN (a IN (I, IND)) THEN i ELSE CASE WHEN (a IN (U, US)) THEN u ELSE - END END END
通常不需要使用 reduce
/fold
等组合 when
语句。 coalesce
就足够了,因为 when
语句在序列,并在条件为假时给出 null
。它还可以使您免于指定 otherwise
,因为您只需在 coalesce
.
val ar1 = Array((us_list,"u"),(i_list,"i"),(n_list,"n"))
val ap = ar1.map( x => when(col("a").isInCollection(x._1),x._2) )
val combined = coalesce(ap :+ lit("-"): _*)
df.withColumn("x", combined).show
+---+---+---+
| a| b| x|
+---+---+---+
| US| 10| u|
|IND| 20| i|
| NZ| 30| n|
|CAN| 40| -|
+---+---+---+