如何通过火花计算有条件的字段

Question

我有一个数据框，有一个名为 A 的枚举字段（值为 0 或 1），另一个字段 B，我想实现以下场景：

if `B` is null:
   count(when `A` is 0) and set a column name `xx` 
   count(when `A` is 1) and set a column name `yy` 
if `B` is not null:
   count(when `A` is 0) and set a column name `zz` 
   count(when `A` is 1) and set a column name `mm`

spark scala 怎么实现？

Answer 1

可以通过这种方式有条件地填充列，但是最终输出的 DataFrame 需要预期的架构。

假设您详述的所有场景都可以在一个 DataFrame 中实现，我建议创建四列中的每一列："xx"、"yy"、"zz" 和 "mm" 和 有条件地填充它们。

在下面的示例中，我使用 "found" 或 "" 填充了值，主要是为了便于查看值的填充位置。在此处使用 true 和 false 或其他枚举可能在现实世界中更有意义。

从 DataFrame 开始（因为你没有指定 "B" 的类型，我已经为这个例子选择了 Option[String] （可为空）：

val df = List(
    (0, None),
    (1, None),
    (0, Some("hello")),
    (1, Some("world"))
).toDF("A", "B")
     
 df.show(false)

给出：

+---+-----+
|A  |B    |
+---+-----+
|0  |null |
|1  |null |
|0  |hello|
|1  |world|
+---+-----+

并创建列：

df
    .withColumn("xx", when(col("B").isNull && col("A") === 0, "found").otherwise(""))
    .withColumn("yy", when(col("B").isNull && col("A") === 1, "found").otherwise(""))
    .withColumn("zz", when(col("B").isNotNull && col("A") === 0, "found").otherwise(""))
    .withColumn("mm", when(col("B").isNotNull && col("A") === 1, "found").otherwise(""))
    .show(false)

给出：

+---+-----+-----+-----+-----+-----+
|A  |B    |xx   |yy   |zz   |mm   |
+---+-----+-----+-----+-----+-----+
|0  |null |found|     |     |     |
|1  |null |     |found|     |     |
|0  |hello|     |     |found|     |
|1  |world|     |     |     |found|
+---+-----+-----+-----+-----+-----+

如何通过火花计算有条件的字段

how to count field with condition by spark

scala

apache-spark