如何在单个条件命令中为多个列 return 多个值 - Pyspark 或 HQL

How to return multiple values ​for multiple columns in a single conditional command - Pyspark or HQL

我目前在我的流程中有很多检查,我想减少这个

CASE WHEN {A > B} THEN 1 ELSE 0 END AS COL1
CASE WHEN {A = B} THEN 1 ELSE 0 END AS COL2
CASE WHEN {A < B} THEN 1 ELSE 0 END AS COL3

为此(类似)

CASE WHEN {A > B} THEN 1 AS COL1, 0 AS COL2, 0 AS COL3 
ELSE CASE WHEN {A = B} THEN 0 AS COL1, 1 AS COL2, 0 AS COL3 
ELSE 0 AS COL1, 0 AS COL2, 1 AS COL3

在我的例子中,有必要这样做,因为这三列已经存在,我需要减少处理

在 pyspark 中,您可以使用多种复杂类型来执行此操作。

这里有一个 array 的例子:

from pyspark.sql import functions as F

# Assuming your dataframe is called df
df.show()
+---+---+
|  A|  B|
+---+---+
| -1|  0|
|  0|  0|
|  1|  0|
+---+---+

df = df.withColumn(
    "test",
    F.when(F.col("A") == F.col("B"), F.array(*map(F.lit, [0, 1, 0])))
    .when(F.col("A") < F.col("B"), F.array(*map(F.lit, [0, 0, 1])))
    .when(F.col("A") > F.col("B"), F.array(*map(F.lit, [1, 0, 0]))),
)

df.show()
+---+---+---------+
|  A|  B|     test|
+---+---+---------+
| -1|  0|[0, 0, 1]|
|  0|  0|[0, 1, 0]|
|  1|  0|[1, 0, 0]|
+---+---+---------+

创建 test 列后,您可以使用 getItem:

将其分配给其他列
df = df.select(
    "A", "B", *(F.col("test").getItem(i).alias(f"col{i+1}") for i in range(3))
)

df.show()
+---+---+----+----+----+
|  A|  B|col1|col2|col3|
+---+---+----+----+----+
| -1|  0|   0|   0|   1|
|  0|  0|   0|   1|   0|
|  1|  0|   1|   0|   0|
+---+---+----+----+----+