如何在单个条件命令中为多个列 return 多个值 - Pyspark 或 HQL
How to return multiple values for multiple columns in a single conditional command - Pyspark or HQL
我目前在我的流程中有很多检查,我想减少这个
CASE WHEN {A > B} THEN 1 ELSE 0 END AS COL1
CASE WHEN {A = B} THEN 1 ELSE 0 END AS COL2
CASE WHEN {A < B} THEN 1 ELSE 0 END AS COL3
为此(类似)
CASE WHEN {A > B} THEN 1 AS COL1, 0 AS COL2, 0 AS COL3
ELSE CASE WHEN {A = B} THEN 0 AS COL1, 1 AS COL2, 0 AS COL3
ELSE 0 AS COL1, 0 AS COL2, 1 AS COL3
在我的例子中,有必要这样做,因为这三列已经存在,我需要减少处理
在 pyspark 中,您可以使用多种复杂类型来执行此操作。
这里有一个 array
的例子:
from pyspark.sql import functions as F
# Assuming your dataframe is called df
df.show()
+---+---+
| A| B|
+---+---+
| -1| 0|
| 0| 0|
| 1| 0|
+---+---+
df = df.withColumn(
"test",
F.when(F.col("A") == F.col("B"), F.array(*map(F.lit, [0, 1, 0])))
.when(F.col("A") < F.col("B"), F.array(*map(F.lit, [0, 0, 1])))
.when(F.col("A") > F.col("B"), F.array(*map(F.lit, [1, 0, 0]))),
)
df.show()
+---+---+---------+
| A| B| test|
+---+---+---------+
| -1| 0|[0, 0, 1]|
| 0| 0|[0, 1, 0]|
| 1| 0|[1, 0, 0]|
+---+---+---------+
创建 test
列后,您可以使用 getItem
:
将其分配给其他列
df = df.select(
"A", "B", *(F.col("test").getItem(i).alias(f"col{i+1}") for i in range(3))
)
df.show()
+---+---+----+----+----+
| A| B|col1|col2|col3|
+---+---+----+----+----+
| -1| 0| 0| 0| 1|
| 0| 0| 0| 1| 0|
| 1| 0| 1| 0| 0|
+---+---+----+----+----+
我目前在我的流程中有很多检查,我想减少这个
CASE WHEN {A > B} THEN 1 ELSE 0 END AS COL1
CASE WHEN {A = B} THEN 1 ELSE 0 END AS COL2
CASE WHEN {A < B} THEN 1 ELSE 0 END AS COL3
为此(类似)
CASE WHEN {A > B} THEN 1 AS COL1, 0 AS COL2, 0 AS COL3
ELSE CASE WHEN {A = B} THEN 0 AS COL1, 1 AS COL2, 0 AS COL3
ELSE 0 AS COL1, 0 AS COL2, 1 AS COL3
在我的例子中,有必要这样做,因为这三列已经存在,我需要减少处理
在 pyspark 中,您可以使用多种复杂类型来执行此操作。
这里有一个 array
的例子:
from pyspark.sql import functions as F
# Assuming your dataframe is called df
df.show()
+---+---+
| A| B|
+---+---+
| -1| 0|
| 0| 0|
| 1| 0|
+---+---+
df = df.withColumn(
"test",
F.when(F.col("A") == F.col("B"), F.array(*map(F.lit, [0, 1, 0])))
.when(F.col("A") < F.col("B"), F.array(*map(F.lit, [0, 0, 1])))
.when(F.col("A") > F.col("B"), F.array(*map(F.lit, [1, 0, 0]))),
)
df.show()
+---+---+---------+
| A| B| test|
+---+---+---------+
| -1| 0|[0, 0, 1]|
| 0| 0|[0, 1, 0]|
| 1| 0|[1, 0, 0]|
+---+---+---------+
创建 test
列后,您可以使用 getItem
:
df = df.select(
"A", "B", *(F.col("test").getItem(i).alias(f"col{i+1}") for i in range(3))
)
df.show()
+---+---+----+----+----+
| A| B|col1|col2|col3|
+---+---+----+----+----+
| -1| 0| 0| 0| 1|
| 0| 0| 0| 1| 0|
| 1| 0| 1| 0| 0|
+---+---+----+----+----+