在 pyspark 数据帧的单个 "when" 函数中对数组的每个元素使用 OR 运算符

Question

我有一个列数组

DiversityTypes = ["ABC","EFG","LMN","XYZ"]

我想在 Pyspark 数据框上工作，我在其中创建一个名为“Is_Diversified”的新列并使用 OR 运算符设置其值 Yes ,No 上面提到的 DiversityTypes 每个元素的值，在单个 when 函数中如下：

    p_df = p_df.withColumn('Is_Diversified', f.when(f.col("ABC") == 'Y'|\
                                                    f.col("EFG") == 'Y'|\
                                                    f.col("LMN") == 'Y'|\
                                                    f.col("XYZ") == 'Y'),lit("Yes")).otherwise(lit("No")))

变成这样，我们迭代数组的每个元素并同时对其应用 OR 运算符

for diversity in DiversityTypes:
    p_df = p_df.withColumn('Is_Diversified', f.when(diversity) == 'Y'),lit("Yes")).otherwise(lit("No")))

我不能在这里应用逻辑，请帮助，谢谢:)

Answer 1

这个呢？制作一个数组并检查数组是否有任何 Y.

DiversityTypes = ["ABC","EFG","LMN","XYZ"]

df.withColumn('Is_Diversified', when(lit('Y').isin(*map(col, DiversityTypes)), "Yes").otherwise("No")).show()

+---+---+---+---+--------------+
|ABC|EFG|LMN|XYZ|Is_Diversified|
+---+---+---+---+--------------+
|  Y|  N|  N|  N|          true|
|  N|  N|  N|  N|         false|
|  Y|  Y|  Y|  Y|          true|
+---+---+---+---+--------------+

Answer 2

我会使用 functools.reduce 和 bitwise or operator:

import pyspark.sql.functions as f
from functools import reduce
from operator import or_

p_df = p_df.withColumn(
    'Is_Diversified', 
    f.when(
        reduce(
            or_, 
            [f.col(c)=="Y" for c in DiversityTypes]
        ), 
        f.lit("Yes")
    ).otherwise(f.lit("No"))
)

在 pyspark 数据帧的单个 "when" 函数中对数组的每个元素使用 OR 运算符

using OR operator for each element of an array in single "when" function of pyspark dataframe

python

arrays

dataframe

pyspark

pyspark-dataframes