用 pyspark 中相应的数字替换数组中的元素

Replace elements in an array with its corresponding number in pyspark

我有一个如下所示的数据框:

   +----------+--------------------------------+
   | Index    |           flagArray            |
   +----------+--------------------------------+
   |    1     | ['A','S','A','E','Z','S','S']  | 
   +----------+--------------------------------+
   |    2     | ['A','Z','Z','E','Z','S','S']  |
   +--------- +--------------------------------+

我想用对应的数值表示数组元素。

     A - 0
     F - 1
     S - 2
     E - 3
     Z - 4

所以我的输出数据框应该看起来像

   +----------+--------------------------------+--------------------------------+
   | Index    |           flagArray            |           finalArray           |
   +----------+--------------------------------+--------------------------------+
   |    1     | ['A','S','A','E','Z','S','S']  | [0, 2, 0, 3, 4, 2, 2]          | 
   +----------+--------------------------------+--------------------------------+
   |    2     | ['A','Z','Z','E','Z','S','S']  | [0, 4, 4, 3, 4, 2, 2]          |
   +--------- +--------------------------------+--------------------------------+

我在 pyspark 中编写了一个 udf,我通过编写一些 if else 语句来实现它。有没有更好的方法来处理。

似乎没有映射数组元素的内置函数,所以这里可能是一个替代的 udf,与您的不同之处在于它使用列表理解:

dic = {'A':0,'F':1,'S':2,'E':3,'Z':4}
map_array = f.udf(lambda a: [dic[k] for k in a])
df.withColumn('finalArray', map_array(df['flagArray'])).show(truncate=False)

输出:

+------+---------------------+---------------------+
|Index |flagArray            |finalArray           |
+------+---------------------+---------------------+
|1     |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2     |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+------+---------------------+---------------------+

对于 Spark 2.4+,您可以简单地使用 transform function to loop through each element of flagArray array and get its mapping value from a map column that you can create from that mapping using element_at:

mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = map_from_entries(array(*[struct(lit(k), lit(v)) for k, v in mappings.items()]))

df = df.withColumn("mappings", mapping_col) \
       .withColumn("finalArray", expr(""" transform(flagArray, x -> element_at(mappings, x))""")) \
       .drop("mappings")

df.show(truncate=False)
#+-----+---------------------+---------------------+
#|Index|flagArray            |finalArray           |
#+-----+---------------------+---------------------+
#|1    |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
#|2    |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
#+-----+---------------------+---------------------+

对于 Spark 3.1+,您可以调用 pyspark.sql.functions.transform and pyspark.sql.functions.element_at 来完成这项工作:

import pyspark.sql.functions as F

mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = F.map_from_entries(F.array(*[F.struct(F.lit(k), F.lit(v)) for k, v in mappings.items()]))

df = df.withColumn("mappings", mapping_col) \
       .withColumn("finalArray", F.transform("flagArray", lambda x: F.element_at(mappings, x))) \
       .drop("mappings")