用 pyspark 中相应的数字替换数组中的元素
Replace elements in an array with its corresponding number in pyspark
我有一个如下所示的数据框:
+----------+--------------------------------+
| Index | flagArray |
+----------+--------------------------------+
| 1 | ['A','S','A','E','Z','S','S'] |
+----------+--------------------------------+
| 2 | ['A','Z','Z','E','Z','S','S'] |
+--------- +--------------------------------+
我想用对应的数值表示数组元素。
A - 0
F - 1
S - 2
E - 3
Z - 4
所以我的输出数据框应该看起来像
+----------+--------------------------------+--------------------------------+
| Index | flagArray | finalArray |
+----------+--------------------------------+--------------------------------+
| 1 | ['A','S','A','E','Z','S','S'] | [0, 2, 0, 3, 4, 2, 2] |
+----------+--------------------------------+--------------------------------+
| 2 | ['A','Z','Z','E','Z','S','S'] | [0, 4, 4, 3, 4, 2, 2] |
+--------- +--------------------------------+--------------------------------+
我在 pyspark 中编写了一个 udf,我通过编写一些 if else 语句来实现它。有没有更好的方法来处理。
似乎没有映射数组元素的内置函数,所以这里可能是一个替代的 udf,与您的不同之处在于它使用列表理解:
dic = {'A':0,'F':1,'S':2,'E':3,'Z':4}
map_array = f.udf(lambda a: [dic[k] for k in a])
df.withColumn('finalArray', map_array(df['flagArray'])).show(truncate=False)
输出:
+------+---------------------+---------------------+
|Index |flagArray |finalArray |
+------+---------------------+---------------------+
|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+------+---------------------+---------------------+
对于 Spark 2.4+,您可以简单地使用 transform
function to loop through each element of flagArray
array and get its mapping value from a map column that you can create from that mapping using element_at
:
mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = map_from_entries(array(*[struct(lit(k), lit(v)) for k, v in mappings.items()]))
df = df.withColumn("mappings", mapping_col) \
.withColumn("finalArray", expr(""" transform(flagArray, x -> element_at(mappings, x))""")) \
.drop("mappings")
df.show(truncate=False)
#+-----+---------------------+---------------------+
#|Index|flagArray |finalArray |
#+-----+---------------------+---------------------+
#|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
#|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
#+-----+---------------------+---------------------+
对于 Spark 3.1+,您可以调用 pyspark.sql.functions.transform
and pyspark.sql.functions.element_at
来完成这项工作:
import pyspark.sql.functions as F
mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = F.map_from_entries(F.array(*[F.struct(F.lit(k), F.lit(v)) for k, v in mappings.items()]))
df = df.withColumn("mappings", mapping_col) \
.withColumn("finalArray", F.transform("flagArray", lambda x: F.element_at(mappings, x))) \
.drop("mappings")
我有一个如下所示的数据框:
+----------+--------------------------------+
| Index | flagArray |
+----------+--------------------------------+
| 1 | ['A','S','A','E','Z','S','S'] |
+----------+--------------------------------+
| 2 | ['A','Z','Z','E','Z','S','S'] |
+--------- +--------------------------------+
我想用对应的数值表示数组元素。
A - 0
F - 1
S - 2
E - 3
Z - 4
所以我的输出数据框应该看起来像
+----------+--------------------------------+--------------------------------+
| Index | flagArray | finalArray |
+----------+--------------------------------+--------------------------------+
| 1 | ['A','S','A','E','Z','S','S'] | [0, 2, 0, 3, 4, 2, 2] |
+----------+--------------------------------+--------------------------------+
| 2 | ['A','Z','Z','E','Z','S','S'] | [0, 4, 4, 3, 4, 2, 2] |
+--------- +--------------------------------+--------------------------------+
我在 pyspark 中编写了一个 udf,我通过编写一些 if else 语句来实现它。有没有更好的方法来处理。
似乎没有映射数组元素的内置函数,所以这里可能是一个替代的 udf,与您的不同之处在于它使用列表理解:
dic = {'A':0,'F':1,'S':2,'E':3,'Z':4}
map_array = f.udf(lambda a: [dic[k] for k in a])
df.withColumn('finalArray', map_array(df['flagArray'])).show(truncate=False)
输出:
+------+---------------------+---------------------+
|Index |flagArray |finalArray |
+------+---------------------+---------------------+
|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+------+---------------------+---------------------+
对于 Spark 2.4+,您可以简单地使用 transform
function to loop through each element of flagArray
array and get its mapping value from a map column that you can create from that mapping using element_at
:
mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = map_from_entries(array(*[struct(lit(k), lit(v)) for k, v in mappings.items()]))
df = df.withColumn("mappings", mapping_col) \
.withColumn("finalArray", expr(""" transform(flagArray, x -> element_at(mappings, x))""")) \
.drop("mappings")
df.show(truncate=False)
#+-----+---------------------+---------------------+
#|Index|flagArray |finalArray |
#+-----+---------------------+---------------------+
#|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
#|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
#+-----+---------------------+---------------------+
对于 Spark 3.1+,您可以调用 pyspark.sql.functions.transform
and pyspark.sql.functions.element_at
来完成这项工作:
import pyspark.sql.functions as F
mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = F.map_from_entries(F.array(*[F.struct(F.lit(k), F.lit(v)) for k, v in mappings.items()]))
df = df.withColumn("mappings", mapping_col) \
.withColumn("finalArray", F.transform("flagArray", lambda x: F.element_at(mappings, x))) \
.drop("mappings")