动态 'when' 子句中的 N 个条件
N conditions in dynamically 'when' clause
我有以下代码
from pyspark.sql.functions import col, count, when
from functools import reduce
df = spark.createDataFrame([ (1,""), (2,None),(3,"c"),(4,"d") ], ['id','name'])
filter1 = col("name").isNull()
filter2 = col("name") == ""
dfresult = df.filter(filter1 | filter2).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
dfresult.show()
+---+-------------+
| id| new_col|
+---+-------------+
| 1|name is empty|
| 2| name is null|
+---+-------------+
在有N个过滤器的场景中。我考虑
filters = []
filters.append({ "item": filter1, "msg":"name is null"})
filters.append({ "item": filter2, "msg":"name is empty"})
dynamic_filter = reduce(
lambda x,y: x | y,
[s['item'] for s in filters]
)
df2 = df.filter(dynamic_filter).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
df2.show()
如何使用动态 when
为 new_col
列做一些更好的事情?
只需使用 functools.reduce
作为过滤器表达式:
from functools import reduce
from pyspark.sql import functions as F
new_col = reduce(
lambda acc, x: acc.when(x["item"], F.lit(x["msg"])),
filters,
F
)
df2 = df.filter(dynamic_filter).select(col("id"), new_col.alias("new_col"))
df2.show()
#+---+-------------+
#| id| new_col|
#+---+-------------+
#| 1|name is empty|
#| 2| name is null|
#+---+-------------+
我有以下代码
from pyspark.sql.functions import col, count, when
from functools import reduce
df = spark.createDataFrame([ (1,""), (2,None),(3,"c"),(4,"d") ], ['id','name'])
filter1 = col("name").isNull()
filter2 = col("name") == ""
dfresult = df.filter(filter1 | filter2).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
dfresult.show()
+---+-------------+
| id| new_col|
+---+-------------+
| 1|name is empty|
| 2| name is null|
+---+-------------+
在有N个过滤器的场景中。我考虑
filters = []
filters.append({ "item": filter1, "msg":"name is null"})
filters.append({ "item": filter2, "msg":"name is empty"})
dynamic_filter = reduce(
lambda x,y: x | y,
[s['item'] for s in filters]
)
df2 = df.filter(dynamic_filter).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
df2.show()
如何使用动态 when
为 new_col
列做一些更好的事情?
只需使用 functools.reduce
作为过滤器表达式:
from functools import reduce
from pyspark.sql import functions as F
new_col = reduce(
lambda acc, x: acc.when(x["item"], F.lit(x["msg"])),
filters,
F
)
df2 = df.filter(dynamic_filter).select(col("id"), new_col.alias("new_col"))
df2.show()
#+---+-------------+
#| id| new_col|
#+---+-------------+
#| 1|name is empty|
#| 2| name is null|
#+---+-------------+