PySpark 是否支持条件语句的短路评估？

Question

我想在我的数据框中创建一个新的布尔列，它的值来自对同一数据框中其他列的两个条件语句的评估：

columns = ["id", "color_one", "color_two"]
data = spark.createDataFrame([(1, "blue", "red"), (2, "red", None)]).toDF(*columns)

data = data.withColumn('is_red', data.color_one.contains("red") | data.color_two.contains("red"))

这很好用，除非 color_one 或 color_two 连续为 NULL。在这种情况下，该行的 is_red 也设置为 NULL 而不是 true 或 false:

+-------+----------+------------+-------+
|id     |color_one |color_two   |is_red |
+-------+----------+------------+-------+
|      1|      blue|         red|   true|
|      2|       red|        NULL|   NULL|
+-------+----------+------------+-------+

这意味着如果第一个条件恰好为真（如我上面示例的第 2 行），PySpark 正在评估条件语句的所有子句而不是提前退出（通过 short-circuit evaluation）。

PySpark 是否支持条件语句的短路评估？

与此同时，这是我想出的对每一列进行空检查的解决方法：

from pyspark.sql import functions as F

color_one_is_null = data.color_one.isNull()
color_two_is_null = data.color_two.isNull()
data = data.withColumn('is_red', F.when(color_two_is_null, data.color_one.contains("red"))
                                  .otherwise(F.when(color_one_is_null, data.color_two.contains("red"))
                                              .otherwise(F.when(color_one_is_null & color_two_is_null, F.lit(False))
                                                          .otherwise(data.color_one.contains("red") | data.color_two.contains("red"))))
                      )

Answer 1

我认为 Spark 不支持此处所述的条件语句的短路评估 https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#:~:text=Spark%20SQL%20(including,short-circuiting%E2%80%9D%20semantics.:

Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.

另一种替代方法是创建一个 column_one 和 column_two 的数组，然后评估是否数组包含 'red' 使用 SQL EXISTS

data = data.withColumn('is_red', F.expr("EXISTS(array(color_one, color_two), x -> x = 'red')"))
data.show()
+---+---------+---------+------+
| id|color_one|color_two|is_red|
+---+---------+---------+------+
|  1|     blue|      red|  true|
|  2|      red|     null|  true|
|  3|     null|    green| false|
|  4|   yellow|     null| false|
|  5|     null|      red|  true|
|  6|     null|     null| false|
+---+---------+---------+------+

PySpark 是否支持条件语句的短路评估？

Does PySpark support the short-circuit evaluation of conditional statements?

python

evaluation

boolean

pyspark

short-circuit-evaluation