何时使用 UDF 与 PySpark 中的函数？

Question

我将 Spark 与 Databricks 一起使用，并具有以下代码：

def replaceBlanksWithNulls(column):
    return when(col(column) != "", col(column)).otherwise(None)

以下两个语句都有效：

x = rawSmallDf.withColumn("z", replaceBlanksWithNulls("z"))

并使用 UDF：

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

从 documentation 我不清楚什么时候应该使用一个而不是另一个，为什么？

Answer 1

您可以在 Spark SQL 中找到不同之处（如文档中所述）。例如，你会发现如果你写：

spark.sql("select replaceBlanksWithNulls(column_name) from dataframe")

如果您没有将函数 replaceBlanksWithNulls 注册为 udf，

将不起作用。在 spark sql 中，我们需要知道执行函数的返回类型。因此，我们需要将自定义函数注册为 user-defined 函数 (udf) 以在 spark sql.

中使用

Answer 2

一个UDF本质上可以是任何类型的函数（当然也有例外）——没有必要使用when、col等Spark结构. 通过使用 UDF，replaceBlanksWithNulls 函数可以写成正常的 python 代码：

def replaceBlanksWithNulls(s):
    return "" if s != "" else None

注册后可以在dataframe列上使用：

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

注意：UDF 的默认 return 类型是字符串。如果需要另一种类型，则必须在注册时指定，例如

from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())

在这种情况下，列操作并不复杂，并且有Spark函数可以实现同样的事情（即replaceBlanksWithNulls如问题：

x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))

只要有可能，总是首选，因为它允许 Spark 优化查询，参见例如

When to use a UDF versus a function in PySpark?