在大数据上优化 Pyspark UDF

Question

我正在尝试优化此代码，当列的值（pyspark 数据框的）在 [categories] 中时创建一个虚拟对象。

当运行在100K行时，到运行大约需要30秒。在我的例子中，我有大约 2000 万行，这将花费很多时间。

def create_dummy(dframe,col_name,top_name,categories,**options):
    lst_tmp_col = []
    if 'lst_tmp_col' in options:
        lst_tmp_col = options["lst_tmp_col"]
    udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
    dframe  = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
    dframe = dframe.select(lst_tmp_col+ [str(top_name)])
    return dframe

换句话说，我如何优化这个功能并减少我的数据量的总时间？以及如何确保此函数不会迭代我的数据？

感谢您的建议。谢谢

Answer 1

您不需要 UDF 来对类别进行编码。您可以使用 isin:

import pyspark.sql.functions as F

def create_dummy(dframe, col_name, top_name, categories, **options):
    lst_tmp_col = []
    if 'lst_tmp_col' in options:
        lst_tmp_col = options["lst_tmp_col"]
    dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
    dframe = dframe.select(lst_tmp_col + [str(top_name)])
    return dframe

在大数据上优化 Pyspark UDF

Optimizing Pyspark UDF on large data

user-defined-functions

apache-spark

apache-spark-sql

pyspark