Spark：在 UDF 或映射函数中加入

Question

我必须编写一个复杂的 UDF，其中我必须使用不同的 table 和 return 匹配数进行连接。实际用例要复杂得多，但我已将此处的案例简化为最少的可重现代码。这是UDF代码。

def predict_id(date,zip):
    filtered_ids = contest_savm.where((F.col('postal_code')==zip)  & (F.col('start_date')>=date))
    return filtered_ids.count()

当我使用以下代码定义 UDF 时，我得到一长串控制台错误：

predict_id_udf = F.udf(predict_id,types.IntegerType())

错误的最后一行是：

py4j.Py4JException: Method __getnewargs__([]) does not exist

我想知道最好的方法是什么。我也试过 map 这样的：

result_rdd = df.select("party_id").rdd\
  .map(lambda x: predict_id(x[0],x[1]))\
  .distinct()

它也导致了类似的最终错误。我想知道，如果有的话，我可以在 UDF 或 map 函数中为原始数据帧的每一行进行连接。

Answer 1

I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches.

设计上是不可能的。如果你想达到这样的效果，你必须使用高级 DF / RDD 运算符：

df.join(ontest_savm,
    (F.col('postal_code')==df["zip"])  & (F.col('start_date') >= df["date"])
).groupBy(*df.columns).count()

Spark：在 UDF 或映射函数中加入

Spark: Join within UDF or map function

user-defined-functions

apache-spark

apache-spark-sql

pyspark

spark-dataframe