在 UDF 中使用函数字典

Question

在 pySpark 中，我有一个数据框 df 如下：

Site    A   B
1       3   83
1       16  26
1       98  46
1       80  14
1       83  54
2       0   83
2       75  67
2       72  24
2       60  13
6       40  50
6       34  60
6       36  39
6       68  6
6       91  51
6       81  82

或者另一方面，我有一个字典 myDict 函数 g = {1 : f1, 2 : f2, 3: f3, 6:f6}

我想用字典生成一个新的列。就像是： df.withColumn("MyCol", myDict[df.Site](df.A, df.B))

但是当我这样做时，我收到错误消息：

unhashable type: 'Column'

Traceback (most recent call last):

TypeError: unhashable type: 'Column'

应该怎么写？

Answer 1

您想使用 Currying。

withColumn 函数仅接受同一数据框中的现有列作为参数或通过 lit() 函数的文字（lit 实际上是 returns 列）。

为了传递额外的参数，你必须使用一个 higher-order function 那 returns 一个 udf:

class MyUDFs():
    @staticmethod
    def trans(myDict):
        def cb(Site,A,B):
            return myDict[Site](A, B)
        return udf(cb, StringType())

df = df.withColumn("MyCol",MyUDFs.trans(myDict)(df["Site"],df["A"],df["B"]))

在 UDF 中使用函数字典

use dict of function in UDF

user-defined-functions

pyspark