使用 udf 在函数中传递两个日期,在 pyspark 中出现 df.show() 错误(类似于 pandas 中的应用函数)

Passing two date in a Function using udf getting error of df.show() in pyspark (similar to apply function in pandas)

from pyspark.sql.functions import *
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df=spark.createDataFrame(data=data,schema=["id","date1",'date2'])
df.show()  

预期输出

我尝试使用以下代码:

    from pyspark.sql.functions import udf
    import pyspark.sql.functions as sf
    def get_datediff(vec):
        d1=vec[0];d2=vec[1]
        rt=datediff(d1,d2)
        return(rt)
    df = df.withColumn('date_diff1', sf.udf(get_datediff)(array('date1','date2')))
df.show()

但我遇到了以下错误,无法获取日期差异。

如果您使用的是 Spark SQL 函数,则无需定义 UDF。直接调用函数即可,例如

import pyspark.sql.functions as F

data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df = spark.createDataFrame(data=data,schema=["id","date1",'date2'])
df2 = df.withColumn('date_diff1', F.datediff('date1','date2'))

df2.show()
+---+----------+----------+----------+
| id|     date1|     date2|date_diff1|
+---+----------+----------+----------+
|  1|2019-07-01|2019-02-03|       148|
|  2|2019-06-24|2019-03-21|        95|
|  3|2019-08-24|2020-08-24|      -366|
+---+----------+----------+----------+

如果你坚持使用UDF,你可以这样做:

import pyspark.sql.functions as F
from datetime import datetime

data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df = spark.createDataFrame(data=data,schema=["id","date1",'date2'])

@F.udf('int')
def datediff_udf(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return (d1 - d2).days

df2 = df.withColumn('date_diff1', datediff_udf('date1', 'date2'))