使用 udf 在函数中传递两个日期,在 pyspark 中出现 df.show() 错误(类似于 pandas 中的应用函数)
Passing two date in a Function using udf getting error of df.show() in pyspark (similar to apply function in pandas)
from pyspark.sql.functions import *
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df=spark.createDataFrame(data=data,schema=["id","date1",'date2'])
df.show()
预期输出
我尝试使用以下代码:
from pyspark.sql.functions import udf
import pyspark.sql.functions as sf
def get_datediff(vec):
d1=vec[0];d2=vec[1]
rt=datediff(d1,d2)
return(rt)
df = df.withColumn('date_diff1', sf.udf(get_datediff)(array('date1','date2')))
df.show()
但我遇到了以下错误,无法获取日期差异。
如果您使用的是 Spark SQL 函数,则无需定义 UDF。直接调用函数即可,例如
import pyspark.sql.functions as F
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df = spark.createDataFrame(data=data,schema=["id","date1",'date2'])
df2 = df.withColumn('date_diff1', F.datediff('date1','date2'))
df2.show()
+---+----------+----------+----------+
| id| date1| date2|date_diff1|
+---+----------+----------+----------+
| 1|2019-07-01|2019-02-03| 148|
| 2|2019-06-24|2019-03-21| 95|
| 3|2019-08-24|2020-08-24| -366|
+---+----------+----------+----------+
如果你坚持使用UDF,你可以这样做:
import pyspark.sql.functions as F
from datetime import datetime
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df = spark.createDataFrame(data=data,schema=["id","date1",'date2'])
@F.udf('int')
def datediff_udf(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return (d1 - d2).days
df2 = df.withColumn('date_diff1', datediff_udf('date1', 'date2'))
from pyspark.sql.functions import *
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df=spark.createDataFrame(data=data,schema=["id","date1",'date2'])
df.show()
预期输出
我尝试使用以下代码:
from pyspark.sql.functions import udf
import pyspark.sql.functions as sf
def get_datediff(vec):
d1=vec[0];d2=vec[1]
rt=datediff(d1,d2)
return(rt)
df = df.withColumn('date_diff1', sf.udf(get_datediff)(array('date1','date2')))
df.show()
但我遇到了以下错误,无法获取日期差异。
如果您使用的是 Spark SQL 函数,则无需定义 UDF。直接调用函数即可,例如
import pyspark.sql.functions as F
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df = spark.createDataFrame(data=data,schema=["id","date1",'date2'])
df2 = df.withColumn('date_diff1', F.datediff('date1','date2'))
df2.show()
+---+----------+----------+----------+
| id| date1| date2|date_diff1|
+---+----------+----------+----------+
| 1|2019-07-01|2019-02-03| 148|
| 2|2019-06-24|2019-03-21| 95|
| 3|2019-08-24|2020-08-24| -366|
+---+----------+----------+----------+
如果你坚持使用UDF,你可以这样做:
import pyspark.sql.functions as F
from datetime import datetime
data = [("1","2019-07-01","2019-02-03"),("2","2019-06-24","2019-03-21"),("3","2019-08-24","2020-08-24")]
df = spark.createDataFrame(data=data,schema=["id","date1",'date2'])
@F.udf('int')
def datediff_udf(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return (d1 - d2).days
df2 = df.withColumn('date_diff1', datediff_udf('date1', 'date2'))