如何在 pyspark 中以秒为单位获取 datediff()?
How to get datediff() in seconds in pyspark?
我试过 () 中的代码,但无法获得以秒为单位的日期差异。我只是在下面的 'Attributes_Timestamp_fix' 和 'lagged_date' 列之间使用 datediff() 。
有什么提示吗?
在我的代码和输出下面。
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
在 pyspark.sql.functions
中,有一个函数 datediff
不幸的是它只计算天数的差异。为了克服这个问题,您可以将两个日期转换为 unix 时间戳(以秒为单位)并计算差异。
让我们创建一些样本数据,计算滞后,然后以秒为单位计算差异。
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = [{'id' : 1, 't' : datetime.datetime(2018,01,01)},\
{'id' : 1, 't' : datetime.datetime(2018,01,02)},\
{'id' : 1, 't' : datetime.datetime(2018,01,04)},\
{'id' : 1, 't' : datetime.datetime(2018,01,07)}]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))\
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))\
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
我试过 (
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
在 pyspark.sql.functions
中,有一个函数 datediff
不幸的是它只计算天数的差异。为了克服这个问题,您可以将两个日期转换为 unix 时间戳(以秒为单位)并计算差异。
让我们创建一些样本数据,计算滞后,然后以秒为单位计算差异。
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = [{'id' : 1, 't' : datetime.datetime(2018,01,01)},\
{'id' : 1, 't' : datetime.datetime(2018,01,02)},\
{'id' : 1, 't' : datetime.datetime(2018,01,04)},\
{'id' : 1, 't' : datetime.datetime(2018,01,07)}]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))\
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))\
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+