pyspark:两个日期列之间的小时差异

pyspark: hours diff between two dates columns

我想计算 pyspark 中两个日期列之间的小时数。 只能找到如何计算日期之间的天数。

dfs_4.show()


+--------------------+--------------------+
|        request_time|            max_time|
+--------------------+--------------------+
|2017-11-17 00:18:...|2017-11-20 23:59:...|
|2017-11-17 00:07:...|2017-11-20 23:59:...|
|2017-11-17 00:35:...|2017-11-20 23:59:...|
|2017-11-17 00:10:...|2017-11-20 23:59:...|
|2017-11-17 00:03:...|2017-11-20 23:59:...|
|2017-11-17 00:45:...|2017-11-20 23:59:...|
|2017-11-17 00:35:...|2017-11-20 23:59:...|
|2017-11-17 00:59:...|2017-11-20 23:59:...|
|2017-11-17 00:28:...|2017-11-20 23:59:...|
|2017-11-17 00:11:...|2017-11-20 23:59:...|
|2017-11-17 00:13:...|2017-11-20 23:59:...|
|2017-11-17 00:42:...|2017-11-20 23:59:...|
|2017-11-17 00:07:...|2017-11-20 23:59:...|
|2017-11-17 00:40:...|2017-11-20 23:59:...|
|2017-11-17 00:15:...|2017-11-20 23:59:...|
|2017-11-17 00:05:...|2017-11-20 23:59:...|
|2017-11-17 00:50:...|2017-11-20 23:59:...|
|2017-11-17 00:40:...|2017-11-20 23:59:...|
|2017-11-17 00:25:...|2017-11-20 23:59:...|
|2017-11-17 00:35:...|2017-11-20 23:59:...|
+--------------------+--------------------+

计算天数:

from pyspark.sql import functions as F
dfs_5 = dfs_4.withColumn('date_diff', F.datediff(F.to_date(dfs_4.max_time), F.to_date(dfs_4.request_time)))

dfs_5.show()

+--------------------+--------------------+---------+
|        request_time|            max_time|date_diff|
+--------------------+--------------------+---------+
|2017-11-17 00:18:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:07:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:35:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:10:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:03:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:45:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:35:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:59:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:28:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:11:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:13:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:42:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:07:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:40:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:15:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:05:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:50:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:40:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:25:...|2017-11-20 23:59:...|        3|
|2017-11-17 00:35:...|2017-11-20 23:59:...|        3|
+--------------------+--------------------+---------+

我怎样才能在几个小时内做同样的事情? 感谢您的帮助

您可以使用 hour 从日期时间字段中提取小时,然后简单地将它们减去一个新列。现在有一种情况,时差超过一天,你需要在中间添加整天。所以我会像你一样创建列 days _diff 然后试试这个:

from pyspark.sql import functions as F

dfs_5 = dfs_4.withColumn('hours_diff', (dfs_4.date_diff*24) + 
                          F.hour(dfs_4.max_time) - F.hour(dfs_4.request_time))

可以使用 unix 时间戳并以秒为单位计算差异。之后转换为所需的单位。

   dfs_5 = dfs_4.withColumn(
      'diff_in_seconds', 
      F.unix_timestamp(F.to_date(dfs_4.max_time) - F.unix_timestamp(F.to_date(dfs_4.request_time))
   )

   dfs_6 = dfs_4.withColumn(
      'diff_in_minutes', 
      F.round(
         (F.unix_timestamp(F.to_date(dfs_4.max_time) - F.unix_timestamp(F.to_date(dfs_4.request_time)))/60
      )
   )