(pyspark)如何将时间间隔划分为时间段

(pyspark)How to Divide Time Intervals into Time Periods

我有一个由 sparksql 创建的数据框,其 ID 对应于图片显示的 checkin_datetime 和 checkout_datetime.As。

我想把这个时间间隔分成一个小时的时间段。如图所示。

创建 sparkdataframe 的代码:

import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],


 'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
  ,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
  'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
                       ,'04-01-2019 21:42','04-02-2019 00:23'
                       ,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)

要计算每小时间隔,

  1. 首先在 checkin_datetimecheckout_datetime 之间展开每小时间隔。我们通过计算 checkin_datetimecheckout_datetime 之间的小时数并迭代该范围以生成间隔来实现这一点。
  2. 一旦我们分解区间找到 next_hour,我们就可以用它来识别 checkin_datetimenext_hourcheckout_datetime 和 [=15 之间的差距=].
from pyspark.sql import functions as F

import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],


 'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
  ,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
  'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
                       ,'04-01-2019 21:42','04-02-2019 00:23'
                       ,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))

unix_checkin = F.unix_timestamp("checkin_datetime")
unix_checkout = F.unix_timestamp("checkout_datetime")

start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")

diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)

next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))

minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
           .when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
           .otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
         ).cast("int")

(df1.withColumn("next_hour", next_hour)
    .withColumn("minutes", minute)
    .withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
    .withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
    .select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
).show()
"""
+---+-------------------+-------------------+----------+---+-------+
| ID|   checkin_datetime|  checkout_datetime|       day| hr|minutes|
+---+-------------------+-------------------+----------+---+-------+
|  4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13|      2|
|  4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13|      3|
|  4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14|      1|
|  4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14|      5|
|  4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15|      6|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20|     37|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21|     32|
| 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21|      4|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23|     38|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02|  0|     23|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  1|     60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  2|     60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  3|     60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  4|     60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  5|     60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  6|     15|
+---+-------------------+-------------------+----------+---+-------+
"""