(pyspark)如何将时间间隔划分为时间段
(pyspark)How to Divide Time Intervals into Time Periods
我有一个由 sparksql 创建的数据框,其 ID 对应于图片显示的 checkin_datetime 和 checkout_datetime.As。
我想把这个时间间隔分成一个小时的时间段。如图所示。
创建 sparkdataframe 的代码:
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)
要计算每小时间隔,
- 首先在
checkin_datetime
和 checkout_datetime
之间展开每小时间隔。我们通过计算 checkin_datetime
和 checkout_datetime
之间的小时数并迭代该范围以生成间隔来实现这一点。
- 一旦我们分解区间找到
next_hour
,我们就可以用它来识别 checkin_datetime
和 next_hour
或 checkout_datetime
和 [=15 之间的差距=].
from pyspark.sql import functions as F
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))
unix_checkin = F.unix_timestamp("checkin_datetime")
unix_checkout = F.unix_timestamp("checkout_datetime")
start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")
diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)
next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))
minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
.when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
.otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
).cast("int")
(df1.withColumn("next_hour", next_hour)
.withColumn("minutes", minute)
.withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
.withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
.select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
).show()
"""
+---+-------------------+-------------------+----------+---+-------+
| ID| checkin_datetime| checkout_datetime| day| hr|minutes|
+---+-------------------+-------------------+----------+---+-------+
| 4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13| 2|
| 4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13| 3|
| 4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14| 1|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14| 5|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15| 6|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20| 37|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21| 32|
| 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21| 4|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23| 38|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02| 0| 23|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 1| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 2| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 3| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 4| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 5| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 6| 15|
+---+-------------------+-------------------+----------+---+-------+
"""
我有一个由 sparksql 创建的数据框,其 ID 对应于图片显示的 checkin_datetime 和 checkout_datetime.As。
我想把这个时间间隔分成一个小时的时间段。如图所示。
创建 sparkdataframe 的代码:
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)
要计算每小时间隔,
- 首先在
checkin_datetime
和checkout_datetime
之间展开每小时间隔。我们通过计算checkin_datetime
和checkout_datetime
之间的小时数并迭代该范围以生成间隔来实现这一点。 - 一旦我们分解区间找到
next_hour
,我们就可以用它来识别checkin_datetime
和next_hour
或checkout_datetime
和 [=15 之间的差距=].
from pyspark.sql import functions as F
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))
unix_checkin = F.unix_timestamp("checkin_datetime")
unix_checkout = F.unix_timestamp("checkout_datetime")
start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")
diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)
next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))
minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
.when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
.otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
).cast("int")
(df1.withColumn("next_hour", next_hour)
.withColumn("minutes", minute)
.withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
.withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
.select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
).show()
"""
+---+-------------------+-------------------+----------+---+-------+
| ID| checkin_datetime| checkout_datetime| day| hr|minutes|
+---+-------------------+-------------------+----------+---+-------+
| 4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13| 2|
| 4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13| 3|
| 4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14| 1|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14| 5|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15| 6|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20| 37|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21| 32|
| 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21| 4|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23| 38|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02| 0| 23|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 1| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 2| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 3| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 4| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 5| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 6| 15|
+---+-------------------+-------------------+----------+---+-------+
"""