如何分割并获取两个日期之间的时间?
how to segment and get the time between two dates?
我有以下 table:
id | number_of _trip | start_date | end_date | seconds
1 637hui 2022-03-10 01:20:00 2022-03-10 01:32:00 720
2 384nfj 2022-03-10 02:18:00 2022-03-10 02:42:00 1440
3 102fiu 2022-03-10 02:10:00 2022-03-10 02:23:00 780
4 948pvc 2022-03-10 02:40:00 2022-03-10 03:20:00 2400
5 473mds 2022-03-10 02:45:00 2022-03-10 02:58:00 780
6 103fkd 2022-03-10 03:05:00 2022-03-10 03:28:00 1380
7 905783 2022-03-10 03:12:00 null 0
8 498wsq 2022-03-10 05:30:00 2022-03-10 05:48:00 1080
我想获取每小时的行驶时间,但如果一次旅行需要两个小时space,则必须按小时计算时间。
如果行程结束还没有结束,end_date
字段为空,但它必须计算从 start_date
.
开始的相应小时内所花费的时间
我有以下查询:
SELECT time_bucket(bucket_width := INTERVAL '1 hour',ts := start_date, "offset" := '0 minutes') AS init_date,
sum(seconds) as seconds
FROM trips
WHERE start_date >= '2022-03-10 01:00:00' AND start_date <= '2022-03-10 06:00:00'
GROUP BY init_date
ORDER BY init_date;
结果是:
| init_date | seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 05:00:00 1080
但是我希望收到这样的结果:
| init_date | seconds solo como una ayuda visual
2022-03-10 01:00:00 720 id(1:720)
2022-03-10 02:00:00 4200 id(2: 1440 3: 780 4: 1200 5: 780)
2022-03-10 03:00:00 5460 id(4:1200 6:1380 7:2880)
2022-03-10 05:00:00 1080 id(8:1080)
编辑
如果我替换空值,结果仍然是不需要的:
| init_date | seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 05:00:00 1080
我一直在考虑获取所有数据并用pandas解决问题。如果我得到答案,我会尝试 post。
编辑
我之前的结果并不完全正确,因为还有几个小时的行程还没有结束,正确的结果应该是:
start_date seconds
0 2022-03-10 01:00:00 720
1 2022-03-10 02:00:00 4200
2 2022-03-10 03:00:00 5460
3 2022-03-10 04:00:00 3600
4 2022-03-10 05:00:00 4680
新代码
def bucket_count(bucket, data):
result = pd.DataFrame()
list_r = []
for row_bucket in bucket.to_dict('records'):
inicio = row_bucket['start_date']
fin = row_bucket['end_date']
df = data[
(inicio <= data['end_date']) & (inicio <= fin) & (data['start_date'] <= fin) & (data['start_date'] <= data['end_date'])
]
df_dict = df.to_dict('records')
for row in df_dict:
seconds = 0
if row['start_date'] >= inicio and fin >= row['end_date']:
seconds = (row['end_date'] - row['start_date']).total_seconds()
elif row['start_date'] <= inicio <= row['end_date'] <= fin:
seconds = (row['end_date'] - inicio).total_seconds()
elif inicio <= row['start_date'] <= fin <= row['end_date']:
seconds = (fin - row['start_date']).total_seconds()
elif row['start_date'] < inicio and fin < row['end_date']:
seconds = (fin - inicio).total_seconds()
row['start_date'] = inicio
row['end_date'] = fin
row['seconds'] = seconds
list_r.append(row)
result = pd.DataFrame(list_r)
return result.groupby(['start_date'])["seconds"].apply(lambda x: x.astype(int).sum()).reset_index()
以下是 sqlite 中的工作原理(可以测试):
CREATE TABLE trips(
id INT PRIMARY KEY NOT NULL,
start_date TIMESTAMP,
end_date TIMESTAMP,
seconds INT
);
INSERT INTO trips(id, start_date, end_date, seconds) VALUES
(1, '2022-03-10 01:20:00', '2022-03-10 01:32:00', 720),
(2, '2022-03-10 02:18:00', '2022-03-10 02:42:00', 1440),
(3, '2022-03-10 02:10:00', '2022-03-10 02:23:00', 780),
(4, '2022-03-10 02:40:00', '2022-03-10 03:20:00', 2400),
(5, '2022-03-10 02:45:00', '2022-03-10 02:58:00', 780),
(6, '2022-03-10 03:05:00', '2022-03-10 03:28:00', 1380),
(7, '2022-03-10 03:12:00', NULL, 0),
(8, '2022-03-10 05:30:00', '2022-03-10 05:48:00', 1080);
WITH
checked AS (SELECT '2022-03-10 03:00:00' AS start, '2022-03-10 04:00:00' AS end)
SELECT
SUM(
IIF(end_date IS NULL, ROUND(MAX(0, (JULIANDAY(checked.end) - JULIANDAY(start_date)) * 24 * 60 * 60)),
MAX(
0,
(JULIANDAY(MIN(checked.end, end_date)) - JULIANDAY(MAX(checked.start, start_date))) /
(JULIANDAY(end_date) - JULIANDAY(start_date)) * seconds
)
)
)
FROM trips, checked;
DROP TABLE trips;
代码简化了,sqlite 缺少一些特性,但我认为它会很容易适应:)
简而言之,算法是:
- 如果end_time = NULL,则:
- 计算从行程开始到间隔结束的秒数
- 丢弃负值
- 否则:
- 计算一个区间内我们需要的行程部分(以秒为单位)
- 丢弃负值
- 对值求和
这可以在有开始和结束的任何时间间隔内完成
I have been thinking about getting all the data and solving the problem with pandas.
TLDR: 生成每次行程的分钟数范围,explode
those minutes into rows, and resample
those rows into hours to count
每小时分钟数:
import pandas as pd
df = pd.read_sql(...)
# convert to datetime dtype if not already
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# fill missing end dates
current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now()
df['end_date'] = df['end_date'].fillna(current_time)
# generate range of minutes per trip
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
(df[['id', 'init_date']].explode('init_date') # explode minutes into rows
.set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour
.mul(60).reset_index(name='seconds')) # convert minutes to seconds
输出:
init_date seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 4200
2022-03-10 03:00:00 5460
2022-03-10 04:00:00 0
2022-03-10 05:00:00 1080
Step-by-step细分
每次行程从 start_date
到 end_date
生成 date_range
分钟:
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
# id number_of_trip ... init_date
# 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00'])
# 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00'])
# 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00'])
# 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00'])
# 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00'])
# 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00'])
# 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00'])
# 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
explode
分钟成行:
exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id']
# init_date
# 2022-03-10 01:20:00 1
# 2022-03-10 01:21:00 1
# 2022-03-10 01:22:00 1
# ..
# 2022-03-10 05:45:00 8
# 2022-03-10 05:46:00 8
# 2022-03-10 05:47:00 8
# Name: id, Length: 191, dtype: int64
resample
the rows into hours to count
每小时的分钟数(×60换算成秒):
out = exploded.resample('H').count().mul(60).reset_index(name='seconds')
# init_date seconds
# 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
Driver ID
If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
在这种情况下,只需更改resample
to groupby.resample
。 Select driver_id
在爆炸之前,在重采样之前按 driver_id
分组。
作为一个最小的例子,我复制了示例数据以创建两个 driver_id
组 a
和 b
:
# after preprocessing and creating init_date ...
(df[['driver_id', 'init_date']] # now include driver_id
.explode('init_date').set_index('init_date') # explode minutes into rows
.groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id
.mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds
# seconds
# driver_id init_date
# a 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
# b 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
此答案将使用 staircase,它基于 pandas 和 numpy,并作为 pandas 生态系统的一部分运行。
您的数据描述了间隔,可以将其视为阶跃函数,在间隔期间值为 1,否则为 0。使用 staircase
我们将把每个行程的步进函数加在一起,将步进函数切入小时桶,然后积分得到每个桶的总时间。
设置
数据框 pandas.Timestamp
。行程编号与此解决方案无关。
df = pd.DataFrame({
"start_date": [
pd.Timestamp("2022-03-10 1:20"),
pd.Timestamp("2022-03-10 2:18"),
pd.Timestamp("2022-03-10 2:10"),
pd.Timestamp("2022-03-10 2:40"),
pd.Timestamp("2022-03-10 2:45"),
pd.Timestamp("2022-03-10 3:05"),
pd.Timestamp("2022-03-10 3:12"),
pd.Timestamp("2022-03-10 5:30"),
],
"end_date": [
pd.Timestamp("2022-03-10 1:32"),
pd.Timestamp("2022-03-10 2:42"),
pd.Timestamp("2022-03-10 2:23"),
pd.Timestamp("2022-03-10 3:20"),
pd.Timestamp("2022-03-10 2:58"),
pd.Timestamp("2022-03-10 3:28"),
pd.NaT,
pd.Timestamp("2022-03-10 5:48"),
],
})
解决方案
import staircase as sc
# create step function
# the Stairs class represents a step function. It is to staircase as DataFrame is to pandas.
sf = sc.Stairs(df, start="start_date", end="end_date")
# you could visually inspect it if you want
sf.plot(style="hlines")
通过检查,您可以看到最大并发行程为 3。另请注意,step 函数继续无穷大,值为 1 - 这是因为我们不知道其中一条记录的结束日期。
# define hourly buckets as pandas PeriodIndex
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
# integrate the step function over the hourly buckets
total_per_hour = sf.slice(hour_buckets).integral()
total_per_hour
是 pandas.Series
个 pandas.Timedelta
值,由 pandas.IntervalIndex
索引。看起来像这样
[2022-03-10 01:00:00, 2022-03-10 02:00:00) 0 days 00:12:00
[2022-03-10 02:00:00, 2022-03-10 03:00:00) 0 days 01:10:00
[2022-03-10 03:00:00, 2022-03-10 04:00:00) 0 days 01:31:00
[2022-03-10 04:00:00, 2022-03-10 05:00:00) 0 days 01:00:00
[2022-03-10 05:00:00, 2022-03-10 06:00:00) 0 days 01:18:00
dtype: timedelta64[ns]
如果您想要一种仅引用间隔左侧且时间以秒为单位的数据帧格式,请使用以下内容
pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
总结
解决方案是
import staircase as sc
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
total_per_hour = sc.Stairs(df, start="start_date", end="end_date").slice(hour_buckets).integral()
# optional
total_per_hour = pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
注1
在您的预期答案中,您没有 2022-03-10 04:00:00
.
的值
这似乎与旅行时间 905783
(没有结束日期)被包含在 2022-03-10 03:00:00
而不是后续时间的事实不一致。
此处提出的解决方案包括 2022-03-10 04:00:00
和 2022-03-10 05:00:00
的 3600,这就是它与原始问题中预期解决方案不同的原因。
注2
如果你的数据框有一个“driver”列并且你想计算每个 driver 的时间,那么下面的方法将起作用
def make_total_by_hour(df_):
return sc.Stairs(df_, "start_date", "end_date").slice(hour_buckets).integral()
total_per_hour = (
df.groupby("driver")
.apply(make_total_by_hour)
.melt(ignore_index=False)
.reset_index()
)
这可以在普通 sql 中完成(除了 time_bucket
函数),在嵌套 sql 查询中:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
) subq
group by interval_start
order by interval_start;
这给了我以下结果:
interval_start | seconds
---------------------+---------
2022-03-10 01:00:00 | 720
2022-03-10 02:00:00 | 4200
2022-03-10 03:00:00 | 5460
2022-03-10 04:00:00 | 3600
2022-03-10 05:00:00 | 4680
2022-03-10 06:00:00 | 0
(6 rows)
说明
让我们分解查询。
在最里面的查询中:
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
我们生成一系列时间间隔开始 - 从最小 start_date
值到最大 end_time
值,截断为整小时,步长为 1 小时。每个边界显然可以用任意日期时间替换。此查询的直接结果如下:
interval_start
---------------------
2022-03-10 01:00:00
2022-03-10 02:00:00
2022-03-10 03:00:00
2022-03-10 04:00:00
2022-03-10 05:00:00
2022-03-10 06:00:00
(6 rows)
然后,middle-level 查询将此系列与 trips
table 连接起来,当且仅当旅行的任何部分发生在 hour-long 期间时才连接行从 'interval_start' 列给出的时间开始的间隔:
select interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
-- innermost query
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
-- innermost query end
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour' and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
两个计算值分别代表:
seconds_before_trip_started
- 间隔开始和行程开始之间经过的秒数(如果行程在间隔开始之前开始,则为 0)。这是旅行没有发生的时间——因此我们将在接下来的步骤 中对其进行子结构化
seconds_before_trip_ended
- 间隔结束与行程结束之间经过的秒数(如果行程未在相关间隔内结束,则为 3600)。
最外层的查询减去前面提到的两个字段,有效地计算每个间隔中每次旅行花费的时间,并对所有行程求和,按间隔分组:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
-- middle-level query
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
-- middle-level query end
) subq
group by interval_start
order by interval_start;
附加分组
如果我们在 table 中有另一列,而我们真正需要的是针对该列对上述结果进行细分,我们只需将其添加到适当的 select
和 group by
子句(也可选择 order by
子句)。
假设 trips
table 中有一个额外的 driver_id
列:
id | number_of_trip | start_date | end_date | seconds | driver_id
----+----------------+---------------------+---------------------+---------+-----------
1 | 637hui | 2022-03-10 01:20:00 | 2022-03-10 01:32:00 | 720 | 0
2 | 384nfj | 2022-03-10 02:18:00 | 2022-03-10 02:42:00 | 1440 | 0
3 | 102fiu | 2022-03-10 02:10:00 | 2022-03-10 02:23:00 | 780 | 1
4 | 948pvc | 2022-03-10 02:40:00 | 2022-03-10 03:20:00 | 2400 | 1
5 | 473mds | 2022-03-10 02:45:00 | 2022-03-10 02:58:00 | 780 | 1
6 | 103fkd | 2022-03-10 03:05:00 | 2022-03-10 03:28:00 | 1380 | 2
7 | 905783 | 2022-03-10 03:12:00 | | 0 | 2
8 | 498wsq | 2022-03-10 05:30:00 | 2022-03-10 05:48:00 | 1080 | 2
修改后的查询如下所示:
select
interval_start,
driver_id,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
driver_id,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
) subq
group by interval_start, driver_id
order by interval_start, driver_id;
并给出以下结果:
interval_start | driver_id | seconds
---------------------+-----------+---------
2022-03-10 01:00:00 | 0 | 720
2022-03-10 02:00:00 | 0 | 1440
2022-03-10 02:00:00 | 1 | 2760
2022-03-10 03:00:00 | 1 | 1200
2022-03-10 03:00:00 | 2 | 4260
2022-03-10 04:00:00 | 2 | 3600
2022-03-10 05:00:00 | 2 | 4680
2022-03-10 06:00:00 | 2 | 0
我有以下 table:
id | number_of _trip | start_date | end_date | seconds
1 637hui 2022-03-10 01:20:00 2022-03-10 01:32:00 720
2 384nfj 2022-03-10 02:18:00 2022-03-10 02:42:00 1440
3 102fiu 2022-03-10 02:10:00 2022-03-10 02:23:00 780
4 948pvc 2022-03-10 02:40:00 2022-03-10 03:20:00 2400
5 473mds 2022-03-10 02:45:00 2022-03-10 02:58:00 780
6 103fkd 2022-03-10 03:05:00 2022-03-10 03:28:00 1380
7 905783 2022-03-10 03:12:00 null 0
8 498wsq 2022-03-10 05:30:00 2022-03-10 05:48:00 1080
我想获取每小时的行驶时间,但如果一次旅行需要两个小时space,则必须按小时计算时间。
如果行程结束还没有结束,end_date
字段为空,但它必须计算从 start_date
.
我有以下查询:
SELECT time_bucket(bucket_width := INTERVAL '1 hour',ts := start_date, "offset" := '0 minutes') AS init_date,
sum(seconds) as seconds
FROM trips
WHERE start_date >= '2022-03-10 01:00:00' AND start_date <= '2022-03-10 06:00:00'
GROUP BY init_date
ORDER BY init_date;
结果是:
| init_date | seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 05:00:00 1080
但是我希望收到这样的结果:
| init_date | seconds solo como una ayuda visual
2022-03-10 01:00:00 720 id(1:720)
2022-03-10 02:00:00 4200 id(2: 1440 3: 780 4: 1200 5: 780)
2022-03-10 03:00:00 5460 id(4:1200 6:1380 7:2880)
2022-03-10 05:00:00 1080 id(8:1080)
编辑
如果我替换空值,结果仍然是不需要的:
| init_date | seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 05:00:00 1080
我一直在考虑获取所有数据并用pandas解决问题。如果我得到答案,我会尝试 post。 编辑
我之前的结果并不完全正确,因为还有几个小时的行程还没有结束,正确的结果应该是:
start_date seconds
0 2022-03-10 01:00:00 720
1 2022-03-10 02:00:00 4200
2 2022-03-10 03:00:00 5460
3 2022-03-10 04:00:00 3600
4 2022-03-10 05:00:00 4680
新代码
def bucket_count(bucket, data):
result = pd.DataFrame()
list_r = []
for row_bucket in bucket.to_dict('records'):
inicio = row_bucket['start_date']
fin = row_bucket['end_date']
df = data[
(inicio <= data['end_date']) & (inicio <= fin) & (data['start_date'] <= fin) & (data['start_date'] <= data['end_date'])
]
df_dict = df.to_dict('records')
for row in df_dict:
seconds = 0
if row['start_date'] >= inicio and fin >= row['end_date']:
seconds = (row['end_date'] - row['start_date']).total_seconds()
elif row['start_date'] <= inicio <= row['end_date'] <= fin:
seconds = (row['end_date'] - inicio).total_seconds()
elif inicio <= row['start_date'] <= fin <= row['end_date']:
seconds = (fin - row['start_date']).total_seconds()
elif row['start_date'] < inicio and fin < row['end_date']:
seconds = (fin - inicio).total_seconds()
row['start_date'] = inicio
row['end_date'] = fin
row['seconds'] = seconds
list_r.append(row)
result = pd.DataFrame(list_r)
return result.groupby(['start_date'])["seconds"].apply(lambda x: x.astype(int).sum()).reset_index()
以下是 sqlite 中的工作原理(可以测试):
CREATE TABLE trips(
id INT PRIMARY KEY NOT NULL,
start_date TIMESTAMP,
end_date TIMESTAMP,
seconds INT
);
INSERT INTO trips(id, start_date, end_date, seconds) VALUES
(1, '2022-03-10 01:20:00', '2022-03-10 01:32:00', 720),
(2, '2022-03-10 02:18:00', '2022-03-10 02:42:00', 1440),
(3, '2022-03-10 02:10:00', '2022-03-10 02:23:00', 780),
(4, '2022-03-10 02:40:00', '2022-03-10 03:20:00', 2400),
(5, '2022-03-10 02:45:00', '2022-03-10 02:58:00', 780),
(6, '2022-03-10 03:05:00', '2022-03-10 03:28:00', 1380),
(7, '2022-03-10 03:12:00', NULL, 0),
(8, '2022-03-10 05:30:00', '2022-03-10 05:48:00', 1080);
WITH
checked AS (SELECT '2022-03-10 03:00:00' AS start, '2022-03-10 04:00:00' AS end)
SELECT
SUM(
IIF(end_date IS NULL, ROUND(MAX(0, (JULIANDAY(checked.end) - JULIANDAY(start_date)) * 24 * 60 * 60)),
MAX(
0,
(JULIANDAY(MIN(checked.end, end_date)) - JULIANDAY(MAX(checked.start, start_date))) /
(JULIANDAY(end_date) - JULIANDAY(start_date)) * seconds
)
)
)
FROM trips, checked;
DROP TABLE trips;
代码简化了,sqlite 缺少一些特性,但我认为它会很容易适应:)
简而言之,算法是:
- 如果end_time = NULL,则:
- 计算从行程开始到间隔结束的秒数
- 丢弃负值
- 否则:
- 计算一个区间内我们需要的行程部分(以秒为单位)
- 丢弃负值
- 对值求和
这可以在有开始和结束的任何时间间隔内完成
I have been thinking about getting all the data and solving the problem with pandas.
TLDR: 生成每次行程的分钟数范围,explode
those minutes into rows, and resample
those rows into hours to count
每小时分钟数:
import pandas as pd
df = pd.read_sql(...)
# convert to datetime dtype if not already
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# fill missing end dates
current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now()
df['end_date'] = df['end_date'].fillna(current_time)
# generate range of minutes per trip
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
(df[['id', 'init_date']].explode('init_date') # explode minutes into rows
.set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour
.mul(60).reset_index(name='seconds')) # convert minutes to seconds
输出:
init_date seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 4200
2022-03-10 03:00:00 5460
2022-03-10 04:00:00 0
2022-03-10 05:00:00 1080
Step-by-step细分
每次行程从
start_date
到end_date
生成date_range
分钟:df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) # id number_of_trip ... init_date # 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00']) # 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00']) # 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00']) # 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00']) # 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00']) # 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00']) # 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00']) # 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
explode
分钟成行:exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id'] # init_date # 2022-03-10 01:20:00 1 # 2022-03-10 01:21:00 1 # 2022-03-10 01:22:00 1 # .. # 2022-03-10 05:45:00 8 # 2022-03-10 05:46:00 8 # 2022-03-10 05:47:00 8 # Name: id, Length: 191, dtype: int64
resample
the rows into hours tocount
每小时的分钟数(×60换算成秒):out = exploded.resample('H').count().mul(60).reset_index(name='seconds') # init_date seconds # 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080
Driver ID
If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
在这种情况下,只需更改resample
to groupby.resample
。 Select driver_id
在爆炸之前,在重采样之前按 driver_id
分组。
作为一个最小的例子,我复制了示例数据以创建两个 driver_id
组 a
和 b
:
# after preprocessing and creating init_date ...
(df[['driver_id', 'init_date']] # now include driver_id
.explode('init_date').set_index('init_date') # explode minutes into rows
.groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id
.mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds
# seconds
# driver_id init_date
# a 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
# b 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
此答案将使用 staircase,它基于 pandas 和 numpy,并作为 pandas 生态系统的一部分运行。
您的数据描述了间隔,可以将其视为阶跃函数,在间隔期间值为 1,否则为 0。使用 staircase
我们将把每个行程的步进函数加在一起,将步进函数切入小时桶,然后积分得到每个桶的总时间。
设置
数据框 pandas.Timestamp
。行程编号与此解决方案无关。
df = pd.DataFrame({
"start_date": [
pd.Timestamp("2022-03-10 1:20"),
pd.Timestamp("2022-03-10 2:18"),
pd.Timestamp("2022-03-10 2:10"),
pd.Timestamp("2022-03-10 2:40"),
pd.Timestamp("2022-03-10 2:45"),
pd.Timestamp("2022-03-10 3:05"),
pd.Timestamp("2022-03-10 3:12"),
pd.Timestamp("2022-03-10 5:30"),
],
"end_date": [
pd.Timestamp("2022-03-10 1:32"),
pd.Timestamp("2022-03-10 2:42"),
pd.Timestamp("2022-03-10 2:23"),
pd.Timestamp("2022-03-10 3:20"),
pd.Timestamp("2022-03-10 2:58"),
pd.Timestamp("2022-03-10 3:28"),
pd.NaT,
pd.Timestamp("2022-03-10 5:48"),
],
})
解决方案
import staircase as sc
# create step function
# the Stairs class represents a step function. It is to staircase as DataFrame is to pandas.
sf = sc.Stairs(df, start="start_date", end="end_date")
# you could visually inspect it if you want
sf.plot(style="hlines")
通过检查,您可以看到最大并发行程为 3。另请注意,step 函数继续无穷大,值为 1 - 这是因为我们不知道其中一条记录的结束日期。
# define hourly buckets as pandas PeriodIndex
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
# integrate the step function over the hourly buckets
total_per_hour = sf.slice(hour_buckets).integral()
total_per_hour
是 pandas.Series
个 pandas.Timedelta
值,由 pandas.IntervalIndex
索引。看起来像这样
[2022-03-10 01:00:00, 2022-03-10 02:00:00) 0 days 00:12:00
[2022-03-10 02:00:00, 2022-03-10 03:00:00) 0 days 01:10:00
[2022-03-10 03:00:00, 2022-03-10 04:00:00) 0 days 01:31:00
[2022-03-10 04:00:00, 2022-03-10 05:00:00) 0 days 01:00:00
[2022-03-10 05:00:00, 2022-03-10 06:00:00) 0 days 01:18:00
dtype: timedelta64[ns]
如果您想要一种仅引用间隔左侧且时间以秒为单位的数据帧格式,请使用以下内容
pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
总结
解决方案是
import staircase as sc
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
total_per_hour = sc.Stairs(df, start="start_date", end="end_date").slice(hour_buckets).integral()
# optional
total_per_hour = pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
注1
在您的预期答案中,您没有 2022-03-10 04:00:00
.
这似乎与旅行时间 905783
(没有结束日期)被包含在 2022-03-10 03:00:00
而不是后续时间的事实不一致。
此处提出的解决方案包括 2022-03-10 04:00:00
和 2022-03-10 05:00:00
的 3600,这就是它与原始问题中预期解决方案不同的原因。
注2
如果你的数据框有一个“driver”列并且你想计算每个 driver 的时间,那么下面的方法将起作用
def make_total_by_hour(df_):
return sc.Stairs(df_, "start_date", "end_date").slice(hour_buckets).integral()
total_per_hour = (
df.groupby("driver")
.apply(make_total_by_hour)
.melt(ignore_index=False)
.reset_index()
)
这可以在普通 sql 中完成(除了 time_bucket
函数),在嵌套 sql 查询中:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
) subq
group by interval_start
order by interval_start;
这给了我以下结果:
interval_start | seconds
---------------------+---------
2022-03-10 01:00:00 | 720
2022-03-10 02:00:00 | 4200
2022-03-10 03:00:00 | 5460
2022-03-10 04:00:00 | 3600
2022-03-10 05:00:00 | 4680
2022-03-10 06:00:00 | 0
(6 rows)
说明
让我们分解查询。
在最里面的查询中:
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
我们生成一系列时间间隔开始 - 从最小 start_date
值到最大 end_time
值,截断为整小时,步长为 1 小时。每个边界显然可以用任意日期时间替换。此查询的直接结果如下:
interval_start
---------------------
2022-03-10 01:00:00
2022-03-10 02:00:00
2022-03-10 03:00:00
2022-03-10 04:00:00
2022-03-10 05:00:00
2022-03-10 06:00:00
(6 rows)
然后,middle-level 查询将此系列与 trips
table 连接起来,当且仅当旅行的任何部分发生在 hour-long 期间时才连接行从 'interval_start' 列给出的时间开始的间隔:
select interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
-- innermost query
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
-- innermost query end
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour' and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
两个计算值分别代表:
seconds_before_trip_started
- 间隔开始和行程开始之间经过的秒数(如果行程在间隔开始之前开始,则为 0)。这是旅行没有发生的时间——因此我们将在接下来的步骤 中对其进行子结构化
seconds_before_trip_ended
- 间隔结束与行程结束之间经过的秒数(如果行程未在相关间隔内结束,则为 3600)。
最外层的查询减去前面提到的两个字段,有效地计算每个间隔中每次旅行花费的时间,并对所有行程求和,按间隔分组:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
-- middle-level query
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
-- middle-level query end
) subq
group by interval_start
order by interval_start;
附加分组
如果我们在 table 中有另一列,而我们真正需要的是针对该列对上述结果进行细分,我们只需将其添加到适当的 select
和 group by
子句(也可选择 order by
子句)。
假设 trips
table 中有一个额外的 driver_id
列:
id | number_of_trip | start_date | end_date | seconds | driver_id
----+----------------+---------------------+---------------------+---------+-----------
1 | 637hui | 2022-03-10 01:20:00 | 2022-03-10 01:32:00 | 720 | 0
2 | 384nfj | 2022-03-10 02:18:00 | 2022-03-10 02:42:00 | 1440 | 0
3 | 102fiu | 2022-03-10 02:10:00 | 2022-03-10 02:23:00 | 780 | 1
4 | 948pvc | 2022-03-10 02:40:00 | 2022-03-10 03:20:00 | 2400 | 1
5 | 473mds | 2022-03-10 02:45:00 | 2022-03-10 02:58:00 | 780 | 1
6 | 103fkd | 2022-03-10 03:05:00 | 2022-03-10 03:28:00 | 1380 | 2
7 | 905783 | 2022-03-10 03:12:00 | | 0 | 2
8 | 498wsq | 2022-03-10 05:30:00 | 2022-03-10 05:48:00 | 1080 | 2
修改后的查询如下所示:
select
interval_start,
driver_id,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
driver_id,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
) subq
group by interval_start, driver_id
order by interval_start, driver_id;
并给出以下结果:
interval_start | driver_id | seconds
---------------------+-----------+---------
2022-03-10 01:00:00 | 0 | 720
2022-03-10 02:00:00 | 0 | 1440
2022-03-10 02:00:00 | 1 | 2760
2022-03-10 03:00:00 | 1 | 1200
2022-03-10 03:00:00 | 2 | 4260
2022-03-10 04:00:00 | 2 | 3600
2022-03-10 05:00:00 | 2 | 4680
2022-03-10 06:00:00 | 2 | 0