Pandas 对具有开始和结束时间的行进行上采样
Pandas upsample rows with a start and end time
我有以下形式的数据框:
In [5]: df = pd.DataFrame({
...: 'start_time': ['2022-01-01 01:15', '2022-01-01 13:00'],
...: 'end_time': ['2022-01-01 03:45', '2022-01-01 15:00'],
...: 'values': [1000, 750]})
In [6]: df
Out[6]:
start_time end_time values
0 2022-01-01 01:15 2022-01-01 03:45 1000
1 2022-01-01 13:00 2022-01-01 15:00 750
我想将其转换为 24 小时值,在 start_time/end_time 范围内的小时内按比例拆分值。对于上面的例子,这应该产生:
In [10]: result
Out[10]:
value
2022-01-01 00:00:00 0
2022-01-01 01:00:00 300
2022-01-01 02:00:00 400
2022-01-01 03:00:00 300
2022-01-01 04:00:00 0
2022-01-01 05:00:00 0
2022-01-01 06:00:00 0
2022-01-01 07:00:00 0
2022-01-01 08:00:00 0
2022-01-01 09:00:00 0
2022-01-01 10:00:00 0
2022-01-01 11:00:00 0
2022-01-01 12:00:00 0
2022-01-01 13:00:00 375
2022-01-01 14:00:00 375
2022-01-01 15:00:00 0
2022-01-01 16:00:00 0
2022-01-01 17:00:00 0
2022-01-01 18:00:00 0
2022-01-01 19:00:00 0
2022-01-01 20:00:00 0
2022-01-01 21:00:00 0
2022-01-01 22:00:00 0
2022-01-01 23:00:00 0
start_time/end_time 范围不重叠。关于如何实现这一点有什么建议吗?
使用:
#get differencies between start and end in minutes
df['diff'] = pd.to_datetime(df['end_time']).sub(pd.to_datetime(df['start_time'])).dt.total_seconds().div(60)
#create DataFrame with repeat values by minutes
s = pd.concat([pd.Series(r.Index,pd.date_range(r.start_time, r.end_time, freq='Min', closed='left')) for r in df.itertuples()])
s = pd.Series(s.index, s.to_numpy(), name='new')
df = df.join(s)
#resample to hours
df = df.resample('H', on='new').agg({'values':'first', 'diff':'first', 'new':'size'})
#multiple values by ratio
df['value'] = df['values'].mul(df['new'].div(df['diff'])).fillna(0)
#add missing rows
r = pd.date_range(df.index.min().normalize(), df.index.max().normalize() + pd.Timedelta('23H'), freq='H')
df = df[['value']].reindex(r, fill_value=0)
print (df)
value
2022-01-01 00:00:00 0.0
2022-01-01 01:00:00 300.0
2022-01-01 02:00:00 400.0
2022-01-01 03:00:00 300.0
2022-01-01 04:00:00 0.0
2022-01-01 05:00:00 0.0
2022-01-01 06:00:00 0.0
2022-01-01 07:00:00 0.0
2022-01-01 08:00:00 0.0
2022-01-01 09:00:00 0.0
2022-01-01 10:00:00 0.0
2022-01-01 11:00:00 0.0
2022-01-01 12:00:00 0.0
2022-01-01 13:00:00 375.0
2022-01-01 14:00:00 375.0
2022-01-01 15:00:00 0.0
2022-01-01 16:00:00 0.0
2022-01-01 17:00:00 0.0
2022-01-01 18:00:00 0.0
2022-01-01 19:00:00 0.0
2022-01-01 20:00:00 0.0
2022-01-01 21:00:00 0.0
2022-01-01 22:00:00 0.0
2022-01-01 23:00:00 0.0
我有以下形式的数据框:
In [5]: df = pd.DataFrame({
...: 'start_time': ['2022-01-01 01:15', '2022-01-01 13:00'],
...: 'end_time': ['2022-01-01 03:45', '2022-01-01 15:00'],
...: 'values': [1000, 750]})
In [6]: df
Out[6]:
start_time end_time values
0 2022-01-01 01:15 2022-01-01 03:45 1000
1 2022-01-01 13:00 2022-01-01 15:00 750
我想将其转换为 24 小时值,在 start_time/end_time 范围内的小时内按比例拆分值。对于上面的例子,这应该产生:
In [10]: result
Out[10]:
value
2022-01-01 00:00:00 0
2022-01-01 01:00:00 300
2022-01-01 02:00:00 400
2022-01-01 03:00:00 300
2022-01-01 04:00:00 0
2022-01-01 05:00:00 0
2022-01-01 06:00:00 0
2022-01-01 07:00:00 0
2022-01-01 08:00:00 0
2022-01-01 09:00:00 0
2022-01-01 10:00:00 0
2022-01-01 11:00:00 0
2022-01-01 12:00:00 0
2022-01-01 13:00:00 375
2022-01-01 14:00:00 375
2022-01-01 15:00:00 0
2022-01-01 16:00:00 0
2022-01-01 17:00:00 0
2022-01-01 18:00:00 0
2022-01-01 19:00:00 0
2022-01-01 20:00:00 0
2022-01-01 21:00:00 0
2022-01-01 22:00:00 0
2022-01-01 23:00:00 0
start_time/end_time 范围不重叠。关于如何实现这一点有什么建议吗?
使用:
#get differencies between start and end in minutes
df['diff'] = pd.to_datetime(df['end_time']).sub(pd.to_datetime(df['start_time'])).dt.total_seconds().div(60)
#create DataFrame with repeat values by minutes
s = pd.concat([pd.Series(r.Index,pd.date_range(r.start_time, r.end_time, freq='Min', closed='left')) for r in df.itertuples()])
s = pd.Series(s.index, s.to_numpy(), name='new')
df = df.join(s)
#resample to hours
df = df.resample('H', on='new').agg({'values':'first', 'diff':'first', 'new':'size'})
#multiple values by ratio
df['value'] = df['values'].mul(df['new'].div(df['diff'])).fillna(0)
#add missing rows
r = pd.date_range(df.index.min().normalize(), df.index.max().normalize() + pd.Timedelta('23H'), freq='H')
df = df[['value']].reindex(r, fill_value=0)
print (df)
value
2022-01-01 00:00:00 0.0
2022-01-01 01:00:00 300.0
2022-01-01 02:00:00 400.0
2022-01-01 03:00:00 300.0
2022-01-01 04:00:00 0.0
2022-01-01 05:00:00 0.0
2022-01-01 06:00:00 0.0
2022-01-01 07:00:00 0.0
2022-01-01 08:00:00 0.0
2022-01-01 09:00:00 0.0
2022-01-01 10:00:00 0.0
2022-01-01 11:00:00 0.0
2022-01-01 12:00:00 0.0
2022-01-01 13:00:00 375.0
2022-01-01 14:00:00 375.0
2022-01-01 15:00:00 0.0
2022-01-01 16:00:00 0.0
2022-01-01 17:00:00 0.0
2022-01-01 18:00:00 0.0
2022-01-01 19:00:00 0.0
2022-01-01 20:00:00 0.0
2022-01-01 21:00:00 0.0
2022-01-01 22:00:00 0.0
2022-01-01 23:00:00 0.0