将 cumxxx (sum, min...) 应用于 DataFrame 中不同大小的 window
Apply cumxxx (sum, min...) to a window of varying size in a DataFrame
我想对 DataFrame 中 不同大小 的 window 应用 cumxxx
操作。
考虑以下输入:
import pandas as pd
from random import seed, randint
from collections import OrderedDict
p5h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='5h', name='p5h')
p1h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='1h', name='p1h')
seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)
p5h_st_as_series = p5h.start_time.to_series()
df['OpeneningPeriod'] = df.apply(
lambda x: p5h.to_series().loc[p5h_st_as_series.index <=
x.name.start_time].index[-1],
axis=1)
结果
df.head(20)
Values OpeneningPeriod
p1h
2020-02-01 00:00 2 2020-02-01 00:00
2020-02-01 01:00 9 2020-02-01 00:00
2020-02-01 02:00 1 2020-02-01 00:00
2020-02-01 03:00 4 2020-02-01 00:00
2020-02-01 04:00 1 2020-02-01 00:00
2020-02-01 05:00 7 2020-02-01 05:00
2020-02-01 06:00 7 2020-02-01 05:00
2020-02-01 07:00 7 2020-02-01 05:00
2020-02-01 08:00 10 2020-02-01 05:00
2020-02-01 09:00 6 2020-02-01 05:00
2020-02-01 10:00 3 2020-02-01 10:00
2020-02-01 11:00 1 2020-02-01 10:00
2020-02-01 12:00 7 2020-02-01 10:00
2020-02-01 13:00 0 2020-02-01 10:00
2020-02-01 14:00 6 2020-02-01 10:00
2020-02-01 15:00 6 2020-02-01 15:00
2020-02-01 16:00 9 2020-02-01 15:00
2020-02-01 17:00 0 2020-02-01 15:00
2020-02-01 18:00 7 2020-02-01 15:00
2020-02-01 19:00 4 2020-02-01 15:00
此处,cumxxx
将应用于定义的 5 小时时段。它可以是不同的长度,因为 windows 可以是一天(有些带有夏令时),也可以是一个月(一个月中的小时数不是固定的)。
我要找的结果是:
df_result.head(11)
Values OpeneningPeriod Cumsum
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2 <- cumsum starts with a new period
2020-02-01 01:00 9 2020-02-01 00:00 11
2020-02-01 02:00 1 2020-02-01 00:00 12
2020-02-01 03:00 4 2020-02-01 00:00 16
2020-02-01 04:00 1 2020-02-01 00:00 17
2020-02-01 05:00 7 2020-02-01 05:00 7 <- cumsum starts with a new period
2020-02-01 06:00 7 2020-02-01 05:00 14
2020-02-01 07:00 7 2020-02-01 05:00 21
2020-02-01 08:00 10 2020-02-01 05:00 31
2020-02-01 09:00 6 2020-02-01 05:00 37
2020-02-01 10:00 3 2020-02-01 10:00 3 <- cumsum starts with a new period
与cummin
& cummax
也是同理。
有人知道吗?
感谢您的帮助!
最佳,
如果需要按 5H
window 按 DatetimeIndex
分组,请使用 DataFrame.to_period
和 cumsum
:
df['Cumsum'] = df.resample('5H')['Values'].cumsum()
或Grouper
:
df['Cumsum'] = df.groupby(pd.Grouper(freq='5H'))['Values'].cumsum()
print (df.head(11))
Values OpeneningPeriod Cumsum
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2
2020-02-01 01:00 9 2020-02-01 00:00 11
2020-02-01 02:00 1 2020-02-01 00:00 12
2020-02-01 03:00 4 2020-02-01 00:00 16
2020-02-01 04:00 1 2020-02-01 00:00 17
2020-02-01 05:00 7 2020-02-01 05:00 7
2020-02-01 06:00 7 2020-02-01 05:00 14
2020-02-01 07:00 7 2020-02-01 05:00 21
2020-02-01 08:00 10 2020-02-01 05:00 31
2020-02-01 09:00 6 2020-02-01 05:00 37
2020-02-01 10:00 3 2020-02-01 10:00 3
groupby
应该是一个很好的起点:
df['Cumsum'] = df.groupby('OpeneningPeriod')['Values'].cumsum()
它给出:
Values OpeneningPeriod Cumsum
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2
2020-02-01 01:00 9 2020-02-01 00:00 11
2020-02-01 02:00 1 2020-02-01 00:00 12
2020-02-01 03:00 4 2020-02-01 00:00 16
2020-02-01 04:00 1 2020-02-01 00:00 17
2020-02-01 05:00 7 2020-02-01 05:00 7
2020-02-01 06:00 7 2020-02-01 05:00 14
2020-02-01 07:00 7 2020-02-01 05:00 21
2020-02-01 08:00 10 2020-02-01 05:00 31
2020-02-01 09:00 6 2020-02-01 05:00 37
2020-02-01 10:00 3 2020-02-01 10:00 3
2020-02-01 11:00 1 2020-02-01 10:00 4
2020-02-01 12:00 7 2020-02-01 10:00 11
2020-02-01 13:00 0 2020-02-01 10:00 11
2020-02-01 14:00 6 2020-02-01 10:00 17
2020-02-01 15:00 6 2020-02-01 15:00 6
...
我想对 DataFrame 中 不同大小 的 window 应用 cumxxx
操作。
考虑以下输入:
import pandas as pd
from random import seed, randint
from collections import OrderedDict
p5h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='5h', name='p5h')
p1h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='1h', name='p1h')
seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)
p5h_st_as_series = p5h.start_time.to_series()
df['OpeneningPeriod'] = df.apply(
lambda x: p5h.to_series().loc[p5h_st_as_series.index <=
x.name.start_time].index[-1],
axis=1)
结果
df.head(20)
Values OpeneningPeriod
p1h
2020-02-01 00:00 2 2020-02-01 00:00
2020-02-01 01:00 9 2020-02-01 00:00
2020-02-01 02:00 1 2020-02-01 00:00
2020-02-01 03:00 4 2020-02-01 00:00
2020-02-01 04:00 1 2020-02-01 00:00
2020-02-01 05:00 7 2020-02-01 05:00
2020-02-01 06:00 7 2020-02-01 05:00
2020-02-01 07:00 7 2020-02-01 05:00
2020-02-01 08:00 10 2020-02-01 05:00
2020-02-01 09:00 6 2020-02-01 05:00
2020-02-01 10:00 3 2020-02-01 10:00
2020-02-01 11:00 1 2020-02-01 10:00
2020-02-01 12:00 7 2020-02-01 10:00
2020-02-01 13:00 0 2020-02-01 10:00
2020-02-01 14:00 6 2020-02-01 10:00
2020-02-01 15:00 6 2020-02-01 15:00
2020-02-01 16:00 9 2020-02-01 15:00
2020-02-01 17:00 0 2020-02-01 15:00
2020-02-01 18:00 7 2020-02-01 15:00
2020-02-01 19:00 4 2020-02-01 15:00
此处,cumxxx
将应用于定义的 5 小时时段。它可以是不同的长度,因为 windows 可以是一天(有些带有夏令时),也可以是一个月(一个月中的小时数不是固定的)。
我要找的结果是:
df_result.head(11)
Values OpeneningPeriod Cumsum
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2 <- cumsum starts with a new period
2020-02-01 01:00 9 2020-02-01 00:00 11
2020-02-01 02:00 1 2020-02-01 00:00 12
2020-02-01 03:00 4 2020-02-01 00:00 16
2020-02-01 04:00 1 2020-02-01 00:00 17
2020-02-01 05:00 7 2020-02-01 05:00 7 <- cumsum starts with a new period
2020-02-01 06:00 7 2020-02-01 05:00 14
2020-02-01 07:00 7 2020-02-01 05:00 21
2020-02-01 08:00 10 2020-02-01 05:00 31
2020-02-01 09:00 6 2020-02-01 05:00 37
2020-02-01 10:00 3 2020-02-01 10:00 3 <- cumsum starts with a new period
与cummin
& cummax
也是同理。
有人知道吗?
感谢您的帮助! 最佳,
如果需要按 5H
window 按 DatetimeIndex
分组,请使用 DataFrame.to_period
和 cumsum
:
df['Cumsum'] = df.resample('5H')['Values'].cumsum()
或Grouper
:
df['Cumsum'] = df.groupby(pd.Grouper(freq='5H'))['Values'].cumsum()
print (df.head(11))
Values OpeneningPeriod Cumsum
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2
2020-02-01 01:00 9 2020-02-01 00:00 11
2020-02-01 02:00 1 2020-02-01 00:00 12
2020-02-01 03:00 4 2020-02-01 00:00 16
2020-02-01 04:00 1 2020-02-01 00:00 17
2020-02-01 05:00 7 2020-02-01 05:00 7
2020-02-01 06:00 7 2020-02-01 05:00 14
2020-02-01 07:00 7 2020-02-01 05:00 21
2020-02-01 08:00 10 2020-02-01 05:00 31
2020-02-01 09:00 6 2020-02-01 05:00 37
2020-02-01 10:00 3 2020-02-01 10:00 3
groupby
应该是一个很好的起点:
df['Cumsum'] = df.groupby('OpeneningPeriod')['Values'].cumsum()
它给出:
Values OpeneningPeriod Cumsum
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2
2020-02-01 01:00 9 2020-02-01 00:00 11
2020-02-01 02:00 1 2020-02-01 00:00 12
2020-02-01 03:00 4 2020-02-01 00:00 16
2020-02-01 04:00 1 2020-02-01 00:00 17
2020-02-01 05:00 7 2020-02-01 05:00 7
2020-02-01 06:00 7 2020-02-01 05:00 14
2020-02-01 07:00 7 2020-02-01 05:00 21
2020-02-01 08:00 10 2020-02-01 05:00 31
2020-02-01 09:00 6 2020-02-01 05:00 37
2020-02-01 10:00 3 2020-02-01 10:00 3
2020-02-01 11:00 1 2020-02-01 10:00 4
2020-02-01 12:00 7 2020-02-01 10:00 11
2020-02-01 13:00 0 2020-02-01 10:00 11
2020-02-01 14:00 6 2020-02-01 10:00 17
2020-02-01 15:00 6 2020-02-01 15:00 6
...