rolling window 如何在多索引的时间序列上进行计数?
How to do rolling window count on time series with multi-index?
我有这个数据框:
ID Date Received
000 2018-01-01 00:00:00+00:00 True
2018-01-01 06:24:44+00:00 True
2018-01-03 16:24:45+00:00 False
2018-01-13 20:00:00+00:00 True
2018-01-13 23:00:00+00:00 True
2018-01-25 22:30:55+00:00 True
2018-01-26 00:30:55+00:00 False
111 2018-01-01 12:00:00+00:00 True
2018-01-02 15:00:45+00:00 True
2018-01-04 00:00:00+00:00 True
有没有办法对 7 天内“已接收”列中的真值数量进行滚动 window 计数并按 ID 分组?我尝试使用 df.rolling('7D').count() 但它 returns 是一个错误。
我正在寻找这样的东西:
ID Date range Count
000 2018-01-01 00:00:00+00:00 - 2018-01-07 00:00:00+00:00 2
2018-01-08 20:00:00+00:00 - 2018-01-14 00:00:00+00:00 2
2018-01-15 22:30:55+00:00 - 2018-01-21 20:00:00+00:00 0
2018-01-22 22:30:55+00:00 - 2018-01-28 20:00:00+00:00 1
111 2018-01-01 00:00:00+00:00 - 2018-01-07 00:00:00+00:00 3
您可以尝试pd.Grouper
并指定频率:
df.groupby(["ID", pd.Grouper(key='Date', freq='1W')])["Received"].sum()
完整答案:
# Count the number of True per week per ID
out = df.groupby(["ID", pd.Grouper(key='Date', freq='1W')])["Received"] \
.sum() \
.to_frame() \
.reset_index() \
.rename(columns={"Received": "Count"})
print(out)
# ID Date Count
# 0 000 2018-01-07 00:00:00+00:00 2.0
# 1 000 2018-01-14 00:00:00+00:00 2.0
# 2 000 2018-01-28 00:00:00+00:00 1.0
# 3 111 2018-01-07 00:00:00+00:00 3.0
# Fill missing date ranges
def fill_date_range(df):
dates = pd.date_range(df.Date.min(), df.Date.max(), freq="1W")
return df.set_index("Date") \
.reindex(dates)[['Count']] \
.fillna(0)
# Fill missing date range
out = out.groupby(by="ID").apply(fill_date_range) \
.reset_index() \
.rename(columns={"level_1": "Date"})
print(out)
# ID Date Count
# 0 000 2018-01-07 00:00:00+00:00 2.0
# 1 000 2018-01-14 00:00:00+00:00 2.0
# 2 000 2018-01-21 00:00:00+00:00 0.0
# 3 000 2018-01-28 00:00:00+00:00 1.0
# 4 111 2018-01-07 00:00:00+00:00 3.0
# Add date range interval as string
format = '%Y-%m-%d %H:%M:%S'
out["Date_expected"] = out.Date.dt.strftime(format) + " - " + (out.Date + pd.Timedelta(weeks=-1)).dt.strftime(format)
print(out)
# ID Date Count Date_expected
# 0 000 2018-01-07 00:00:00+00:00 2.0 2018-01-07 00:00:00 - 2017-12-31 00:00:00
# 1 000 2018-01-14 00:00:00+00:00 2.0 2018-01-14 00:00:00 - 2018-01-07 00:00:00
# 2 000 2018-01-21 00:00:00+00:00 0.0 2018-01-21 00:00:00 - 2018-01-14 00:00:00
# 3 000 2018-01-28 00:00:00+00:00 1.0 2018-01-28 00:00:00 - 2018-01-21 00:00:00
# 4 111 2018-01-07 00:00:00+00:00 3.0 2018-01-07 00:00:00 - 2017-12-31 00:00:00
我有这个数据框:
ID Date Received
000 2018-01-01 00:00:00+00:00 True
2018-01-01 06:24:44+00:00 True
2018-01-03 16:24:45+00:00 False
2018-01-13 20:00:00+00:00 True
2018-01-13 23:00:00+00:00 True
2018-01-25 22:30:55+00:00 True
2018-01-26 00:30:55+00:00 False
111 2018-01-01 12:00:00+00:00 True
2018-01-02 15:00:45+00:00 True
2018-01-04 00:00:00+00:00 True
有没有办法对 7 天内“已接收”列中的真值数量进行滚动 window 计数并按 ID 分组?我尝试使用 df.rolling('7D').count() 但它 returns 是一个错误。
我正在寻找这样的东西:
ID Date range Count
000 2018-01-01 00:00:00+00:00 - 2018-01-07 00:00:00+00:00 2
2018-01-08 20:00:00+00:00 - 2018-01-14 00:00:00+00:00 2
2018-01-15 22:30:55+00:00 - 2018-01-21 20:00:00+00:00 0
2018-01-22 22:30:55+00:00 - 2018-01-28 20:00:00+00:00 1
111 2018-01-01 00:00:00+00:00 - 2018-01-07 00:00:00+00:00 3
您可以尝试pd.Grouper
并指定频率:
df.groupby(["ID", pd.Grouper(key='Date', freq='1W')])["Received"].sum()
完整答案:
# Count the number of True per week per ID
out = df.groupby(["ID", pd.Grouper(key='Date', freq='1W')])["Received"] \
.sum() \
.to_frame() \
.reset_index() \
.rename(columns={"Received": "Count"})
print(out)
# ID Date Count
# 0 000 2018-01-07 00:00:00+00:00 2.0
# 1 000 2018-01-14 00:00:00+00:00 2.0
# 2 000 2018-01-28 00:00:00+00:00 1.0
# 3 111 2018-01-07 00:00:00+00:00 3.0
# Fill missing date ranges
def fill_date_range(df):
dates = pd.date_range(df.Date.min(), df.Date.max(), freq="1W")
return df.set_index("Date") \
.reindex(dates)[['Count']] \
.fillna(0)
# Fill missing date range
out = out.groupby(by="ID").apply(fill_date_range) \
.reset_index() \
.rename(columns={"level_1": "Date"})
print(out)
# ID Date Count
# 0 000 2018-01-07 00:00:00+00:00 2.0
# 1 000 2018-01-14 00:00:00+00:00 2.0
# 2 000 2018-01-21 00:00:00+00:00 0.0
# 3 000 2018-01-28 00:00:00+00:00 1.0
# 4 111 2018-01-07 00:00:00+00:00 3.0
# Add date range interval as string
format = '%Y-%m-%d %H:%M:%S'
out["Date_expected"] = out.Date.dt.strftime(format) + " - " + (out.Date + pd.Timedelta(weeks=-1)).dt.strftime(format)
print(out)
# ID Date Count Date_expected
# 0 000 2018-01-07 00:00:00+00:00 2.0 2018-01-07 00:00:00 - 2017-12-31 00:00:00
# 1 000 2018-01-14 00:00:00+00:00 2.0 2018-01-14 00:00:00 - 2018-01-07 00:00:00
# 2 000 2018-01-21 00:00:00+00:00 0.0 2018-01-21 00:00:00 - 2018-01-14 00:00:00
# 3 000 2018-01-28 00:00:00+00:00 1.0 2018-01-28 00:00:00 - 2018-01-21 00:00:00
# 4 111 2018-01-07 00:00:00+00:00 3.0 2018-01-07 00:00:00 - 2017-12-31 00:00:00