Pandas 滚动window 带过滤条件去掉最新的一些数据

Pandas Rolling window with filtering condition to remove the some latest data

这是 的后续问题。我想执行最近 n 天的滚动 window,但我想从每个 window 中过滤掉最近的 x 天(x 小于 n)

这是一个例子:

d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
     'Date': ['08/01/2021',
              '27/01/2021',
              '05/02/2021',
              '10/02/2021',
              '17/02/2021',
              '18/02/2021',
              '20/02/2021',
              '21/02/2021',
              '22/02/2021',
              '29/03/2021'],
     'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}

df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
   Name       Date  Earning
0  Jack 2021-01-08       40
1   Jim 2021-01-27       10
2  Jack 2021-02-05       20
3   Jim 2021-02-10       20
4  Jack 2021-02-17       40
5  Jack 2021-02-18       50
6   Jim 2021-02-20      100
7  Jack 2021-02-21       70
8  Jane 2021-02-22       80
9  Jane 2021-03-29       90

我愿意

预期结果:(不需要Window_FromWindow_To两列,我只是用它们来演示模拟数据)

   Name       Date  Earning Window_From  Window_To   Sum
0  Jack 2021-01-08       40  2020-12-09 2020-12-19   0.0
1   Jim 2021-01-27       10  2020-12-28 2021-01-07   0.0
2  Jack 2021-02-05       20  2021-01-06 2021-01-16  40.0
3   Jim 2021-02-10       20  2021-01-11 2021-01-21   0.0
4  Jack 2021-02-17       40  2021-01-18 2021-01-28   0.0
5  Jack 2021-02-18       50  2021-01-19 2021-01-29   0.0
6   Jim 2021-02-20      100  2021-01-21 2021-01-31  10.0
7  Jack 2021-02-21       70  2021-01-22 2021-02-01   0.0
8  Jane 2021-02-22       80  2021-01-23 2021-02-02   0.0
9  Jane 2021-03-29       90  2021-02-27 2021-03-09   0.0

简单的解决方案

计算 30 天和 20 天 rolling sum 然后用 20 天的总和减去 30 天的总和得到前 10 天的有效 rolling sum

s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()

df.merge(s1.sub(s2).reset_index(name='sum'), how='left')

   Name       Date  Earning   sum
0  Jack 2021-01-08       40   0.0
1   Jim 2021-01-27       10   0.0
2  Jack 2021-02-05       20  40.0
3   Jim 2021-02-10       20   0.0
4  Jack 2021-02-17       40   0.0
5  Jack 2021-02-18       50   0.0
6   Jim 2021-02-20      100  10.0
7  Jack 2021-02-21       70   0.0
8  Jane 2021-02-22       80   0.0
9  Jane 2021-03-29       90   0.0

滚动的替代方法(可能更快):

编辑:使用 OP 的数据集实际上速度较慢。

df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10) 
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum() 
             for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]

print(df.reset_index().drop(columns=['start', 'end']))
   Name       Date  Earning  Sum
0  Jack 2021-01-08       40    0
1   Jim 2021-01-27       10    0
2  Jack 2021-02-05       20   40
3   Jim 2021-02-10       20    0
4  Jack 2021-02-17       40    0
5  Jack 2021-02-18       50    0
6   Jim 2021-02-20      100   10
7  Jack 2021-02-21       70    0
8  Jane 2021-02-22       80    0
9  Jane 2021-03-29       90    0