Pandas 带 groupby 和条件的滚动总和
Pandas rolling sum with groupby and conditions
我有一个数据框,其中包含带有客户分析的不同商品的销售时间序列。对于每个项目和给定的一天,我要计算:
- 我最好的客户在过去 2 天的总销售额中的份额
- 最近 2 天总销售额中我的顶级客户(来自列表)的份额
我已经尝试过此处提供的解决方案:
- 滚动总和:
Pandas sum over a date range for each category separately
- 对于条件分组:
示例数据帧可以通过以下方式生成:
import pandas as pd
from datetime import timedelta
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
# Create datetimes and data
df_1['item'] = [1, 1, 1, 2, 2, 2, 2]
df_1['date'] = pd.date_range('1/1/2018', periods=7, freq='D')
df_1['customer'] = ['a', 'b', 'c', 'a', 'b', 'b', 'c']
df_1['sales'] = [2, 4, 1, 5, 7, 2, 3]
df_2['item'] = [1, 1, 1, 2, 2, 2, 2]
df_2['date'] = pd.date_range('1/1/2018', periods=7, freq='D')
df_2['customer'] = ['b', 'b', 'c', 'a', 'a', 'c', 'a']
df_2['sales'] = [2, 3, 4, 2, 3, 5, 6]
df_3['item'] = [1, 1, 1, 2, 2, 2, 2]
df_3['date'] = pd.date_range('1/1/2018', periods=7, freq='D')
df_3['customer'] = ['b', 'c', 'a', 'c', 'b', 'a', 'b']
df_3['sales'] = [6, 5, 2, 3, 4, 5, 6]
df = pd.concat([df_1, df_2, df_3])
df = df.sort_values(['item', 'date'])
df.reset_index(drop=True)
看起来像这样:
item
date
customer
sales
1
2018-01-01
a
2
1
2018-01-01
b
2
1
2018-01-01
b
6
1
2018-01-02
b
4
1
2018-01-02
b
3
1
2018-01-02
c
5
1
2018-01-03
c
1
1
2018-01-03
c
4
1
2018-01-03
a
2
2
2018-01-04
a
5
2
2018-01-04
a
2
2
2018-01-04
c
3
2
2018-01-05
b
7
2
2018-01-05
a
3
2
2018-01-05
b
4
2
2018-01-06
b
2
2
2018-01-06
c
5
2
2018-01-06
a
5
2
2018-01-07
c
3
2
2018-01-07
a
6
2
2018-01-07
b
6
我希望得到以下结果:
item
date
sales_at_day
sales_last_2_days
a_share
top_share
1
2018-01-01
10
NaN
NaN
NaN
1
2018-01-02
12
10
0.20
0.20
1
2018-01-03
7
22
0.09
0.09
2
2018-01-04
10
NaN
NaN
NaN
2
2018-01-05
14
10
0.70
1.00
2
2018-01-06
12
24
0.29
0.42
2
2018-01-07
15
26
0.31
0.50
其中,
a_share
是客户 'a' 的销售额在过去 2 天(不包括当天)的总销售额中所占的份额
top_share
是
中客户销售额的份额
top_cust = ['a', 'c']
列出最近 2 天(不包括当天)的总销售额
有什么想法吗?非常感谢:)
安迪
使用:
#custom rolling with shift first day
f = lambda x: x.rolling(2, min_periods=1).sum().shift()
#aggregate sum
df1 = df.groupby(['item','date'], as_index=False)['sales'].sum()
#apply custom rolling per groups
df1['sales_last_2_days'] = df1.groupby('item')['sales'].apply(f).reset_index(drop=True, level=0)
#filter customer a and aggregate sum
a = df[df['customer'].eq('a')].groupby(['item','date'])['sales'].sum().rename('a_share')
#add new column to original
df1 = df1.join(a, on=['item','date'])
#applt custom rolling per groups and divide
df1['a_share'] = df1.groupby('item')['a_share'].apply(f).reset_index(drop=True, level=0) / df1['sales_last_2_days']
#verys similar like before, only test membership by isin
top_cust = ['a', 'c']
a = df[df['customer'].isin(top_cust)].groupby(['item','date'])['sales'].sum().rename('top_share')
df1 = df1.join(a, on=['item','date'])
df1['top_share'] = df1.groupby('item')['top_share'].apply(f).reset_index(drop=True, level=0) / df1['sales_last_2_days']
print (df1)
item date sales sales_last_2_days a_share top_share
0 1 2018-01-01 10 NaN NaN NaN
1 1 2018-01-02 12 10.0 0.200000 0.200000
2 1 2018-01-03 7 22.0 0.090909 0.318182
3 2 2018-01-04 10 NaN NaN NaN
4 2 2018-01-05 14 10.0 0.700000 1.000000
5 2 2018-01-06 12 24.0 0.416667 0.541667
6 2 2018-01-07 15 26.0 0.307692 0.500000
如果想用rolling
带天,那就更复杂了:
df1 = df.groupby(['item','date'], as_index=False)['sales'].sum()
sales1 = (df1.set_index('date')
.groupby('item')['sales']
.rolling('2D', min_periods=1)
.sum()
.groupby('item')
.shift()
.rename('sales_last_2_days')
)
df1 = df1.join(sales1, on=['item','date'])
df2 = df[df['customer'].eq('a')].groupby(['item','date'], as_index=False)['sales'].sum()
a = (df2.set_index('date')
.groupby('item')[['sales']]
.apply(lambda x: x.asfreq('D'))
.reset_index(level=0)
.groupby('item')['sales']
.rolling('2D', min_periods=1)
.sum()
.groupby('item')
.shift()
.rename('a_share')
)
print (a)
df1 = df1.join(a, on=['item','date'])
df1['a_share'] /= df1['sales_last_2_days']
top_cust = ['a', 'c']
df3 = df[df['customer'].isin(top_cust)].groupby(['item','date'], as_index=False)['sales'].sum()
b = (df3.set_index('date')
.groupby('item')[['sales']]
.apply(lambda x: x.asfreq('D'))
.reset_index(level=0)
.groupby('item')['sales']
.rolling('2D', min_periods=1)
.sum()
.groupby('item')
.shift()
.rename('top_share')
)
df1 = df1.join(b, on=['item','date'])
df1['top_share'] /= df1['sales_last_2_days']
print (df1)
item date sales sales_last_2_days a_share top_share
0 1 2018-01-01 10 NaN NaN NaN
1 1 2018-01-02 12 10.0 0.200000 0.200000
2 1 2018-01-03 7 22.0 0.090909 0.318182
3 2 2018-01-04 10 NaN NaN NaN
4 2 2018-01-05 14 10.0 0.700000 1.000000
5 2 2018-01-06 12 24.0 0.416667 0.541667
6 2 2018-01-07 15 26.0 0.307692 0.500000
我有一个数据框,其中包含带有客户分析的不同商品的销售时间序列。对于每个项目和给定的一天,我要计算:
- 我最好的客户在过去 2 天的总销售额中的份额
- 最近 2 天总销售额中我的顶级客户(来自列表)的份额
我已经尝试过此处提供的解决方案:
- 滚动总和: Pandas sum over a date range for each category separately
- 对于条件分组:
示例数据帧可以通过以下方式生成:
import pandas as pd
from datetime import timedelta
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
# Create datetimes and data
df_1['item'] = [1, 1, 1, 2, 2, 2, 2]
df_1['date'] = pd.date_range('1/1/2018', periods=7, freq='D')
df_1['customer'] = ['a', 'b', 'c', 'a', 'b', 'b', 'c']
df_1['sales'] = [2, 4, 1, 5, 7, 2, 3]
df_2['item'] = [1, 1, 1, 2, 2, 2, 2]
df_2['date'] = pd.date_range('1/1/2018', periods=7, freq='D')
df_2['customer'] = ['b', 'b', 'c', 'a', 'a', 'c', 'a']
df_2['sales'] = [2, 3, 4, 2, 3, 5, 6]
df_3['item'] = [1, 1, 1, 2, 2, 2, 2]
df_3['date'] = pd.date_range('1/1/2018', periods=7, freq='D')
df_3['customer'] = ['b', 'c', 'a', 'c', 'b', 'a', 'b']
df_3['sales'] = [6, 5, 2, 3, 4, 5, 6]
df = pd.concat([df_1, df_2, df_3])
df = df.sort_values(['item', 'date'])
df.reset_index(drop=True)
看起来像这样:
item | date | customer | sales |
---|---|---|---|
1 | 2018-01-01 | a | 2 |
1 | 2018-01-01 | b | 2 |
1 | 2018-01-01 | b | 6 |
1 | 2018-01-02 | b | 4 |
1 | 2018-01-02 | b | 3 |
1 | 2018-01-02 | c | 5 |
1 | 2018-01-03 | c | 1 |
1 | 2018-01-03 | c | 4 |
1 | 2018-01-03 | a | 2 |
2 | 2018-01-04 | a | 5 |
2 | 2018-01-04 | a | 2 |
2 | 2018-01-04 | c | 3 |
2 | 2018-01-05 | b | 7 |
2 | 2018-01-05 | a | 3 |
2 | 2018-01-05 | b | 4 |
2 | 2018-01-06 | b | 2 |
2 | 2018-01-06 | c | 5 |
2 | 2018-01-06 | a | 5 |
2 | 2018-01-07 | c | 3 |
2 | 2018-01-07 | a | 6 |
2 | 2018-01-07 | b | 6 |
我希望得到以下结果:
item | date | sales_at_day | sales_last_2_days | a_share | top_share |
---|---|---|---|---|---|
1 | 2018-01-01 | 10 | NaN | NaN | NaN |
1 | 2018-01-02 | 12 | 10 | 0.20 | 0.20 |
1 | 2018-01-03 | 7 | 22 | 0.09 | 0.09 |
2 | 2018-01-04 | 10 | NaN | NaN | NaN |
2 | 2018-01-05 | 14 | 10 | 0.70 | 1.00 |
2 | 2018-01-06 | 12 | 24 | 0.29 | 0.42 |
2 | 2018-01-07 | 15 | 26 | 0.31 | 0.50 |
其中,
a_share
是客户 'a' 的销售额在过去 2 天(不包括当天)的总销售额中所占的份额
top_share
是
top_cust = ['a', 'c']
列出最近 2 天(不包括当天)的总销售额
有什么想法吗?非常感谢:)
安迪
使用:
#custom rolling with shift first day
f = lambda x: x.rolling(2, min_periods=1).sum().shift()
#aggregate sum
df1 = df.groupby(['item','date'], as_index=False)['sales'].sum()
#apply custom rolling per groups
df1['sales_last_2_days'] = df1.groupby('item')['sales'].apply(f).reset_index(drop=True, level=0)
#filter customer a and aggregate sum
a = df[df['customer'].eq('a')].groupby(['item','date'])['sales'].sum().rename('a_share')
#add new column to original
df1 = df1.join(a, on=['item','date'])
#applt custom rolling per groups and divide
df1['a_share'] = df1.groupby('item')['a_share'].apply(f).reset_index(drop=True, level=0) / df1['sales_last_2_days']
#verys similar like before, only test membership by isin
top_cust = ['a', 'c']
a = df[df['customer'].isin(top_cust)].groupby(['item','date'])['sales'].sum().rename('top_share')
df1 = df1.join(a, on=['item','date'])
df1['top_share'] = df1.groupby('item')['top_share'].apply(f).reset_index(drop=True, level=0) / df1['sales_last_2_days']
print (df1)
item date sales sales_last_2_days a_share top_share
0 1 2018-01-01 10 NaN NaN NaN
1 1 2018-01-02 12 10.0 0.200000 0.200000
2 1 2018-01-03 7 22.0 0.090909 0.318182
3 2 2018-01-04 10 NaN NaN NaN
4 2 2018-01-05 14 10.0 0.700000 1.000000
5 2 2018-01-06 12 24.0 0.416667 0.541667
6 2 2018-01-07 15 26.0 0.307692 0.500000
如果想用rolling
带天,那就更复杂了:
df1 = df.groupby(['item','date'], as_index=False)['sales'].sum()
sales1 = (df1.set_index('date')
.groupby('item')['sales']
.rolling('2D', min_periods=1)
.sum()
.groupby('item')
.shift()
.rename('sales_last_2_days')
)
df1 = df1.join(sales1, on=['item','date'])
df2 = df[df['customer'].eq('a')].groupby(['item','date'], as_index=False)['sales'].sum()
a = (df2.set_index('date')
.groupby('item')[['sales']]
.apply(lambda x: x.asfreq('D'))
.reset_index(level=0)
.groupby('item')['sales']
.rolling('2D', min_periods=1)
.sum()
.groupby('item')
.shift()
.rename('a_share')
)
print (a)
df1 = df1.join(a, on=['item','date'])
df1['a_share'] /= df1['sales_last_2_days']
top_cust = ['a', 'c']
df3 = df[df['customer'].isin(top_cust)].groupby(['item','date'], as_index=False)['sales'].sum()
b = (df3.set_index('date')
.groupby('item')[['sales']]
.apply(lambda x: x.asfreq('D'))
.reset_index(level=0)
.groupby('item')['sales']
.rolling('2D', min_periods=1)
.sum()
.groupby('item')
.shift()
.rename('top_share')
)
df1 = df1.join(b, on=['item','date'])
df1['top_share'] /= df1['sales_last_2_days']
print (df1)
item date sales sales_last_2_days a_share top_share
0 1 2018-01-01 10 NaN NaN NaN
1 1 2018-01-02 12 10.0 0.200000 0.200000
2 1 2018-01-03 7 22.0 0.090909 0.318182
3 2 2018-01-04 10 NaN NaN NaN
4 2 2018-01-05 14 10.0 0.700000 1.000000
5 2 2018-01-06 12 24.0 0.416667 0.541667
6 2 2018-01-07 15 26.0 0.307692 0.500000