Excel Pandas 中的 SUMIF 等价物
Excel SUMIF equivalent in Pandas
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 201901, 10, 201801, 201801],
['B', 201902, 11, 201801, 201802],
['B', 201903, 13, 201801, 201803],
['B', 201905, 18, 201801, 201805],
['A', 201906, 80, 201801, 201806],
['A', 202001, 10, 201901, 201901],
['A', 202002, 11, 201901, 201902],
['A', 202003, 13, 201901, 201903],
['A', 202004, 18, 201901, 201904],
['B', 202005, 80, 201901, 201905],
['A', 202006, 80, 201901, 201906],
['B', 201901, 10, 201801, 201801],
['A', 201902, 11, 201801, 201802],
['A', 201903, 13, 201801, 201803],
['A', 201905, 18, 201801, 201805],
['B', 201906, 80, 201801, 201806],
['B', 202001, 10, 201901, 201901],
['B', 202002, 11, 201901, 201902],
['B', 202003, 13, 201901, 201903],
['B', 202004, 18, 201901, 201904],
['A', 202005 ,80, 201901, 201905],
['B', 202006 ,80, 201901, 201906]],
columns = ['Store','yearweek','sales','Start_PY','PY'])
df
从上面的 df
(注意缺少第 201904 周) 我想添加一列 'Sales_PY'
,每行的总和每家商店前一年周的销售额。
像这样:
Store
yearweek
sales
Start_PY
PY
sales_PY
A
201901
100
201801
201801
NaN
B
201902
11
201801
201802
NaN
B
201903
13
201801
201803
NaN
B
201905
18
201801
201805
NaN
A
201906
800
201801
201806
NaN
A
202001
100
201901
201901
100.0
A
202002
110
201901
201902
210.0
A
202003
130
201901
201903
340.0
A
202004
180
201901
201904
340.0
B
202005
80
201901
201905
52.0
A
202006
800
201901
201906
1320.0
B
201901
10
201801
201801
NaN
A
201902
110
201801
201802
NaN
A
201903
130
201801
201803
NaN
A
201905
180
201801
201805
NaN
B
201906
80
201801
201806
NaN
B
202001
10
201901
201901
10.0
B
202002
11
201901
201902
21.0
B
202003
13
201901
201903
34.0
B
202004
18
201901
201904
34.0
A
202005
800
201901
201905
520.0
B
202006
80
201901
201906
132.0
而且我认为 Pandas 中必须有 Excel 的 SUMIF
等价物。
即最后一行的销售额 PY 将是销售额的总和 WHERE store == 'B' AND yearweek >= 201901 AND yearweek <= 201906。这等于 132.
因为我不能确保我的 df 会被 store/week 安排,而且我的 df 有时会缺少几周我不喜欢使用 shift() and/or cumsum()功能。
你可以按店铺分组,向前移动一行,然后再分组,取累计和。
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 4, 10, 3, 1],
['A', 5, 11, 4, 2],
['A', 6, 13, 5, 3],
['A', 7, 18, 6, 4],
['B', 4 ,80, 3, 1],
['B', 5, 78, 4, 2],
['B', 6, 71, 5, 3],
['B', 7, 80, 6, 4]],
columns = ['Store','week','sales','week_min_1','week_min_3'])
df['sales_last_3_weeks'] = df.groupby('Store')['sales'].shift()
df['sales_last_3_weeks'] = df.groupby('Store')['sales_last_3_weeks'].cumsum()
根据 OP
的说明完全替换了答案
请注意,您编码的df与您在table中打印的df不一致。我选择了 table
中的那个
下面不是最优雅的,但考虑到缺少周等,我想不出更矢量化的操作
我们基本上实现了非常紧密地遵循sumif
逻辑的逐行计算。 apply
中的函数应用于每一行 r
对于每一行 r
它选择原始数据帧的相关子集 df
并计算总和
df['Sales_PY'] = (df.apply(lambda r: df.loc[(df['yearweek'] >= r['Start_PY'])
&(df['yearweek'] <= r['PY'])
&(df['Store']==r['Store']) ,'sales'].sum(),axis=1)
)
输出
Store yearweek sales Start_PY PY Sales_PY
-- ------- ---------- ------- ---------- ------ ----------
0 A 201901 100 201801 201801 0
1 B 201902 11 201801 201802 0
2 B 201903 13 201801 201803 0
3 B 201905 18 201801 201805 0
4 A 201906 800 201801 201806 0
5 A 202001 100 201901 201901 100
6 A 202002 110 201901 201902 210
7 A 202003 130 201901 201903 340
8 A 202004 180 201901 201904 340
9 B 202005 80 201901 201905 52
10 A 202006 800 201901 201906 1320
11 B 201901 10 201801 201801 0
12 A 201902 110 201801 201802 0
13 A 201903 130 201801 201803 0
14 A 201905 180 201801 201805 0
15 B 201906 80 201801 201806 0
16 B 202001 10 201901 201901 10
17 B 202002 11 201901 201902 21
18 B 202003 13 201901 201903 34
19 B 202004 18 201901 201904 34
20 A 202005 800 201901 201905 520
21 B 202006 80 201901 201906 132
如果您想要 NaN
s 而不是没有销售数据的 0s,您可以在上面的 sum
中传递 min_count=1
参数:.sum(min_count=1)
A店和B店的日期似乎是一致的;我们可以使用不等式连接来获取相关行,在合并回原始数据帧之前用 groupby 对值求和。 conditional_join from pyjanitor 在这里对 non-equi 合并很有帮助,我们使用二进制搜索,而不是遍历每一行;根据数据大小,性能可能会有所帮助:
# pip install pyjanitor
import janitor
import pandas as pd
dates = df.filter(like = 'PY').drop_duplicates()
left = df.loc[:, :"sales"]
outcome = (
left.conditional_join(
dates,
("yearweek", "Start_PY", ">="),
("yearweek", "PY", "<="),
how="right",
)
.groupby(["Store", "Start_PY", "PY"])
.sales.sum()
)
# join back to the original dataframe
df.merge(
outcome.rename("Sales_PY"),
left_on=["Store", "Start_PY", "PY"],
right_index=True,
how="left",
)
Store yearweek sales Start_PY PY Sales_PY
0 A 201901 100 201801 201801 NaN
1 B 201902 11 201801 201802 NaN
2 B 201903 13 201801 201803 NaN
3 B 201905 18 201801 201805 NaN
4 A 201906 800 201801 201806 NaN
5 A 202001 100 201901 201901 100.0
6 A 202002 110 201901 201902 210.0
7 A 202003 130 201901 201903 340.0
8 A 202004 180 201901 201904 340.0
9 B 202005 80 201901 201905 52.0
10 A 202006 800 201901 201906 1320.0
11 B 201901 10 201801 201801 NaN
12 A 201902 110 201801 201802 NaN
13 A 201903 130 201801 201803 NaN
14 A 201905 180 201801 201805 NaN
15 B 201906 80 201801 201806 NaN
16 B 202001 10 201901 201901 10.0
17 B 202002 11 201901 201902 21.0
18 B 202003 13 201901 201903 34.0
19 B 202004 18 201901 201904 34.0
20 A 202005 800 201901 201905 520.0
21 B 202006 80 201901 201906 132.0
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 201901, 10, 201801, 201801],
['B', 201902, 11, 201801, 201802],
['B', 201903, 13, 201801, 201803],
['B', 201905, 18, 201801, 201805],
['A', 201906, 80, 201801, 201806],
['A', 202001, 10, 201901, 201901],
['A', 202002, 11, 201901, 201902],
['A', 202003, 13, 201901, 201903],
['A', 202004, 18, 201901, 201904],
['B', 202005, 80, 201901, 201905],
['A', 202006, 80, 201901, 201906],
['B', 201901, 10, 201801, 201801],
['A', 201902, 11, 201801, 201802],
['A', 201903, 13, 201801, 201803],
['A', 201905, 18, 201801, 201805],
['B', 201906, 80, 201801, 201806],
['B', 202001, 10, 201901, 201901],
['B', 202002, 11, 201901, 201902],
['B', 202003, 13, 201901, 201903],
['B', 202004, 18, 201901, 201904],
['A', 202005 ,80, 201901, 201905],
['B', 202006 ,80, 201901, 201906]],
columns = ['Store','yearweek','sales','Start_PY','PY'])
df
从上面的 df
(注意缺少第 201904 周) 我想添加一列 'Sales_PY'
,每行的总和每家商店前一年周的销售额。
像这样:
Store | yearweek | sales | Start_PY | PY | sales_PY |
---|---|---|---|---|---|
A | 201901 | 100 | 201801 | 201801 | NaN |
B | 201902 | 11 | 201801 | 201802 | NaN |
B | 201903 | 13 | 201801 | 201803 | NaN |
B | 201905 | 18 | 201801 | 201805 | NaN |
A | 201906 | 800 | 201801 | 201806 | NaN |
A | 202001 | 100 | 201901 | 201901 | 100.0 |
A | 202002 | 110 | 201901 | 201902 | 210.0 |
A | 202003 | 130 | 201901 | 201903 | 340.0 |
A | 202004 | 180 | 201901 | 201904 | 340.0 |
B | 202005 | 80 | 201901 | 201905 | 52.0 |
A | 202006 | 800 | 201901 | 201906 | 1320.0 |
B | 201901 | 10 | 201801 | 201801 | NaN |
A | 201902 | 110 | 201801 | 201802 | NaN |
A | 201903 | 130 | 201801 | 201803 | NaN |
A | 201905 | 180 | 201801 | 201805 | NaN |
B | 201906 | 80 | 201801 | 201806 | NaN |
B | 202001 | 10 | 201901 | 201901 | 10.0 |
B | 202002 | 11 | 201901 | 201902 | 21.0 |
B | 202003 | 13 | 201901 | 201903 | 34.0 |
B | 202004 | 18 | 201901 | 201904 | 34.0 |
A | 202005 | 800 | 201901 | 201905 | 520.0 |
B | 202006 | 80 | 201901 | 201906 | 132.0 |
而且我认为 Pandas 中必须有 Excel 的 SUMIF
等价物。
即最后一行的销售额 PY 将是销售额的总和 WHERE store == 'B' AND yearweek >= 201901 AND yearweek <= 201906。这等于 132.
因为我不能确保我的 df 会被 store/week 安排,而且我的 df 有时会缺少几周我不喜欢使用 shift() and/or cumsum()功能。
你可以按店铺分组,向前移动一行,然后再分组,取累计和。
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 4, 10, 3, 1],
['A', 5, 11, 4, 2],
['A', 6, 13, 5, 3],
['A', 7, 18, 6, 4],
['B', 4 ,80, 3, 1],
['B', 5, 78, 4, 2],
['B', 6, 71, 5, 3],
['B', 7, 80, 6, 4]],
columns = ['Store','week','sales','week_min_1','week_min_3'])
df['sales_last_3_weeks'] = df.groupby('Store')['sales'].shift()
df['sales_last_3_weeks'] = df.groupby('Store')['sales_last_3_weeks'].cumsum()
根据 OP
的说明完全替换了答案请注意,您编码的df与您在table中打印的df不一致。我选择了 table
中的那个下面不是最优雅的,但考虑到缺少周等,我想不出更矢量化的操作
我们基本上实现了非常紧密地遵循sumif
逻辑的逐行计算。 apply
中的函数应用于每一行 r
对于每一行 r
它选择原始数据帧的相关子集 df
并计算总和
df['Sales_PY'] = (df.apply(lambda r: df.loc[(df['yearweek'] >= r['Start_PY'])
&(df['yearweek'] <= r['PY'])
&(df['Store']==r['Store']) ,'sales'].sum(),axis=1)
)
输出
Store yearweek sales Start_PY PY Sales_PY
-- ------- ---------- ------- ---------- ------ ----------
0 A 201901 100 201801 201801 0
1 B 201902 11 201801 201802 0
2 B 201903 13 201801 201803 0
3 B 201905 18 201801 201805 0
4 A 201906 800 201801 201806 0
5 A 202001 100 201901 201901 100
6 A 202002 110 201901 201902 210
7 A 202003 130 201901 201903 340
8 A 202004 180 201901 201904 340
9 B 202005 80 201901 201905 52
10 A 202006 800 201901 201906 1320
11 B 201901 10 201801 201801 0
12 A 201902 110 201801 201802 0
13 A 201903 130 201801 201803 0
14 A 201905 180 201801 201805 0
15 B 201906 80 201801 201806 0
16 B 202001 10 201901 201901 10
17 B 202002 11 201901 201902 21
18 B 202003 13 201901 201903 34
19 B 202004 18 201901 201904 34
20 A 202005 800 201901 201905 520
21 B 202006 80 201901 201906 132
如果您想要 NaN
s 而不是没有销售数据的 0s,您可以在上面的 sum
中传递 min_count=1
参数:.sum(min_count=1)
A店和B店的日期似乎是一致的;我们可以使用不等式连接来获取相关行,在合并回原始数据帧之前用 groupby 对值求和。 conditional_join from pyjanitor 在这里对 non-equi 合并很有帮助,我们使用二进制搜索,而不是遍历每一行;根据数据大小,性能可能会有所帮助:
# pip install pyjanitor
import janitor
import pandas as pd
dates = df.filter(like = 'PY').drop_duplicates()
left = df.loc[:, :"sales"]
outcome = (
left.conditional_join(
dates,
("yearweek", "Start_PY", ">="),
("yearweek", "PY", "<="),
how="right",
)
.groupby(["Store", "Start_PY", "PY"])
.sales.sum()
)
# join back to the original dataframe
df.merge(
outcome.rename("Sales_PY"),
left_on=["Store", "Start_PY", "PY"],
right_index=True,
how="left",
)
Store yearweek sales Start_PY PY Sales_PY
0 A 201901 100 201801 201801 NaN
1 B 201902 11 201801 201802 NaN
2 B 201903 13 201801 201803 NaN
3 B 201905 18 201801 201805 NaN
4 A 201906 800 201801 201806 NaN
5 A 202001 100 201901 201901 100.0
6 A 202002 110 201901 201902 210.0
7 A 202003 130 201901 201903 340.0
8 A 202004 180 201901 201904 340.0
9 B 202005 80 201901 201905 52.0
10 A 202006 800 201901 201906 1320.0
11 B 201901 10 201801 201801 NaN
12 A 201902 110 201801 201802 NaN
13 A 201903 130 201801 201803 NaN
14 A 201905 180 201801 201805 NaN
15 B 201906 80 201801 201806 NaN
16 B 202001 10 201901 201901 10.0
17 B 202002 11 201901 201902 21.0
18 B 202003 13 201901 201903 34.0
19 B 202004 18 201901 201904 34.0
20 A 202005 800 201901 201905 520.0
21 B 202006 80 201901 201906 132.0