从 groupby 中每个组的前一列中获取值
Get value from previous column for each group in groupby
这是我的 df
-
Site
Product
Period
Inflow
Outflow
Production
Opening Inventory
New Opening Inventory
Closing Inventory
Production Needed
California
Apples
1
0
3226
4300
1213
1213
0
0
California
Apples
2
0
3279
3876
0
0
0
0
California
Apples
3
0
4390
4530
0
0
0
0
California
Apples
4
0
4281
3870
0
0
0
0
California
Apples
5
0
4421
4393
0
0
0
0
California
Oranges
1
0
505
400
0
0
0
0
California
Oranges
2
0
278
505
0
0
0
0
California
Oranges
3
0
167
278
0
0
0
0
California
Oranges
4
0
124
167
0
0
0
0
California
Oranges
5
0
106
124
0
0
0
0
Montreal
Maple Syrup
1
0
445
465
293
293
0
0
Montreal
Maple Syrup
2
0
82
398
0
0
0
0
Montreal
Maple Syrup
3
0
745
346
0
0
0
0
Montreal
Maple Syrup
4
0
241
363
0
0
0
0
Montreal
Maple Syrup
5
0
189
254
0
0
0
0
可以看到,按Site
和Product
分组,共有三组。对于三个组中的每一个,我想执行以下操作(第 2 到 5 个时间段)-
- 将
New Opening Inventory
设为上期的Closing Inventory
- 使用公式
Closing Inventory
= Production
+ Inflow
+ New Opening Inventory
- Outflow
[=48 计算下一个周期的 Closing Inventory
=]
我正在尝试结合使用 groupby
和 for loop
来解决这个问题
这是我目前所拥有的 -
如果df
是一个组,我可以简单地做
# calculate closing inventory of period 1
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)
for i in range(1, len(df)):
df.loc[i, 'New Opening Inventory'] = df.loc[i-1, 'Closing Inventory']
df.loc[i, 'Closing Inventory'] = df.loc[i, 'Production'] + df.loc[i, 'Inflow'] + df.loc[i, 'New Opening Inventory'] - df.loc[i, 'Outflow']
当我尝试将此 for loop
嵌套在 groups
的循环中时
# calculate closing inventory of period 1 for all groups
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)
g = df.groupby(['Site', 'Product']
alist = []
for k in g.groups.keys():
temp = g.get_group(k).reset_index(drop=True)
for i in range(1, len(temp)):
temp.loc[i, 'New Opening Inventory'] = temp.loc[i-1, 'Closing Inventory']
temp.loc[i, 'Closing Inventory'] = temp.loc[i, 'Production'] + temp.loc[i, 'Inflow'] + temp.loc[i, 'New Opening Inventory'] - temp.loc[i, 'Outflow']
alist.append(temp)
df2 = pd.concat(alist, ignore_index=True)
df2
这个解决方案有效,但嵌套循环似乎效率很低。有更好的方法吗?
您的新期初库存始终是之前的期末库存。
所以我可以修改这个公式
Closing Inventory = Production + Inflow + New Opening Inventory -
Outflow
至
Closing Inventory = Production + Inflow + Previous Closing Inventory -
Outflow
对于第一行,您没有期末库存。但是从第 2 行开始计算期末库存并将期末库存结转到下一行。
在获取期末库存之前,首先使用列表理解计算“生产”+“流入”-“溢出”。列表理解比 for 循环执行得更好。
df['Closing Inventory'] = [x + y - z if p > 1 else 0 for p, x, y, z in zip(df['Period'], df['Production'], df['Inflow'], df['Outflow'])]
# df[['Site', 'Product', 'Closing Inventory']]
# Site Product Closing Inventory
# 0 California Apples 0
# 1 California Apples 597
# 2 California Apples 140
# 3 California Apples -411
# 4 California Apples -28
# 5 California Oranges 0
# 6 California Oranges 227
# 7 California Oranges 111
# ...
然后,剩下的公式就是加上之前计算的期末库存,也就是说你可以cumsum
这个结果。
For row 1, Previous Closing (0) + calculated part (597) = 597
For row 2, Previous Closing (597) + calculated part (140) = 737
...
df['Closing Inventory'] = df.groupby(['Site', 'Product'])['Closing Inventory'].cumsum()
# df[['Site', 'Product', 'Closing Inventory']]
# Site Product Closing_Inventory
# 0 California Apples 0
# 1 California Apples 597
# 2 California Apples 737
# 3 California Apples 326
# 4 California Apples 298
# 5 California Oranges 0
# 6 California Oranges 227
# 7 California Oranges 338
# ...
同样,新期初库存始终是之前的期末库存,除非周期为 1。因此,首先移动期末库存,然后在周期为 1 时选择新期初库存。
我使用 combine_first
从新期初或期末库存中挑选价值。
df['New Opening Inventory'] = (df['New Opening Inventory'].replace(0, np.nan)
.combine_first(
df.groupby(['Site', 'Product'])['Closing Inventory']
.shift()
.fillna(0)
).astype(int))
结果
Site Product Period New Opening Inventory Closing Inventory
0 California Apples 1 1213 0
1 California Apples 2 0 597
2 California Apples 3 597 737
3 California Apples 4 737 326
4 California Apples 5 326 298
5 California Oranges 1 0 0
6 California Oranges 2 0 227
7 California Oranges 3 227 338
...
在我的笔记本电脑上使用样本数据,
Original solution: 8.44 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This solution: 2.95 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我认为有了列表理解和向量化功能,这个解决方案可以执行得更快。
这是我的 df
-
Site | Product | Period | Inflow | Outflow | Production | Opening Inventory | New Opening Inventory | Closing Inventory | Production Needed |
---|---|---|---|---|---|---|---|---|---|
California | Apples | 1 | 0 | 3226 | 4300 | 1213 | 1213 | 0 | 0 |
California | Apples | 2 | 0 | 3279 | 3876 | 0 | 0 | 0 | 0 |
California | Apples | 3 | 0 | 4390 | 4530 | 0 | 0 | 0 | 0 |
California | Apples | 4 | 0 | 4281 | 3870 | 0 | 0 | 0 | 0 |
California | Apples | 5 | 0 | 4421 | 4393 | 0 | 0 | 0 | 0 |
California | Oranges | 1 | 0 | 505 | 400 | 0 | 0 | 0 | 0 |
California | Oranges | 2 | 0 | 278 | 505 | 0 | 0 | 0 | 0 |
California | Oranges | 3 | 0 | 167 | 278 | 0 | 0 | 0 | 0 |
California | Oranges | 4 | 0 | 124 | 167 | 0 | 0 | 0 | 0 |
California | Oranges | 5 | 0 | 106 | 124 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 1 | 0 | 445 | 465 | 293 | 293 | 0 | 0 |
Montreal | Maple Syrup | 2 | 0 | 82 | 398 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 3 | 0 | 745 | 346 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 4 | 0 | 241 | 363 | 0 | 0 | 0 | 0 |
Montreal | Maple Syrup | 5 | 0 | 189 | 254 | 0 | 0 | 0 | 0 |
可以看到,按Site
和Product
分组,共有三组。对于三个组中的每一个,我想执行以下操作(第 2 到 5 个时间段)-
- 将
New Opening Inventory
设为上期的Closing Inventory
- 使用公式
Closing Inventory
=Production
+Inflow
+New Opening Inventory
-Outflow
[=48 计算下一个周期的Closing Inventory
=]
我正在尝试结合使用 groupby
和 for loop
这是我目前所拥有的 -
如果df
是一个组,我可以简单地做
# calculate closing inventory of period 1
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)
for i in range(1, len(df)):
df.loc[i, 'New Opening Inventory'] = df.loc[i-1, 'Closing Inventory']
df.loc[i, 'Closing Inventory'] = df.loc[i, 'Production'] + df.loc[i, 'Inflow'] + df.loc[i, 'New Opening Inventory'] - df.loc[i, 'Outflow']
当我尝试将此 for loop
嵌套在 groups
# calculate closing inventory of period 1 for all groups
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)
g = df.groupby(['Site', 'Product']
alist = []
for k in g.groups.keys():
temp = g.get_group(k).reset_index(drop=True)
for i in range(1, len(temp)):
temp.loc[i, 'New Opening Inventory'] = temp.loc[i-1, 'Closing Inventory']
temp.loc[i, 'Closing Inventory'] = temp.loc[i, 'Production'] + temp.loc[i, 'Inflow'] + temp.loc[i, 'New Opening Inventory'] - temp.loc[i, 'Outflow']
alist.append(temp)
df2 = pd.concat(alist, ignore_index=True)
df2
这个解决方案有效,但嵌套循环似乎效率很低。有更好的方法吗?
您的新期初库存始终是之前的期末库存。
所以我可以修改这个公式
Closing Inventory = Production + Inflow + New Opening Inventory - Outflow
至
Closing Inventory = Production + Inflow + Previous Closing Inventory - Outflow
对于第一行,您没有期末库存。但是从第 2 行开始计算期末库存并将期末库存结转到下一行。
在获取期末库存之前,首先使用列表理解计算“生产”+“流入”-“溢出”。列表理解比 for 循环执行得更好。
df['Closing Inventory'] = [x + y - z if p > 1 else 0 for p, x, y, z in zip(df['Period'], df['Production'], df['Inflow'], df['Outflow'])]
# df[['Site', 'Product', 'Closing Inventory']]
# Site Product Closing Inventory
# 0 California Apples 0
# 1 California Apples 597
# 2 California Apples 140
# 3 California Apples -411
# 4 California Apples -28
# 5 California Oranges 0
# 6 California Oranges 227
# 7 California Oranges 111
# ...
然后,剩下的公式就是加上之前计算的期末库存,也就是说你可以cumsum
这个结果。
For row 1, Previous Closing (0) + calculated part (597) = 597
For row 2, Previous Closing (597) + calculated part (140) = 737
...
df['Closing Inventory'] = df.groupby(['Site', 'Product'])['Closing Inventory'].cumsum()
# df[['Site', 'Product', 'Closing Inventory']]
# Site Product Closing_Inventory
# 0 California Apples 0
# 1 California Apples 597
# 2 California Apples 737
# 3 California Apples 326
# 4 California Apples 298
# 5 California Oranges 0
# 6 California Oranges 227
# 7 California Oranges 338
# ...
同样,新期初库存始终是之前的期末库存,除非周期为 1。因此,首先移动期末库存,然后在周期为 1 时选择新期初库存。
我使用 combine_first
从新期初或期末库存中挑选价值。
df['New Opening Inventory'] = (df['New Opening Inventory'].replace(0, np.nan)
.combine_first(
df.groupby(['Site', 'Product'])['Closing Inventory']
.shift()
.fillna(0)
).astype(int))
结果
Site Product Period New Opening Inventory Closing Inventory
0 California Apples 1 1213 0
1 California Apples 2 0 597
2 California Apples 3 597 737
3 California Apples 4 737 326
4 California Apples 5 326 298
5 California Oranges 1 0 0
6 California Oranges 2 0 227
7 California Oranges 3 227 338
...
在我的笔记本电脑上使用样本数据,
Original solution: 8.44 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This solution: 2.95 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我认为有了列表理解和向量化功能,这个解决方案可以执行得更快。