从 groupby 中每个组的前一列中获取值

Question

这是我的 df -

Site	Product	Period	Outflow	Production	Opening Inventory	New Opening Inventory
California	Apples	1	3226	4300	1213	1213
California	Apples	2	3279	3876	0	0
California	Apples	3	4390	4530	0	0
California	Apples	4	4281	3870	0	0
California	Apples	5	4421	4393	0	0
California	Oranges	1	505	400	0	0
California	Oranges	2	278	505	0	0
California	Oranges	3	167	278	0	0
California	Oranges	4	124	167	0	0
California	Oranges	5	106	124	0	0
Montreal	Maple Syrup	1	445	465	293	293
Montreal	Maple Syrup	2	82	398	0	0
Montreal	Maple Syrup	3	745	346	0	0
Montreal	Maple Syrup	4	241	363	0	0
Montreal	Maple Syrup	5	189	254	0	0

可以看到，按Site和Product分组，共有三组。对于三个组中的每一个，我想执行以下操作（第 2 到 5 个时间段）-

将New Opening Inventory设为上期的Closing Inventory
使用公式 Closing Inventory = Production + Inflow + New Opening Inventory - Outflow[=48 计算下一个周期的 Closing Inventory =]

我正在尝试结合使用 groupby 和 for loop

来解决这个问题

这是我目前所拥有的 -

如果df是一个组，我可以简单地做

# calculate closing inventory of period 1
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)

for i in range(1, len(df)):
    df.loc[i, 'New Opening Inventory'] = df.loc[i-1, 'Closing Inventory']
    df.loc[i, 'Closing Inventory'] = df.loc[i, 'Production'] + df.loc[i, 'Inflow'] + df.loc[i, 'New Opening Inventory'] - df.loc[i, 'Outflow']

当我尝试将此 for loop 嵌套在 groups

的循环中时

# calculate closing inventory of period 1 for all groups
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)

g = df.groupby(['Site', 'Product']

alist = []

for k in g.groups.keys():
    temp = g.get_group(k).reset_index(drop=True)
    for i in range(1, len(temp)):
        temp.loc[i, 'New Opening Inventory'] = temp.loc[i-1, 'Closing Inventory']
        temp.loc[i, 'Closing Inventory'] = temp.loc[i, 'Production'] + temp.loc[i, 'Inflow'] + temp.loc[i, 'New Opening Inventory'] - temp.loc[i, 'Outflow']
    alist.append(temp)

df2 = pd.concat(alist, ignore_index=True)
df2

这个解决方案有效，但嵌套循环似乎效率很低。有更好的方法吗？

Answer 1

您的新期初库存始终是之前的期末库存。

所以我可以修改这个公式

Closing Inventory = Production + Inflow + New Opening Inventory - Outflow

至

Closing Inventory = Production + Inflow + Previous Closing Inventory - Outflow

对于第一行，您没有期末库存。但是从第 2 行开始计算期末库存并将期末库存结转到下一行。

在获取期末库存之前，首先使用列表理解计算“生产”+“流入”-“溢出”。列表理解比 for 循环执行得更好。

df['Closing Inventory'] = [x + y - z if p > 1 else 0 for p, x, y, z in zip(df['Period'], df['Production'], df['Inflow'], df['Outflow'])]

# df[['Site', 'Product', 'Closing Inventory']]
#         Site  Product Closing Inventory
# 0 California  Apples                  0
# 1 California  Apples                597
# 2 California  Apples                140
# 3 California  Apples               -411
# 4 California  Apples                -28
# 5 California  Oranges                 0
# 6 California  Oranges               227
# 7 California  Oranges               111
# ...

然后，剩下的公式就是加上之前计算的期末库存，也就是说你可以cumsum这个结果。

For row 1, Previous Closing (0) + calculated part (597) = 597
For row 2, Previous Closing (597) + calculated part (140) = 737
...

df['Closing Inventory'] = df.groupby(['Site', 'Product'])['Closing Inventory'].cumsum()

# df[['Site', 'Product', 'Closing Inventory']]
#         Site  Product Closing_Inventory
# 0 California  Apples                  0
# 1 California  Apples                597
# 2 California  Apples                737
# 3 California  Apples                326
# 4 California  Apples                298
# 5 California  Oranges                 0
# 6 California  Oranges               227
# 7 California  Oranges               338
# ...

同样，新期初库存始终是之前的期末库存，除非周期为 1。因此，首先移动期末库存，然后在周期为 1 时选择新期初库存。

我使用 combine_first 从新期初或期末库存中挑选价值。

df['New Opening Inventory'] = (df['New Opening Inventory'].replace(0, np.nan)
                               .combine_first(
                                   df.groupby(['Site', 'Product'])['Closing Inventory']
                                   .shift()
                                   .fillna(0)
                               ).astype(int))

结果

          Site  Product Period  New Opening Inventory Closing Inventory
0   California  Apples       1                   1213                 0
1   California  Apples       2                      0               597
2   California  Apples       3                    597               737
3   California  Apples       4                    737               326
4   California  Apples       5                    326               298
5   California  Oranges      1                      0                 0
6   California  Oranges      2                      0               227
7   California  Oranges      3                    227               338
...

在我的笔记本电脑上使用样本数据，

Original solution: 8.44 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This solution: 2.95 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我认为有了列表理解和向量化功能，这个解决方案可以执行得更快。

从 groupby 中每个组的前一列中获取值

Get value from previous column for each group in groupby

python

dataframe

pandas

pandas-groupby