Python：缺少行的多个 ID 的累积和

Question

我有一个包含 104 个唯一日期和 20 万个 SKU 的大型数据集。对于这个解释，我使用了 3 个 SKU 和 4 个日期。

数据如下

 Date      SKU        Demand      Supply
 20160501   1            10          10
 20160508   1            35          20
 20160501   2            20          15
 20160508   2            15          20
 20160522   2            5           0
 20160522   3            55          45

这些行仅在存在非零需求或供应的情况下填充。我想计算累计需求和供应，同时通过在缺失日期中添加 0 来为所有 ID 提供连续的日期范围。

我的输出是这样的

Date       SKU        Demand      Supply    Cum_Demand    Cum_Supply
20160501     1         10         10         10            10
20160508     1         35         20         45            30
20160515     1         0          0          45            30
20160522     1         0          0          45            30
20160501     2         20         15         20            15
20160508     2         15         20         35            35
20160515     2         0          0          35            35
20160522     2         5          0          40            35
20160501     3         0          0          0             0
20160508     3         0          0          0             0
20160515     3         0          0          0             0
20160522     3         55         45         55            45

数据框代码

data = pd.DataFrame({'Date':[20160501,20160508,20160501,20160508,20160522,20160522],
                 'SKU':[1,1,2,2,2,3],
                 'Demand':[10,35,20,15,5,55],
                 'Supply':[10,20,15,20,0,45]}
                ,columns=['Date', 'SKU', 'Demand', 'Supply'])

Answer 1

需要先reindex，然后groupby + cumsum，才能concatenate返回：

import pandas as pd

idx = pd.MultiIndex.from_product([[20160501,20160508,20160515,20160522], 
                                  data.SKU.unique()], names=['Date', 'SKU'])
#If have all unique dates needed in column then: 
#pd.MultiIndex.from_product([np.unique(data.Date), data.SKU.unique()])

data2 = data.set_index(['Date', 'SKU']).reindex(idx).fillna(0)
data2 = pd.concat([data2, data2.groupby(level=1).cumsum().add_prefix('Cum_')], 1).sort_index(level=1).reset_index()

输出`data2`:

        Date  SKU  Demand  Supply  Cum_Demand  Cum_Supply
0   20160501    1    10.0    10.0        10.0        10.0
1   20160508    1    35.0    20.0        45.0        30.0
2   20160515    1     0.0     0.0        45.0        30.0
3   20160522    1     0.0     0.0        45.0        30.0
4   20160501    2    20.0    15.0        20.0        15.0
5   20160508    2    15.0    20.0        35.0        35.0
6   20160515    2     0.0     0.0        35.0        35.0
7   20160522    2     5.0     0.0        40.0        35.0
8   20160501    3     0.0     0.0         0.0         0.0
9   20160508    3     0.0     0.0         0.0         0.0
10  20160515    3     0.0     0.0         0.0         0.0
11  20160522    3    55.0    45.0        55.0        45.0

您需要注意约会。在这种情况下，我明确列出了顺序，因此较早的日期首先出现。如果它们是数字，那么您可以使用 np.unique，其中将对值进行排序，确保日期是有序的。但这取决于每个日期至少在 DataFrame 中出现一次。否则，您将需要以某种方式创建您的有序日期列表。

Answer 2

首先将 date 格式转换为 datetime 格式：

df.Date = pd.to_datetime(df.Date, format='%Y%m%d')

您可以使用现有日期创建每周 pd.date_range：

ix = pd.date_range(df.Date.min(), df.Date.max() + pd.DateOffset(1), freq="W")

接下来的步骤是GorupBySKU,reindex根据创建的日期范围，根据列选择填充方式，ffill和bfill 填写所有 NaNs 的情况下 SKU 和 0 为 Demand 和 Supply。

df1 = (df.set_index('Date').groupby('SKU').apply(lambda x: x.reindex(ix)[['SKU']])
                          .ffill().bfill().reset_index(0, drop=True))
df2 = (df.set_index('Date').groupby('SKU').apply(lambda x: x.reindex(ix)[['Demand','Supply']])
                          .fillna(0).reset_index(0, drop=True))

最后一步是连接两个dataframes，取Demand和Supply的cumsum:

df_final = pd.concat([df2,df1],axis=1)

(df_final.assign(**df_final.groupby('SKU')
    .agg({'Demand':'cumsum','Supply':'cumsum'})
    .add_prefix('cum_')))

            SKU   Demand  Supply    cum_Demand  cum_Supply
2016-05-01  1.0    10.0    10.0        10.0        10.0
2016-05-08  1.0    35.0    20.0        45.0        30.0
2016-05-15  1.0     0.0     0.0        45.0        30.0
2016-05-22  1.0     0.0     0.0        45.0        30.0
2016-05-01  2.0    20.0    15.0        20.0        15.0
2016-05-08  2.0    15.0    20.0        35.0        35.0
2016-05-15  2.0     0.0     0.0        35.0        35.0
2016-05-22  2.0     5.0     0.0        40.0        35.0
2016-05-01  3.0     0.0     0.0         0.0         0.0
2016-05-08  3.0     0.0     0.0         0.0         0.0
2016-05-15  3.0     0.0     0.0         0.0         0.0
2016-05-22  3.0    55.0    45.0        55.0        45.0

Python：缺少行的多个 ID 的累积和

Python: Cumulative Sum on Multiple IDs with missing row

python

cumulative-sum

pandas

输出`data2`:

Python：缺少行的多个 ID 的累积和

Python: Cumulative Sum on Multiple IDs with missing row

python

cumulative-sum

pandas

输出data2:

输出`data2`: