根据日期生成累计和行的函数(Pandas)

Function to generate cumulative sum rows based on date (Pandas)

我有一个如下所示的数据集:

market  product date     value
germany a       2020-01  4
germany a       2020-02  1
germany a       2020-03  6
germany a       2020-04  3
germany b       2020-01  15
germany b       2020-02  19
germany b       2020-03  11
france  a       2020-02  31
france  a       2020-03  25
france  a       2020-04  24
france  a       2020-05  29 

按市场和产品分组,我想按日期生成累计值的所有组合。 cumsum 的边界列为 date_startdate_end,其中 date_end >= date_start.

输出应如下所示

market  product date_start date_end cumulative_value
germany a       2020-01    2020-01  4
germany a       2020-01    2020-02  5
germany a       2020-01    2020-03  11
germany a       2020-01    2020-04  14
germany a       2020-02    2020-02  1
germany a       2020-02    2020-03  7
germany a       2020-02    2020-04  10
germany a       2020-03    2020-03  6
germany a       2020-03    2020-04  9
germany a       2020-04    2020-04  3

germany b       2020-01    2020-01  15
germany b       2020-01    2020-02  34
germany b       2020-01    2020-03  45
germany b       2020-02    2020-02  19
germany b       2020-02    2020-03  30
germany b       2020-03    2020-03  11

france  a       2020-02    2020-02  31
france  a       2020-02    2020-03  56
france  a       2020-02    2020-04  80
france  a       2020-02    2020-05  109
france  a       2020-03    2020-03  25
france  a       2020-03    2020-04  49
france  a       2020-03    2020-05  78
france  a       2020-04    2020-04  24
france  a       2020-04    2020-05  53
france  a       2020-05    2020-05  29

非常感谢任何建议。

你可以这样做:

df['cumulative_value'] = df.groupby(['market', 'product', 'date_start']).cumsum()

为了演示,我对数据进行了一些改动,但你得到的是这样的:

     market product date_start date_end  value  cumulative_value
0   germany       a    2020-01  2020-01      4                 4
1   germany       a    2020-01  2020-02      1                 5
2   germany       a    2020-01  2020-03      6                11
3   germany       a    2020-01  2020-04      3                14
4   germany       a    2020-01  2020-05     15                29
5   germany       a    2020-01  2020-06     19                48
6   germany       a    2020-01  2020-07     11                59
7   germany       a    2020-01  2020-08     31                90
8   germany       a    2020-01  2020-09     25               115
9   germany       a    2020-01  2020-10     24               139
10  germany       a    2020-01  2020-11     29               168

函数:

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[..., i] = a
    return arr.reshape(-1, la)


def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:, i]] for i, df in enumerate(dfs)]))


def remove_negative_horizons(df, date_start, date_end):
    df = df[date_start <= date_end]
    return df


def generate_dates(df):
    df = cartesian_product_multi(df[['value', 'date']], pd.DataFrame(df['date']))
    df = remove_negative_horizons(df, df[2], df[1]) # This is the right order.
    return df


def compute_cumulative_sum(df):
    df = df.sort_values(by=['market', 'product', 'date_start', 'date_end'], ascending=True)
    df['cumsum'] = df.groupby(['market', 'product', 'date_start'])['value'].apply(cumulative_sum)
    return df


def cumulative_sum(series):
    return series.cumsum()

df 变换:

df = pd.DataFrame({'market': ['germany', 'germany', 'germany', 'germany', 'germany', 'germany', 'germany',
                              'france', 'france', 'france', 'france'],
                  'product': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a', 'a'],
                  'date': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-01', '2020-02', '2020-03', 
                           '2020-02', '2020-03', '2020-04', '2020-05'],
                  'value': [4, 1, 6, 3, 15, 19, 11, 31, 25, 24, 29]})

df = df.groupby(['market', 'product']).apply(generate_dates).reset_index()
df = df.drop(columns = ['level_2']).rename(columns = {0: 'value', 1: 'date_end', 2: 'date_start'})
df = df.reindex(columns=['market', 'product', 'value', 'date_start', 'date_end', 'cumsum'])

df = compute_cumulative_sum(df=df)
print(df)

结果:


market  product value date_start date_end cumsum
france        a    31    2020-02  2020-02     31
france        a    25    2020-02  2020-03     56
france        a    24    2020-02  2020-04     80
france        a    29    2020-02  2020-05    109
france        a    25    2020-03  2020-03     25
france        a    24    2020-03  2020-04     49
france        a    29    2020-03  2020-05     78
france        a    24    2020-04  2020-04     24
france        a    29    2020-04  2020-05     53
france        a    29    2020-05  2020-05     29
germany       a     4    2020-01  2020-01      4
germany       a     1    2020-01  2020-02      5
germany       a     6    2020-01  2020-03     11
germany       a     3    2020-01  2020-04     14
germany       a     1    2020-02  2020-02      1
germany       a     6    2020-02  2020-03      7
germany       a     3    2020-02  2020-04     10
germany       a     6    2020-03  2020-03      6
germany       a     3    2020-03  2020-04      9
germany       a     3    2020-04  2020-04      3
germany       b    15    2020-01  2020-01     15
germany       b    19    2020-01  2020-02     34
germany       b    11    2020-01  2020-03     45
germany       b    19    2020-02  2020-02     19
germany       b    11    2020-02  2020-03     30
germany       b    11    2020-03  2020-03     11