根据日期生成累计和行的函数(Pandas)
Function to generate cumulative sum rows based on date (Pandas)
我有一个如下所示的数据集:
market product date value
germany a 2020-01 4
germany a 2020-02 1
germany a 2020-03 6
germany a 2020-04 3
germany b 2020-01 15
germany b 2020-02 19
germany b 2020-03 11
france a 2020-02 31
france a 2020-03 25
france a 2020-04 24
france a 2020-05 29
按市场和产品分组,我想按日期生成累计值的所有组合。 cumsum 的边界列为 date_start
和 date_end
,其中 date_end >= date_start
.
输出应如下所示
market product date_start date_end cumulative_value
germany a 2020-01 2020-01 4
germany a 2020-01 2020-02 5
germany a 2020-01 2020-03 11
germany a 2020-01 2020-04 14
germany a 2020-02 2020-02 1
germany a 2020-02 2020-03 7
germany a 2020-02 2020-04 10
germany a 2020-03 2020-03 6
germany a 2020-03 2020-04 9
germany a 2020-04 2020-04 3
germany b 2020-01 2020-01 15
germany b 2020-01 2020-02 34
germany b 2020-01 2020-03 45
germany b 2020-02 2020-02 19
germany b 2020-02 2020-03 30
germany b 2020-03 2020-03 11
france a 2020-02 2020-02 31
france a 2020-02 2020-03 56
france a 2020-02 2020-04 80
france a 2020-02 2020-05 109
france a 2020-03 2020-03 25
france a 2020-03 2020-04 49
france a 2020-03 2020-05 78
france a 2020-04 2020-04 24
france a 2020-04 2020-05 53
france a 2020-05 2020-05 29
非常感谢任何建议。
你可以这样做:
df['cumulative_value'] = df.groupby(['market', 'product', 'date_start']).cumsum()
为了演示,我对数据进行了一些改动,但你得到的是这样的:
market product date_start date_end value cumulative_value
0 germany a 2020-01 2020-01 4 4
1 germany a 2020-01 2020-02 1 5
2 germany a 2020-01 2020-03 6 11
3 germany a 2020-01 2020-04 3 14
4 germany a 2020-01 2020-05 15 29
5 germany a 2020-01 2020-06 19 48
6 germany a 2020-01 2020-07 11 59
7 germany a 2020-01 2020-08 31 90
8 germany a 2020-01 2020-09 25 115
9 germany a 2020-01 2020-10 24 139
10 germany a 2020-01 2020-11 29 168
函数:
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[..., i] = a
return arr.reshape(-1, la)
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:, i]] for i, df in enumerate(dfs)]))
def remove_negative_horizons(df, date_start, date_end):
df = df[date_start <= date_end]
return df
def generate_dates(df):
df = cartesian_product_multi(df[['value', 'date']], pd.DataFrame(df['date']))
df = remove_negative_horizons(df, df[2], df[1]) # This is the right order.
return df
def compute_cumulative_sum(df):
df = df.sort_values(by=['market', 'product', 'date_start', 'date_end'], ascending=True)
df['cumsum'] = df.groupby(['market', 'product', 'date_start'])['value'].apply(cumulative_sum)
return df
def cumulative_sum(series):
return series.cumsum()
df 变换:
df = pd.DataFrame({'market': ['germany', 'germany', 'germany', 'germany', 'germany', 'germany', 'germany',
'france', 'france', 'france', 'france'],
'product': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a', 'a'],
'date': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-01', '2020-02', '2020-03',
'2020-02', '2020-03', '2020-04', '2020-05'],
'value': [4, 1, 6, 3, 15, 19, 11, 31, 25, 24, 29]})
df = df.groupby(['market', 'product']).apply(generate_dates).reset_index()
df = df.drop(columns = ['level_2']).rename(columns = {0: 'value', 1: 'date_end', 2: 'date_start'})
df = df.reindex(columns=['market', 'product', 'value', 'date_start', 'date_end', 'cumsum'])
df = compute_cumulative_sum(df=df)
print(df)
结果:
market product value date_start date_end cumsum
france a 31 2020-02 2020-02 31
france a 25 2020-02 2020-03 56
france a 24 2020-02 2020-04 80
france a 29 2020-02 2020-05 109
france a 25 2020-03 2020-03 25
france a 24 2020-03 2020-04 49
france a 29 2020-03 2020-05 78
france a 24 2020-04 2020-04 24
france a 29 2020-04 2020-05 53
france a 29 2020-05 2020-05 29
germany a 4 2020-01 2020-01 4
germany a 1 2020-01 2020-02 5
germany a 6 2020-01 2020-03 11
germany a 3 2020-01 2020-04 14
germany a 1 2020-02 2020-02 1
germany a 6 2020-02 2020-03 7
germany a 3 2020-02 2020-04 10
germany a 6 2020-03 2020-03 6
germany a 3 2020-03 2020-04 9
germany a 3 2020-04 2020-04 3
germany b 15 2020-01 2020-01 15
germany b 19 2020-01 2020-02 34
germany b 11 2020-01 2020-03 45
germany b 19 2020-02 2020-02 19
germany b 11 2020-02 2020-03 30
germany b 11 2020-03 2020-03 11
我有一个如下所示的数据集:
market product date value
germany a 2020-01 4
germany a 2020-02 1
germany a 2020-03 6
germany a 2020-04 3
germany b 2020-01 15
germany b 2020-02 19
germany b 2020-03 11
france a 2020-02 31
france a 2020-03 25
france a 2020-04 24
france a 2020-05 29
按市场和产品分组,我想按日期生成累计值的所有组合。 cumsum 的边界列为 date_start
和 date_end
,其中 date_end >= date_start
.
输出应如下所示
market product date_start date_end cumulative_value
germany a 2020-01 2020-01 4
germany a 2020-01 2020-02 5
germany a 2020-01 2020-03 11
germany a 2020-01 2020-04 14
germany a 2020-02 2020-02 1
germany a 2020-02 2020-03 7
germany a 2020-02 2020-04 10
germany a 2020-03 2020-03 6
germany a 2020-03 2020-04 9
germany a 2020-04 2020-04 3
germany b 2020-01 2020-01 15
germany b 2020-01 2020-02 34
germany b 2020-01 2020-03 45
germany b 2020-02 2020-02 19
germany b 2020-02 2020-03 30
germany b 2020-03 2020-03 11
france a 2020-02 2020-02 31
france a 2020-02 2020-03 56
france a 2020-02 2020-04 80
france a 2020-02 2020-05 109
france a 2020-03 2020-03 25
france a 2020-03 2020-04 49
france a 2020-03 2020-05 78
france a 2020-04 2020-04 24
france a 2020-04 2020-05 53
france a 2020-05 2020-05 29
非常感谢任何建议。
你可以这样做:
df['cumulative_value'] = df.groupby(['market', 'product', 'date_start']).cumsum()
为了演示,我对数据进行了一些改动,但你得到的是这样的:
market product date_start date_end value cumulative_value
0 germany a 2020-01 2020-01 4 4
1 germany a 2020-01 2020-02 1 5
2 germany a 2020-01 2020-03 6 11
3 germany a 2020-01 2020-04 3 14
4 germany a 2020-01 2020-05 15 29
5 germany a 2020-01 2020-06 19 48
6 germany a 2020-01 2020-07 11 59
7 germany a 2020-01 2020-08 31 90
8 germany a 2020-01 2020-09 25 115
9 germany a 2020-01 2020-10 24 139
10 germany a 2020-01 2020-11 29 168
函数:
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[..., i] = a
return arr.reshape(-1, la)
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:, i]] for i, df in enumerate(dfs)]))
def remove_negative_horizons(df, date_start, date_end):
df = df[date_start <= date_end]
return df
def generate_dates(df):
df = cartesian_product_multi(df[['value', 'date']], pd.DataFrame(df['date']))
df = remove_negative_horizons(df, df[2], df[1]) # This is the right order.
return df
def compute_cumulative_sum(df):
df = df.sort_values(by=['market', 'product', 'date_start', 'date_end'], ascending=True)
df['cumsum'] = df.groupby(['market', 'product', 'date_start'])['value'].apply(cumulative_sum)
return df
def cumulative_sum(series):
return series.cumsum()
df 变换:
df = pd.DataFrame({'market': ['germany', 'germany', 'germany', 'germany', 'germany', 'germany', 'germany',
'france', 'france', 'france', 'france'],
'product': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a', 'a'],
'date': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-01', '2020-02', '2020-03',
'2020-02', '2020-03', '2020-04', '2020-05'],
'value': [4, 1, 6, 3, 15, 19, 11, 31, 25, 24, 29]})
df = df.groupby(['market', 'product']).apply(generate_dates).reset_index()
df = df.drop(columns = ['level_2']).rename(columns = {0: 'value', 1: 'date_end', 2: 'date_start'})
df = df.reindex(columns=['market', 'product', 'value', 'date_start', 'date_end', 'cumsum'])
df = compute_cumulative_sum(df=df)
print(df)
结果:
market product value date_start date_end cumsum
france a 31 2020-02 2020-02 31
france a 25 2020-02 2020-03 56
france a 24 2020-02 2020-04 80
france a 29 2020-02 2020-05 109
france a 25 2020-03 2020-03 25
france a 24 2020-03 2020-04 49
france a 29 2020-03 2020-05 78
france a 24 2020-04 2020-04 24
france a 29 2020-04 2020-05 53
france a 29 2020-05 2020-05 29
germany a 4 2020-01 2020-01 4
germany a 1 2020-01 2020-02 5
germany a 6 2020-01 2020-03 11
germany a 3 2020-01 2020-04 14
germany a 1 2020-02 2020-02 1
germany a 6 2020-02 2020-03 7
germany a 3 2020-02 2020-04 10
germany a 6 2020-03 2020-03 6
germany a 3 2020-03 2020-04 9
germany a 3 2020-04 2020-04 3
germany b 15 2020-01 2020-01 15
germany b 19 2020-01 2020-02 34
germany b 11 2020-01 2020-03 45
germany b 19 2020-02 2020-02 19
germany b 11 2020-02 2020-03 30
germany b 11 2020-03 2020-03 11