根据组中条件累积的先前值聚合有序数据帧 (pandas)

aggregation of an ordered dataframe based on conditional accumulated previous values in a group (pandas)

我有一个有序的数据框,我试图通过一些分组列并基于其他列的累积先前值来聚合它。

df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
            , 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
            , 'Value1':[0,1,1,1,1,1,0,0,0]
            , 'Value2':[1,2,3,4,5,4,3,2,2]})


df
        ID  Group   Value1  Value2
    0   ID1 Group1    0       1
    1   ID1 Group2    1       2
    2   ID1 Group2    1       3
    3   ID1 Group2    1       4
    4   ID1 Group1    1       5
    5   ID2 Group2    1       4
    6   ID2 Group2    0       3
    7   ID2 Group2    0       2
    8   ID2 Group1    0       2

我想使用 Value1 和 Value 2 按 ID 和组分组的三种不同方式进行聚合。 df 已经订购(基于日期、ID 和组)

Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)

Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)

Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)

这是我想要的输出:

    ID  Group   Value1  Value2  Output1 Output2 Output3
0   ID1 Group1    0       1        0      1       NaN
1   ID1 Group2    1       2        0      2       NaN
2   ID1 Group2    1       3        1      5        2
3   ID1 Group2    1       4        2      9        5
4   ID1 Group1    1       5        0      6       NaN 
5   ID2 Group2    1       4        0      4       NaN
6   ID2 Group2    0       3        1      7        4
7   ID2 Group2    0       2        1      9        4
8   ID2 Group1    0       2        0      2       NaN

为了确保清楚我要做什么,让我们看一下输出索引 3(第四行)

3   ID1 Group2    1       4        2      9        5

Output1 = 2 because there are two rows above it in ID1/Group2 that has Value1 = 1.

Output2 = 9 because the sum of Value2 of all rows above it in ID1/Group2, including the row itself is (2+3+4 = 9).

Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 + 3 = 5)

我想补充一点,我正在处理大型数据集,因此我正在寻找 efficient/high 性能解决方案。

您可以为第三个输出添加一个屏蔽列并计算一个分组的、移位的累积和:

import numpy as np

# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill  = {'Value1': 0}

df[['Output1', 'Output2', 'Output3']] = (df
 .assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
 .groupby(['ID', 'Group'])
 .transform(lambda x: x.shift(d_shift.get(x.name, 0),
                              fill_value=d_fill.get(x.name, np.nan)).cumsum())
)

或者,作为线性形式:

g = (df.assign(Value3=df['Value2']
       .mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
     )
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())

输出:

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      NaN
8  ID2  Group1       0       2        0        2      NaN

解决方案

  • 对于输出 1 和 2:我们可以使用 groupby + cumsum
  • 对于输出 3:这是一个有点棘手的计算,因为您必须首先屏蔽列 Value2 中的值,其中列 Value1 中的对应值为 0,然后您需要对屏蔽的列进行分组,然后现在使用 cumsum 计算累计总和以排除您拥有的当前行可以从累计总和中减去屏蔽列
g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()

s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s

根据评论中的新要求更新:

def transform(g):
    g['Output1'] = g['Value1'].cumsum() - g['Value1']
    g['Output2'] = g['Value2'].cumsum()

    cond = g['Value1'].eq(1)
    g['Output3'] = g['Value2'].mask(~cond).cumsum().shift().ffill()
    return g


df.groupby(['ID', 'Group']).apply(transform)

结果

print(df)

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      4.0
8  ID2  Group1       0       2        0        2      NaN