根据组中条件累积的先前值聚合有序数据帧 (pandas)
aggregation of an ordered dataframe based on conditional accumulated previous values in a group (pandas)
我有一个有序的数据框,我试图通过一些分组列并基于其他列的累积先前值来聚合它。
df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
, 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
, 'Value1':[0,1,1,1,1,1,0,0,0]
, 'Value2':[1,2,3,4,5,4,3,2,2]})
df
ID Group Value1 Value2
0 ID1 Group1 0 1
1 ID1 Group2 1 2
2 ID1 Group2 1 3
3 ID1 Group2 1 4
4 ID1 Group1 1 5
5 ID2 Group2 1 4
6 ID2 Group2 0 3
7 ID2 Group2 0 2
8 ID2 Group1 0 2
我想使用 Value1 和 Value 2 按 ID 和组分组的三种不同方式进行聚合。
df 已经订购(基于日期、ID 和组)
Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)
Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)
Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)
这是我想要的输出:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 NaN
为了确保清楚我要做什么,让我们看一下输出索引 3(第四行)
3 ID1 Group2 1 4 2 9 5
Output1 = 2 because there are two rows above it in ID1/Group2 that has
Value1 = 1.
Output2 = 9 because the sum of Value2 of all rows above it in
ID1/Group2, including the row itself is (2+3+4 = 9).
Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 + 3 = 5)
我想补充一点,我正在处理大型数据集,因此我正在寻找 efficient/high 性能解决方案。
您可以为第三个输出添加一个屏蔽列并计算一个分组的、移位的累积和:
import numpy as np
# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill = {'Value1': 0}
df[['Output1', 'Output2', 'Output3']] = (df
.assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
.groupby(['ID', 'Group'])
.transform(lambda x: x.shift(d_shift.get(x.name, 0),
fill_value=d_fill.get(x.name, np.nan)).cumsum())
)
或者,作为线性形式:
g = (df.assign(Value3=df['Value2']
.mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
)
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())
输出:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2.0
3 ID1 Group2 1 4 2 9 5.0
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4.0
7 ID2 Group2 0 2 1 9 NaN
8 ID2 Group1 0 2 0 2 NaN
解决方案
- 对于输出 1 和 2:我们可以使用
groupby + cumsum
- 对于输出 3:这是一个有点棘手的计算,因为您必须首先屏蔽列
Value2
中的值,其中列 Value1 中的对应值为 0,然后您需要对屏蔽的列进行分组,然后现在使用 cumsum
计算累计总和以排除您拥有的当前行可以从累计总和中减去屏蔽列
g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()
s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s
根据评论中的新要求更新:
def transform(g):
g['Output1'] = g['Value1'].cumsum() - g['Value1']
g['Output2'] = g['Value2'].cumsum()
cond = g['Value1'].eq(1)
g['Output3'] = g['Value2'].mask(~cond).cumsum().shift().ffill()
return g
df.groupby(['ID', 'Group']).apply(transform)
结果
print(df)
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2.0
3 ID1 Group2 1 4 2 9 5.0
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4.0
7 ID2 Group2 0 2 1 9 4.0
8 ID2 Group1 0 2 0 2 NaN
我有一个有序的数据框,我试图通过一些分组列并基于其他列的累积先前值来聚合它。
df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
, 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
, 'Value1':[0,1,1,1,1,1,0,0,0]
, 'Value2':[1,2,3,4,5,4,3,2,2]})
df
ID Group Value1 Value2
0 ID1 Group1 0 1
1 ID1 Group2 1 2
2 ID1 Group2 1 3
3 ID1 Group2 1 4
4 ID1 Group1 1 5
5 ID2 Group2 1 4
6 ID2 Group2 0 3
7 ID2 Group2 0 2
8 ID2 Group1 0 2
我想使用 Value1 和 Value 2 按 ID 和组分组的三种不同方式进行聚合。 df 已经订购(基于日期、ID 和组)
Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)
Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)
Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)
这是我想要的输出:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 NaN
为了确保清楚我要做什么,让我们看一下输出索引 3(第四行)
3 ID1 Group2 1 4 2 9 5
Output1 = 2 because there are two rows above it in ID1/Group2 that has Value1 = 1.
Output2 = 9 because the sum of Value2 of all rows above it in ID1/Group2, including the row itself is (2+3+4 = 9).
Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 + 3 = 5)
我想补充一点,我正在处理大型数据集,因此我正在寻找 efficient/high 性能解决方案。
您可以为第三个输出添加一个屏蔽列并计算一个分组的、移位的累积和:
import numpy as np
# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill = {'Value1': 0}
df[['Output1', 'Output2', 'Output3']] = (df
.assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
.groupby(['ID', 'Group'])
.transform(lambda x: x.shift(d_shift.get(x.name, 0),
fill_value=d_fill.get(x.name, np.nan)).cumsum())
)
或者,作为线性形式:
g = (df.assign(Value3=df['Value2']
.mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
)
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())
输出:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2.0
3 ID1 Group2 1 4 2 9 5.0
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4.0
7 ID2 Group2 0 2 1 9 NaN
8 ID2 Group1 0 2 0 2 NaN
解决方案
- 对于输出 1 和 2:我们可以使用
groupby + cumsum
- 对于输出 3:这是一个有点棘手的计算,因为您必须首先屏蔽列
Value2
中的值,其中列 Value1 中的对应值为 0,然后您需要对屏蔽的列进行分组,然后现在使用cumsum
计算累计总和以排除您拥有的当前行可以从累计总和中减去屏蔽列
g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()
s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s
根据评论中的新要求更新:
def transform(g):
g['Output1'] = g['Value1'].cumsum() - g['Value1']
g['Output2'] = g['Value2'].cumsum()
cond = g['Value1'].eq(1)
g['Output3'] = g['Value2'].mask(~cond).cumsum().shift().ffill()
return g
df.groupby(['ID', 'Group']).apply(transform)
结果
print(df)
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2.0
3 ID1 Group2 1 4 2 9 5.0
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4.0
7 ID2 Group2 0 2 1 9 4.0
8 ID2 Group1 0 2 0 2 NaN