Pandas:组内和跨另一个组的累积行数
Pandas: number rows within group cumulatively and across another group
给定以下数据框:
col_1 col_2 col_3
0 1 A 1
1 1 B 1
2 2 A 3
3 2 A 3
4 2 A 3
5 2 B 3
6 2 B 3
7 2 B 3
8 3 A 2
9 3 A 2
10 3 C 2
11 3 C 2
我需要创建一个新列,其中行在由 'col_1' 和 'col_2' 形成的每个组内累积编号,但也在每个 'col_1' 组之后累积编号,例如这个:
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6
我试过:
df['new'] = df.groupby(['col_1', 'col_2']).cumcount() + 1
但这并没有按预期从上一组中加起来。
您可以使用两个连续的 groupby
,一个在两列上,第二个在第一组上 col_1:
# classical cumcount per group
count1 = df.groupby(['col_1', 'col_2']).cumcount().add(1)
# max cumcount per group
g = count1.groupby(df['col_1']) # (*) read below
count2 = g.ngroup().map(g.max().cumsum()).fillna(0, downcast='infer')
# add the two
df['new'] = count1+count2
### Note (*)
## if df['col_1'] is not of the form 1/2/3...
## use this to group instead:
# group = df['col_1'].ne(df['col_1'].shift()).cumsum()
# g = count1.groupby(group)
输出:
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6
这是一个棘手的问题。您想要计算组内的 cumcount,但对于所有后续组,您需要跟踪已经增加了多少,以便知道要应用的偏移量。这可以通过 max
+ cumsum
of this cumcount
over the previous groups 来完成。这里唯一复杂的是你需要确定前后组标签之间的关系,以防后续组的标签之间没有一些简单的 + 1 增量。
# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()
# Determine how many cumcounts were within all previous groups of `col_1'
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')
# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)
# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')
# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s
# Clean up merging column
df = df.drop(columns='col_1_shift')
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6
给定以下数据框:
col_1 col_2 col_3
0 1 A 1
1 1 B 1
2 2 A 3
3 2 A 3
4 2 A 3
5 2 B 3
6 2 B 3
7 2 B 3
8 3 A 2
9 3 A 2
10 3 C 2
11 3 C 2
我需要创建一个新列,其中行在由 'col_1' 和 'col_2' 形成的每个组内累积编号,但也在每个 'col_1' 组之后累积编号,例如这个:
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6
我试过:
df['new'] = df.groupby(['col_1', 'col_2']).cumcount() + 1
但这并没有按预期从上一组中加起来。
您可以使用两个连续的 groupby
,一个在两列上,第二个在第一组上 col_1:
# classical cumcount per group
count1 = df.groupby(['col_1', 'col_2']).cumcount().add(1)
# max cumcount per group
g = count1.groupby(df['col_1']) # (*) read below
count2 = g.ngroup().map(g.max().cumsum()).fillna(0, downcast='infer')
# add the two
df['new'] = count1+count2
### Note (*)
## if df['col_1'] is not of the form 1/2/3...
## use this to group instead:
# group = df['col_1'].ne(df['col_1'].shift()).cumsum()
# g = count1.groupby(group)
输出:
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6
这是一个棘手的问题。您想要计算组内的 cumcount,但对于所有后续组,您需要跟踪已经增加了多少,以便知道要应用的偏移量。这可以通过 max
+ cumsum
of this cumcount
over the previous groups 来完成。这里唯一复杂的是你需要确定前后组标签之间的关系,以防后续组的标签之间没有一些简单的 + 1 增量。
# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()
# Determine how many cumcounts were within all previous groups of `col_1'
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')
# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)
# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')
# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s
# Clean up merging column
df = df.drop(columns='col_1_shift')
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6