在每个组中查找前 N 个值

Find top N values within each group

我有一个类似于以下示例的数据集:

| id | size   | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6  | small  | 3     | 0     | 21    | 0     |
| 6  | small  | 9     | 0     | 23    | 0     |
| 13 | medium | 3     | 0     | 12    | 0     |
| 13 | medium | 37    | 0     | 20    | 1     |
| 20 | medium | 30    | 0     | 5     | 6     |
| 20 | medium | 12    | 2     | 3     | 0     |
| 12 | small  | 7     | 0     | 2     | 0     |
| 10 | small  | 8     | 0     | 12    | 0     |
| 15 | small  | 19    | 0     | 3     | 0     |
| 15 | small  | 54    | 0     | 8     | 0     |
| 87 | medium | 6     | 0     | 9     | 0     |
| 90 | medium | 11    | 1     | 16    | 0     |
| 90 | medium | 25    | 0     | 4     | 0     |
| 90 | medium | 10    | 0     | 5     | 0     |
| 9  | large  | 8     | 1     | 23    | 0     |
| 9  | large  | 19    | 0     | 2     | 0     |
| 1  | large  | 1     | 0     | 0     | 0     |
| 50 | large  | 34    | 0     | 7     | 0     |

这是上面table的输入:

data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])

我想要一个输出,它将根据大小对数据集进行分组,并根据每组大小中 'new_a' 列的值列出前 2 个 ID。由于某些 ID 重复多次,因此我想对此类 ID 的 new_a 值求和,然后找到前 2 个值。我的最终 table 应该如下所示:

| size   | id | new_a |
|--------|----|-------|
| large  | 9  | 25    |
| large  | 50 | 7     |
| medium | 13 | 32    |
| medium | 90 | 25    |
| small  | 6  | 44    |
| small  | 10 | 12    |

我尝试了下面的代码,但它没有显示 'size' 列中每个组的前 2 个 new_a 值。

nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
    df.groupby('size').apply(
        lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
    ).reset_index()[['size', 'id', 'new_a']]
)

打印:

     size  id  new_a
0   large   9     25
1   large  50      7
2  medium  13     32
3  medium  90     25
4   small   6     44
5   small  10     12

这里可以设置size,id为索引避免double groupby,利用Series.sum利用level参数

df.set_index(["size", "id"]).groupby(level=0).apply(
    lambda x: x.sum(level=1).nlargest(2)
).reset_index()

     size  id  new_a
0   large   9     25
1   large  50      7
2  medium  13     32
3  medium  90     25
4   small   6     44
5   small  10     12

您可以链接两个 groupby 方法:

data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()

输出:

   id    size  new_a
0   9   large     25
1  50   large      7
2  13  medium     32
3  90  medium     25
4   6   small     44
5  10   small     12