聚合为最多两个元素的列表

aggregate as list with max two elements

给定用户 table 如下:

   user       query
0    a1      orange
1    a1  strawberry
2    a1        pear
3    a2      orange
4    a2  strawberry
5    a2       lemon
6    a3      orange
7    a3      banana
8    a6        meat
9    a7        beer
10   a8       juice

我想按 user 分组并汇总为 query 的列表,如果超过两项,则选择前两项,预期结果是

  user                        query
0   a1         [orange, strawberry]
1   a2         [orange, strawberry]
2   a3             [orange, banana]
3   a6                       [meat]
4   a7                       [beer]
5   a8                      [juice]

使用下面的代码

df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 
                                  4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 
                                  8: 'a6', 9: 'a7', 10: 'a8'}, 
                         'query': {0: 'orange', 1: 'strawberry', 
                                   2: 'pear', 3: 'orange', 4: 'strawberry', 
                                   5: 'lemon', 6: 'orange', 7: 'banana', 
                                   8: 'meat', 9: 'beer', 10: 'juice'}} )

print(df_user.groupby(['user'], as_index=False).agg(list))

我成功了

  user                        query
0   a1   [orange, strawberry, pear]
1   a2  [orange, strawberry, lemon]
2   a3             [orange, banana]
3   a6                       [meat]
4   a7                       [beer]
5   a8                      [juice]

获得预期结果的好方法是什么?

这是一种方法:

out = df[df.groupby('user').cumcount()<2].groupby('user', as_index=False).agg(list)

输出:

  user                 query
0   a1  [orange, strawberry]
1   a2  [orange, strawberry]
2   a3      [orange, banana]
3   a6                [meat]
4   a7                [beer]
5   a8               [juice]
​

您可以使用 iloc 分割最多 2 个项目:

df_user.groupby(['user'], as_index=False).agg(lambda s: s.iloc[:2].to_list())

输出:

  user                 query
0   a1  [orange, strawberry]
1   a2  [orange, strawberry]
2   a3      [orange, banana]
3   a6                [meat]
4   a7                [beer]
5   a8               [juice]

您可以使用 groupby + nth() 到 select 每个组中的元素(如果存在):

new_df = df.groupby('user').nth([0, 1]).groupby(level=0).agg(list)

输出:

>>> new_df
                     query
user                      
a1    [orange, strawberry]
a2    [orange, strawberry]
a3        [orange, banana]
a6                  [meat]
a7                  [beer]
a8                 [juice]

请注意,如果您不想输入所有这些数字,list(range(2)) 会比 [0, 1] 更动态:)