聚合为最多两个元素的列表
aggregate as list with max two elements
给定用户 table 如下:
user query
0 a1 orange
1 a1 strawberry
2 a1 pear
3 a2 orange
4 a2 strawberry
5 a2 lemon
6 a3 orange
7 a3 banana
8 a6 meat
9 a7 beer
10 a8 juice
我想按 user
分组并汇总为 query
的列表,如果超过两项,则选择前两项,预期结果是
user query
0 a1 [orange, strawberry]
1 a2 [orange, strawberry]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
使用下面的代码
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2',
4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3',
8: 'a6', 9: 'a7', 10: 'a8'},
'query': {0: 'orange', 1: 'strawberry',
2: 'pear', 3: 'orange', 4: 'strawberry',
5: 'lemon', 6: 'orange', 7: 'banana',
8: 'meat', 9: 'beer', 10: 'juice'}} )
print(df_user.groupby(['user'], as_index=False).agg(list))
我成功了
user query
0 a1 [orange, strawberry, pear]
1 a2 [orange, strawberry, lemon]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
获得预期结果的好方法是什么?
这是一种方法:
out = df[df.groupby('user').cumcount()<2].groupby('user', as_index=False).agg(list)
输出:
user query
0 a1 [orange, strawberry]
1 a2 [orange, strawberry]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
您可以使用 iloc
分割最多 2 个项目:
df_user.groupby(['user'], as_index=False).agg(lambda s: s.iloc[:2].to_list())
输出:
user query
0 a1 [orange, strawberry]
1 a2 [orange, strawberry]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
您可以使用 groupby
+ nth()
到 select 每个组中的元素(如果存在):
new_df = df.groupby('user').nth([0, 1]).groupby(level=0).agg(list)
输出:
>>> new_df
query
user
a1 [orange, strawberry]
a2 [orange, strawberry]
a3 [orange, banana]
a6 [meat]
a7 [beer]
a8 [juice]
请注意,如果您不想输入所有这些数字,list(range(2))
会比 [0, 1]
更动态:)
给定用户 table 如下:
user query
0 a1 orange
1 a1 strawberry
2 a1 pear
3 a2 orange
4 a2 strawberry
5 a2 lemon
6 a3 orange
7 a3 banana
8 a6 meat
9 a7 beer
10 a8 juice
我想按 user
分组并汇总为 query
的列表,如果超过两项,则选择前两项,预期结果是
user query
0 a1 [orange, strawberry]
1 a2 [orange, strawberry]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
使用下面的代码
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2',
4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3',
8: 'a6', 9: 'a7', 10: 'a8'},
'query': {0: 'orange', 1: 'strawberry',
2: 'pear', 3: 'orange', 4: 'strawberry',
5: 'lemon', 6: 'orange', 7: 'banana',
8: 'meat', 9: 'beer', 10: 'juice'}} )
print(df_user.groupby(['user'], as_index=False).agg(list))
我成功了
user query
0 a1 [orange, strawberry, pear]
1 a2 [orange, strawberry, lemon]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
获得预期结果的好方法是什么?
这是一种方法:
out = df[df.groupby('user').cumcount()<2].groupby('user', as_index=False).agg(list)
输出:
user query
0 a1 [orange, strawberry]
1 a2 [orange, strawberry]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
您可以使用 iloc
分割最多 2 个项目:
df_user.groupby(['user'], as_index=False).agg(lambda s: s.iloc[:2].to_list())
输出:
user query
0 a1 [orange, strawberry]
1 a2 [orange, strawberry]
2 a3 [orange, banana]
3 a6 [meat]
4 a7 [beer]
5 a8 [juice]
您可以使用 groupby
+ nth()
到 select 每个组中的元素(如果存在):
new_df = df.groupby('user').nth([0, 1]).groupby(level=0).agg(list)
输出:
>>> new_df
query
user
a1 [orange, strawberry]
a2 [orange, strawberry]
a3 [orange, banana]
a6 [meat]
a7 [beer]
a8 [juice]
请注意,如果您不想输入所有这些数字,list(range(2))
会比 [0, 1]
更动态:)