如何从 pandas groupby() 组中 select 列中的最大值行?
How to select row with max value in column from pandas groupby() groups?
我有一个 table 这样的:
import pandas as pd
df = pd.DataFrame(
[
['john', 'rdgsdr', 2, 'A'],
['ann', 'dsdfds', 3, 'A'],
['john', 'jkfgdj', 1, 'B'],
['bob', 'xcxfcd', 5, 'A'],
['john', 'uityuu', 3, 'C'],
['ann', 'werwwe', 2, 'C'],
],
columns=['name', 'stuff', 'orders', 'store']
)
# df
# name stuff orders store
# 0 john rdgsdr 2 A
# 1 ann dsdfds 3 A
# 2 john jkfgdj 1 B
# 3 bob xcxfcd 5 A
# 4 john uityuu 3 C
# 5 ann werwwe 2 C
我需要为每个名称提取具有最大订单数的行;并为该名称计算所有商店的列表。像这样:
grouped = df.groupby('name')
for name, group in grouped:
print('-'*5, name, '-'*5)
print(group)
# ----- ann -----
# name stuff orders store
# 1 ann dsdfds 3 A <- max(orders) for ann
# 5 ann werwwe 2 C
# ----- bob -----
# name stuff orders store
# 3 bob xcxfcd 5 A <- max(orders) for bob
# ----- john -----
# name stuff orders store
# 0 john rdgsdr 2 A
# 2 john jkfgdj 1 B
# 4 john uityuu 3 C <- max(orders) for john
# ##########################
# This is what I want to get
# ##########################
>>> result
name stuff max orders all stores
1 ann dsdfds 3 A,C
3 bob xcxfcd 5 A
4 john uityuu 3 A,B,C
我试过这个:
result = grouped.agg(
**{
# 'stuff': 'stuff',
'max orders': pd.NamedAgg('orders', max),
'all stores': pd.NamedAgg('store', lambda s: s.str.join(',')),
}
)
但我不知道如何在结果中包含 'stuff' 列(在我的真实应用程序中,我有很多这样的附加列,可能有几十个)。而且,连接给了我列表而不是字符串:
>>> result
name max orders all stores
0 ann 3 [A, C]
1 bob 5 A
2 john 3 [A, B, C]
试试 first
out = df.set_index('stuff').groupby('name').agg(stuff = ('orders' , 'idxmax'),
max_orders = ('orders' , 'max'),
all_stores = ('store',','.join))#.reset_index()
Out[200]:
stuff max_orders all_stores
name
ann dsdfds 3 A,C
bob xcxfcd 5 A
john uityuu 3 A,B,C
您可以通过将 this answer 与 groupby 结合使用来获取他们工作过的商店列表。
# Get stores that each person works at
stores_for_each_name = df.groupby('name')['store'].apply(','.join)
# Get row with largest order value for each name
df = df.sort_values('orders', ascending=False).drop_duplicates('name').rename({'orders': 'max_orders'}, axis=1)
# Replace store column with comma-separated list of stores they have worked at
df = df.drop('store', axis=1)
df = df.join(stores_for_each_name, on='name')
输出:
name stuff max_orders store
3 bob xcxfcd 5 A
1 ann dsdfds 3 A,C
4 john uityuu 3 A,B,C
我有一个 table 这样的:
import pandas as pd
df = pd.DataFrame(
[
['john', 'rdgsdr', 2, 'A'],
['ann', 'dsdfds', 3, 'A'],
['john', 'jkfgdj', 1, 'B'],
['bob', 'xcxfcd', 5, 'A'],
['john', 'uityuu', 3, 'C'],
['ann', 'werwwe', 2, 'C'],
],
columns=['name', 'stuff', 'orders', 'store']
)
# df
# name stuff orders store
# 0 john rdgsdr 2 A
# 1 ann dsdfds 3 A
# 2 john jkfgdj 1 B
# 3 bob xcxfcd 5 A
# 4 john uityuu 3 C
# 5 ann werwwe 2 C
我需要为每个名称提取具有最大订单数的行;并为该名称计算所有商店的列表。像这样:
grouped = df.groupby('name')
for name, group in grouped:
print('-'*5, name, '-'*5)
print(group)
# ----- ann -----
# name stuff orders store
# 1 ann dsdfds 3 A <- max(orders) for ann
# 5 ann werwwe 2 C
# ----- bob -----
# name stuff orders store
# 3 bob xcxfcd 5 A <- max(orders) for bob
# ----- john -----
# name stuff orders store
# 0 john rdgsdr 2 A
# 2 john jkfgdj 1 B
# 4 john uityuu 3 C <- max(orders) for john
# ##########################
# This is what I want to get
# ##########################
>>> result
name stuff max orders all stores
1 ann dsdfds 3 A,C
3 bob xcxfcd 5 A
4 john uityuu 3 A,B,C
我试过这个:
result = grouped.agg(
**{
# 'stuff': 'stuff',
'max orders': pd.NamedAgg('orders', max),
'all stores': pd.NamedAgg('store', lambda s: s.str.join(',')),
}
)
但我不知道如何在结果中包含 'stuff' 列(在我的真实应用程序中,我有很多这样的附加列,可能有几十个)。而且,连接给了我列表而不是字符串:
>>> result
name max orders all stores
0 ann 3 [A, C]
1 bob 5 A
2 john 3 [A, B, C]
试试 first
out = df.set_index('stuff').groupby('name').agg(stuff = ('orders' , 'idxmax'),
max_orders = ('orders' , 'max'),
all_stores = ('store',','.join))#.reset_index()
Out[200]:
stuff max_orders all_stores
name
ann dsdfds 3 A,C
bob xcxfcd 5 A
john uityuu 3 A,B,C
您可以通过将 this answer 与 groupby 结合使用来获取他们工作过的商店列表。
# Get stores that each person works at
stores_for_each_name = df.groupby('name')['store'].apply(','.join)
# Get row with largest order value for each name
df = df.sort_values('orders', ascending=False).drop_duplicates('name').rename({'orders': 'max_orders'}, axis=1)
# Replace store column with comma-separated list of stores they have worked at
df = df.drop('store', axis=1)
df = df.join(stores_for_each_name, on='name')
输出:
name stuff max_orders store
3 bob xcxfcd 5 A
1 ann dsdfds 3 A,C
4 john uityuu 3 A,B,C