使用 pandas 的多列 groupby 来查找每个组的最大值
Multiple column groupby with pandas to find maximum value for each group
我有如下数据框:
Feature
value
frequency
label
age_45_and_above
No
2700
negative
age_45_and_above
No
1707
positive
age_45_and_above
No
83
other
age_45_and_above
Yes
222
negative
age_45_and_above
Yes
15
positive
age_45_and_above
Yes
8
other
age_45_and_above
[Null]
323
negative
age_45_and_above
[Null]
8
other
age_45_and_above
[Null]
5
positive
talk
No
20
negative
talk
No
170
positive
talk
No
500
other
talk
Yes
210
negative
talk
Yes
1500
positive
talk
Yes
809
other
talk
[Null]
234
negative
talk
[Null]
43
other
talk
[Null]
85
positive
等等。
对于每个特征组,我想找到最大频率及其所有相关行数据,就像特征是age_45_and_above然后通过寻找NO组我们有 3 行 不同的频率和标签,我想报告 最多一个及其相关数据.
我尝试过 groupby
不同的方式:
result.groupby(['Feature', 'Value'])['Frequency', 'Predict'].max()
或者这个,这个,我得到 multi-Index dataframe
这不是我想要的结果:
result.groupby(['Feature', 'Value', 'Predict'])['Frequency'].max()
以及 idxmax
、transfrom
和...的多次失败尝试。
我正在寻找的预期输出如下所示:
Feature
value
frequency
label
age_45_and_above
No
2700
negative
age_45_and_above
Yes
222
negative
age_45_and_above
[Null]
323
negative
talk
No
500
other
talk
Yes
1500
positive
talk
[Null]
234
negative
此外,我想知道 如何对除最大行 之外的每个 <<特征值>> 组的频率求和,因为我不知道如何定位最大行,就像这里的第一个特征和值 ,<<age_45_and_above-无>> max为2700,所以总和为 1707+83.
感谢您的宝贵时间。
我会通过对分组数据使用 merge
来做到这一点。
基于此数据:
df = pd.DataFrame({'Feature':['age']*9+['talk']*9,
'value':(['No']*3+['Yes']*3+['[Null]']*3)*2,
'frequency':[2700,1707,83,222,15,8,323,8,5,20,170,500,210,1500,809,234,43,85],
'label':['N','P','O']*6})
使用:
df.groupby(['Feature','value'],as_index=False)['frequency'].max().merge(df,on=['Feature','Value','frequency'])
输出:
Feature value frequency label
0 age No 2700 N
1 age Yes 222 N
2 age [Null] 323 N
3 talk No 500 O
4 talk Yes 1500 P
5 talk [Null] 234 N
可以通过简单的赋值来添加额外的列:
df_1['sum_no_max'] = df.groupby(['Feature','value'])['frequency'].sum().values - df_1['frequency'].values
最终输出:
Feature value frequency label sum_no_max
0 age No 2700 N 1790
1 age Yes 222 N 23
2 age [Null] 323 N 13
3 talk No 500 O 190
4 talk Yes 1500 P 1019
5 talk [Null] 234 N 128
在 loc
中的 groupby
之后使用 idxmax
。
print(df.loc[df.groupby(['Feature','value'])['frequency'].idxmax()])
Feature value frequency label
0 age_45_and_above No 2700 negative
3 age_45_and_above Yes 222 negative
6 age_45_and_above [Null] 323 negative
11 talk No 500 other
13 talk Yes 1500 positive
15 talk [Null] 234 negative
并且对于没有 max
的 sum
,然后计算每组的总和并删除行的频率,然后 select 最大行
gr = df.groupby(['Feature','value'])['frequency']
res = (
df.assign(total=gr.transform(sum)-df['frequency'])
.loc[gr.idxmax()]
)
print(res)
Feature value frequency label total
0 age_45_and_above No 2700 negative 1790
3 age_45_and_above Yes 222 negative 23
6 age_45_and_above [Null] 323 negative 13
11 talk No 500 other 190
13 talk Yes 1500 positive 1019
15 talk [Null] 234 negative 128
我有如下数据框:
Feature | value | frequency | label |
---|---|---|---|
age_45_and_above | No | 2700 | negative |
age_45_and_above | No | 1707 | positive |
age_45_and_above | No | 83 | other |
age_45_and_above | Yes | 222 | negative |
age_45_and_above | Yes | 15 | positive |
age_45_and_above | Yes | 8 | other |
age_45_and_above | [Null] | 323 | negative |
age_45_and_above | [Null] | 8 | other |
age_45_and_above | [Null] | 5 | positive |
talk | No | 20 | negative |
talk | No | 170 | positive |
talk | No | 500 | other |
talk | Yes | 210 | negative |
talk | Yes | 1500 | positive |
talk | Yes | 809 | other |
talk | [Null] | 234 | negative |
talk | [Null] | 43 | other |
talk | [Null] | 85 | positive |
等等。
对于每个特征组,我想找到最大频率及其所有相关行数据,就像特征是age_45_and_above然后通过寻找NO组我们有 3 行 不同的频率和标签,我想报告 最多一个及其相关数据.
我尝试过 groupby
不同的方式:
result.groupby(['Feature', 'Value'])['Frequency', 'Predict'].max()
或者这个,这个,我得到 multi-Index dataframe
这不是我想要的结果:
result.groupby(['Feature', 'Value', 'Predict'])['Frequency'].max()
以及 idxmax
、transfrom
和...的多次失败尝试。
我正在寻找的预期输出如下所示:
Feature | value | frequency | label |
---|---|---|---|
age_45_and_above | No | 2700 | negative |
age_45_and_above | Yes | 222 | negative |
age_45_and_above | [Null] | 323 | negative |
talk | No | 500 | other |
talk | Yes | 1500 | positive |
talk | [Null] | 234 | negative |
此外,我想知道 如何对除最大行 之外的每个 <<特征值>> 组的频率求和,因为我不知道如何定位最大行,就像这里的第一个特征和值 ,<<age_45_and_above-无>> max为2700,所以总和为 1707+83.
感谢您的宝贵时间。
我会通过对分组数据使用 merge
来做到这一点。
基于此数据:
df = pd.DataFrame({'Feature':['age']*9+['talk']*9,
'value':(['No']*3+['Yes']*3+['[Null]']*3)*2,
'frequency':[2700,1707,83,222,15,8,323,8,5,20,170,500,210,1500,809,234,43,85],
'label':['N','P','O']*6})
使用:
df.groupby(['Feature','value'],as_index=False)['frequency'].max().merge(df,on=['Feature','Value','frequency'])
输出:
Feature value frequency label
0 age No 2700 N
1 age Yes 222 N
2 age [Null] 323 N
3 talk No 500 O
4 talk Yes 1500 P
5 talk [Null] 234 N
可以通过简单的赋值来添加额外的列:
df_1['sum_no_max'] = df.groupby(['Feature','value'])['frequency'].sum().values - df_1['frequency'].values
最终输出:
Feature value frequency label sum_no_max
0 age No 2700 N 1790
1 age Yes 222 N 23
2 age [Null] 323 N 13
3 talk No 500 O 190
4 talk Yes 1500 P 1019
5 talk [Null] 234 N 128
在 loc
中的 groupby
之后使用 idxmax
。
print(df.loc[df.groupby(['Feature','value'])['frequency'].idxmax()])
Feature value frequency label
0 age_45_and_above No 2700 negative
3 age_45_and_above Yes 222 negative
6 age_45_and_above [Null] 323 negative
11 talk No 500 other
13 talk Yes 1500 positive
15 talk [Null] 234 negative
并且对于没有 max
的 sum
,然后计算每组的总和并删除行的频率,然后 select 最大行
gr = df.groupby(['Feature','value'])['frequency']
res = (
df.assign(total=gr.transform(sum)-df['frequency'])
.loc[gr.idxmax()]
)
print(res)
Feature value frequency label total
0 age_45_and_above No 2700 negative 1790
3 age_45_and_above Yes 222 negative 23
6 age_45_and_above [Null] 323 negative 13
11 talk No 500 other 190
13 talk Yes 1500 positive 1019
15 talk [Null] 234 negative 128