Pandas

Question

我有一个 Pandas 数据框，里面有多个组，A、B、C。每个组都有多个与之关联的计数，我想创建一个新列，该列被归一化为最大值每组.

即

index, group, year, count
0, A, 2015, 1
1, A, 2016, 2
2, A, 2017, 3
3, B, 2012, 10
4, B, 2013, 14
5, B, 2014, 18
6, C, 2014, 55
7, C, 2015, 59
8, C, 2016, 58

...变成

index, group, year, count, normalised
0, A, 2015, 1,  0.333
1, A, 2016, 2,  0.667
2, A, 2017, 3,  1.000
3, B, 2012, 10, 0.557
4, B, 2013, 14, 0.778
5, B, 2014, 18, 1.000
6, C, 2014, 55, 0.932
7, C, 2015, 59, 1.000
8, C, 2016, 58, 0.983

如果我尝试类似...

df.assign(normalised=lambda x: x['count']/df[df['group'] == x['group']]['count'].max()

然后 max 将 return 59 而不是类别中最大的数字

Answer 1

可以用groupby+transform计算每组当前值与最大值的比值：

df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())

df
   index group  year  count  normalised
0      0     A  2015      1    0.333333
1      1     A  2016      2    0.666667
2      2     A  2017      3    1.000000
3      3     B  2012     10    0.555556
4      4     B  2013     14    0.777778
5      5     B  2014     18    1.000000
6      6     C  2014     55    0.932203
7      7     C  2015     59    1.000000
8      8     C  2016     58    0.983051

Answer 2

类似于 Psidom 的回答，但避免了 lambda 因此速度更快：

df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')

时间

>>> %timeit df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')                                         
1.16 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>>                                                                                                                              
>>> %timeit df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())                                    
1.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas - 基于分组值的“最大值”的新列

Pandas - new column based on `max` of grouped values

python

assign