Pandas - 基于分组值的“最大值”的新列
Pandas - new column based on `max` of grouped values
我有一个 Pandas 数据框,里面有多个组,A、B、C。每个组都有多个与之关联的计数,我想创建一个新列,该列被归一化为最大值每组.
即
index, group, year, count
0, A, 2015, 1
1, A, 2016, 2
2, A, 2017, 3
3, B, 2012, 10
4, B, 2013, 14
5, B, 2014, 18
6, C, 2014, 55
7, C, 2015, 59
8, C, 2016, 58
...变成
index, group, year, count, normalised
0, A, 2015, 1, 0.333
1, A, 2016, 2, 0.667
2, A, 2017, 3, 1.000
3, B, 2012, 10, 0.557
4, B, 2013, 14, 0.778
5, B, 2014, 18, 1.000
6, C, 2014, 55, 0.932
7, C, 2015, 59, 1.000
8, C, 2016, 58, 0.983
如果我尝试类似...
df.assign(normalised=lambda x: x['count']/df[df['group'] == x['group']]['count'].max()
然后 max
将 return 59
而不是类别中最大的数字
可以用groupby
+transform
计算每组当前值与最大值的比值:
df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())
df
index group year count normalised
0 0 A 2015 1 0.333333
1 1 A 2016 2 0.666667
2 2 A 2017 3 1.000000
3 3 B 2012 10 0.555556
4 4 B 2013 14 0.777778
5 5 B 2014 18 1.000000
6 6 C 2014 55 0.932203
7 7 C 2015 59 1.000000
8 8 C 2016 58 0.983051
类似于 Psidom 的回答,但避免了 lambda
因此速度更快:
df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')
时间
>>> %timeit df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')
1.16 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>>
>>> %timeit df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())
1.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我有一个 Pandas 数据框,里面有多个组,A、B、C。每个组都有多个与之关联的计数,我想创建一个新列,该列被归一化为最大值每组.
即
index, group, year, count
0, A, 2015, 1
1, A, 2016, 2
2, A, 2017, 3
3, B, 2012, 10
4, B, 2013, 14
5, B, 2014, 18
6, C, 2014, 55
7, C, 2015, 59
8, C, 2016, 58
...变成
index, group, year, count, normalised
0, A, 2015, 1, 0.333
1, A, 2016, 2, 0.667
2, A, 2017, 3, 1.000
3, B, 2012, 10, 0.557
4, B, 2013, 14, 0.778
5, B, 2014, 18, 1.000
6, C, 2014, 55, 0.932
7, C, 2015, 59, 1.000
8, C, 2016, 58, 0.983
如果我尝试类似...
df.assign(normalised=lambda x: x['count']/df[df['group'] == x['group']]['count'].max()
然后 max
将 return 59
而不是类别中最大的数字
可以用groupby
+transform
计算每组当前值与最大值的比值:
df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())
df
index group year count normalised
0 0 A 2015 1 0.333333
1 1 A 2016 2 0.666667
2 2 A 2017 3 1.000000
3 3 B 2012 10 0.555556
4 4 B 2013 14 0.777778
5 5 B 2014 18 1.000000
6 6 C 2014 55 0.932203
7 7 C 2015 59 1.000000
8 8 C 2016 58 0.983051
类似于 Psidom 的回答,但避免了 lambda
因此速度更快:
df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')
时间
>>> %timeit df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')
1.16 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>>
>>> %timeit df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())
1.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)