Pandas df:不同列中的分组、分箱和平均值?
Pandas df: group, bin and average in different columns?
我的数据在质量上看起来像这个假人 table:
speed_observation, car_brand, traction_force
10, ford, 2
20, ford, 4
35, seat, 8
50, ford, 16
10, audi, 2
20, audi, 5
43, audi, 2
12, seat, 2.5
10, ford, 0.5
30, audi, 6
23, ford, 4
17, seat, 5.5
10, seat, 10
38, audi, 2
40, ford, 9
19, ford, 6.6
49, seat, 18
18, ford, 4
我想按汽车品牌对数据框进行分组,并为每个品牌将速度观察值划分为范围(例如 [0,25] 和 [25,50]),然后为每个品牌和箱计算测量的平均牵引力,收到类似:
speed_bin_upper_lim, car_brand, avrg_traction_force_in_speed_bin
25, audi, X1
50, audi, X2
25, ford, X3
50, ford, X4
25, seat, X5
50, seat, X6
我该怎么做?它应该适用于任意数量的唯一 car_brand
类 并且用户应该只提供速度箱的数量或箱的范围(例如 n=3
或 [0,25,50]
) .我想 pd.groupby
和 pd.cut
会做到这一点,但我没有找到确切的方法。
Quang Hoang 的回答非常有效,如果你想扩展它,因为你想再按一列分组,假设 wheel_kind
,你的数据框看起来像:
speed_observation,car_brand,wheel_kind,traction_force
10, ford, winter, 2
20, ford, summer, 4
35, seat, summer, 8
50, ford, winter, 16
10, audi, summer, 2
20, audi, summer, 5
43, audi, summer, 2
12, seat, summer, 2.5
10, ford, summer, 0.5
30, audi, summer, 6
23, ford, summer, 4
17, seat, summer, 5.5
10, seat, summer, 10
38, audi, summer, 2
40, ford, summer, 9
19, ford, summer, 6.6
49, seat, summer, 18
18, ford, summer, 4
然后只需将 wheel_kind
列添加到先前的解决方案中,更准确地说:
(df.groupby(['car_brand', `wheel_kind`, cuts])
.traction_force.mean()
.reset_index(name='avg_traction_force')
)
之后不要忘记删除 NaN,因为 ford
和 audi
没有冬季车轮:
df_grp.dropna(inplace=True)
df_grp.reset_index(drop=True, inplace=True) #just to reset the index
您可以用所需的垃圾箱剪切 speed_observation
并按其分组:
cuts = pd.cut(df['speed_observation'], [0,25,50])
(df.groupby(['car_brand', cuts])
.traction_force.mean()
.reset_index(name='avg_traction_force')
)
输出:
car_brand speed_observation avg_traction_force
0 audi (0, 25] 3.500000
1 audi (25, 50] 3.333333
2 ford (0, 25] 3.516667
3 ford (25, 50] 12.500000
4 seat (0, 25] 6.000000
5 seat (25, 50] 13.000000
我们可以
创建一个系列以手动分组作为 pd.cut
的替代方法
n = 25
blocks = (df.speed_observation.sub(1) // n).add(1).mul(n)
blocks = blocks.rename('speed_bin_upper_lim')
(df.groupby([blocks, 'car_brand'])
.traction_force.mean()
.reset_index(name='avrg_traction_force_in_speed_bin'))
speed_bin_upper_lim car_brand avrg_traction_force_in_speed_bin
0 25 audi 3.500000
1 25 ford 3.516667
2 25 seat 6.000000
3 50 audi 3.333333
4 50 ford 12.500000
5 50 seat 13.000000
详情
print(blocks)
0 25
1 25
2 50
3 50
4 25
5 25
6 50
7 25
8 25
9 50
10 25
11 25
12 25
13 50
14 50
15 25
16 50
17 25
Name: speed_bin_upper_lim, dtype: int64
我的数据在质量上看起来像这个假人 table:
speed_observation, car_brand, traction_force
10, ford, 2
20, ford, 4
35, seat, 8
50, ford, 16
10, audi, 2
20, audi, 5
43, audi, 2
12, seat, 2.5
10, ford, 0.5
30, audi, 6
23, ford, 4
17, seat, 5.5
10, seat, 10
38, audi, 2
40, ford, 9
19, ford, 6.6
49, seat, 18
18, ford, 4
我想按汽车品牌对数据框进行分组,并为每个品牌将速度观察值划分为范围(例如 [0,25] 和 [25,50]),然后为每个品牌和箱计算测量的平均牵引力,收到类似:
speed_bin_upper_lim, car_brand, avrg_traction_force_in_speed_bin
25, audi, X1
50, audi, X2
25, ford, X3
50, ford, X4
25, seat, X5
50, seat, X6
我该怎么做?它应该适用于任意数量的唯一 car_brand
类 并且用户应该只提供速度箱的数量或箱的范围(例如 n=3
或 [0,25,50]
) .我想 pd.groupby
和 pd.cut
会做到这一点,但我没有找到确切的方法。
Quang Hoang 的回答非常有效,如果你想扩展它,因为你想再按一列分组,假设 wheel_kind
,你的数据框看起来像:
speed_observation,car_brand,wheel_kind,traction_force
10, ford, winter, 2
20, ford, summer, 4
35, seat, summer, 8
50, ford, winter, 16
10, audi, summer, 2
20, audi, summer, 5
43, audi, summer, 2
12, seat, summer, 2.5
10, ford, summer, 0.5
30, audi, summer, 6
23, ford, summer, 4
17, seat, summer, 5.5
10, seat, summer, 10
38, audi, summer, 2
40, ford, summer, 9
19, ford, summer, 6.6
49, seat, summer, 18
18, ford, summer, 4
然后只需将 wheel_kind
列添加到先前的解决方案中,更准确地说:
(df.groupby(['car_brand', `wheel_kind`, cuts])
.traction_force.mean()
.reset_index(name='avg_traction_force')
)
之后不要忘记删除 NaN,因为 ford
和 audi
没有冬季车轮:
df_grp.dropna(inplace=True)
df_grp.reset_index(drop=True, inplace=True) #just to reset the index
您可以用所需的垃圾箱剪切 speed_observation
并按其分组:
cuts = pd.cut(df['speed_observation'], [0,25,50])
(df.groupby(['car_brand', cuts])
.traction_force.mean()
.reset_index(name='avg_traction_force')
)
输出:
car_brand speed_observation avg_traction_force
0 audi (0, 25] 3.500000
1 audi (25, 50] 3.333333
2 ford (0, 25] 3.516667
3 ford (25, 50] 12.500000
4 seat (0, 25] 6.000000
5 seat (25, 50] 13.000000
我们可以
创建一个系列以手动分组作为 pd.cut
n = 25
blocks = (df.speed_observation.sub(1) // n).add(1).mul(n)
blocks = blocks.rename('speed_bin_upper_lim')
(df.groupby([blocks, 'car_brand'])
.traction_force.mean()
.reset_index(name='avrg_traction_force_in_speed_bin'))
speed_bin_upper_lim car_brand avrg_traction_force_in_speed_bin
0 25 audi 3.500000
1 25 ford 3.516667
2 25 seat 6.000000
3 50 audi 3.333333
4 50 ford 12.500000
5 50 seat 13.000000
详情
print(blocks)
0 25
1 25
2 50
3 50
4 25
5 25
6 50
7 25
8 25
9 50
10 25
11 25
12 25
13 50
14 50
15 25
16 50
17 25
Name: speed_bin_upper_lim, dtype: int64