Pandas 使用 Sklearn 指标的相关误差
Pandas Correlation Error Using Sklearn Metrics
我正在尝试使用 pandas 计算大型数据集的 r2 或 r 平方,并按 plant_name 和月份在数据框中对数据进行分组,如下所示的“data1”。问题是,当我使用 sklearn 指标和定义的函数时,我得到的结果与我在 Excel 中使用“data1”中的相同数据获得的结果不一致。这是“data1”中的数据
plant_name month year wind_speed_obs wind_speed_ms
0 BIG HORN I 1 2018 5.143830 6.012436
1 BIG HORN I 1 2019 4.556545 5.231855
2 BIG HORN I 1 2020 6.582890 7.866532
3 BIG HORN I 2 2018 7.904438 9.248810
4 BIG HORN I 2 2019 4.353567 5.115625
5 BIG HORN I 2 2020 7.376739 8.408046
6 BIG HORN I 3 2018 6.138197 6.922043
7 BIG HORN I 3 2019 3.881804 4.484274
8 BIG HORN I 3 2020 7.071029 7.347177
9 BIG HORN I 4 2018 7.106936 7.699861
10 BIG HORN I 4 2019 6.874942 7.575278
11 BIG HORN I 4 2020 6.855979 7.106250
12 BIG HORN I 5 2018 5.366054 6.510753
13 BIG HORN I 5 2019 5.657342 6.597581
14 BIG HORN I 5 2020 7.010745 7.247043
15 BIG HORN I 6 2018 6.399417 7.076528
16 BIG HORN I 6 2019 6.578241 7.556111
17 BIG HORN I 6 2020 7.120105 7.548194
18 BIG HORN I 7 2018 5.615110 6.123925
19 BIG HORN I 7 2019 6.212104 6.963441
20 BIG HORN I 7 2020 6.663250 6.972312
21 BIG HORN I 8 2018 5.303967 5.947312
22 BIG HORN I 8 2019 5.176691 6.209274
23 BIG HORN I 8 2020 6.093748 6.337634
24 BIG HORN I 9 2018 5.375531 5.878472
25 BIG HORN I 9 2019 6.126961 6.792500
26 BIG HORN I 9 2020 5.608530 6.028056
27 BIG HORN I 10 2018 4.466079 5.054973
28 BIG HORN I 10 2019 5.492795 6.326075
29 BIG HORN I 10 2020 7.103278 7.492070
30 BIG HORN I 11 2018 5.341987 5.889028
31 BIG HORN I 11 2019 4.887397 5.144028
32 BIG HORN I 11 2020 6.718649 7.150000
33 BIG HORN I 12 2018 5.099386 5.866935
34 BIG HORN I 12 2019 3.925717 4.234140
35 BIG HORN I 12 2020 5.589325 5.943145
这是我使用的代码:
from sklearn.metrics import r2_score
def r2_rmse2( g ):
r2 = r2_score( g['wind_speed_obs'], g['wind_speed_ms'] )
#rmse = np.sqrt( mean_squared_error( g['wind_speed_obs'], g['wind_speed_ms'] ) )
return pd.Series( dict( r2 = r2 ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()
我在应用上面的 r2_rmse2 函数时得到了这个结果:
plant_name month r2
0 BIG HORN I 1 -0.314771
1 BIG HORN I 2 0.529890
2 BIG HORN I 3 0.804066
3 BIG HORN I 4 -22.164720
4 BIG HORN I 5 -0.460690
5 BIG HORN I 6 -4.673359
6 BIG HORN I 7 -0.662166
7 BIG HORN I 8 -2.118815
8 BIG HORN I 9 -1.946566
9 BIG HORN I 10 0.662636
10 BIG HORN I 11 0.696896
11 BIG HORN I 12 0.446235
当我测试 Excel 中的函数时,我应该获得应用该函数的正确结果是:
plant_name month r2
BIG HORN I 1 0.999975202
BIG HORN I 2 0.998459857
BIG HORN I 3 0.988712352
BIG HORN I 4 0.711649414
BIG HORN I 5 0.998282523
BIG HORN I 6 0.681460011
BIG HORN I 7 0.907152074
BIG HORN I 8 0.66212225
BIG HORN I 9 0.98807953
BIG HORN I 10 0.988469127
BIG HORN I 11 0.990836283
BIG HORN I 12 0.968629237
我无法理解为什么应用函数不正确。感谢您的帮助。
这是对您的数据计算 R 平方、RMSE 和 Pearson correlation coefficient(在 Excel 中使用):
from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import pearsonr
def r2_rmse2(g):
r2 = r2_score(g['wind_speed_obs'], g['wind_speed_ms'])
rmse = mean_squared_error(g['wind_speed_obs'], g['wind_speed_ms'], squared=False)
correl = pearsonr(g['wind_speed_obs'], g['wind_speed_ms'])[0]
return pd.Series( dict( r2 = r2, rmse=rmse, correl=correl ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()
plant_name month r2 rmse correl
0 BIG HORN I 1 -0.314771 0.976090 0.999975
1 BIG HORN I 2 0.529890 1.072639 0.998460
2 BIG HORN I 3 0.804066 0.592633 0.988712
3 BIG HORN I 4 -22.164844 0.549141 0.711649
4 BIG HORN I 5 -0.460691 0.866068 0.998283
5 BIG HORN I 6 -4.673359 0.729833 0.681460
6 BIG HORN I 7 -0.662167 0.553450 0.907152
7 BIG HORN I 8 -2.118817 0.716380 0.662122
8 BIG HORN I 9 -1.946562 0.539102 0.988080
9 BIG HORN I 10 0.662637 0.630426 0.988469
10 BIG HORN I 11 0.696896 0.428632 0.990836
11 BIG HORN I 12 0.446234 0.519437 0.968629
我正在尝试使用 pandas 计算大型数据集的 r2 或 r 平方,并按 plant_name 和月份在数据框中对数据进行分组,如下所示的“data1”。问题是,当我使用 sklearn 指标和定义的函数时,我得到的结果与我在 Excel 中使用“data1”中的相同数据获得的结果不一致。这是“data1”中的数据
plant_name month year wind_speed_obs wind_speed_ms
0 BIG HORN I 1 2018 5.143830 6.012436
1 BIG HORN I 1 2019 4.556545 5.231855
2 BIG HORN I 1 2020 6.582890 7.866532
3 BIG HORN I 2 2018 7.904438 9.248810
4 BIG HORN I 2 2019 4.353567 5.115625
5 BIG HORN I 2 2020 7.376739 8.408046
6 BIG HORN I 3 2018 6.138197 6.922043
7 BIG HORN I 3 2019 3.881804 4.484274
8 BIG HORN I 3 2020 7.071029 7.347177
9 BIG HORN I 4 2018 7.106936 7.699861
10 BIG HORN I 4 2019 6.874942 7.575278
11 BIG HORN I 4 2020 6.855979 7.106250
12 BIG HORN I 5 2018 5.366054 6.510753
13 BIG HORN I 5 2019 5.657342 6.597581
14 BIG HORN I 5 2020 7.010745 7.247043
15 BIG HORN I 6 2018 6.399417 7.076528
16 BIG HORN I 6 2019 6.578241 7.556111
17 BIG HORN I 6 2020 7.120105 7.548194
18 BIG HORN I 7 2018 5.615110 6.123925
19 BIG HORN I 7 2019 6.212104 6.963441
20 BIG HORN I 7 2020 6.663250 6.972312
21 BIG HORN I 8 2018 5.303967 5.947312
22 BIG HORN I 8 2019 5.176691 6.209274
23 BIG HORN I 8 2020 6.093748 6.337634
24 BIG HORN I 9 2018 5.375531 5.878472
25 BIG HORN I 9 2019 6.126961 6.792500
26 BIG HORN I 9 2020 5.608530 6.028056
27 BIG HORN I 10 2018 4.466079 5.054973
28 BIG HORN I 10 2019 5.492795 6.326075
29 BIG HORN I 10 2020 7.103278 7.492070
30 BIG HORN I 11 2018 5.341987 5.889028
31 BIG HORN I 11 2019 4.887397 5.144028
32 BIG HORN I 11 2020 6.718649 7.150000
33 BIG HORN I 12 2018 5.099386 5.866935
34 BIG HORN I 12 2019 3.925717 4.234140
35 BIG HORN I 12 2020 5.589325 5.943145
这是我使用的代码:
from sklearn.metrics import r2_score
def r2_rmse2( g ):
r2 = r2_score( g['wind_speed_obs'], g['wind_speed_ms'] )
#rmse = np.sqrt( mean_squared_error( g['wind_speed_obs'], g['wind_speed_ms'] ) )
return pd.Series( dict( r2 = r2 ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()
我在应用上面的 r2_rmse2 函数时得到了这个结果:
plant_name month r2
0 BIG HORN I 1 -0.314771
1 BIG HORN I 2 0.529890
2 BIG HORN I 3 0.804066
3 BIG HORN I 4 -22.164720
4 BIG HORN I 5 -0.460690
5 BIG HORN I 6 -4.673359
6 BIG HORN I 7 -0.662166
7 BIG HORN I 8 -2.118815
8 BIG HORN I 9 -1.946566
9 BIG HORN I 10 0.662636
10 BIG HORN I 11 0.696896
11 BIG HORN I 12 0.446235
当我测试 Excel 中的函数时,我应该获得应用该函数的正确结果是:
plant_name month r2
BIG HORN I 1 0.999975202
BIG HORN I 2 0.998459857
BIG HORN I 3 0.988712352
BIG HORN I 4 0.711649414
BIG HORN I 5 0.998282523
BIG HORN I 6 0.681460011
BIG HORN I 7 0.907152074
BIG HORN I 8 0.66212225
BIG HORN I 9 0.98807953
BIG HORN I 10 0.988469127
BIG HORN I 11 0.990836283
BIG HORN I 12 0.968629237
我无法理解为什么应用函数不正确。感谢您的帮助。
这是对您的数据计算 R 平方、RMSE 和 Pearson correlation coefficient(在 Excel 中使用):
from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import pearsonr
def r2_rmse2(g):
r2 = r2_score(g['wind_speed_obs'], g['wind_speed_ms'])
rmse = mean_squared_error(g['wind_speed_obs'], g['wind_speed_ms'], squared=False)
correl = pearsonr(g['wind_speed_obs'], g['wind_speed_ms'])[0]
return pd.Series( dict( r2 = r2, rmse=rmse, correl=correl ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()
plant_name month r2 rmse correl
0 BIG HORN I 1 -0.314771 0.976090 0.999975
1 BIG HORN I 2 0.529890 1.072639 0.998460
2 BIG HORN I 3 0.804066 0.592633 0.988712
3 BIG HORN I 4 -22.164844 0.549141 0.711649
4 BIG HORN I 5 -0.460691 0.866068 0.998283
5 BIG HORN I 6 -4.673359 0.729833 0.681460
6 BIG HORN I 7 -0.662167 0.553450 0.907152
7 BIG HORN I 8 -2.118817 0.716380 0.662122
8 BIG HORN I 9 -1.946562 0.539102 0.988080
9 BIG HORN I 10 0.662637 0.630426 0.988469
10 BIG HORN I 11 0.696896 0.428632 0.990836
11 BIG HORN I 12 0.446234 0.519437 0.968629