Pandas 使用 Sklearn 指标的相关误差

Question

我正在尝试使用 pandas 计算大型数据集的 r2 或 r 平方，并按 plant_name 和月份在数据框中对数据进行分组，如下所示的“data1”。问题是，当我使用 sklearn 指标和定义的函数时，我得到的结果与我在 Excel 中使用“data1”中的相同数据获得的结果不一致。这是“data1”中的数据

    plant_name  month  year  wind_speed_obs  wind_speed_ms
0   BIG HORN I      1  2018        5.143830       6.012436
1   BIG HORN I      1  2019        4.556545       5.231855
2   BIG HORN I      1  2020        6.582890       7.866532
3   BIG HORN I      2  2018        7.904438       9.248810
4   BIG HORN I      2  2019        4.353567       5.115625
5   BIG HORN I      2  2020        7.376739       8.408046
6   BIG HORN I      3  2018        6.138197       6.922043
7   BIG HORN I      3  2019        3.881804       4.484274
8   BIG HORN I      3  2020        7.071029       7.347177
9   BIG HORN I      4  2018        7.106936       7.699861
10  BIG HORN I      4  2019        6.874942       7.575278
11  BIG HORN I      4  2020        6.855979       7.106250
12  BIG HORN I      5  2018        5.366054       6.510753
13  BIG HORN I      5  2019        5.657342       6.597581
14  BIG HORN I      5  2020        7.010745       7.247043
15  BIG HORN I      6  2018        6.399417       7.076528
16  BIG HORN I      6  2019        6.578241       7.556111
17  BIG HORN I      6  2020        7.120105       7.548194
18  BIG HORN I      7  2018        5.615110       6.123925
19  BIG HORN I      7  2019        6.212104       6.963441
20  BIG HORN I      7  2020        6.663250       6.972312
21  BIG HORN I      8  2018        5.303967       5.947312
22  BIG HORN I      8  2019        5.176691       6.209274
23  BIG HORN I      8  2020        6.093748       6.337634
24  BIG HORN I      9  2018        5.375531       5.878472
25  BIG HORN I      9  2019        6.126961       6.792500
26  BIG HORN I      9  2020        5.608530       6.028056
27  BIG HORN I     10  2018        4.466079       5.054973
28  BIG HORN I     10  2019        5.492795       6.326075
29  BIG HORN I     10  2020        7.103278       7.492070
30  BIG HORN I     11  2018        5.341987       5.889028
31  BIG HORN I     11  2019        4.887397       5.144028
32  BIG HORN I     11  2020        6.718649       7.150000
33  BIG HORN I     12  2018        5.099386       5.866935
34  BIG HORN I     12  2019        3.925717       4.234140
35  BIG HORN I     12  2020        5.589325       5.943145

这是我使用的代码：

from sklearn.metrics import r2_score
def r2_rmse2( g ):
    r2 = r2_score( g['wind_speed_obs'], g['wind_speed_ms'] )
    #rmse = np.sqrt( mean_squared_error( g['wind_speed_obs'], g['wind_speed_ms'] ) )
    return pd.Series( dict(  r2 = r2 ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()

我在应用上面的 r2_rmse2 函数时得到了这个结果：

    plant_name  month         r2
0   BIG HORN I      1  -0.314771
1   BIG HORN I      2   0.529890
2   BIG HORN I      3   0.804066
3   BIG HORN I      4 -22.164720
4   BIG HORN I      5  -0.460690
5   BIG HORN I      6  -4.673359
6   BIG HORN I      7  -0.662166
7   BIG HORN I      8  -2.118815
8   BIG HORN I      9  -1.946566
9   BIG HORN I     10   0.662636
10  BIG HORN I     11   0.696896
11  BIG HORN I     12   0.446235

当我测试 Excel 中的函数时，我应该获得应用该函数的正确结果是：

plant_name  month   r2
BIG HORN I  1   0.999975202
BIG HORN I  2   0.998459857
BIG HORN I  3   0.988712352
BIG HORN I  4   0.711649414
BIG HORN I  5   0.998282523
BIG HORN I  6   0.681460011
BIG HORN I  7   0.907152074
BIG HORN I  8   0.66212225
BIG HORN I  9   0.98807953
BIG HORN I  10  0.988469127
BIG HORN I  11  0.990836283
BIG HORN I  12  0.968629237

我无法理解为什么应用函数不正确。感谢您的帮助。

Answer 1

这是对您的数据计算 R 平方、RMSE 和 Pearson correlation coefficient（在 Excel 中使用）：

from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import pearsonr
def r2_rmse2(g):
    r2 = r2_score(g['wind_speed_obs'], g['wind_speed_ms'])
    rmse = mean_squared_error(g['wind_speed_obs'], g['wind_speed_ms'], squared=False)
    correl = pearsonr(g['wind_speed_obs'], g['wind_speed_ms'])[0]
    return pd.Series( dict(  r2 = r2, rmse=rmse, correl=correl ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()

    plant_name  month         r2      rmse    correl
0   BIG HORN I      1  -0.314771  0.976090  0.999975
1   BIG HORN I      2   0.529890  1.072639  0.998460
2   BIG HORN I      3   0.804066  0.592633  0.988712
3   BIG HORN I      4 -22.164844  0.549141  0.711649
4   BIG HORN I      5  -0.460691  0.866068  0.998283
5   BIG HORN I      6  -4.673359  0.729833  0.681460
6   BIG HORN I      7  -0.662167  0.553450  0.907152
7   BIG HORN I      8  -2.118817  0.716380  0.662122
8   BIG HORN I      9  -1.946562  0.539102  0.988080
9   BIG HORN I     10   0.662637  0.630426  0.988469
10  BIG HORN I     11   0.696896  0.428632  0.990836
11  BIG HORN I     12   0.446234  0.519437  0.968629

Pandas 使用 Sklearn 指标的相关误差

Pandas Correlation Error Using Sklearn Metrics

correlation

pandas

sklearn-pandas

pandas-groupby