Pandas 按年份计算日期范围块中的 RMSE

Pandas Calculate RMSE in Date Range Chunks by Year

我在 df 中有数据,需要计算由月和年数据行组成的列与块周期中的当前月和年行相比的 RMSE。我不知道如何设置每年的排序。例如,我需要按年计算从月 == 5 到月 == 2 的 RMSE,并按开始年份在“变化”列中打印所有 RMSE 值。我的数据如下所示:

                     month  mean_mon_flow  ...  std_anomaly  Variation
date                                       ...                        
1992-04-01 00:00:00      4      12.265100  ...    -1.074586        NaN
1992-05-01 00:00:00      5      12.533220  ...    -1.017388   0.057198
1992-06-01 00:00:00      6      12.491247  ...    -1.117406  -0.100018
1992-07-01 00:00:00      7      12.113165  ...    -1.401221  -0.283815
1992-08-01 00:00:00      8      11.846904  ...    -1.359026   0.042195
1992-09-01 00:00:00      9      11.526178  ...    -0.299250   1.059776
1992-10-01 00:00:00     10      11.555834  ...    -0.628162  -0.328911
1992-11-01 00:00:00     11      11.746104  ...    -1.116374  -0.488213
1992-12-01 00:00:00     12      11.891824  ...    -0.143343   0.973031
1993-01-01 00:00:00      1      11.997252  ...    -0.486450  -0.343107
1993-02-01 00:00:00      2      12.028855  ...    -0.862971  -0.376521
1993-03-01 00:00:00      3      12.063974  ...    -0.596869   0.266102
1993-04-01 00:00:00      4      12.265100  ...    -0.923695  -0.326826
1993-05-01 00:00:00      5      12.533220  ...     0.322987   1.246682
1993-06-01 00:00:00      6      12.491247  ...    -0.478567  -0.801554
1993-07-01 00:00:00      7      12.113165  ...    -0.274119   0.204448
1993-08-01 00:00:00      8      11.846904  ...    -0.707968  -0.433849
1993-09-01 00:00:00      9      11.526178  ...     0.167246   0.875214
1993-10-01 00:00:00     10      11.555834  ...    -0.089410  -0.256656
1993-11-01 00:00:00     11      11.746104  ...    -1.046461  -0.957050
1993-12-01 00:00:00     12      11.891824  ...    -1.293175  -0.246714
1994-01-01 00:00:00      1      11.997252  ...    -1.505133  -0.211959
1994-02-01 00:00:00      2      12.028855  ...    -0.610121   0.895012
1994-03-01 00:00:00      3      12.063974  ...    -0.974184  -0.364063
1994-04-01 00:00:00      4      12.265100  ...    -1.077609  -0.103424

今年观察到的数据是这样的:

                     month  mean_mon_flow  ...  std_anomaly  Variation
date                                       ...                        
2021-05-01 00:00:00      5      12.533220  ...    -0.935899   0.206586
2021-06-01 00:00:00      6      12.491247  ...    -0.647261   0.288638
2021-07-01 00:00:00      7      12.113165  ...    -0.711730  -0.064469
2021-08-01 00:00:00      8      11.846904  ...    -0.482306   0.229424
2021-09-01 00:00:00      9      11.526178  ...    -0.116989   0.365317
2021-10-01 00:00:00     10      11.555834  ...     0.319614   0.436603
2021-11-01 00:00:00     11      11.746104  ...     0.880379   0.560765
2021-12-01 00:00:00     12      11.891824  ...     0.630541  -0.249838
2022-01-01 00:00:00      1      11.997252  ...    -0.151507  -0.782048
2022-02-01 00:00:00      2      12.028855  ...    -0.237398  -0.085891

结果应该如下所示。我试过使用 groupby 语句来计算 RMSE,但不确定如何为 groupby 提供日期范围。

year   RMSE Variation
1992   number
1993   number
1994   number 
..     ..
2020   number

谢谢,

前几年的一些 pre-processing 数据框。首先,通过将日期的年份部分减去 4 个月来获取年份标签。二、降三四月。

from datetime import date
from dateutil.relativedelta import relativedelta
df_prev['year'] = pd.Series(df_prev['date'].dt.to_pydatetime() - relativedelta(months=4)).dt.year

df_prev = df_prev[~df_prev['month'].isin([3,4])]

然后将df_prev转化为以年为列,以月为索引的矩阵,将今年的table转化为以月为索引的序列

df_prev_vari = df_prev.set_index(['month', 'year'])[['Variation']].unstack().droplevel(0, axis=1)
df_this_vari = df_this.set_index('month')['Variation']

将月份作为两种数据的公共索引使我们能够通过匹配索引、然后进行平方、均值和 square-root 运算来相互减去。

(df_prev_vari.sub(df_this_vari, axis=0)**2).mean()**.5