按组滚动 OLS 回归和预测

Question

我有一个 Pandas 数据框，其中包含一些关于赛车手的数据。相关列如下所示：


|Date    |Name     |Distance   |avg_speed_calc  
|----    |----     |----       |----         
|9/6/20  | Smith   | 8         | 85.6
|9/6/20  | Douglas | 8         | 84.9
|9/6/20  | Mostern | 8         | 84.3

.......

|Date    |Name      |Distance  |avg_speed_calc 
|:----   |:----     |:-----    |:----         
|4/5/21  | Smith    | 6        | 88.7
|4/5/21  | Robinson | 6        | 89.3
|4/5/21  | Thomas   | 6        | 87.5

在上面的数据中，每场比赛都有不同数量的参与者，并不是每个赛车手都参加了每项赛事——有些甚至可能只有一行条目。

我正在尝试制作一个简单的 OLS，通过仅根据他的 average_speed_calc 回归距离来了解每个赛车手的首选距离。因为我希望计算在每场比赛后更新，所以我一直在尝试使用 StatsModels 的 RollingOLS。在几年前有点 similar question 的帮助下，我的代码目前看起来像这样：

from statsmodels.regression.rolling import RollingOLS
import pandas as pd
import numpy as np


dist_pref = df.groupby(["Name"]).apply(lambda x: RollingOLS(endog=x['avg_speed_calc'], exog=sm.add_constant(x['Distance']),min_nobs=2)).fit().params)

但是，这会抛出消息“ValueError：无法将输入数组从形状 (0) 广播到形状 (1)”，到目前为止我还无法修复。

我也试过一个辅助函数，基于another old question，没有用：

def ols_res(x, y):
    return pd.Series(RollingOLS(y, x).fit().predict)

df_dist = df.groupby(['Name']).apply(lambda x : x['Distance'].apply(ols_res, y=x['avg_speed_calc']))

理想情况下，我想预测当天比赛的 average_speed_calc，仅使用之前比赛的数据，以便我可以将其与该行中报告的实际 avg_speed_calc 进行比较.我想为每个 coefficient/intercept 创建单独的 DataFrame 列，然后将它们向下移动一个，然后使用这些数字以及当前比赛的距离来进行预测，但也许有更有效的方法。

假设生成的数据框然后按日期和赛车手排序，我想要的最终输出（需要两场或更多场比赛进行预测）是一个包含每个赛车手信息的数据框，如下所示：

Date	Name	Distance	avg_speed_calc	predicted_speed_calc
9/6/20	Smith	8	85.6	nan
11/15/20	Smith	6	82.6	nan
1/4/21	Smith	7	83.4	84.1
2/20/21	Smith	7	82.9	83.9
4/5/21	Smith	8.5	84.8	85.7

有另一列包含 predicted_speed_calc 的标准错误也很好，但这是次要问题，我稍后会解决。提前感谢您对上述内容的任何指导和建议。

编辑 2：

感谢@LarryBird，我现在有了一个适用于滚动线性 OLS 结果的函数。但是，如果可能的话，我还想尝试使用 'from_formula' 方法进行二次多项式拟合，以及 return 残差以将它们与线性拟合进行比较。

我正在尝试根据信息 here 和下面定义的 'speed_preference_from_formula' 函数拟合二次方程。但是，如果我尝试两次引用同一列（即使用 'avg_speed_calc ~ Distance + Distance**2'、'params' return 仅两列而不是预期的三列。我为平方创建了一个辅助列值（见下文），但系数 returned 显然不准确，所以我知道我做错了什么。

import numpy.polynomial.polynomial as poly
import patsy

df['Distance2']  = df['Distance']**2
grouped2 = df.groupby('Name')
form = "avg_speed_calc ~ Distance + Distance2"
params2 = grouped2.apply(lambda x: speed_preference_from_formula(x, form, 4))

Answer 1

您应该能够使用 groupby / apply 模式实现您想要的。下面的代码应该会有帮助。

创建示例数据：

from statsmodels.regression.rolling import RollingOLS
from statsmodels.tools.tools import add_constant
import pandas as pd
import numpy as np

# make some toy data
race_dates = pd.to_datetime(['2020-06-09']*3 + ['2020-12-01']*4 + ['2021-01-21']*4 + ['2021-05-04']*5)
distance = [8]*3 + [7]*4 + [4]*4 + [6]*5
avg_speed = 80 + np.random.randn(len(distance))*10
names = ['Smith', 'Douglas', 'Mostern',
         'Smith', 'Douglas', 'Mostern', 'Robinson',
         'Smith', 'Douglas', 'Mostern', 'Robinson',
         'Smith', 'Douglas', 'Mostern', 'Robinson', 'Thomas']

df = pd.DataFrame({'Date': race_dates,
                  'Name': names,
                  'Distance': distance,
                  'avg_speed_calc': avg_speed})

定义辅助函数：

def speed_preference(df_racer, intercept=False):
    """
    Function to operate on the data of a single driver.
    Assumes df_racer has the columns 'Distance' and 'avg_speed_calc' available.
    Returns a dataframe containing model parameters
    """
    # we should have atleast (number_of_parameters) + 1 observations
    min_obs = 3 if intercept else 2

    # if there are less than min_obs rows in df_racer, RollingOLS will throw an error
    # Instead, handle this case separately
    if df_racer.shape[0] < min_obs:
        cols = ['const', 'Distance'] if intercept else ['Distance']
        return pd.DataFrame(index=df_racer.index, columns=cols)
    
    y = df_racer['avg_speed_calc']
    x = add_constant(df_racer['Distance']) if intercept else df_racer['Distance']
    
    model = RollingOLS(y, x, expanding=True, min_nobs=min_obs).fit()
    return model.params


def speed_prediction(df_racer, intercept=False):
    """
    Function to operate on the data of a single driver.
    Assumes df_racer has the columns 'Distance' and 'avg_speed_calc' available.
    Returns a series containing predicted speed
    """
    params = speed_preference(df_racer, intercept)
    params_shifted = params.shift(1)
    
    if intercept:
        return (params_shifted.mul(add_constant(df_racer['Distance']), axis=0)\
                .sum(axis=1, min_count=1)).to_frame('predicted_speed_calc')
    
    return (params_shifted.mul(df_racer['Distance'], axis=0))\
                .rename({'Distance': 'predicted_speed_calc'}, axis=1)

speed_preference 函数计算单个驱动程序的滚动 OLS，以及 return 拟合参数。 speed_prediction 函数根据要求使用前一场比赛（注意 params_shifted）的模型计算预测速度。把它们放在一起，只需要一个简单的 groupby 和 join：

无拦截

grouped = df.groupby('Name')
params = grouped.apply(speed_preference)
predictions = grouped.apply(speed_prediction)

df_out_no_intercept = df.join(params, rsuffix='_coef').join(predictions)
df_out_no_intercept

有拦截

grouped = df.groupby('Name')
params = grouped.apply(lambda x: speed_preference(x, True))
predictions = grouped.apply(lambda x: speed_prediction(x, True))

df_out_w_intercept = df.join(params, rsuffix='_coef').join(predictions)
df_out_w_intercept

编辑

如果您想从公式拟合模型，您可以使用：

def speed_preference_from_formula(df_racer, formula, min_nobs):
    """
    Function to operate on the data of a single driver. "formula" should reference column names in df_racer.
    min_nobs should always be >= (# of parameters in the model)+1
    """
    
    # if there are less than min_obs rows in df_racer, RollingOLS will throw an error
    # Instead, handle this case separately
    if df_racer.shape[0] < min_nobs:

        return None
    model = RollingOLS.from_formula(formula, data=df_racer, expanding=True, min_nobs=min_nobs, window=None).fit()
    return model.params

那么对于多项式模型，您可以按如下方式计算参数：

grouped = df.groupby('Name')
formula = "avg_speed_calc ~ 1 + Distance + Distance^2"
grouped.apply(lambda x: speed_preference_from_formula(x, formula, 4))

输出：

请注意，您还需要编辑速度预测函数以正确处理参数和生成的预测。

请注意，在公式中，我引用了传入的数据框中的列名。1 表示应使用截距（类似于 sm.add_constant），Distance表示直接使用Distance列中的值，Distance^2表示将Distance列中的值平方，然后将该值作为特征。如果要拟合立方模型，可以添加项 + Distance^3.

有关如何使用“R-style”公式的良好参考，请参阅 here。

编辑 2：调试 groupby

将 groupby.apply 模式视为简单地将数据帧拆分为多个较小的数据帧，分别对每个数据帧应用一些函数，然后重新连接在一起。要查看 groupby 生成的每个 sub-dataframe，您可以使用：

grouped = df.groupby('Name')
for name, group in grouped:
    print(f"The sub-dataframe for: {name}")
    print(group)

这很有用，因为您现在可以准确地看到传递给您在 .apply().

中使用的函数的内容

因此对于评论中提到的错误，您可以将函数分别应用于每个组以缩小错误发生的位置。

按组滚动 OLS 回归和预测

Rolling OLS Regressions and Predictions by Group

python

dataframe

statsmodels

rolling-computation

pandas-groupby