来自公式的 statsmodels ols with groupby pandas

statsmodels ols from formula with groupby pandas

我有一个类型的数据框:

       date         TICKER        x1       x2  ...       Z        Y  month    x3
0 1999-12-31    A UN Equity  52.1330  51.9645  ...  0.0052      NaN     12   NaN
1 1999-12-31   AA UN Equity  92.9415  92.8715  ...  0.0052      NaN     12   NaN
2 1999-12-31  ABC UN Equity   3.6843   3.6539  ...  0.0052      NaN     12   NaN
3 1999-12-31  ABF UN Equity  22.0625  21.9375  ...  0.0052      NaN     12   NaN
4 1999-12-31  ABM UN Equity  10.2188  10.1250  ...  0.0052      NaN     12   NaN

我想 运行 组 ['TICKER','year','month'] 的公式 'Y ~ x1 + x2:x3' 的 OLS 回归(年份是此处未出现的列)来自 statsmodels.formula.api as smf .因此我使用:

data.groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))

但是,我收到以下错误:

IndexError: tuple index out of range

知道为什么吗?

完整的 tracebakc 是

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
    res = f(group)
  File "<input>", line 1, in <lambda>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 195, in from_formula
    mod = cls(endog, exog, *args, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 872, in __init__
    super(OLS, self).__init__(endog, exog, missing=missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 703, in __init__
    super(WLS, self).__init__(endog, exog, missing=missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 190, in __init__
    super(RegressionModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 237, in __init__
    super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 77, in __init__
    self.data = self._handle_data(endog, exog, missing, hasconst,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 101, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 672, in handle_data
    return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 71, in __init__
    arrays, nan_idx = self.handle_missing(endog, exog, missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 247, in handle_missing
    if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range

我看到你的 Y 列有很多 NaN,所以你需要确保子组有足够的观察值,这样回归才能起作用。

所以如果我使用示例数据:

import statsmodels.formula.api as smf
np.random.seed(123)
data = pd.concat([
    pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),
                    'year':np.random.choice([2000,2001],30),
                    'month':np.random.choice([1,2],30)}),
    pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
],axis=1)

data.loc[:6,'Y'] = np.nan

如果我 运行 你的代码在上面的数据框中,我会得到同样的错误。

因此,如果我们仅使用完整数据(与您的回归相关):

complete_ix = data[['Y','x1','x2','x3']].dropna().index
data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))

有效:

TICKER  year  month
A       2000  2        <statsmodels.regression.linear_model.OLS objec...
        2001  1        <statsmodels.regression.linear_model.OLS objec...
              2        <statsmodels.regression.linear_model.OLS objec...
B       2000  1        <statsmodels.regression.linear_model.OLS objec...
              2        <statsmodels.regression.linear_model.OLS objec...
        2001  1        <statsmodels.regression.linear_model.OLS objec...
C       2000  1        <statsmodels.regression.linear_model.OLS objec...
              2        <statsmodels.regression.linear_model.OLS objec...