使用 statsmodels 的线性回归模型中的工作日作为虚拟/因子变量

Weekday as dummy / factor variable in a linear regression model using statsmodels

问题:

如何使用 sm.OLS() 向模型添加虚拟/因子变量?

详情:

数据样本结构:

Date    A   B   weekday
2013-05-04  25.03   88.51   Saturday
2013-05-05  52.98   67.99   Sunday
2013-05-06  39.93   75.19   Monday
2013-05-07  47.31   86.99   Tuesday
2013-05-08  19.61   87.94   Wednesday
2013-05-09  39.51   83.10   Thursday
2013-05-10  21.22   62.16   Friday
2013-05-11  19.04   58.79   Saturday
2013-05-12  18.53   75.27   Sunday
2013-05-13  11.90   75.43   Monday
2013-05-14  47.64   64.76   Tuesday
2013-05-15  27.47   91.65   Wednesday
2013-05-16  11.20   59.83   Thursday
2013-05-17  25.10   67.47   Friday
2013-05-18  19.89   64.70   Saturday
2013-05-19  38.91   76.68   Sunday
2013-05-20  42.11   94.36   Monday
2013-05-21  7.845   73.67   Tuesday
2013-05-22  35.45   76.67   Wednesday
2013-05-23  29.43   79.05   Thursday
2013-05-24  33.51   78.53   Friday
2013-05-25  13.58   59.26   Saturday
2013-05-26  37.38   68.59   Sunday
2013-05-27  37.09   67.79   Monday
2013-05-28  21.70   70.54   Tuesday
2013-05-29  11.85   60.00   Wednesday

下面使用sm.ols()(包括使用sm.add_constant()的常数项)

创建B对A的线性回归模型

使用统计模型进行回归分析的带有数据样本的完整代码:

# imports
import pandas as pd
import statsmodels.api as sm

# same data as described above
data = {'Date': {0: '2013-05-04',
          1: '2013-05-05',
          2: '2013-05-06',
          3: '2013-05-07',
          4: '2013-05-08',
          5: '2013-05-09',
          6: '2013-05-10',
          7: '2013-05-11',
          8: '2013-05-12',
          9: '2013-05-13',
          10: '2013-05-14',
          11: '2013-05-15',
          12: '2013-05-16',
          13: '2013-05-17',
          14: '2013-05-18',
          15: '2013-05-19',
          16: '2013-05-20',
          17: '2013-05-21',
          18: '2013-05-22',
          19: '2013-05-23',
          20: '2013-05-24',
          21: '2013-05-25',
          22: '2013-05-26',
          23: '2013-05-27',
          24: '2013-05-28',
          25: '2013-05-29'},
         'A': {0: 25.03,
          1: 52.98,
          2: 39.93,
          3: 47.31,
          4: 19.61,
          5: 39.51,
          6: 21.22,
          7: 19.04,
          8: 18.53,
          9: 11.9,
          10: 47.64,
          11: 27.47,
          12: 11.2,
          13: 25.1,
          14: 19.89,
          15: 38.91,
          16: 42.11,
          17: 7.845,
          18: 35.45,
          19: 29.43,
          20: 33.51,
          21: 13.58,
          22: 37.38,
          23: 37.09,
          24: 21.7,
          25: 11.85},
         'B': {0: 88.51,
          1: 67.99,
          2: 75.19,
          3: 86.99,
          4: 87.94,
          5: 83.1,
          6: 62.16,
          7: 58.79,
          8: 75.27,
          9: 75.43,
          10: 64.76,
          11: 91.65,
          12: 59.83,
          13: 67.47,
          14: 64.7,
          15: 76.68,
          16: 94.36,
          17: 73.67,
          18: 76.67,
          19: 79.05,
          20: 78.53,
          21: 59.26,
          22: 68.59,
          23: 67.79,
          24: 70.54,
          25: 60.0},
         'weekday': {0: 'Saturday',
          1: 'Sunday',
          2: 'Monday',
          3: 'Tuesday',
          4: 'Wednesday',
          5: 'Thursday',
          6: 'Friday',
          7: 'Saturday',
          8: 'Sunday',
          9: 'Monday',
          10: 'Tuesday',
          11: 'Wednesday',
          12: 'Thursday',
          13: 'Friday',
          14: 'Saturday',
          15: 'Sunday',
          16: 'Monday',
          17: 'Tuesday',
          18: 'Wednesday',
          19: 'Thursday',
          20: 'Friday',
          21: 'Saturday',
          22: 'Sunday',
          23: 'Monday',
          24: 'Tuesday',
          25: 'Wednesday'}}

df = pd.DataFrame(data)
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

输出(缩短):

                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -1.4328     17.355     -0.083      0.935       -37.252    34.386
B              0.4034      0.233      1.729      0.097        -0.078     0.885
==============================================================================

现在我想添加工作日作为解释因子变量。我希望它就像更改数据框中的数据类型一样简单,但不幸的是,尽管 x = sm.add_constant(independent) 部分接受了该列,但它似乎不起作用。

import pandas as pd
import statsmodels.api as sm

df = pd.read_clipboard(sep='\s+')
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)

independent = df[['B', 'weekday']]
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

当你来到model = sm.OLS(df['A'], x).fit()部分时,出现一个值错误:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

还有其他建议吗?

您可以使用 pandas 分类来创建虚拟变量,或者更简单地使用 patsy 将所有非数字列转换为虚拟变量的公式界面,或其他因子编码。

在这种情况下使用公式界面(与statsmodels.formula.api中的小写ols相同)显示如下结果。 Patsy 按字母顺序对分类变量的级别进行排序。 'Friday' 在变量列表中缺失,已被选为参考类别。

>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.301
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     1.105
Date:                Thu, 03 May 2018   Prob (F-statistic):              0.401
Time:                        15:26:02   Log-Likelihood:                -97.898
No. Observations:                  26   AIC:                             211.8
Df Residuals:                      18   BIC:                             221.9
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -1.4717     19.343     -0.076      0.940     -42.110      39.167
weekday[T.Monday]        2.5837      9.857      0.262      0.796     -18.124      23.291
weekday[T.Saturday]     -6.5889      9.599     -0.686      0.501     -26.755      13.577
weekday[T.Sunday]        9.2287      9.616      0.960      0.350     -10.975      29.432
weekday[T.Thursday]     -1.7610     10.321     -0.171      0.866     -23.445      19.923
weekday[T.Tuesday]       2.6507      9.664      0.274      0.787     -17.652      22.953
weekday[T.Wendesday]    -6.9320      9.911     -0.699      0.493     -27.754      13.890
B                        0.4047      0.258      1.566      0.135      -0.138       0.948
==============================================================================
Omnibus:                        1.039   Durbin-Watson:                   2.313
Prob(Omnibus):                  0.595   Jarque-Bera (JB):                0.532
Skew:                          -0.350   Prob(JB):                        0.766
Kurtosis:                       3.007   Cond. No.                         638.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

有关分类编码的选项,请参阅 patsy 文档 http://patsy.readthedocs.io/en/latest/categorical-coding.html

例如,参考编码可以在这个公式中明确指定

"A ~ B + C(weekday, Treatment('Sunday'))"

http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment