使用 statsmodels 的线性回归模型中的工作日作为虚拟/因子变量
Weekday as dummy / factor variable in a linear regression model using statsmodels
问题:
如何使用 sm.OLS()
向模型添加虚拟/因子变量?
详情:
数据样本结构:
Date A B weekday
2013-05-04 25.03 88.51 Saturday
2013-05-05 52.98 67.99 Sunday
2013-05-06 39.93 75.19 Monday
2013-05-07 47.31 86.99 Tuesday
2013-05-08 19.61 87.94 Wednesday
2013-05-09 39.51 83.10 Thursday
2013-05-10 21.22 62.16 Friday
2013-05-11 19.04 58.79 Saturday
2013-05-12 18.53 75.27 Sunday
2013-05-13 11.90 75.43 Monday
2013-05-14 47.64 64.76 Tuesday
2013-05-15 27.47 91.65 Wednesday
2013-05-16 11.20 59.83 Thursday
2013-05-17 25.10 67.47 Friday
2013-05-18 19.89 64.70 Saturday
2013-05-19 38.91 76.68 Sunday
2013-05-20 42.11 94.36 Monday
2013-05-21 7.845 73.67 Tuesday
2013-05-22 35.45 76.67 Wednesday
2013-05-23 29.43 79.05 Thursday
2013-05-24 33.51 78.53 Friday
2013-05-25 13.58 59.26 Saturday
2013-05-26 37.38 68.59 Sunday
2013-05-27 37.09 67.79 Monday
2013-05-28 21.70 70.54 Tuesday
2013-05-29 11.85 60.00 Wednesday
下面使用sm.ols()
(包括使用sm.add_constant()
的常数项)
创建B对A的线性回归模型
使用统计模型进行回归分析的带有数据样本的完整代码:
# imports
import pandas as pd
import statsmodels.api as sm
# same data as described above
data = {'Date': {0: '2013-05-04',
1: '2013-05-05',
2: '2013-05-06',
3: '2013-05-07',
4: '2013-05-08',
5: '2013-05-09',
6: '2013-05-10',
7: '2013-05-11',
8: '2013-05-12',
9: '2013-05-13',
10: '2013-05-14',
11: '2013-05-15',
12: '2013-05-16',
13: '2013-05-17',
14: '2013-05-18',
15: '2013-05-19',
16: '2013-05-20',
17: '2013-05-21',
18: '2013-05-22',
19: '2013-05-23',
20: '2013-05-24',
21: '2013-05-25',
22: '2013-05-26',
23: '2013-05-27',
24: '2013-05-28',
25: '2013-05-29'},
'A': {0: 25.03,
1: 52.98,
2: 39.93,
3: 47.31,
4: 19.61,
5: 39.51,
6: 21.22,
7: 19.04,
8: 18.53,
9: 11.9,
10: 47.64,
11: 27.47,
12: 11.2,
13: 25.1,
14: 19.89,
15: 38.91,
16: 42.11,
17: 7.845,
18: 35.45,
19: 29.43,
20: 33.51,
21: 13.58,
22: 37.38,
23: 37.09,
24: 21.7,
25: 11.85},
'B': {0: 88.51,
1: 67.99,
2: 75.19,
3: 86.99,
4: 87.94,
5: 83.1,
6: 62.16,
7: 58.79,
8: 75.27,
9: 75.43,
10: 64.76,
11: 91.65,
12: 59.83,
13: 67.47,
14: 64.7,
15: 76.68,
16: 94.36,
17: 73.67,
18: 76.67,
19: 79.05,
20: 78.53,
21: 59.26,
22: 68.59,
23: 67.79,
24: 70.54,
25: 60.0},
'weekday': {0: 'Saturday',
1: 'Sunday',
2: 'Monday',
3: 'Tuesday',
4: 'Wednesday',
5: 'Thursday',
6: 'Friday',
7: 'Saturday',
8: 'Sunday',
9: 'Monday',
10: 'Tuesday',
11: 'Wednesday',
12: 'Thursday',
13: 'Friday',
14: 'Saturday',
15: 'Sunday',
16: 'Monday',
17: 'Tuesday',
18: 'Wednesday',
19: 'Thursday',
20: 'Friday',
21: 'Saturday',
22: 'Sunday',
23: 'Monday',
24: 'Tuesday',
25: 'Wednesday'}}
df = pd.DataFrame(data)
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
输出(缩短):
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -1.4328 17.355 -0.083 0.935 -37.252 34.386
B 0.4034 0.233 1.729 0.097 -0.078 0.885
==============================================================================
现在我想添加工作日作为解释因子变量。我希望它就像更改数据框中的数据类型一样简单,但不幸的是,尽管 x = sm.add_constant(independent)
部分接受了该列,但它似乎不起作用。
import pandas as pd
import statsmodels.api as sm
df = pd.read_clipboard(sep='\s+')
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df[['B', 'weekday']]
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
当你来到model = sm.OLS(df['A'], x).fit()
部分时,出现一个值错误:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
还有其他建议吗?
您可以使用 pandas 分类来创建虚拟变量,或者更简单地使用 patsy 将所有非数字列转换为虚拟变量的公式界面,或其他因子编码。
在这种情况下使用公式界面(与statsmodels.formula.api中的小写ols
相同)显示如下结果。
Patsy 按字母顺序对分类变量的级别进行排序。 'Friday' 在变量列表中缺失,已被选为参考类别。
>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.301
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 1.105
Date: Thu, 03 May 2018 Prob (F-statistic): 0.401
Time: 15:26:02 Log-Likelihood: -97.898
No. Observations: 26 AIC: 211.8
Df Residuals: 18 BIC: 221.9
Df Model: 7
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept -1.4717 19.343 -0.076 0.940 -42.110 39.167
weekday[T.Monday] 2.5837 9.857 0.262 0.796 -18.124 23.291
weekday[T.Saturday] -6.5889 9.599 -0.686 0.501 -26.755 13.577
weekday[T.Sunday] 9.2287 9.616 0.960 0.350 -10.975 29.432
weekday[T.Thursday] -1.7610 10.321 -0.171 0.866 -23.445 19.923
weekday[T.Tuesday] 2.6507 9.664 0.274 0.787 -17.652 22.953
weekday[T.Wendesday] -6.9320 9.911 -0.699 0.493 -27.754 13.890
B 0.4047 0.258 1.566 0.135 -0.138 0.948
==============================================================================
Omnibus: 1.039 Durbin-Watson: 2.313
Prob(Omnibus): 0.595 Jarque-Bera (JB): 0.532
Skew: -0.350 Prob(JB): 0.766
Kurtosis: 3.007 Cond. No. 638.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
有关分类编码的选项,请参阅 patsy 文档 http://patsy.readthedocs.io/en/latest/categorical-coding.html
例如,参考编码可以在这个公式中明确指定
"A ~ B + C(weekday, Treatment('Sunday'))"
http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment
问题:
如何使用 sm.OLS()
向模型添加虚拟/因子变量?
详情:
数据样本结构:
Date A B weekday
2013-05-04 25.03 88.51 Saturday
2013-05-05 52.98 67.99 Sunday
2013-05-06 39.93 75.19 Monday
2013-05-07 47.31 86.99 Tuesday
2013-05-08 19.61 87.94 Wednesday
2013-05-09 39.51 83.10 Thursday
2013-05-10 21.22 62.16 Friday
2013-05-11 19.04 58.79 Saturday
2013-05-12 18.53 75.27 Sunday
2013-05-13 11.90 75.43 Monday
2013-05-14 47.64 64.76 Tuesday
2013-05-15 27.47 91.65 Wednesday
2013-05-16 11.20 59.83 Thursday
2013-05-17 25.10 67.47 Friday
2013-05-18 19.89 64.70 Saturday
2013-05-19 38.91 76.68 Sunday
2013-05-20 42.11 94.36 Monday
2013-05-21 7.845 73.67 Tuesday
2013-05-22 35.45 76.67 Wednesday
2013-05-23 29.43 79.05 Thursday
2013-05-24 33.51 78.53 Friday
2013-05-25 13.58 59.26 Saturday
2013-05-26 37.38 68.59 Sunday
2013-05-27 37.09 67.79 Monday
2013-05-28 21.70 70.54 Tuesday
2013-05-29 11.85 60.00 Wednesday
下面使用sm.ols()
(包括使用sm.add_constant()
的常数项)
使用统计模型进行回归分析的带有数据样本的完整代码:
# imports
import pandas as pd
import statsmodels.api as sm
# same data as described above
data = {'Date': {0: '2013-05-04',
1: '2013-05-05',
2: '2013-05-06',
3: '2013-05-07',
4: '2013-05-08',
5: '2013-05-09',
6: '2013-05-10',
7: '2013-05-11',
8: '2013-05-12',
9: '2013-05-13',
10: '2013-05-14',
11: '2013-05-15',
12: '2013-05-16',
13: '2013-05-17',
14: '2013-05-18',
15: '2013-05-19',
16: '2013-05-20',
17: '2013-05-21',
18: '2013-05-22',
19: '2013-05-23',
20: '2013-05-24',
21: '2013-05-25',
22: '2013-05-26',
23: '2013-05-27',
24: '2013-05-28',
25: '2013-05-29'},
'A': {0: 25.03,
1: 52.98,
2: 39.93,
3: 47.31,
4: 19.61,
5: 39.51,
6: 21.22,
7: 19.04,
8: 18.53,
9: 11.9,
10: 47.64,
11: 27.47,
12: 11.2,
13: 25.1,
14: 19.89,
15: 38.91,
16: 42.11,
17: 7.845,
18: 35.45,
19: 29.43,
20: 33.51,
21: 13.58,
22: 37.38,
23: 37.09,
24: 21.7,
25: 11.85},
'B': {0: 88.51,
1: 67.99,
2: 75.19,
3: 86.99,
4: 87.94,
5: 83.1,
6: 62.16,
7: 58.79,
8: 75.27,
9: 75.43,
10: 64.76,
11: 91.65,
12: 59.83,
13: 67.47,
14: 64.7,
15: 76.68,
16: 94.36,
17: 73.67,
18: 76.67,
19: 79.05,
20: 78.53,
21: 59.26,
22: 68.59,
23: 67.79,
24: 70.54,
25: 60.0},
'weekday': {0: 'Saturday',
1: 'Sunday',
2: 'Monday',
3: 'Tuesday',
4: 'Wednesday',
5: 'Thursday',
6: 'Friday',
7: 'Saturday',
8: 'Sunday',
9: 'Monday',
10: 'Tuesday',
11: 'Wednesday',
12: 'Thursday',
13: 'Friday',
14: 'Saturday',
15: 'Sunday',
16: 'Monday',
17: 'Tuesday',
18: 'Wednesday',
19: 'Thursday',
20: 'Friday',
21: 'Saturday',
22: 'Sunday',
23: 'Monday',
24: 'Tuesday',
25: 'Wednesday'}}
df = pd.DataFrame(data)
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
输出(缩短):
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -1.4328 17.355 -0.083 0.935 -37.252 34.386
B 0.4034 0.233 1.729 0.097 -0.078 0.885
==============================================================================
现在我想添加工作日作为解释因子变量。我希望它就像更改数据框中的数据类型一样简单,但不幸的是,尽管 x = sm.add_constant(independent)
部分接受了该列,但它似乎不起作用。
import pandas as pd
import statsmodels.api as sm
df = pd.read_clipboard(sep='\s+')
df = df.set_index(['Date'])
df['weekday'] = df['weekday'].astype(object)
independent = df[['B', 'weekday']]
x = sm.add_constant(independent)
model = sm.OLS(df['A'], x).fit()
model.summary()
当你来到model = sm.OLS(df['A'], x).fit()
部分时,出现一个值错误:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
还有其他建议吗?
您可以使用 pandas 分类来创建虚拟变量,或者更简单地使用 patsy 将所有非数字列转换为虚拟变量的公式界面,或其他因子编码。
在这种情况下使用公式界面(与statsmodels.formula.api中的小写ols
相同)显示如下结果。
Patsy 按字母顺序对分类变量的级别进行排序。 'Friday' 在变量列表中缺失,已被选为参考类别。
>>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
>>> print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.301
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 1.105
Date: Thu, 03 May 2018 Prob (F-statistic): 0.401
Time: 15:26:02 Log-Likelihood: -97.898
No. Observations: 26 AIC: 211.8
Df Residuals: 18 BIC: 221.9
Df Model: 7
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept -1.4717 19.343 -0.076 0.940 -42.110 39.167
weekday[T.Monday] 2.5837 9.857 0.262 0.796 -18.124 23.291
weekday[T.Saturday] -6.5889 9.599 -0.686 0.501 -26.755 13.577
weekday[T.Sunday] 9.2287 9.616 0.960 0.350 -10.975 29.432
weekday[T.Thursday] -1.7610 10.321 -0.171 0.866 -23.445 19.923
weekday[T.Tuesday] 2.6507 9.664 0.274 0.787 -17.652 22.953
weekday[T.Wendesday] -6.9320 9.911 -0.699 0.493 -27.754 13.890
B 0.4047 0.258 1.566 0.135 -0.138 0.948
==============================================================================
Omnibus: 1.039 Durbin-Watson: 2.313
Prob(Omnibus): 0.595 Jarque-Bera (JB): 0.532
Skew: -0.350 Prob(JB): 0.766
Kurtosis: 3.007 Cond. No. 638.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
有关分类编码的选项,请参阅 patsy 文档 http://patsy.readthedocs.io/en/latest/categorical-coding.html
例如,参考编码可以在这个公式中明确指定
"A ~ B + C(weekday, Treatment('Sunday'))"
http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment