Pandas OLS 随机效应模型 - 奇怪的预测值
Pandas OLS Random Effects model - weird prediction values
我正在尝试使用 Pandas 创建 RE 模型。我以前只有 Stata 经验。
我的目标是通过对一组假人的温度数据进行回归来生成预测温度。第一组假人代表一年中的月份,而第二组假人代表车站。
我怀疑我做错了什么 id = 1 和 month = 1 的预测值为零。在 Stata 中执行相同的 RE 模型时,我收到的预测值为 -23.02825。现在,Pandas 和 Stata 之间的 RE 模型当然可能存在一些我不知道的差异,但我怀疑我在某种程度上指定了错误的代码。有人对此有见解吗?
我不是 Whosebug 上的经验丰富的发帖人,所以如果我需要添加任何其他信息,请告诉我!
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from linearmodels.panel import RandomEffects
data = pd.read_stata("/Users/user/Desktop/dataset2.dta")
print(data.head())
id value dates month antal year iid
0 1 0 2000-01-01 1.0 62.0 2000-01-01 1
1 1 21 2000-01-02 1.0 62.0 2000-01-01 1
2 1 32 2000-01-03 1.0 62.0 2000-01-01 1
3 1 31 2000-01-04 1.0 62.0 2000-01-01 1
4 1 11 2000-01-05 1.0 62.0 2000-01-01 1
... .. ... ... ... ... ... ...
174903 55 -56 2010-12-27 12.0 31.0 2010-01-01 49
174904 55 -76 2010-12-28 12.0 31.0 2010-01-01 49
174905 55 -93 2010-12-29 12.0 31.0 2010-01-01 49
174906 55 -90 2010-12-30 12.0 31.0 2010-01-01 49
174907 55 10 2010-12-31 12.0 31.0 2010-01-01 49
[174908 rows x 7 columns]
data = data.astype({'iid': 'int16'})
print(data)
data = data.astype({'id': 'int16'})
data = data.astype({'antal': 'int8'})
data = data.astype({'month': 'int8'})
data.dtypes
id int16
value int32
dates datetime64[ns]
month int8
antal int8
year datetime64[ns]
iid int16
dtype: object
d_iid = pd.get_dummies(data.iid, prefix='id', dtype='int8').iloc[:,1:]
data = pd.concat([data, d_iid], axis=1)
d_m = pd.get_dummies(data.month, prefix='m').iloc[:,1:]
data = pd.concat([data, d_m], axis=1)
data = data.set_index(["iid", "dates"])
print(data)
id value month antal year id_2 id_3 id_4 id_5 \
iid dates
1 2000-01-01 1 0 1 62 2000-01-01 0 0 0 0
2000-01-02 1 21 1 62 2000-01-01 0 0 0 0
2000-01-03 1 32 1 62 2000-01-01 0 0 0 0
2000-01-04 1 31 1 62 2000-01-01 0 0 0 0
2000-01-05 1 11 1 62 2000-01-01 0 0 0 0
... .. ... ... ... ... ... ... ... ...
49 2010-12-27 55 -56 12 31 2010-01-01 0 0 0 0
2010-12-28 55 -76 12 31 2010-01-01 0 0 0 0
2010-12-29 55 -93 12 31 2010-01-01 0 0 0 0
2010-12-30 55 -90 12 31 2010-01-01 0 0 0 0
2010-12-31 55 10 12 31 2010-01-01 0 0 0 0
id_6 ... m_3 m_4 m_5 m_6 m_7 m_8 m_9 m_10 m_11 m_12
iid dates ...
1 2000-01-01 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-02 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-03 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-04 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-05 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
49 2010-12-27 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-28 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-29 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-30 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-31 0 ... 0 0 0 0 0 0 0 0 0 1
[174908 rows x 62 columns]
var=data.loc[:, 'id_2':'m_12']
mod = RandomEffects(data.value, var)
re_res = mod.fit()
print(re_res)
RandomEffects Estimation Summary
================================================================================
Dep. Variable: value R-squared: 0.9129
Estimator: RandomEffects R-squared (Between): 0.9993
No. Observations: 174908 R-squared (Within): 0.7416
Date: Fri, Oct 09 2020 R-squared (Overall): 0.9129
Time: 18:15:09 Log-likelihood -8.937e+05
Cov. Estimator: Unadjusted
F-statistic: 3.216e+04
Entities: 47 P-value 0.0000
Avg Obs: 3721.4 Distribution: F(57,174851)
Min Obs: 1096.0
Max Obs: 4018.0 F-statistic (robust): 3.216e+04
P-value 0.0000
Time periods: 4018 Distribution: F(57,174851)
Avg Obs: 43.531
Min Obs: 41.000
Max Obs: 46.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
id_2 -32.954 0.7098 -46.425 0.0000 -34.345 -31.563
id_3 -71.447 0.9984 -71.562 0.0000 -73.404 -69.491
id_4 -20.982 0.6981 -30.056 0.0000 -22.350 -19.614
id_5 -20.982 0.6981 -30.056 0.0000 -22.350 -19.614
id_6 -17.287 0.7680 -22.511 0.0000 -18.792 -15.782
id_7 -26.455 0.6981 -37.897 0.0000 -27.824 -25.087
id_8 -51.832 0.6981 -74.248 0.0000 -53.200 -50.464
id_9 -51.832 0.6981 -74.248 0.0000 -53.200 -50.464
id_10 -11.651 0.6981 -16.690 0.0000 -13.019 -10.283
id_11 8.3476 0.7003 11.921 0.0000 6.9751 9.7201
id_12 20.876 0.6982 29.902 0.0000 19.508 22.245
id_13 12.719 0.6982 18.218 0.0000 11.351 14.088
id_14 10.063 0.6981 14.415 0.0000 8.6947 11.431
id_15 -137.28 0.6981 -196.66 0.0000 -138.65 -135.92
id_16 23.770 0.6981 34.050 0.0000 22.402 25.138
id_17 38.704 0.6981 55.443 0.0000 37.336 40.073
id_18 48.107 0.9304 51.706 0.0000 46.283 49.930
id_19 112.55 0.6991 161.00 0.0000 111.18 113.92
id_20 111.20 0.6981 159.30 0.0000 109.84 112.57
id_21 107.49 0.6981 153.98 0.0000 106.12 108.86
id_22 112.01 0.9761 114.75 0.0000 110.10 113.92
id_23 22.266 0.6981 31.895 0.0000 20.898 23.634
id_24 -26.090 0.6981 -37.373 0.0000 -27.458 -24.722
id_25 -52.292 0.6982 -74.900 0.0000 -53.660 -50.924
id_26 -86.101 0.6981 -123.34 0.0000 -87.469 -84.733
id_27 73.427 1.2464 58.911 0.0000 70.984 75.869
id_28 30.520 0.6981 43.719 0.0000 29.152 31.888
id_29 50.253 0.6981 71.986 0.0000 48.884 51.621
id_30 50.380 0.6981 72.167 0.0000 49.011 51.748
id_31 69.253 0.6981 99.204 0.0000 67.885 70.622
id_32 36.982 0.6981 52.976 0.0000 35.614 38.350
id_33 68.688 0.6981 98.394 0.0000 67.320 70.056
id_34 5.7040 0.6981 8.1709 0.0000 4.3358 7.0723
id_35 7.8369 0.6981 11.226 0.0000 6.4687 9.2052
id_36 7.8414 0.6981 11.233 0.0000 6.4732 9.2097
id_37 6.1834 0.6981 8.8576 0.0000 4.8151 7.5516
id_38 19.922 0.6981 28.538 0.0000 18.554 21.290
id_39 9.7777 0.6986 13.995 0.0000 8.4084 11.147
id_41 7.0468 0.6981 10.094 0.0000 5.6785 8.4150
id_42 -14.053 0.6981 -20.131 0.0000 -15.421 -12.685
id_43 11.974 0.6981 17.152 0.0000 10.605 13.342
id_44 10.657 0.9111 11.697 0.0000 8.8716 12.443
id_45 23.552 0.6981 33.737 0.0000 22.183 24.920
id_46 9.1107 0.6981 13.051 0.0000 7.7425 10.479
id_48 6.9631 0.6981 9.9745 0.0000 5.5949 8.3314
id_49 4.3713 0.6981 6.2618 0.0000 3.0030 5.7395
m_2 6.2046 0.4627 13.411 0.0000 5.2978 7.1115
m_3 36.329 0.4509 80.562 0.0000 35.445 37.212
m_4 84.415 0.4549 185.57 0.0000 83.524 85.307
m_5 129.82 0.4509 287.93 0.0000 128.94 130.71
m_6 166.63 0.4546 366.52 0.0000 165.74 167.52
m_7 188.15 0.4509 417.30 0.0000 187.26 189.03
m_8 181.04 0.4509 401.55 0.0000 180.16 181.93
m_9 138.76 0.4543 305.42 0.0000 137.87 139.65
m_10 94.589 0.4488 210.78 0.0000 93.709 95.468
m_11 47.314 0.4514 104.81 0.0000 46.429 48.199
m_12 5.9907 0.4483 13.363 0.0000 5.1121 6.8694
==============================================================================
re_res.predict()
fitted_values
iid dates
1 2000-01-01 0.000000
2000-01-02 0.000000
2000-01-03 0.000000
2000-01-04 0.000000
2000-01-05 0.000000
... ... ...
49 2010-12-27 10.362003
2010-12-28 10.362003
2010-12-29 10.362003
2010-12-30 10.362003
2010-12-31 10.362003
174908 rows × 1 columns
所以,我发现 Python 没有添加截距 (Pandas/Statsmodel OLS predicting future values)。
因此,解决方案:
data['intercept'] = 1
var=data.loc[:,'intercept':'m_12']
我正在尝试使用 Pandas 创建 RE 模型。我以前只有 Stata 经验。
我的目标是通过对一组假人的温度数据进行回归来生成预测温度。第一组假人代表一年中的月份,而第二组假人代表车站。
我怀疑我做错了什么 id = 1 和 month = 1 的预测值为零。在 Stata 中执行相同的 RE 模型时,我收到的预测值为 -23.02825。现在,Pandas 和 Stata 之间的 RE 模型当然可能存在一些我不知道的差异,但我怀疑我在某种程度上指定了错误的代码。有人对此有见解吗?
我不是 Whosebug 上的经验丰富的发帖人,所以如果我需要添加任何其他信息,请告诉我!
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from linearmodels.panel import RandomEffects
data = pd.read_stata("/Users/user/Desktop/dataset2.dta")
print(data.head())
id value dates month antal year iid
0 1 0 2000-01-01 1.0 62.0 2000-01-01 1
1 1 21 2000-01-02 1.0 62.0 2000-01-01 1
2 1 32 2000-01-03 1.0 62.0 2000-01-01 1
3 1 31 2000-01-04 1.0 62.0 2000-01-01 1
4 1 11 2000-01-05 1.0 62.0 2000-01-01 1
... .. ... ... ... ... ... ...
174903 55 -56 2010-12-27 12.0 31.0 2010-01-01 49
174904 55 -76 2010-12-28 12.0 31.0 2010-01-01 49
174905 55 -93 2010-12-29 12.0 31.0 2010-01-01 49
174906 55 -90 2010-12-30 12.0 31.0 2010-01-01 49
174907 55 10 2010-12-31 12.0 31.0 2010-01-01 49
[174908 rows x 7 columns]
data = data.astype({'iid': 'int16'})
print(data)
data = data.astype({'id': 'int16'})
data = data.astype({'antal': 'int8'})
data = data.astype({'month': 'int8'})
data.dtypes
id int16
value int32
dates datetime64[ns]
month int8
antal int8
year datetime64[ns]
iid int16
dtype: object
d_iid = pd.get_dummies(data.iid, prefix='id', dtype='int8').iloc[:,1:]
data = pd.concat([data, d_iid], axis=1)
d_m = pd.get_dummies(data.month, prefix='m').iloc[:,1:]
data = pd.concat([data, d_m], axis=1)
data = data.set_index(["iid", "dates"])
print(data)
id value month antal year id_2 id_3 id_4 id_5 \
iid dates
1 2000-01-01 1 0 1 62 2000-01-01 0 0 0 0
2000-01-02 1 21 1 62 2000-01-01 0 0 0 0
2000-01-03 1 32 1 62 2000-01-01 0 0 0 0
2000-01-04 1 31 1 62 2000-01-01 0 0 0 0
2000-01-05 1 11 1 62 2000-01-01 0 0 0 0
... .. ... ... ... ... ... ... ... ...
49 2010-12-27 55 -56 12 31 2010-01-01 0 0 0 0
2010-12-28 55 -76 12 31 2010-01-01 0 0 0 0
2010-12-29 55 -93 12 31 2010-01-01 0 0 0 0
2010-12-30 55 -90 12 31 2010-01-01 0 0 0 0
2010-12-31 55 10 12 31 2010-01-01 0 0 0 0
id_6 ... m_3 m_4 m_5 m_6 m_7 m_8 m_9 m_10 m_11 m_12
iid dates ...
1 2000-01-01 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-02 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-03 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-04 0 ... 0 0 0 0 0 0 0 0 0 0
2000-01-05 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
49 2010-12-27 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-28 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-29 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-30 0 ... 0 0 0 0 0 0 0 0 0 1
2010-12-31 0 ... 0 0 0 0 0 0 0 0 0 1
[174908 rows x 62 columns]
var=data.loc[:, 'id_2':'m_12']
mod = RandomEffects(data.value, var)
re_res = mod.fit()
print(re_res)
RandomEffects Estimation Summary
================================================================================
Dep. Variable: value R-squared: 0.9129
Estimator: RandomEffects R-squared (Between): 0.9993
No. Observations: 174908 R-squared (Within): 0.7416
Date: Fri, Oct 09 2020 R-squared (Overall): 0.9129
Time: 18:15:09 Log-likelihood -8.937e+05
Cov. Estimator: Unadjusted
F-statistic: 3.216e+04
Entities: 47 P-value 0.0000
Avg Obs: 3721.4 Distribution: F(57,174851)
Min Obs: 1096.0
Max Obs: 4018.0 F-statistic (robust): 3.216e+04
P-value 0.0000
Time periods: 4018 Distribution: F(57,174851)
Avg Obs: 43.531
Min Obs: 41.000
Max Obs: 46.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
id_2 -32.954 0.7098 -46.425 0.0000 -34.345 -31.563
id_3 -71.447 0.9984 -71.562 0.0000 -73.404 -69.491
id_4 -20.982 0.6981 -30.056 0.0000 -22.350 -19.614
id_5 -20.982 0.6981 -30.056 0.0000 -22.350 -19.614
id_6 -17.287 0.7680 -22.511 0.0000 -18.792 -15.782
id_7 -26.455 0.6981 -37.897 0.0000 -27.824 -25.087
id_8 -51.832 0.6981 -74.248 0.0000 -53.200 -50.464
id_9 -51.832 0.6981 -74.248 0.0000 -53.200 -50.464
id_10 -11.651 0.6981 -16.690 0.0000 -13.019 -10.283
id_11 8.3476 0.7003 11.921 0.0000 6.9751 9.7201
id_12 20.876 0.6982 29.902 0.0000 19.508 22.245
id_13 12.719 0.6982 18.218 0.0000 11.351 14.088
id_14 10.063 0.6981 14.415 0.0000 8.6947 11.431
id_15 -137.28 0.6981 -196.66 0.0000 -138.65 -135.92
id_16 23.770 0.6981 34.050 0.0000 22.402 25.138
id_17 38.704 0.6981 55.443 0.0000 37.336 40.073
id_18 48.107 0.9304 51.706 0.0000 46.283 49.930
id_19 112.55 0.6991 161.00 0.0000 111.18 113.92
id_20 111.20 0.6981 159.30 0.0000 109.84 112.57
id_21 107.49 0.6981 153.98 0.0000 106.12 108.86
id_22 112.01 0.9761 114.75 0.0000 110.10 113.92
id_23 22.266 0.6981 31.895 0.0000 20.898 23.634
id_24 -26.090 0.6981 -37.373 0.0000 -27.458 -24.722
id_25 -52.292 0.6982 -74.900 0.0000 -53.660 -50.924
id_26 -86.101 0.6981 -123.34 0.0000 -87.469 -84.733
id_27 73.427 1.2464 58.911 0.0000 70.984 75.869
id_28 30.520 0.6981 43.719 0.0000 29.152 31.888
id_29 50.253 0.6981 71.986 0.0000 48.884 51.621
id_30 50.380 0.6981 72.167 0.0000 49.011 51.748
id_31 69.253 0.6981 99.204 0.0000 67.885 70.622
id_32 36.982 0.6981 52.976 0.0000 35.614 38.350
id_33 68.688 0.6981 98.394 0.0000 67.320 70.056
id_34 5.7040 0.6981 8.1709 0.0000 4.3358 7.0723
id_35 7.8369 0.6981 11.226 0.0000 6.4687 9.2052
id_36 7.8414 0.6981 11.233 0.0000 6.4732 9.2097
id_37 6.1834 0.6981 8.8576 0.0000 4.8151 7.5516
id_38 19.922 0.6981 28.538 0.0000 18.554 21.290
id_39 9.7777 0.6986 13.995 0.0000 8.4084 11.147
id_41 7.0468 0.6981 10.094 0.0000 5.6785 8.4150
id_42 -14.053 0.6981 -20.131 0.0000 -15.421 -12.685
id_43 11.974 0.6981 17.152 0.0000 10.605 13.342
id_44 10.657 0.9111 11.697 0.0000 8.8716 12.443
id_45 23.552 0.6981 33.737 0.0000 22.183 24.920
id_46 9.1107 0.6981 13.051 0.0000 7.7425 10.479
id_48 6.9631 0.6981 9.9745 0.0000 5.5949 8.3314
id_49 4.3713 0.6981 6.2618 0.0000 3.0030 5.7395
m_2 6.2046 0.4627 13.411 0.0000 5.2978 7.1115
m_3 36.329 0.4509 80.562 0.0000 35.445 37.212
m_4 84.415 0.4549 185.57 0.0000 83.524 85.307
m_5 129.82 0.4509 287.93 0.0000 128.94 130.71
m_6 166.63 0.4546 366.52 0.0000 165.74 167.52
m_7 188.15 0.4509 417.30 0.0000 187.26 189.03
m_8 181.04 0.4509 401.55 0.0000 180.16 181.93
m_9 138.76 0.4543 305.42 0.0000 137.87 139.65
m_10 94.589 0.4488 210.78 0.0000 93.709 95.468
m_11 47.314 0.4514 104.81 0.0000 46.429 48.199
m_12 5.9907 0.4483 13.363 0.0000 5.1121 6.8694
==============================================================================
re_res.predict()
fitted_values
iid dates
1 2000-01-01 0.000000
2000-01-02 0.000000
2000-01-03 0.000000
2000-01-04 0.000000
2000-01-05 0.000000
... ... ...
49 2010-12-27 10.362003
2010-12-28 10.362003
2010-12-29 10.362003
2010-12-30 10.362003
2010-12-31 10.362003
174908 rows × 1 columns
所以,我发现 Python 没有添加截距 (Pandas/Statsmodel OLS predicting future values)。
因此,解决方案:
data['intercept'] = 1
var=data.loc[:,'intercept':'m_12']