如何估计避免多重共线性的 lm 虚拟回归?
How to estimate a lm dummy regression avoiding multicollinearity?
我在虚拟变量上使用 lm 进行回归时遇到问题。我想弄清楚季节性影响(季节性)随着时间的推移而变化。为此,我建立了以下回归:
AT.trendinseason.lm <- lm(DTR.detrended~0+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.jul+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+dum.jan*t+dum.feb*t+dum.mar*t+dum.apr*t+dum.may*t+dum.jun*t+dum.jul*t+dum.aug*t+dum.sep*t+dum.oct*t+dum.nov*t+dum.dec*t)
我得到的输出如下:
summary(AT.trendinseason.lm)
Call:
lm(formula = DTR.detrended ~ 0 + dum.jan + dum.feb + dum.mar +
dum.apr + dum.may + dum.jun + dum.jul + dum.aug + dum.sep +
dum.oct + dum.nov + dum.dec + dum.jan * t + dum.feb * t +
dum.mar * t + dum.apr * t + dum.may * t + dum.jun * t + dum.jul *
t + dum.aug * t + dum.sep * t + dum.oct * t + dum.nov * t +
dum.dec * t)
Residuals:
Min 1Q Median 3Q Max
-9.4047 -2.2737 -0.3229 2.0987 18.9906
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
dum.jan -2.495e+00 1.121e-01 -22.262 < 2e-16 ***
dum.feb -1.527e+00 1.176e-01 -12.983 < 2e-16 ***
dum.mar 2.493e-01 1.124e-01 2.218 0.026552 *
dum.apr 1.266e+00 1.144e-01 11.073 < 2e-16 ***
dum.may 1.785e+00 1.127e-01 15.844 < 2e-16 ***
dum.jun 1.597e+00 1.147e-01 13.926 < 2e-16 ***
dum.jul 1.882e+00 1.131e-01 16.640 < 2e-16 ***
dum.aug 1.544e+00 1.126e-01 13.721 < 2e-16 ***
dum.sep 1.335e+00 1.134e-01 11.780 < 2e-16 ***
dum.oct 8.306e-02 1.117e-01 0.744 0.456961
dum.nov -2.545e+00 1.137e-01 -22.390 < 2e-16 ***
dum.dec -3.101e+00 1.119e-01 -27.703 < 2e-16 ***
t -1.343e-05 5.431e-06 -2.473 0.013389 *
dum.jan:t -8.571e-06 7.681e-06 -1.116 0.264444
dum.feb:t -3.094e-06 7.866e-06 -0.393 0.694090
dum.mar:t 5.346e-06 7.681e-06 0.696 0.486406
dum.apr:t 3.850e-05 7.744e-06 4.971 6.69e-07 ***
dum.may:t 2.748e-05 7.681e-06 3.578 0.000346 ***
dum.jun:t 2.959e-05 7.744e-06 3.821 0.000133 ***
dum.jul:t 3.384e-05 7.698e-06 4.396 1.10e-05 ***
dum.aug:t 4.494e-05 7.711e-06 5.828 5.67e-09 ***
dum.sep:t -1.921e-06 7.744e-06 -0.248 0.804105
dum.oct:t -1.526e-05 7.681e-06 -1.987 0.046943 *
dum.nov:t 8.864e-07 7.744e-06 0.114 0.908876
dum.dec:t NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.093 on 35745 degrees of freedom
Multiple R-squared: 0.3145, Adjusted R-squared: 0.314
F-statistic: 683.2 on 24 and 35745 DF, p-value: < 2.2e-16
不过情况是我知道应该不会有多重共线性的问题。 R 仍然省略了我的变量。有什么办法可以阻止 R 这样做吗?
我想遵循的模型来自我读过的一篇论文,它似乎可行:
这是我想采用的方法,但似乎行不通。
请帮忙。
我解决了这个问题,这完全取决于我如何编写交互项。似乎 R 在 * 符号方面遇到了一些麻烦。我用 : 替换了 *,结果成功了。我不知道为什么,但感谢上帝,我找到了解决方案。新代码是:
AT.trendinseason.lm <- lm(DTR.detrended~0+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.jul+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+dum.jan:t+dum.feb:t+dum.mar:t+dum.apr:t+dum.may:t+dum.jun:t+dum.jul:t+dum.aug:t+dum.sep:t+dum.oct:t+dum.nov:t+dum.dec:t)
给我想要的结果:
Call:
lm(formula = DTR.detrended ~ 0 + dum.jan + dum.feb + dum.mar +
dum.apr + dum.may + dum.jun + dum.jul + dum.aug + dum.sep +
dum.oct + dum.nov + dum.dec + dum.jan:t + dum.feb:t + dum.mar:t +
dum.apr:t + dum.may:t + dum.jun:t + dum.jul:t + dum.aug:t +
dum.sep:t + dum.oct:t + dum.nov:t + dum.dec:t)
Residuals:
Min 1Q Median 3Q Max
-9.4047 -2.2737 -0.3229 2.0987 18.9906
Coefficients:
Estimate Std. Error t value Pr(>|t|)
dum.jan -2.495e+00 1.121e-01 -22.262 < 2e-16 ***
dum.feb -1.527e+00 1.176e-01 -12.983 < 2e-16 ***
dum.mar 2.493e-01 1.124e-01 2.218 0.026552 *
dum.apr 1.266e+00 1.144e-01 11.073 < 2e-16 ***
dum.may 1.785e+00 1.127e-01 15.844 < 2e-16 ***
dum.jun 1.597e+00 1.147e-01 13.926 < 2e-16 ***
dum.jul 1.882e+00 1.131e-01 16.640 < 2e-16 ***
dum.aug 1.544e+00 1.126e-01 13.721 < 2e-16 ***
dum.sep 1.335e+00 1.134e-01 11.780 < 2e-16 ***
dum.oct 8.306e-02 1.117e-01 0.744 0.456961
dum.nov -2.545e+00 1.137e-01 -22.390 < 2e-16 ***
dum.dec -3.101e+00 1.119e-01 -27.703 < 2e-16 ***
dum.jan:t -2.200e-05 5.431e-06 -4.052 5.10e-05 ***
dum.feb:t -1.653e-05 5.691e-06 -2.904 0.003685 **
dum.mar:t -8.087e-06 5.431e-06 -1.489 0.136489
dum.apr:t 2.507e-05 5.521e-06 4.540 5.64e-06 ***
dum.may:t 1.405e-05 5.431e-06 2.587 0.009688 **
dum.jun:t 1.616e-05 5.521e-06 2.927 0.003422 **
dum.jul:t 2.041e-05 5.455e-06 3.741 0.000184 ***
dum.aug:t 3.150e-05 5.474e-06 5.755 8.73e-09 ***
dum.sep:t -1.535e-05 5.521e-06 -2.781 0.005420 **
dum.oct:t -2.869e-05 5.431e-06 -5.283 1.28e-07 ***
dum.nov:t -1.255e-05 5.521e-06 -2.273 0.023056 *
dum.dec:t -1.343e-05 5.431e-06 -2.473 0.013389 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.093 on 35745 degrees of freedom
Multiple R-squared: 0.3145, Adjusted R-squared: 0.314
F-statistic: 683.2 on 24 and 35745 DF, p-value: < 2.2e-16
无论如何,您现在知道解决此问题的一种方法。我希望它能帮助别人。
我在虚拟变量上使用 lm 进行回归时遇到问题。我想弄清楚季节性影响(季节性)随着时间的推移而变化。为此,我建立了以下回归:
AT.trendinseason.lm <- lm(DTR.detrended~0+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.jul+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+dum.jan*t+dum.feb*t+dum.mar*t+dum.apr*t+dum.may*t+dum.jun*t+dum.jul*t+dum.aug*t+dum.sep*t+dum.oct*t+dum.nov*t+dum.dec*t)
我得到的输出如下:
summary(AT.trendinseason.lm)
Call:
lm(formula = DTR.detrended ~ 0 + dum.jan + dum.feb + dum.mar +
dum.apr + dum.may + dum.jun + dum.jul + dum.aug + dum.sep +
dum.oct + dum.nov + dum.dec + dum.jan * t + dum.feb * t +
dum.mar * t + dum.apr * t + dum.may * t + dum.jun * t + dum.jul *
t + dum.aug * t + dum.sep * t + dum.oct * t + dum.nov * t +
dum.dec * t)
Residuals:
Min 1Q Median 3Q Max
-9.4047 -2.2737 -0.3229 2.0987 18.9906
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
dum.jan -2.495e+00 1.121e-01 -22.262 < 2e-16 ***
dum.feb -1.527e+00 1.176e-01 -12.983 < 2e-16 ***
dum.mar 2.493e-01 1.124e-01 2.218 0.026552 *
dum.apr 1.266e+00 1.144e-01 11.073 < 2e-16 ***
dum.may 1.785e+00 1.127e-01 15.844 < 2e-16 ***
dum.jun 1.597e+00 1.147e-01 13.926 < 2e-16 ***
dum.jul 1.882e+00 1.131e-01 16.640 < 2e-16 ***
dum.aug 1.544e+00 1.126e-01 13.721 < 2e-16 ***
dum.sep 1.335e+00 1.134e-01 11.780 < 2e-16 ***
dum.oct 8.306e-02 1.117e-01 0.744 0.456961
dum.nov -2.545e+00 1.137e-01 -22.390 < 2e-16 ***
dum.dec -3.101e+00 1.119e-01 -27.703 < 2e-16 ***
t -1.343e-05 5.431e-06 -2.473 0.013389 *
dum.jan:t -8.571e-06 7.681e-06 -1.116 0.264444
dum.feb:t -3.094e-06 7.866e-06 -0.393 0.694090
dum.mar:t 5.346e-06 7.681e-06 0.696 0.486406
dum.apr:t 3.850e-05 7.744e-06 4.971 6.69e-07 ***
dum.may:t 2.748e-05 7.681e-06 3.578 0.000346 ***
dum.jun:t 2.959e-05 7.744e-06 3.821 0.000133 ***
dum.jul:t 3.384e-05 7.698e-06 4.396 1.10e-05 ***
dum.aug:t 4.494e-05 7.711e-06 5.828 5.67e-09 ***
dum.sep:t -1.921e-06 7.744e-06 -0.248 0.804105
dum.oct:t -1.526e-05 7.681e-06 -1.987 0.046943 *
dum.nov:t 8.864e-07 7.744e-06 0.114 0.908876
dum.dec:t NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.093 on 35745 degrees of freedom
Multiple R-squared: 0.3145, Adjusted R-squared: 0.314
F-statistic: 683.2 on 24 and 35745 DF, p-value: < 2.2e-16
不过情况是我知道应该不会有多重共线性的问题。 R 仍然省略了我的变量。有什么办法可以阻止 R 这样做吗?
我想遵循的模型来自我读过的一篇论文,它似乎可行:
这是我想采用的方法,但似乎行不通。
请帮忙。
我解决了这个问题,这完全取决于我如何编写交互项。似乎 R 在 * 符号方面遇到了一些麻烦。我用 : 替换了 *,结果成功了。我不知道为什么,但感谢上帝,我找到了解决方案。新代码是:
AT.trendinseason.lm <- lm(DTR.detrended~0+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.jul+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+dum.jan:t+dum.feb:t+dum.mar:t+dum.apr:t+dum.may:t+dum.jun:t+dum.jul:t+dum.aug:t+dum.sep:t+dum.oct:t+dum.nov:t+dum.dec:t)
给我想要的结果:
Call:
lm(formula = DTR.detrended ~ 0 + dum.jan + dum.feb + dum.mar +
dum.apr + dum.may + dum.jun + dum.jul + dum.aug + dum.sep +
dum.oct + dum.nov + dum.dec + dum.jan:t + dum.feb:t + dum.mar:t +
dum.apr:t + dum.may:t + dum.jun:t + dum.jul:t + dum.aug:t +
dum.sep:t + dum.oct:t + dum.nov:t + dum.dec:t)
Residuals:
Min 1Q Median 3Q Max
-9.4047 -2.2737 -0.3229 2.0987 18.9906
Coefficients:
Estimate Std. Error t value Pr(>|t|)
dum.jan -2.495e+00 1.121e-01 -22.262 < 2e-16 ***
dum.feb -1.527e+00 1.176e-01 -12.983 < 2e-16 ***
dum.mar 2.493e-01 1.124e-01 2.218 0.026552 *
dum.apr 1.266e+00 1.144e-01 11.073 < 2e-16 ***
dum.may 1.785e+00 1.127e-01 15.844 < 2e-16 ***
dum.jun 1.597e+00 1.147e-01 13.926 < 2e-16 ***
dum.jul 1.882e+00 1.131e-01 16.640 < 2e-16 ***
dum.aug 1.544e+00 1.126e-01 13.721 < 2e-16 ***
dum.sep 1.335e+00 1.134e-01 11.780 < 2e-16 ***
dum.oct 8.306e-02 1.117e-01 0.744 0.456961
dum.nov -2.545e+00 1.137e-01 -22.390 < 2e-16 ***
dum.dec -3.101e+00 1.119e-01 -27.703 < 2e-16 ***
dum.jan:t -2.200e-05 5.431e-06 -4.052 5.10e-05 ***
dum.feb:t -1.653e-05 5.691e-06 -2.904 0.003685 **
dum.mar:t -8.087e-06 5.431e-06 -1.489 0.136489
dum.apr:t 2.507e-05 5.521e-06 4.540 5.64e-06 ***
dum.may:t 1.405e-05 5.431e-06 2.587 0.009688 **
dum.jun:t 1.616e-05 5.521e-06 2.927 0.003422 **
dum.jul:t 2.041e-05 5.455e-06 3.741 0.000184 ***
dum.aug:t 3.150e-05 5.474e-06 5.755 8.73e-09 ***
dum.sep:t -1.535e-05 5.521e-06 -2.781 0.005420 **
dum.oct:t -2.869e-05 5.431e-06 -5.283 1.28e-07 ***
dum.nov:t -1.255e-05 5.521e-06 -2.273 0.023056 *
dum.dec:t -1.343e-05 5.431e-06 -2.473 0.013389 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.093 on 35745 degrees of freedom
Multiple R-squared: 0.3145, Adjusted R-squared: 0.314
F-statistic: 683.2 on 24 and 35745 DF, p-value: < 2.2e-16
无论如何,您现在知道解决此问题的一种方法。我希望它能帮助别人。