手动计算多元线性回归(OLS)的置信区间
Manually calculating the confidence interval of a multiple linear regression(OLS)
我想了解如何手动计算多元线性回归 (OLS) 的置信区间。我的问题是我不知道如何计算所有单个系数的标准误差。
对于只有一个自变量的回归,我遵循了以下教程:http://stattrek.com/regression/slope-confidence-interval.aspx。本教程提供以下公式:
事实证明,公式有效。但是,我没有完全理解这个公式。例如,为什么 (-2) 位于公式的顶部。为了验证正确性,我编写了以下已经显示标准错误的 r 代码:
x<-1:50
y<-c(x[1:48]+rnorm(48,0,5),rnorm(2,150,5))
QR <- rq(y~x, tau=0.5)
summary(QR, se='boot')
LM<-lm(y~x)
alligator = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.43, 4.38, 4.42, 4.25)
)
alli.mod1 = lm(lnWeight ~ ., data = alligator)
newdata = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78)
)
y_predicted = predict(alli.mod1, newdata, interval="predict")[,1]
length_mean = mean(alligator$lnLength)
> summary(alli.mod1)
Call:
lm(formula = lnWeight ~ ., data = alligator)
Residuals:
1 2 3 4
0.08526 -0.02368 0.03316 -0.09474
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5279 5.7561 1.308 0.321
lnLength -0.8421 1.5349 -0.549 0.638
Residual standard error: 0.09462 on 2 degrees of freedom
Multiple R-squared: 0.1308, Adjusted R-squared: -0.3038
F-statistic: 0.301 on 1 and 2 DF, p-value: 0.6383
然后我使用以下r代码手动计算了SE(根据上面的公式):
rss = (alligator$lnWeight[1] - y_predicted[1])^2 +
(alligator$lnWeight[2] - y_predicted[2])^2 +
(alligator$lnWeight[3] - y_predicted[3])^2 +
(alligator$lnWeight[4] - y_predicted[4])^2
a = sqrt(rss/(length(y_predicted)-2))
b = sqrt((alligator$lnLength[1] - length_mean)^2 +
(alligator$lnLength[2] - length_mean)^2 +
(alligator$lnLength[3] - length_mean)^2 +
(alligator$lnLength[4] - length_mean)^2)
a/b
1.534912
这导致了与 summary(alli.mod1) 的 SE 相同的值。所以,我想,当我尝试使用 2 个变量时,它可能会起作用。不幸的是,这导致了错误的答案。如下代码所示:
alligator = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnColor = c(2.43, 2.59, 2.46, 2.22),
lnWeight = c(4.43, 4.38, 4.42, 4.25)
)
alli.mod1 = lm(lnWeight ~ ., data = alligator)
newdata = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnColor = c(2.43, 2.59, 2.46, 2.22)
)
y_predicted = predict(alli.mod1, newdata, interval="predict")[,1]
length_mean = mean(alligator$lnLength)
color_mean = mean(alligator$lnColor)
rss = (alligator$lnWeight[1] - y_predicted[1])^2 +
(alligator$lnWeight[2] - y_predicted[2])^2 +
(alligator$lnWeight[3] - y_predicted[3])^2 +
(alligator$lnWeight[4] - y_predicted[4])^2
a = sqrt(rss/(length(y_predicted)-2))
b = sqrt((alligator$lnColor[1] - color_mean)^2 +
(alligator$lnColor[2] - color_mean)^2 +
(alligator$lnColor[3] - color_mean)^2 +
(alligator$lnColor[4] - color_mean)^2)
b1 = sqrt((alligator$lnLength[1] - length_mean)^2 +
(alligator$lnLength[2] - length_mean)^2 +
(alligator$lnLength[3] - length_mean)^2 +
(alligator$lnLength[4] - length_mean)^2)
> summary(alli.mod1)
Call:
lm(formula = lnWeight ~ ., data = alligator)
Residuals:
1 2 3 4
0.006725 -0.041534 0.058147 -0.023338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.5746 8.8650 -0.403 0.756
lnLength 1.6569 2.1006 0.789 0.575
lnColor 0.7140 0.4877 1.464 0.381
Residual standard error: 0.07547 on 1 degrees of freedom
Multiple R-squared: 0.7235, Adjusted R-squared: 0.1705
F-statistic: 1.308 on 2 and 1 DF, p-value: 0.5258
> a/b
1
0.2009918
> a/b1
1
0.8657274
有没有我可以遵循的通用方法来计算标准误差?
我建议阅读一些关于 OLS 的一般性阅读材料,包括多元回归。有几个免费的信息来源;一个起点可能是来自麻省理工学院开放课件的 Penn State's STAT 501 website. You can find a derivation of the formula for OLS β standard errors on slides 8 and 9 of these slides。
本质上,标准误差是 β 方差的平方根。正如您在我链接的幻灯片中看到的那样,系数方差-协方差矩阵的公式为 σ2(X'X)-1,其中σ2是误差项的方差。然后每个 βj 的方差是该矩阵的第 j 对角线。由于我们不知道真实的 σ2,我们像上面那样估计它——我们取误差平方和除以 n 的平方根- p,其中 p 是解释变量的数量(包括/+截距)——在简单回归中 p = 2. 然而,在简单回归的情况下,(X'X)-1[=30的对角线=] 可以通过上面公式的分母找到,多元回归中不会出现这种情况;你需要做矩阵代数。幸运的是,这在 R 中非常简单:
# First we make the example data
alligator = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnColor = c(2.43, 2.59, 2.46, 2.22),
lnWeight = c(4.43, 4.38, 4.42, 4.25)
)
# Then we use lm() for a check on our answers later
alli.mod1 = lm(lnWeight ~ ., data = alligator)
# Find the sum of the squared residuals
rss <- sum(alli.mod1$residuals^2)
# And use that to find the estimate of sigma^2, commonly called S
S <- sqrt(rss / (length(alli.mod1$residuals) - length(alli.mod1$coefficients)))
# Make the X matrix; a column of 1s for the intercept and one for each variable
X <- cbind(rep(1, nrow(alligator)), alligator$lnLength, alligator$lnColor)
# We can multiply matrices using %*%, transpose them with t(),
# and invert them with solve(); so we directly apply the formula above with:
std.errors <- S * sqrt(diag(solve(t(X) %*% X)))
# Now we check our answers:
summary(alli.mod1)$coefficients[ , 2] # the second column is the std. errors
# (Intercept) lnLength lnColor
# 8.8650459 2.1005738 0.4876803
std.errors
# [1] 8.8650459 2.1005738 0.4876803
我想了解如何手动计算多元线性回归 (OLS) 的置信区间。我的问题是我不知道如何计算所有单个系数的标准误差。
对于只有一个自变量的回归,我遵循了以下教程:http://stattrek.com/regression/slope-confidence-interval.aspx。本教程提供以下公式:
事实证明,公式有效。但是,我没有完全理解这个公式。例如,为什么 (-2) 位于公式的顶部。为了验证正确性,我编写了以下已经显示标准错误的 r 代码:
x<-1:50
y<-c(x[1:48]+rnorm(48,0,5),rnorm(2,150,5))
QR <- rq(y~x, tau=0.5)
summary(QR, se='boot')
LM<-lm(y~x)
alligator = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.43, 4.38, 4.42, 4.25)
)
alli.mod1 = lm(lnWeight ~ ., data = alligator)
newdata = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78)
)
y_predicted = predict(alli.mod1, newdata, interval="predict")[,1]
length_mean = mean(alligator$lnLength)
> summary(alli.mod1)
Call:
lm(formula = lnWeight ~ ., data = alligator)
Residuals:
1 2 3 4
0.08526 -0.02368 0.03316 -0.09474
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5279 5.7561 1.308 0.321
lnLength -0.8421 1.5349 -0.549 0.638
Residual standard error: 0.09462 on 2 degrees of freedom
Multiple R-squared: 0.1308, Adjusted R-squared: -0.3038
F-statistic: 0.301 on 1 and 2 DF, p-value: 0.6383
然后我使用以下r代码手动计算了SE(根据上面的公式):
rss = (alligator$lnWeight[1] - y_predicted[1])^2 +
(alligator$lnWeight[2] - y_predicted[2])^2 +
(alligator$lnWeight[3] - y_predicted[3])^2 +
(alligator$lnWeight[4] - y_predicted[4])^2
a = sqrt(rss/(length(y_predicted)-2))
b = sqrt((alligator$lnLength[1] - length_mean)^2 +
(alligator$lnLength[2] - length_mean)^2 +
(alligator$lnLength[3] - length_mean)^2 +
(alligator$lnLength[4] - length_mean)^2)
a/b
1.534912
这导致了与 summary(alli.mod1) 的 SE 相同的值。所以,我想,当我尝试使用 2 个变量时,它可能会起作用。不幸的是,这导致了错误的答案。如下代码所示:
alligator = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnColor = c(2.43, 2.59, 2.46, 2.22),
lnWeight = c(4.43, 4.38, 4.42, 4.25)
)
alli.mod1 = lm(lnWeight ~ ., data = alligator)
newdata = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnColor = c(2.43, 2.59, 2.46, 2.22)
)
y_predicted = predict(alli.mod1, newdata, interval="predict")[,1]
length_mean = mean(alligator$lnLength)
color_mean = mean(alligator$lnColor)
rss = (alligator$lnWeight[1] - y_predicted[1])^2 +
(alligator$lnWeight[2] - y_predicted[2])^2 +
(alligator$lnWeight[3] - y_predicted[3])^2 +
(alligator$lnWeight[4] - y_predicted[4])^2
a = sqrt(rss/(length(y_predicted)-2))
b = sqrt((alligator$lnColor[1] - color_mean)^2 +
(alligator$lnColor[2] - color_mean)^2 +
(alligator$lnColor[3] - color_mean)^2 +
(alligator$lnColor[4] - color_mean)^2)
b1 = sqrt((alligator$lnLength[1] - length_mean)^2 +
(alligator$lnLength[2] - length_mean)^2 +
(alligator$lnLength[3] - length_mean)^2 +
(alligator$lnLength[4] - length_mean)^2)
> summary(alli.mod1)
Call:
lm(formula = lnWeight ~ ., data = alligator)
Residuals:
1 2 3 4
0.006725 -0.041534 0.058147 -0.023338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.5746 8.8650 -0.403 0.756
lnLength 1.6569 2.1006 0.789 0.575
lnColor 0.7140 0.4877 1.464 0.381
Residual standard error: 0.07547 on 1 degrees of freedom
Multiple R-squared: 0.7235, Adjusted R-squared: 0.1705
F-statistic: 1.308 on 2 and 1 DF, p-value: 0.5258
> a/b
1
0.2009918
> a/b1
1
0.8657274
有没有我可以遵循的通用方法来计算标准误差?
我建议阅读一些关于 OLS 的一般性阅读材料,包括多元回归。有几个免费的信息来源;一个起点可能是来自麻省理工学院开放课件的 Penn State's STAT 501 website. You can find a derivation of the formula for OLS β standard errors on slides 8 and 9 of these slides。
本质上,标准误差是 β 方差的平方根。正如您在我链接的幻灯片中看到的那样,系数方差-协方差矩阵的公式为 σ2(X'X)-1,其中σ2是误差项的方差。然后每个 βj 的方差是该矩阵的第 j 对角线。由于我们不知道真实的 σ2,我们像上面那样估计它——我们取误差平方和除以 n 的平方根- p,其中 p 是解释变量的数量(包括/+截距)——在简单回归中 p = 2. 然而,在简单回归的情况下,(X'X)-1[=30的对角线=] 可以通过上面公式的分母找到,多元回归中不会出现这种情况;你需要做矩阵代数。幸运的是,这在 R 中非常简单:
# First we make the example data
alligator = data.frame(
lnLength = c(3.78, 3.71, 3.73, 3.78),
lnColor = c(2.43, 2.59, 2.46, 2.22),
lnWeight = c(4.43, 4.38, 4.42, 4.25)
)
# Then we use lm() for a check on our answers later
alli.mod1 = lm(lnWeight ~ ., data = alligator)
# Find the sum of the squared residuals
rss <- sum(alli.mod1$residuals^2)
# And use that to find the estimate of sigma^2, commonly called S
S <- sqrt(rss / (length(alli.mod1$residuals) - length(alli.mod1$coefficients)))
# Make the X matrix; a column of 1s for the intercept and one for each variable
X <- cbind(rep(1, nrow(alligator)), alligator$lnLength, alligator$lnColor)
# We can multiply matrices using %*%, transpose them with t(),
# and invert them with solve(); so we directly apply the formula above with:
std.errors <- S * sqrt(diag(solve(t(X) %*% X)))
# Now we check our answers:
summary(alli.mod1)$coefficients[ , 2] # the second column is the std. errors
# (Intercept) lnLength lnColor
# 8.8650459 2.1005738 0.4876803
std.errors
# [1] 8.8650459 2.1005738 0.4876803