R:lm()结果在使用“weights”参数和使用手动重新加权数据时不同
R: lm() result differs when using `weights` argument and when using manually reweighted data
为了纠正误差项的异方差性,我运行在 R 中使用以下加权最小二乘回归:
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
其中 "weighting" 是一个变量(变量 q
的函数),用于对观察值进行加权。 q2
就是 q^2
.
现在,为了仔细检查我的结果,我通过创建新的加权变量手动对我的变量进行加权:
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
和运行以下回归,没有权重选项,也没有常量 - 因为常量是加权的,所以原始预测矩阵中 1 的列现在应该等于变量权重:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
如您所见,结果相似但不完全相同。我是在手动对变量加权时做错了什么,还是选项 "weights" 做的不仅仅是简单地将变量乘以加权向量?
只要正确地进行手动加权,就不会出现差异。
所以正确的做法是:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
虽然一般建议直接传入矩阵时使用lm.fit
和lm.wfit
:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
但是在使用lm.fit
和lm.wfit
这些内部子程序时,要求所有输入都是完整的没有NA
的情况,否则底层C程序stats:::C_Cdqrls
将抱怨。
如果您仍然想使用公式界面而不是矩阵,您可以执行以下操作:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
可重现的例子
让我们使用 R 的内置数据集 trees
。使用 head(trees)
检查此数据集。此数据集中没有 NA
。我们的目标是拟合一个模型:
Height ~ Girth + Volume
1 和 2 之间的一些随机权重:
set.seed(0); w <- runif(nrow(trees), 1, 2)
我们通过加权回归来拟合这个模型,方法是将权重传递给 lm
,或者手动转换数据并调用 lm
没有权重:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
确实,我们看到了相同的结果。
或者,我们可以使用 lm.fit
和 lm.wfit
:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
我们可以通过以下方式检查系数:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
结果还是一样。
为了纠正误差项的异方差性,我运行在 R 中使用以下加权最小二乘回归:
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
其中 "weighting" 是一个变量(变量 q
的函数),用于对观察值进行加权。 q2
就是 q^2
.
现在,为了仔细检查我的结果,我通过创建新的加权变量手动对我的变量进行加权:
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
和运行以下回归,没有权重选项,也没有常量 - 因为常量是加权的,所以原始预测矩阵中 1 的列现在应该等于变量权重:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
如您所见,结果相似但不完全相同。我是在手动对变量加权时做错了什么,还是选项 "weights" 做的不仅仅是简单地将变量乘以加权向量?
只要正确地进行手动加权,就不会出现差异。
所以正确的做法是:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
虽然一般建议直接传入矩阵时使用lm.fit
和lm.wfit
:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
但是在使用lm.fit
和lm.wfit
这些内部子程序时,要求所有输入都是完整的没有NA
的情况,否则底层C程序stats:::C_Cdqrls
将抱怨。
如果您仍然想使用公式界面而不是矩阵,您可以执行以下操作:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
可重现的例子
让我们使用 R 的内置数据集 trees
。使用 head(trees)
检查此数据集。此数据集中没有 NA
。我们的目标是拟合一个模型:
Height ~ Girth + Volume
1 和 2 之间的一些随机权重:
set.seed(0); w <- runif(nrow(trees), 1, 2)
我们通过加权回归来拟合这个模型,方法是将权重传递给 lm
,或者手动转换数据并调用 lm
没有权重:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
确实,我们看到了相同的结果。
或者,我们可以使用 lm.fit
和 lm.wfit
:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
我们可以通过以下方式检查系数:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
结果还是一样。