lm:1 个系数由于奇异性而未定义,但没有主要的共线性问题
lm: 1 coefficient not defined because of singularities while no major collinearity issues
我看到类似 的帖子说收到错误消息:Coefficients: (1 not defined because of singularities)
是因为 lm()
调用中使用的预测变量之间几乎完美相关。
但在我的例子中,预测变量之间没有近乎完美的相关性,但在 lm()
的输出中仍然有一个系数 (X_wthn_outcome
) returns NA
。
不知returnsNA
的系数有什么问题?
出于再现性目的,下面提供了完全相同的数据和代码。
library(dplyr)
set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e
dat <- data %>%
group_by(study) %>%
mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
group_by(outcome, .add = TRUE) %>%
mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
round(cor(select(dat,-study,-outcome,-X,-yi)),3)
# X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study 1.000 0.000 0.141 0.00
#X_wthn_study 0.000 1.000 0.698 0.71
#X_btw_outcome 0.141 0.698 1.000 0.00
#X_wthn_outcome 0.000 0.710 0.000 1.00
summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
+ X_wthn_outcome, data = dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#X_btw_study 0.524093 0.069610 7.529 5.15e-14 ***
#X_btw_outcome 0.014557 0.013694 1.063 0.288
#X_wthn_study 0.589517 0.009649 61.096 < 2e-16 ***
#X_wthn_outcome NA NA NA NA ## What's wrong with this variable
您构造了一个问题,其中 X_btw_study + X_btw_outcome + X_wthn_study
的三向组合完美地预测了 X_wthn_outcome
:
lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
#------------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Coefficients:
(Intercept) X_btw_study X_btw_outcome X_wthn_study
1.165e-17 1.000e+00 -1.000e+00 1.000e+00
#--------------
summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )
#---------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.901e-14 -6.000e-17 0.000e+00 5.000e-17 3.195e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.165e-17 3.242e-18 3.594e+00 0.000326 ***
X_btw_study 1.000e+00 3.312e-17 3.020e+16 < 2e-16 ***
X_btw_outcome -1.000e+00 6.515e-18 -1.535e+17 < 2e-16 ***
X_wthn_study 1.000e+00 4.590e-18 2.178e+17 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025e-15 on 99996 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.582e+34 on 3 and 99996 DF, p-value: < 2.2e-16
您的调整后 R^2 为 1,具有三个预测变量。所以多重共线性但不是双向共线性。 (R 抓住了你的把戏,不会让你逃脱这种“隐藏依赖关系”的 dplyr
游戏。)我认为如果你按顺序构建变量,依赖关系可能会更明显独立的步骤而不是管道链。
我看到类似 Coefficients: (1 not defined because of singularities)
是因为 lm()
调用中使用的预测变量之间几乎完美相关。
但在我的例子中,预测变量之间没有近乎完美的相关性,但在 lm()
的输出中仍然有一个系数 (X_wthn_outcome
) returns NA
。
不知returnsNA
的系数有什么问题?
出于再现性目的,下面提供了完全相同的数据和代码。
library(dplyr)
set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e
dat <- data %>%
group_by(study) %>%
mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
group_by(outcome, .add = TRUE) %>%
mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
round(cor(select(dat,-study,-outcome,-X,-yi)),3)
# X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study 1.000 0.000 0.141 0.00
#X_wthn_study 0.000 1.000 0.698 0.71
#X_btw_outcome 0.141 0.698 1.000 0.00
#X_wthn_outcome 0.000 0.710 0.000 1.00
summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
+ X_wthn_outcome, data = dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#X_btw_study 0.524093 0.069610 7.529 5.15e-14 ***
#X_btw_outcome 0.014557 0.013694 1.063 0.288
#X_wthn_study 0.589517 0.009649 61.096 < 2e-16 ***
#X_wthn_outcome NA NA NA NA ## What's wrong with this variable
您构造了一个问题,其中 X_btw_study + X_btw_outcome + X_wthn_study
的三向组合完美地预测了 X_wthn_outcome
:
lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
#------------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Coefficients:
(Intercept) X_btw_study X_btw_outcome X_wthn_study
1.165e-17 1.000e+00 -1.000e+00 1.000e+00
#--------------
summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )
#---------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.901e-14 -6.000e-17 0.000e+00 5.000e-17 3.195e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.165e-17 3.242e-18 3.594e+00 0.000326 ***
X_btw_study 1.000e+00 3.312e-17 3.020e+16 < 2e-16 ***
X_btw_outcome -1.000e+00 6.515e-18 -1.535e+17 < 2e-16 ***
X_wthn_study 1.000e+00 4.590e-18 2.178e+17 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025e-15 on 99996 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.582e+34 on 3 and 99996 DF, p-value: < 2.2e-16
您的调整后 R^2 为 1,具有三个预测变量。所以多重共线性但不是双向共线性。 (R 抓住了你的把戏,不会让你逃脱这种“隐藏依赖关系”的 dplyr
游戏。)我认为如果你按顺序构建变量,依赖关系可能会更明显独立的步骤而不是管道链。