lm:1 个系数由于奇异性而未定义,但没有主要的共线性问题

lm: 1 coefficient not defined because of singularities while no major collinearity issues

我看到类似 的帖子说收到错误消息:Coefficients: (1 not defined because of singularities) 是因为 lm() 调用中使用的预测变量之间几乎完美相关。

但在我的例子中,预测变量之间没有近乎完美的相关性,但在 lm() 的输出中仍然有一个系数 (X_wthn_outcome) returns NA

不知returnsNA的系数有什么问题?

出于再现性目的,下面提供了完全相同的数据和代码。

library(dplyr)

set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e

dat <- data %>% 
  group_by(study) %>% 
  mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
  group_by(outcome, .add = TRUE) %>%
  mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
  
round(cor(select(dat,-study,-outcome,-X,-yi)),3)

#               X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study          1.000        0.000         0.141           0.00
#X_wthn_study         0.000        1.000         0.698           0.71
#X_btw_outcome        0.141        0.698         1.000           0.00
#X_wthn_outcome       0.000        0.710         0.000           1.00

summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
   + X_wthn_outcome, data = dat))

#Coefficients: (1 not defined because of singularities)
#               Estimate Std. Error t value Pr(>|t|)    
#X_btw_study    0.524093   0.069610   7.529 5.15e-14 ***
#X_btw_outcome  0.014557   0.013694   1.063    0.288    
#X_wthn_study   0.589517   0.009649  61.096  < 2e-16 ***
#X_wthn_outcome       NA         NA      NA       NA  ## What's wrong with this variable

您构造了一个问题,其中 X_btw_study + X_btw_outcome + X_wthn_study 的三向组合完美地预测了 X_wthn_outcome:

lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
#------------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study, 
    data = dat)

Coefficients:
  (Intercept)    X_btw_study  X_btw_outcome   X_wthn_study  
    1.165e-17      1.000e+00     -1.000e+00      1.000e+00  
#--------------
summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )

#---------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study, 
    data = dat)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.901e-14 -6.000e-17  0.000e+00  5.000e-17  3.195e-13 

Coefficients:
                Estimate Std. Error    t value Pr(>|t|)    
(Intercept)    1.165e-17  3.242e-18  3.594e+00 0.000326 ***
X_btw_study    1.000e+00  3.312e-17  3.020e+16  < 2e-16 ***
X_btw_outcome -1.000e+00  6.515e-18 -1.535e+17  < 2e-16 ***
X_wthn_study   1.000e+00  4.590e-18  2.178e+17  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025e-15 on 99996 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.582e+34 on 3 and 99996 DF,  p-value: < 2.2e-16

您的调整后 R^2 为 1,具有三个预测变量。所以多重共线性但不是双向共线性。 (R 抓住了你的把戏,不会让你逃脱这种“隐藏依赖关系”的 dplyr 游戏。)我认为如果你按顺序构建变量,依赖关系可能会更明显独立的步骤而不是管道链。