R model.matrix 删除多重共线性变量

Question

有没有办法强制model.matrix.lm删除多重共线性变量，就像lm()在估计阶段所做的那样？

这是一个例子：

library(fixest)

N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)

df <- data.frame(y = y, x1 = x1, x2 = x2)

fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
#   lm(formula = y ~ x1 + x2, data = df)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -1.82680 -0.41503  0.05499  0.67185  0.97830 
# 
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.7494     0.2885   2.598   0.0317 *  
#   x1            2.3905     0.3157   7.571 6.48e-05 ***
#   x2                NA         NA      NA       NA    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared:  0.8775,  Adjusted R-squared:  0.8622 
# F-statistic: 57.33 on 1 and 8 DF,  p-value: 6.476e-05

请注意，lm() 从模型中删除了共线变量 x2。但是 model.matrix() 保留它：

 model.matrix(fit_lm)
#   (Intercept)          x1          x2
#1            1  1.41175158  1.41175158
#2            1  0.06164133  0.06164133
#3            1  0.09285047  0.09285047
#4            1 -0.63202909 -0.63202909
#5            1  0.25189850  0.25189850
#6            1 -0.18553830 -0.18553830
#7            1  0.65630180  0.65630180
#8            1 -1.77536852 -1.77536852
#9            1 -0.30571009 -0.30571009
#10           1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2

fixst 中的 model.matrix 方法允许删除 x2:

fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept)          x1
# [1,]           1  1.41175158
# [2,]           1  0.06164133
# [3,]           1  0.09285047
# [4,]           1 -0.63202909
# [5,]           1  0.25189850
# [6,]           1 -0.18553830
# [7,]           1  0.65630180
# [8,]           1 -1.77536852
# [9,]           1 -0.30571009
# [10,]           1 -1.47296229

有没有办法在调用 model.matrix.lm() 时删除 x2？

Answer 1

只要运行线性模型的开销不是太高，您就可以像这里那样编写一个小函数来完成它：

N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)

df <- data.frame(y = y, x1 = x1, x2 = x2)

fit_lm <- lm(y ~ x1 + x2, data = df)

model.matrix2 <- function(model){
  bn <- names(na.omit(coef(model)))
  X <- model.matrix(model)
  X[,colnames(X) %in% bn]
}

model.matrix2(fit_lm)
#>    (Intercept)          x1
#> 1            1 -0.04654473
#> 2            1  2.14473751
#> 3            1  0.02688125
#> 4            1  0.95071038
#> 5            1 -1.41621259
#> 6            1  1.47840480
#> 7            1  0.56580182
#> 8            1  0.14480401
#> 9            1 -0.02404072
#> 10           1 -0.14393258

^{由 reprex package (v2.0.1)}

创建于 2022-05-02

在上面的代码中，model.matrix2() 是 post-processes 模型矩阵仅包含线性模型中具有 non-missing 系数的变量的函数。

R model.matrix 删除多重共线性变量

R model.matrix drop multicollinear variables

r

lm

model.matrix

fixest