如何使用虚拟变量模拟多元回归分析数据

Question

我想模拟数据进行回归分析，其中涉及一个虚拟变量。当回归恢复斜率时，它不会恢复截距：

beta <- c(2,3,4)
x1   <- rnorm(100,50,5)
x2   <- sample(c(0,1), replace=T,100)
eps  <- rnorm(100, 0, 5)

y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

summary(lm(y~x1 + x2))

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6598  -2.7433  -0.2873   2.4616  13.2250 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.2858     5.3470  -1.550    0.124    
x1            3.2216     0.1070  30.109  < 2e-16 ***
x2            3.9209     0.9065   4.325  3.7e-05 ***

我知道虚拟变量会向上或向下移动截距，但我不清楚要进行哪些调整才能创建可以恢复截距的数据集。非常感谢任何建议，谢谢。

Answer 1

只是数据点数量的问题。数据越多，您就越接近正确的截距。

set.seed(1111)
beta <- c(2,3,4)
x1   <- rnorm(1E6,50,5)
x2   <- sample(c(0,1), replace=T,1E6)
eps  <- rnorm(1E6, 0, 5)

y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

summary(lm(y~x1 + x2))

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.6565  -3.3651  -0.0003   3.3694  25.8225 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.9914611  0.0504585   39.47   <2e-16 ***
x1          3.0003014  0.0009994 3002.14   <2e-16 ***
x2          3.9902120  0.0099931  399.30   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.997 on 999997 degrees of freedom
Multiple R-squared:  0.9017,    Adjusted R-squared:  0.9017 
F-statistic: 4.587e+06 on 2 and 999997 DF,  p-value: < 2.2e-16

Answer 2

您应该期望平均，您将恢复真实的系数值。但是在给定的模拟数据集上，您的系数估计会出现偏差。我将您的模拟设置重复 1000 次，然后取系数估计值的平均值。

beta <- c(2,3,4)
do_experiment <- function(n = 100, eps.sd = 5) {
  x1   <- rnorm(n, 50, 5)
  x2   <- sample(c(0,1), replace=T, n)
  eps  <- rnorm(n, 0, eps.sd)

  y <- beta[1] + beta[2]*x1 + beta[3]*x2 + eps

  return(coef(lm(y~x1 + x2)))
}

set.seed(1212)
coefEstimates <- replicate(1000, do_experiment(n = 100))
rowMeans(coefEstimates)

(Intercept)          x1          x2 
   1.972531    3.000327    4.010408

apply(coefEstimates, 1, sd)

(Intercept)          x1          x2 
 5.05588111  0.09988136  1.00523822

如果您希望从一个模拟到另一个模拟的截距估计的可变性较小，您可以减少误差项的方差。正如@rookie 提到的，您还可以增加样本量。

set.seed(1213)
coefEstimates2 <- replicate(1000, do_experiment(n = 100, eps.sd = 1))
rowMeans(coefEstimates2)

(Intercept)          x1          x2 
   2.046227    2.999186    3.996409 

apply(coefEstimates2, 1, sd)

(Intercept)          x1          x2 
  1.0459009   0.0205488   0.1995421

如何使用虚拟变量模拟多元回归分析数据

How to simulate for multiple regression analysis data with a dummy variable

simulation

regression

r