使用 R 中的精确参数模拟来自回归模型的数据

Question

如何模拟数据，使 lm 恢复的系数被确定为特定的预定值和具有正态分布的残差？例如，我是否可以生成数据以使 lm(y ~ 1 + x) 产生 (Intercept) = 1.500 和 x = 4.000？我希望该解决方案足够通用，可以用于具有连续 x 的多元回归（例如，lm(y ~ 1 + x1 + x2)），但如果它也适用于交互（lm(y ~ 1 + x1 + x2 + x1*x2)），则会有加分。此外，它应该适用于小 N（例如，N < 200）。

我知道如何模拟由这些参数生成的随机数据（参见 here），但随机性会影响估计系数的变化，例如，Intercept = 1.488 和 x = 4.067。

相关：可以生成产生预定相关系数的数据（参见 here and here）。所以我问这是否可以用于多元回归？

Answer 1

您可以进行拒绝抽样：

set.seed(42)
tol <- 1e-8

x <- 1:100
continue <- TRUE
while(continue) {
  y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
  if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}

coef(lm(y ~ x))
#(Intercept)           x 
#   1.500013    4.000023

显然，这是一种蛮力方法，公差越小，模型越复杂，所需时间就越长。通过提供残差作为输入，然后使用一些矩阵代数来计算 y 值，应该可以采用更有效的方法。但这更像是一个数学问题...

Answer 2

一种方法是使用完全对称的噪声。噪声会自行抵消，因此估计的参数正是输入参数，但残差呈正态分布。

x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)

x <- c(x, x)
y <- c(y + eps, y - eps)

fit <- lm(y ~ x)
# (Intercept)            x  
#         1.5          4.0 

plot(fit)

残差服从正态分布...

...但表现出异常完美的对称性！

OP 编辑：我编写了一个利用对称残差技巧的通用代码。它可以很好地适应更复杂的模型。此示例还表明它适用于分类预测变量和交互作用。

library(dplyr)

# Data and residuals
df = tibble(
  # Predictors
  x1 = 1:100,  # Continuous
  x2 = rep(c(0, 1), each=50),  # Dummy-coded categorical

  # Generate y from model, including interaction term
  y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
  noise = rnorm(100)  # Residuals
)

# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
  df %>% mutate(y = y_model + noise),
  df %>% mutate(y = y_model - noise)  # Mirrored
)

# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept)          x1          x2       x1:x2 
#     1.50000     4.00000    -2.10000     8.76543

使用 R 中的精确参数模拟来自回归模型的数据

Simulate data from regression model with exact parameters in R

simulation

r

linear-regression

coefficients