自举样本结果变量 link 是否与 R 回归中的 x 值一致

Question

我正在尝试运行对 R 中的引导样本进行回归。

原始样本看起来像这个数据框（称为 df）并且有数百个条目。 Y为结果变量，treat为0或1。

y  treat
3  0
5  1
2  0
4  1

我用放回抽样从 df$y 生成了 900 个观测值。

set.seed(5)
b1 <- sample(df$y, 900, replace = TRUE, prob = NULL)

然后我运行进行了以下回归。

lm(b1 ~ treat, df)

当使用样本 b1 作为回归结果时，这是否会自动将 b1 的正确值与原始数据框中的处理值相匹配？如果我希望 b1 中的结果值与原始数据框中的正确处理值相对应，我是否需要做一些不同的事情？我如何检查这是否是我正在尝试运行的回归？

Answer 1

我们可以 sample 行序列而不是单个列。在 OP 的代码中，它只是对 'y' 进行采样，'treat' 只剩下 4 个元素，当我们应用公式方法时，这会导致错误，因为其中一个对象具有不同的长度。

lm(b1 ~ treat, df)

Error in model.frame.default(formula = b1 ~ treat, data = df, drop.unused.levels = TRUE) : variable lengths differ (found for 'treat')

相反，我们sample 行序列

set.seed(5)
df1 <- df[sample(seq_len(nrow(df)), 900, replace = TRUE),]
lm(y ~ treat, df1)

df <- structure(list(y = c(3L, 5L, 2L, 4L), treat = c(0L, 1L, 0L, 1L
)), class = "data.frame", row.names = c(NA, -4L))

Does bootstrapped sample outcome variable link up with x values in regression in R