Bootstrap 重采样:不同的输入结构产生不同的结果

Bootstrap Resampling: Different results with different input structures

似乎向 R 中的 bootstrap 重采样函数提供列表与数据帧可以产生不同的结果。

library(dplyr)

ctrl <- iris %>% dplyr::filter(Species == 'virginica')
ctrl <- ctrl$Sepal.Length
      
test <- iris %>% dplyr::filter(Species == 'setosa')
test <- test$Sepal.Length

input_list1 <- data.frame(control=ctrl, test=test)
input_list2 <- list(control=ctrl, test=test)


mean_d <- function(data, indices) {
  control <- data$control[indices]
  test <- data$test[indices]

  return(mean(test) - mean(control))
}



set.seed(12345)
boot_result1 <- boot::boot(input_list1,
                           mean_d,
                           R = 5000)
set.seed(NULL)


set.seed(12345)
boot_result2 <- boot::boot(input_list2,
                           mean_d,
                           R = 5000)

virginicasetosa 萼片长度之间的真正平均差异当然是

> mean(test) - mean(control)
[1] - 1.582

只有收到 data.frame 的 boot_result1 会产生正确的结果:

> boot_result1
ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)


Bootstrap Statistics :
    original    bias    std. error
t1*   -1.582 -0.000972  0.09649542

boot_result2,接收列表作为输入,产生不准确的均值差。

> boot_result2
ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = input_list1, statistic = mean_d, R = 5000)


Bootstrap Statistics :
    original  bias    std. error
t1*    -1.05  -3e-05    0.106013

为什么会这样?

如果你阅读了 boot() 的小插图:

In all other cases ‘statistic’ must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.

在您的示例中,列表的长度为 2,因此它将采样 1:2,这将是您的索引。如果您查看您的 t0,它是每个列表中前 2 个条目之间的差异:

mean(c(6.3,5.8))-mean(c(5.1,4.9))
[1] 1.05

要使用列表编写,请执行:

mean_d <- function(data, indices) {
  control <- sapply(data[indices],"[[","control")
  test <- sapply(data[indices],"[[","test")

  return(mean(test) - mean(control))
}

input_list2 <- asplit(data.frame(control=ctrl, test=test),1)

set.seed(12345)
boot_result2 <- boot::boot(input_list2,
                           mean_d,
                           R = 5000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)


Bootstrap Statistics :
    original    bias    std. error
t1*   -1.582 -0.000972  0.09649542

我认为这比应该的要复杂一些。也许您需要将它用于其他一些数据。本质上,您的列表需要结构化,以便每个元素都是指向 bootstrap.

的数据