Bootstrap 重采样：不同的输入结构产生不同的结果

Question

似乎向 R 中的 bootstrap 重采样函数提供列表与数据帧可以产生不同的结果。

library(dplyr)

ctrl <- iris %>% dplyr::filter(Species == 'virginica')
ctrl <- ctrl$Sepal.Length
      
test <- iris %>% dplyr::filter(Species == 'setosa')
test <- test$Sepal.Length

input_list1 <- data.frame(control=ctrl, test=test)
input_list2 <- list(control=ctrl, test=test)


mean_d <- function(data, indices) {
  control <- data$control[indices]
  test <- data$test[indices]

  return(mean(test) - mean(control))
}



set.seed(12345)
boot_result1 <- boot::boot(input_list1,
                           mean_d,
                           R = 5000)
set.seed(NULL)


set.seed(12345)
boot_result2 <- boot::boot(input_list2,
                           mean_d,
                           R = 5000)

virginica 和 setosa 萼片长度之间的真正平均差异当然是

> mean(test) - mean(control)

[1] - 1.582

只有收到 data.frame 的 boot_result1 会产生正确的结果：

> boot_result1

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)


Bootstrap Statistics :
    original    bias    std. error
t1*   -1.582 -0.000972  0.09649542

boot_result2，接收列表作为输入，产生不准确的均值差。

> boot_result2

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = input_list1, statistic = mean_d, R = 5000)


Bootstrap Statistics :
    original  bias    std. error
t1*    -1.05  -3e-05    0.106013

为什么会这样？

Answer 1

如果你阅读了 boot() 的小插图：

In all other cases ‘statistic’ must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.

在您的示例中，列表的长度为 2，因此它将采样 1:2，这将是您的索引。如果您查看您的 t0，它是每个列表中前 2 个条目之间的差异：

mean(c(6.3,5.8))-mean(c(5.1,4.9))
[1] 1.05

要使用列表编写，请执行：

mean_d <- function(data, indices) {
  control <- sapply(data[indices],"[[","control")
  test <- sapply(data[indices],"[[","test")

  return(mean(test) - mean(control))
}

input_list2 <- asplit(data.frame(control=ctrl, test=test),1)

set.seed(12345)
boot_result2 <- boot::boot(input_list2,
                           mean_d,
                           R = 5000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)


Bootstrap Statistics :
    original    bias    std. error
t1*   -1.582 -0.000972  0.09649542

我认为这比应该的要复杂一些。也许您需要将它用于其他一些数据。本质上，您的列表需要结构化，以便每个元素都是指向 bootstrap.

的数据

Bootstrap 重采样：不同的输入结构产生不同的结果

Bootstrap Resampling: Different results with different input structures

statistics

r

resampling

statistics-bootstrap