Bootstrap 重采样:不同的输入结构产生不同的结果
Bootstrap Resampling: Different results with different input structures
似乎向 R 中的 bootstrap 重采样函数提供列表与数据帧可以产生不同的结果。
library(dplyr)
ctrl <- iris %>% dplyr::filter(Species == 'virginica')
ctrl <- ctrl$Sepal.Length
test <- iris %>% dplyr::filter(Species == 'setosa')
test <- test$Sepal.Length
input_list1 <- data.frame(control=ctrl, test=test)
input_list2 <- list(control=ctrl, test=test)
mean_d <- function(data, indices) {
control <- data$control[indices]
test <- data$test[indices]
return(mean(test) - mean(control))
}
set.seed(12345)
boot_result1 <- boot::boot(input_list1,
mean_d,
R = 5000)
set.seed(NULL)
set.seed(12345)
boot_result2 <- boot::boot(input_list2,
mean_d,
R = 5000)
virginica
和 setosa
萼片长度之间的真正平均差异当然是
> mean(test) - mean(control)
[1] - 1.582
只有收到 data.frame 的 boot_result1
会产生正确的结果:
> boot_result1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)
Bootstrap Statistics :
original bias std. error
t1* -1.582 -0.000972 0.09649542
boot_result2
,接收列表作为输入,产生不准确的均值差。
> boot_result2
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot::boot(data = input_list1, statistic = mean_d, R = 5000)
Bootstrap Statistics :
original bias std. error
t1* -1.05 -3e-05 0.106013
为什么会这样?
如果你阅读了 boot() 的小插图:
In all other cases ‘statistic’ must take at least two arguments. The
first argument passed will always be the original data. The second
will be a vector of indices, frequencies or weights which define the
bootstrap sample.
在您的示例中,列表的长度为 2,因此它将采样 1:2,这将是您的索引。如果您查看您的 t0,它是每个列表中前 2 个条目之间的差异:
mean(c(6.3,5.8))-mean(c(5.1,4.9))
[1] 1.05
要使用列表编写,请执行:
mean_d <- function(data, indices) {
control <- sapply(data[indices],"[[","control")
test <- sapply(data[indices],"[[","test")
return(mean(test) - mean(control))
}
input_list2 <- asplit(data.frame(control=ctrl, test=test),1)
set.seed(12345)
boot_result2 <- boot::boot(input_list2,
mean_d,
R = 5000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)
Bootstrap Statistics :
original bias std. error
t1* -1.582 -0.000972 0.09649542
我认为这比应该的要复杂一些。也许您需要将它用于其他一些数据。本质上,您的列表需要结构化,以便每个元素都是指向 bootstrap.
的数据
似乎向 R 中的 bootstrap 重采样函数提供列表与数据帧可以产生不同的结果。
library(dplyr)
ctrl <- iris %>% dplyr::filter(Species == 'virginica')
ctrl <- ctrl$Sepal.Length
test <- iris %>% dplyr::filter(Species == 'setosa')
test <- test$Sepal.Length
input_list1 <- data.frame(control=ctrl, test=test)
input_list2 <- list(control=ctrl, test=test)
mean_d <- function(data, indices) {
control <- data$control[indices]
test <- data$test[indices]
return(mean(test) - mean(control))
}
set.seed(12345)
boot_result1 <- boot::boot(input_list1,
mean_d,
R = 5000)
set.seed(NULL)
set.seed(12345)
boot_result2 <- boot::boot(input_list2,
mean_d,
R = 5000)
virginica
和 setosa
萼片长度之间的真正平均差异当然是
> mean(test) - mean(control)
[1] - 1.582
只有收到 data.frame 的 boot_result1
会产生正确的结果:
> boot_result1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)
Bootstrap Statistics :
original bias std. error
t1* -1.582 -0.000972 0.09649542
boot_result2
,接收列表作为输入,产生不准确的均值差。
> boot_result2
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot::boot(data = input_list1, statistic = mean_d, R = 5000)
Bootstrap Statistics :
original bias std. error
t1* -1.05 -3e-05 0.106013
为什么会这样?
如果你阅读了 boot() 的小插图:
In all other cases ‘statistic’ must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.
在您的示例中,列表的长度为 2,因此它将采样 1:2,这将是您的索引。如果您查看您的 t0,它是每个列表中前 2 个条目之间的差异:
mean(c(6.3,5.8))-mean(c(5.1,4.9))
[1] 1.05
要使用列表编写,请执行:
mean_d <- function(data, indices) {
control <- sapply(data[indices],"[[","control")
test <- sapply(data[indices],"[[","test")
return(mean(test) - mean(control))
}
input_list2 <- asplit(data.frame(control=ctrl, test=test),1)
set.seed(12345)
boot_result2 <- boot::boot(input_list2,
mean_d,
R = 5000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot::boot(data = input_list2, statistic = mean_d, R = 5000)
Bootstrap Statistics :
original bias std. error
t1* -1.582 -0.000972 0.09649542
我认为这比应该的要复杂一些。也许您需要将它用于其他一些数据。本质上,您的列表需要结构化,以便每个元素都是指向 bootstrap.
的数据