Bootstrap 相关系数的 p 值（重采样方法）

Question

我有这么大的数据集 (N = 300.000)，通过功效分析，我得出的结论是，如果相关性存在，我只需要 250 个观察值即可找到相关性。

所以，我想使用 bootstrap 来挑选 1000 个大小为 n = 250 的样本，以找出这 1000 个样本中的 p 值范围。我对 bootstrap 方法很不熟悉，但在这里我举了一个例子，说明我对引导包的了解程度。我使用 Iris 数据集来说明。

我想要的输出是一个直方图，显示获得的 1000 个 p 值的频率分布和可能的 p 值的 95% 置信区间。

任何人都可以帮助我的脚本吗？

#activate iris datset
library(boot)
library(datasets)

#create function to retrieve p-value
boot.fn <- function(data, sample) {
           x <- iris$Petal.Length[i]
           y <- iris$Sepal.Length[i]
           boot.p <- cor.test(iris$Petal.Length[i],
                              iris$Sepal.Length[i])$p.value
           }

#create 1000 samples with bootstrap function
bootstr <- boot(iris, boot.fn, 1000)

Answer 1

函数 boot 不会提供所需的行为。然而，自己实现它非常简单：

首先是一些数据：

x1 <- rnorm(1e5)
y1 <- x1 + rnorm(1e5, 0.5)

cor.test(x1, y1)
#output
    Pearson's product-moment correlation

data:  x1 and y1
t = 315.97, df = 99998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7037121 0.7099151
sample estimates:
      cor 
0.7068272

对 250 个索引采样 1000 次：

#set.seed(1)
z1 <- replicate(1000, sample(1:length(x1), 250, replace = T))

如果不需要更换，只需删除该部分

现在检查列，使用索引对 x1 和 y1 进行子集化，计算统计量并使用未列出的列表绘制直方图。

hist(unlist(apply(z1, 2, function(x){
  cor.test(x1[x], y1[x])$p.value
})), xlab = "p value", main = "Uh)

也许更多信息是：

hist(unlist(apply(z1, 2, function(x){
  cor.test(x1[x], y1[x])$estimate
})), xlab = "cor", main ="Uh")

Bootstrap 相关系数的 p 值（重采样方法）

Bootstrap p-value for correlation coefficient (resampling methods)

r

resampling

p-value