仅使用 tidyverse 手动引导置信区间

Question

我有一个 grouped 数据集，我有兴趣汇总一列计数（___ 的数量）。要计算摘要的标准误差，我想在组内 bootstrap 并计算中位数的标准差。我正在努力弄清楚如何手动编码（用替换重采样，而不是像 boot() 这样的函数），而不使用 for 循环（即，我是希望有一个纯粹的 tidyverse 解决方案）。如果除了使用 *apply() 之外还有其他方法，那将是首选。将整个过程包装到一个函数中会很棒——要么在管道中使用，比如说，summarise()，要么作为一个独立的函数应用于分组数据。

临时数据集可以是 mtcars，我已将其分组 gear。我现在有兴趣使用中位数总结 hp 列，并获得相同的置信区间。我已经尝试了一些由 SO 上稍微相关的线程建议的解决方案，比如 replicate()+across()、map()/pmap() 等，但无法让他们为我的具体案例工作。

library(tidyverse)

data <- mtcars %>% 
  select(gear, hp) %>% 
  group_by(gear)

> data
# A tibble: 32 x 2
# Groups:   gear [3]
    gear    hp
   <dbl> <dbl>
 1     4   110
 2     4   110
 3     4    93
 4     3   110
 5     3   175
 6     3   105
 7     3   245
 8     4    62
 9     4    95
10     4   123
# ... with 22 more rows

我希望找到一种方法将 bootstrap 结果与简单摘要整合为另一列（每组 SE）：

data2 <- data %>% 
  summarise(hp = median(hp))

虽然用齿轮数来概括马力可能没有多大意义，而且 hp 的分布可能不是典型的泊松分布，但我认为此示例的编码解决方案将适用于我的具体案例尽管如此。

编辑 1

解决方案不需要是一个干净而健壮的函数。对于这种特定情况，它可能只是获取每个组中的 bootstrapped SE 值所需的代码行。所需的输出只是 data2 对象，其中 hp 是中位数列，hpse 是 SE 列。


    data2 <- data %>% 
      summarise(hp = median(hp),
            ### hpse = workingcode()
                )

如果无法在 summarise() 调用中以这种方式直接执行此操作，则至少必须可以稍后将值加入 data2.

相关话题

使用`boot()`

How to perform a bootstrap and find 95% confidence interval for the median of a dataset

使用`*apply()`

Bootstrap a large data set

使用`for`循环

How to perform a bootstrap and find 95% confidence interval for the median of a dataset

其他

Creating bootstrap samples and storing sampled data in different names

Answer 1

首先我们可以做一个bootstrap函数：

boot_fn = function(x, fn = median, B = 1000) {
  1:B %>%
    # For each iteration, generate a sample of x with replacement
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # Obtain the fn estimate for each bootstrap sample
    map_dbl(fn) %>%
    # Obtain the standard error
    sd()
}

请注意我是如何为参数 fn 提供默认值 median 的，这使您有机会将您希望的任何数字函数传递给 boot_fn().

现在我们可以使用您最初要求的功能了：

mtcars %>% 
  group_by(gear) %>%
  summarise(
    hp_median = median(hp), 
    se = boot_fn(hp, fn = median)
  )

# A tibble: 3 x 3
   gear hp_median    se
  <dbl>     <dbl> <dbl>
1     3       180  13.2
2     4        94  15.2
3     5       175  70.3

之所以可行，是因为我们的数据是分组的。对于每个组，x 的新值被发送到 boot_fn()。在这种情况下，传递了 x 的三个不同值，每个值都是 hp 值对应于 gear.

的每个不同值

如果我们只是在我们的函数中添加一个cat()语句，这很容易确认：

boot_fn = function(x, fn = median, B = 1000, verbose = FALSE) {
  if (verbose) cat("Hello, x is ", x, "\n")
  1:B %>%
    # For each iteration, generate a sample of x with replacement
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # Obtain the fn estimate for each bootstrap sample
    map_dbl(fn) %>%
    # Obtain the standard error
    sd()
}

data %>%
  summarise(
    hp_median = median(hp), 
    se = boot_fn(hp, fn = median, verbose = TRUE)
  )

输出：

Hello, x is 110 175 105 245 180 180 180 205 215 230 97 150 150 245 175 
Hello, x is 110 110 93 62 95 123 123 66 52 65 66 109 
Hello, x is 91 113 264 175 335 
# A tibble: 3 x 3
   gear hp_median    se
  <dbl>     <dbl> <dbl>
1     3       180  13.5
2     4        94  14.9
3     5       175  69.6

此函数在用于实际数据时可能会崩溃（由于 NAs 之类的原因），但这是一个好的开始。

Answer 2

适合管道工作流程的 @kybazzi 解决方案的替代方案是：

boot_se <- function(x, fn = median, B = 100){
  replicate(B,
            do.call("fn", list(sample(x, n(), replace = T))),
            simplify = F) %>% 
    unlist() %>% 
    sd()
}

有时好像比较慢：


boot_fn = function(x, fn = median, B = 100) {
  1:B %>%
    # For each iteration, generate a sample of x with replacement
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # Obtain the fn estimate for each bootstrap sample
    map_dbl(fn) %>%
    # Obtain the standard error
    sd()
}


data1 <- mtcars %>% 
  select(gear, hp) %>% 
  group_by(gear)

data2 <- data %>% 
  summarise(hpmed = median(hp),
            hpse = boot_se(hp))

data3 <- data %>% 
  summarise(hpmed = median(hp),
            hpse = boot_fn(hp))

#######################################

library(microbenchmark)

microbenchmark((data %>% 
                 summarise(hpmed = median(hp),
                           hpse = boot_fn(hp))),
               (data %>% 
                  summarise(hpmed = median(hp),
                            hpse = boot_se(hp))))

# Output:

Unit: milliseconds
                                                          expr     min       lq
  (data %>% summarise(hpmed = median(hp), hpse = boot_fn(hp))) 14.5737 15.63690
  (data %>% summarise(hpmed = median(hp), hpse = boot_se(hp))) 20.6675 21.64715
     mean   median       uq     max neval
 22.23120 16.78140 25.85675 91.4154   100
 29.15338 22.68525 32.01430 87.6299   100

#######################################

microbenchmark(data2, data3, times = 1000)

# Output:

Unit: nanoseconds
  expr min    lq   mean median  uq  max neval
 data2   0 100.0 95.986    101 101 3501  1000
 data3   0   1.5 92.318    101 101 2700  1000

仅使用 tidyverse 手动引导置信区间

Manual bootstrapping for confidence intervals using tidyverse only

statistics

r

resampling

tidyverse

编辑 1

相关话题

使用`boot()`

使用`*apply()`

使用`for`循环

其他

仅使用 tidyverse 手动引导置信区间

Manual bootstrapping for confidence intervals using tidyverse only

statistics

r

resampling

tidyverse

编辑 1

相关话题

使用boot()

使用*apply()

使用for循环

其他

使用`boot()`

使用`*apply()`

使用`for`循环