R：删除嵌套的 for 循环以使自定义 bootstrap 更高效

Question

我正在尝试从数据集中收集一些 bootstrapped 估计值以汇总统计数据，但我想以不同的速率对数据集的各个部分重新采样，这导致我依赖于嵌套的 for 循环。

具体来说，假设我的数据集中有两组，每组又分为测试组和控制组。第 1 组的测试控制比率为 75% / 25%，第 2 组的测试控制比率为 50% / 50%。

我想重新采样以使数据集大小相同，但两组的测试控制比均为 90% / 10%...换句话说，以不同的速率对不同的子组重新采样，这让我印象深刻与 boot 软件包通常所做的不同。

在我的数据集中，我创建了一个代表组的 group 变量，以及一个代表与 test/control 连接的组的 groupT 变量，例如：

    id     group     groupT
     1         1         1T
     2         1         1T
     3         2         2T
     4         1         1C
     5         2         2C

这就是我现在的运行，nreps 任意设置为我的 bootstrap 复制次数：

for (j in 1:nreps){

  bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset

  for (i in unique(datafile$groups)){

    tstring<-paste0(i,"T") ## e.g. 1T
    cstring<-paste0(i,"C") ## e.g. 1C

    ## Size of test group resample should be ~90% of total group size

    tsize<-round(.90*length(which(datafile$groups==i)),0)

    ## Size of control group resample should be total group size minus test group size

    csize<-length(which(datafile$groups==i))-tsize

    ## Continue building bootdat by rbinding the test and control resample

    ## before moving on to the next group
    ## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.

    bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
      replace=TRUE),])

    bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
      replace=TRUE),])
  }

  ## Here, there is code to grab some summary statistics from bootdat
  ## and store them in statVector[j] before moving on to the next replication
}

对于大约 100 万条总记录的数据集大小，每次复制需要 3-4 分钟。我确信有更好的方法可以使用 sapply 或某些 dplyr 函数来执行此操作，但到目前为止我的尝试都是空的。如有任何帮助，我们将不胜感激！

Answer 1

我强烈建议您研究 data.table 和 foreach，使用键搜索 bootstrap。它将允许您非常快速地执行单个 bootstrap，并且您可以运行每个 bootstrap 在不同的核心上独立。以下每个 bootstrap 在我的机器上花费 0.5 秒，搜索 100 万行的 table。像下面这样的东西应该让你开始：

library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)

# example data
    dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))


# define number of bootstraps
    nBootstraps <- 1000

# define sampling fractions
    fraction_test <- 0.90
    fraction_control <- 1 - fraction_test

# get number that you want to sample from each group
    N.test <- round(fraction_test * dim(dat)[1])
    N.control <- round(fraction_control * dim(dat)[1])

# key data by id
    setkey(dat, id)

# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {

    # sample each group with the defined sizes, with replacement
    g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
    g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
    g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
    g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
    dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
    dat.all[, bootstrap := n]

    # do summary stats here with dat.all, return the summary stats data.table object
        return(dat.summarized)

}

编辑：下面的示例包括对任意数量的唯一组中的每一个的查找table。为简单起见，可以在 foreach 循环中引用与组 +（测试或控制）的每个组合对应的 ID。 N.test 和 N.control（900 和 100）的数字较小，它会在

中吐出 1000 bootstraps 的结果

library(data.table)
library(foreach)

# example data
    dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))

# save vector of all group values & change group to character vector for hashed environment lookup
    all_groups <- as.character(sort(unique(dat$group)))
    dat[, group := as.character(group)]


# define number of bootstraps
    nBootstraps <- 100

# get number that you want to sample from each group
    N.test <- 900
    N.control <- 100

# key data by id
    setkey(dat, id)

# all values for group

# Set up lookup table for every combination of group + test/control
    control.ids <- new.env()
    test.ids <- new.env()

    for(i in all_groups) {
        control.ids[[i]] <- dat[group==i & test_control=="C"]$id
        test.ids[[i]] <- dat[group==i & test_control=="T"]$id
    }


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
    foreach(group.i = all_groups, .combine="rbind") %do% {

        # get IDs that correspond to this group, for both test and control
        control_id_vector <- control.ids[[group.i]]
        test_id_vector <- test.ids[[group.i]]

        # search and bind
        controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
        tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
        dat.group <- rbindlist(list(controls, tests))
        dat.group[, bootstrap := n]
        return(dat.group[])
    }
    # summarize across all groups for this bootstrap and return summary stat data.table object

}

屈服

> results
              id group test_control bootstrap
       1: 701570     1            C         1
       2: 424018     1            C         1
       3: 909932     1            C         1
       4:  15354     1            C         1
       5: 514882     1            C         1
      ---
23999996: 898651    24            T      1000
23999997: 482374    24            T      1000
23999998: 845577    24            T      1000
23999999: 862359    24            T      1000
24000000: 602078    24            T      1000

这不涉及任何汇总统计计算时间，但这里 1000 bootstraps 在 1 个核心上连续拉出

   user  system elapsed
 62.574   1.267  63.844

如果您需要为每个组手动编码 N 不同，您可以执行与 id 查找相同的操作

# create environments
control.Ns <- new.env()
test.Ns <- new.env()

# assign size values
control.Ns[["1"]]   <- 900
   test.Ns[["1"]]   <- 100
control.Ns[["2"]]   <- 400
   test.Ns[["2"]]   <- 50
    ...             ...
control.Ns[["24"]]   <- 200
   test.Ns[["24"]]   <- 5

然后更改大 bootstrap 循环以根据循环的当前组查找这些值：

results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
    foreach(group.i = all_groups, .combine="rbind") %do% {

        # get IDs that correspond to this group, for both test and control
        control_id_vector <- control.ids[[group.i]]
        test_id_vector <- test.ids[[group.i]]

        # get size values
        N.control <- control.Ns[[group.i]]
        N.test <- test.Ns[[group.i]]

        # search and bind
        controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
        tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
        dat.group <- rbindlist(list(controls, tests))
        dat.group[, bootstrap := n]
        return(dat.group[])
    }
    # summarize across all groups for this bootstrap and return summary stat data.table object

}

Answer 2

就像caw5cv一样，我建议看看data.table，它通常在解决此类问题时非常有效，但是如果你想选择使用dplyr，那么你可以尝试做这样的事情：

summary_of_boot_data <- lapply(1:nreps, 
                           function(y){
                             # get bootdata
                             bootdata <- lapply(unique(datafile$group), 
                                                function(x){
                                                  tstring<-paste0(x,"T")
                                                  cstring<-paste0(x,"C")
                                                  tsize<-round(.90*length(which(datafile$group==x)),0)
                                                  csize<-length(which(datafile$group==x))-tsize
                                                  df <-rbind(datafile[sample(which(datafile$groupT==tstring),
                                                                             size=tsize, 
                                                                             replace=TRUE),],
                                                             datafile[sample(which(datafile$groupT==cstring),
                                                                             size=csize,
                                                                             replace=TRUE),])
                                                  return(df)
                                                }) %>% do.call(rbind, .)
                             # return your summary thing for bootdata e.g. summary(bootdata)
                             summary(bootdata)

                           })
summary_of_boot_data

我尽量不对你的代码做太多改动，我只是将 for 的使用替换为 lapply

希望这对您有所帮助

编辑：根据 Hugh 的评论，您可能想尝试使用 data.table::rbindlist() 而不是 do.call(rbind, .)

R：删除嵌套的 for 循环以使自定义 bootstrap 更高效

R: Remove nested for loops in order to make a custom bootstrap more efficient

for-loop

r

sapply

statistics-bootstrap