R:删除嵌套的 for 循环以使自定义 bootstrap 更高效
R: Remove nested for loops in order to make a custom bootstrap more efficient
我正在尝试从数据集中收集一些 bootstrapped 估计值以汇总统计数据,但我想以不同的速率对数据集的各个部分重新采样,这导致我依赖于嵌套的 for 循环。
具体来说,假设我的数据集中有两组,每组又分为测试组和控制组。第 1 组的测试控制比率为 75% / 25%,第 2 组的测试控制比率为 50% / 50%。
我想重新采样以使数据集大小相同,但两组的测试控制比均为 90% / 10%...换句话说,以不同的速率对不同的子组重新采样,这让我印象深刻与 boot
软件包通常所做的不同。
在我的数据集中,我创建了一个代表组的 group
变量,以及一个代表与 test/control 连接的组的 groupT
变量,例如:
id group groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C
这就是我现在的 运行,nreps
任意设置为我的 bootstrap 复制次数:
for (j in 1:nreps){
bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset
for (i in unique(datafile$groups)){
tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C
## Size of test group resample should be ~90% of total group size
tsize<-round(.90*length(which(datafile$groups==i)),0)
## Size of control group resample should be total group size minus test group size
csize<-length(which(datafile$groups==i))-tsize
## Continue building bootdat by rbinding the test and control resample
## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}
## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}
对于大约 100 万条总记录的数据集大小,每次复制需要 3-4 分钟。我确信有更好的方法可以使用 sapply
或某些 dplyr 函数来执行此操作,但到目前为止我的尝试都是空的。如有任何帮助,我们将不胜感激!
我强烈建议您研究 data.table 和 foreach,使用键搜索 bootstrap。它将允许您非常快速地执行单个 bootstrap,并且您可以 运行 每个 bootstrap 在不同的核心上独立。以下每个 bootstrap 在我的机器上花费 0.5 秒,搜索 100 万行的 table。像下面这样的东西应该让你开始:
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# define number of bootstraps
nBootstraps <- 1000
# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test
# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])
# key data by id
setkey(dat, id)
# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {
# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]
# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)
}
编辑:下面的示例包括对任意数量的唯一组中的每一个的查找table。为简单起见,可以在 foreach 循环中引用与组 +(测试或控制)的每个组合对应的 ID。 N.test 和 N.control(900 和 100)的数字较小,它会在
中吐出 1000 bootstraps 的结果
library(data.table)
library(foreach)
# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]
# define number of bootstraps
nBootstraps <- 100
# get number that you want to sample from each group
N.test <- 900
N.control <- 100
# key data by id
setkey(dat, id)
# all values for group
# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()
for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
屈服
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000
这不涉及任何汇总统计计算时间,但这里 1000 bootstraps 在 1 个核心上连续拉出
user system elapsed
62.574 1.267 63.844
如果您需要为每个组手动编码 N 不同,您可以执行与 id 查找相同的操作
# create environments
control.Ns <- new.env()
test.Ns <- new.env()
# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5
然后更改大 bootstrap 循环以根据循环的当前组查找这些值:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
就像caw5cv一样,我建议看看data.table
,它通常在解决此类问题时非常有效,但是如果你想选择使用dplyr
,那么你可以尝试做这样的事情:
summary_of_boot_data <- lapply(1:nreps,
function(y){
# get bootdata
bootdata <- lapply(unique(datafile$group),
function(x){
tstring<-paste0(x,"T")
cstring<-paste0(x,"C")
tsize<-round(.90*length(which(datafile$group==x)),0)
csize<-length(which(datafile$group==x))-tsize
df <-rbind(datafile[sample(which(datafile$groupT==tstring),
size=tsize,
replace=TRUE),],
datafile[sample(which(datafile$groupT==cstring),
size=csize,
replace=TRUE),])
return(df)
}) %>% do.call(rbind, .)
# return your summary thing for bootdata e.g. summary(bootdata)
summary(bootdata)
})
summary_of_boot_data
我尽量不对你的代码做太多改动,我只是将 for
的使用替换为 lapply
希望这对您有所帮助
编辑:根据 Hugh 的评论,您可能想尝试使用 data.table::rbindlist()
而不是 do.call(rbind, .)
我正在尝试从数据集中收集一些 bootstrapped 估计值以汇总统计数据,但我想以不同的速率对数据集的各个部分重新采样,这导致我依赖于嵌套的 for 循环。
具体来说,假设我的数据集中有两组,每组又分为测试组和控制组。第 1 组的测试控制比率为 75% / 25%,第 2 组的测试控制比率为 50% / 50%。
我想重新采样以使数据集大小相同,但两组的测试控制比均为 90% / 10%...换句话说,以不同的速率对不同的子组重新采样,这让我印象深刻与 boot
软件包通常所做的不同。
在我的数据集中,我创建了一个代表组的 group
变量,以及一个代表与 test/control 连接的组的 groupT
变量,例如:
id group groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C
这就是我现在的 运行,nreps
任意设置为我的 bootstrap 复制次数:
for (j in 1:nreps){
bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset
for (i in unique(datafile$groups)){
tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C
## Size of test group resample should be ~90% of total group size
tsize<-round(.90*length(which(datafile$groups==i)),0)
## Size of control group resample should be total group size minus test group size
csize<-length(which(datafile$groups==i))-tsize
## Continue building bootdat by rbinding the test and control resample
## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}
## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}
对于大约 100 万条总记录的数据集大小,每次复制需要 3-4 分钟。我确信有更好的方法可以使用 sapply
或某些 dplyr 函数来执行此操作,但到目前为止我的尝试都是空的。如有任何帮助,我们将不胜感激!
我强烈建议您研究 data.table 和 foreach,使用键搜索 bootstrap。它将允许您非常快速地执行单个 bootstrap,并且您可以 运行 每个 bootstrap 在不同的核心上独立。以下每个 bootstrap 在我的机器上花费 0.5 秒,搜索 100 万行的 table。像下面这样的东西应该让你开始:
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# define number of bootstraps
nBootstraps <- 1000
# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test
# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])
# key data by id
setkey(dat, id)
# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {
# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]
# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)
}
编辑:下面的示例包括对任意数量的唯一组中的每一个的查找table。为简单起见,可以在 foreach 循环中引用与组 +(测试或控制)的每个组合对应的 ID。 N.test 和 N.control(900 和 100)的数字较小,它会在
中吐出 1000 bootstraps 的结果library(data.table)
library(foreach)
# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]
# define number of bootstraps
nBootstraps <- 100
# get number that you want to sample from each group
N.test <- 900
N.control <- 100
# key data by id
setkey(dat, id)
# all values for group
# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()
for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
屈服
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000
这不涉及任何汇总统计计算时间,但这里 1000 bootstraps 在 1 个核心上连续拉出
user system elapsed
62.574 1.267 63.844
如果您需要为每个组手动编码 N 不同,您可以执行与 id 查找相同的操作
# create environments
control.Ns <- new.env()
test.Ns <- new.env()
# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5
然后更改大 bootstrap 循环以根据循环的当前组查找这些值:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
就像caw5cv一样,我建议看看data.table
,它通常在解决此类问题时非常有效,但是如果你想选择使用dplyr
,那么你可以尝试做这样的事情:
summary_of_boot_data <- lapply(1:nreps,
function(y){
# get bootdata
bootdata <- lapply(unique(datafile$group),
function(x){
tstring<-paste0(x,"T")
cstring<-paste0(x,"C")
tsize<-round(.90*length(which(datafile$group==x)),0)
csize<-length(which(datafile$group==x))-tsize
df <-rbind(datafile[sample(which(datafile$groupT==tstring),
size=tsize,
replace=TRUE),],
datafile[sample(which(datafile$groupT==cstring),
size=csize,
replace=TRUE),])
return(df)
}) %>% do.call(rbind, .)
# return your summary thing for bootdata e.g. summary(bootdata)
summary(bootdata)
})
summary_of_boot_data
我尽量不对你的代码做太多改动,我只是将 for
的使用替换为 lapply
希望这对您有所帮助
编辑:根据 Hugh 的评论,您可能想尝试使用 data.table::rbindlist()
而不是 do.call(rbind, .)