如何 bootstrap R 中被一个因素阻塞的数据集?
How to bootstrap dataset in R which is blocked by a factor?
我想对这个数据集执行bootstrap。请注意,数据有两个因子:replicate
和 level
,以及两个需要回归的变量 high.density
和 low.density
。我想在此数据集上执行 bootstrap,但替换只能在复制和级别的嵌套因子内发生。
replicate level high.density low.density
1 low 14 36
1 low 54 31
1 mid 82 10
1 mid 24 NA
2 low 12 28
2 low 11 45
2 mid 12 17
2 mid NA 24
2 up 40 10
2 up NA 5
2 up 20 2
例如,在 replicate/ level
中:1/low
low.density
31 和 36 可以互换(或 high.density
互换),因此该数据集的头部可能看起来像:
replicate level high.density low.density
1 low 14 31
1 low 54 36
1 mid 82 10
1 mid 24 NA
然后我想从这个数据集中估计线性回归 (glm)。对于尝试实现此目标的任何反馈,我将不胜感激。
##DATA FRAME (credits: caldwellst)
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level)
)
这是使用 dplyr
、purrr
和 tidyr
的解决方案。首先嵌套数值型数据,然后对数据中每个 replicate
和 level
的唯一组合进行采样。然后在其中,bootstrap 密度的唯一值,然后为最终数据框取消嵌套。
# library(tidyverse)
library(dplyr)
library(tidyr)
library(purrr)
df %>%
nest(data = ends_with("density")) %>%
slice_sample(n = 500, replace = TRUE) %>%
mutate(data = map(data, ~summarize(.x, across(.fns = sample, size = 1)))) %>%
unnest(cols = data)
#> # A tibble: 500 × 4
#> replicate level high.density low.density
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 low 54 31
#> 2 2 mid 12 24
#> 3 1 mid 24 10
#> 4 2 up 20 2
#> 5 2 mid 12 24
#> 6 2 mid 12 24
#> 7 1 mid 82 10
#> 8 2 up NA 2
#> 9 1 low 14 36
#> 10 2 mid 12 17
#> # … with 490 more rows
数据
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
level = c("low", "low", "mid", "mid", "low", "low", "mid",
"mid", "up", "up", "up"), high.density = c(14, 54, 82, 24,
12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(
cols = list(replicate = structure(list(), class = c("collector_double",
"collector")), level = structure(list(), class = c("collector_character",
"collector")), high.density = structure(list(), class = c("collector_double",
"collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
我们可以利用 split
并根据 replicate
和 level
的独特组合进行采样。我们可以重复此过程 B
次。
df_shuffle <- function(DF) {
my_split <- split(DF, f = ~ DF$replicate + DF$level)
shuffle <- lapply(my_split, function(x) {
nrX <- nrow(x)
cbind(x[, c('replicate', 'level')],
high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
})
DF_new <- do.call(rbind, shuffle)
rownames(DF_new) <- NULL
return(DF_new)
}
B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
replicate level high.density low.density
1 1 low 54 36
2 1 low 54 36
3 2 low 12 45
4 2 low 12 28
5 1 mid 24 10
6 1 mid 82 10
7 2 mid NA 17
8 2 mid 12 17
9 2 up 20 10
10 2 up 40 10
11 2 up 20 5
因为原始数据包含缺失的观测值,我们要么必须对它们进行乘法插补,要么选择 lisewise 删除它们。现在,让我们执行后一个选项。
# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])
最后,我们对每个打乱后的数据集执行线性回归,并将 B
系数存储在 out
.
中
row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)
## out <- row_bind(
## lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
## family = binomial())$coef)
## )
# -------------------------------------------------------------------
> dim(out)
[1] 1000 2
输出
> head(out)
X.Intercept. low.density
1 13.74881 0.2804738
2 20.01074 -0.2095672
3 30.26643 -0.2946373
4 29.19541 -0.2752761
5 37.76273 -0.4555651
6 37.72250 -0.1548349
可以找到创建此图像所需的代码 。
我想对这个数据集执行bootstrap。请注意,数据有两个因子:replicate
和 level
,以及两个需要回归的变量 high.density
和 low.density
。我想在此数据集上执行 bootstrap,但替换只能在复制和级别的嵌套因子内发生。
replicate level high.density low.density
1 low 14 36
1 low 54 31
1 mid 82 10
1 mid 24 NA
2 low 12 28
2 low 11 45
2 mid 12 17
2 mid NA 24
2 up 40 10
2 up NA 5
2 up 20 2
例如,在 replicate/ level
中:1/low
low.density
31 和 36 可以互换(或 high.density
互换),因此该数据集的头部可能看起来像:
replicate level high.density low.density
1 low 14 31
1 low 54 36
1 mid 82 10
1 mid 24 NA
然后我想从这个数据集中估计线性回归 (glm)。对于尝试实现此目标的任何反馈,我将不胜感激。
##DATA FRAME (credits: caldwellst)
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level)
)
这是使用 dplyr
、purrr
和 tidyr
的解决方案。首先嵌套数值型数据,然后对数据中每个 replicate
和 level
的唯一组合进行采样。然后在其中,bootstrap 密度的唯一值,然后为最终数据框取消嵌套。
# library(tidyverse)
library(dplyr)
library(tidyr)
library(purrr)
df %>%
nest(data = ends_with("density")) %>%
slice_sample(n = 500, replace = TRUE) %>%
mutate(data = map(data, ~summarize(.x, across(.fns = sample, size = 1)))) %>%
unnest(cols = data)
#> # A tibble: 500 × 4
#> replicate level high.density low.density
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 low 54 31
#> 2 2 mid 12 24
#> 3 1 mid 24 10
#> 4 2 up 20 2
#> 5 2 mid 12 24
#> 6 2 mid 12 24
#> 7 1 mid 82 10
#> 8 2 up NA 2
#> 9 1 low 14 36
#> 10 2 mid 12 17
#> # … with 490 more rows
数据
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
level = c("low", "low", "mid", "mid", "low", "low", "mid",
"mid", "up", "up", "up"), high.density = c(14, 54, 82, 24,
12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(
cols = list(replicate = structure(list(), class = c("collector_double",
"collector")), level = structure(list(), class = c("collector_character",
"collector")), high.density = structure(list(), class = c("collector_double",
"collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
我们可以利用 split
并根据 replicate
和 level
的独特组合进行采样。我们可以重复此过程 B
次。
df_shuffle <- function(DF) {
my_split <- split(DF, f = ~ DF$replicate + DF$level)
shuffle <- lapply(my_split, function(x) {
nrX <- nrow(x)
cbind(x[, c('replicate', 'level')],
high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
})
DF_new <- do.call(rbind, shuffle)
rownames(DF_new) <- NULL
return(DF_new)
}
B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
replicate level high.density low.density
1 1 low 54 36
2 1 low 54 36
3 2 low 12 45
4 2 low 12 28
5 1 mid 24 10
6 1 mid 82 10
7 2 mid NA 17
8 2 mid 12 17
9 2 up 20 10
10 2 up 40 10
11 2 up 20 5
因为原始数据包含缺失的观测值,我们要么必须对它们进行乘法插补,要么选择 lisewise 删除它们。现在,让我们执行后一个选项。
# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])
最后,我们对每个打乱后的数据集执行线性回归,并将 B
系数存储在 out
.
row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)
## out <- row_bind(
## lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
## family = binomial())$coef)
## )
# -------------------------------------------------------------------
> dim(out)
[1] 1000 2
输出
> head(out)
X.Intercept. low.density
1 13.74881 0.2804738
2 20.01074 -0.2095672
3 30.26643 -0.2946373
4 29.19541 -0.2752761
5 37.76273 -0.4555651
6 37.72250 -0.1548349
可以找到创建此图像所需的代码