通过按组替换引导，但为重采样单元创建新标识符

Question

我正在尝试 bootstrap 从数据 table 中进行分组，并在 R 中进行替换。

这是数据 table 例如：

dat = data.table('n'=c(1,1,1,2,2,2,2,3,4,4,4,4,4),'y'=round(rnorm(13,0,1),1))


   n    y
 1: 1 -0.8
 2: 1  0.5
 3: 1 -0.1
 4: 2  0.2
 5: 2 -0.1
 6: 2 -2.7
 7: 2  0.1
 8: 3  0.3
 9: 4 -0.7
10: 4 -0.2
11: 4  1.2
12: 4  1.2
13: 4 -0.1

一个bootstrapped样本随机抽取了4组'n'，所以结果可能是这样的（在这个实现中，抽取了第1,4组，抽取了两次3 ):

   n    y
 1: 4 -0.7
 2: 4 -0.2
 3: 4  1.2
 4: 4  1.2
 5: 4 -0.1
 6: 3  0.3
 7: 3  0.3
 8: 1 -0.8
 9: 1  0.5
10: 1 -0.1

但是，我的问题是现在如果我按 'n' 分组，它认为第 6 行和第 7 行是同一组，而实际上它们是重采样版本，所以我想区别对待它们，因为例如，通过添加第三列表示 "this is the SECOND group pulled from 3"（例如 3.1 和 3.2）或完成该操作的内容。

Answer 1

您可以通过连接（也很可能还有其他方式）来做到这一点。

首先我们生成一个 bootstrap 样本。这包含两个变量：新组 ID bid 和示例组 n

set.seed(84)
bootsample = data.table(n=sample(1:4, 4, replace=TRUE), bid=1:4)
bootsample

   n bid
1: 4   1
2: 2   2
3: 4   3
4: 4   4

然后我们需要将其合并回原始数据table。由于组是重复的，我们应该使用 allow.cartesian=TRUE 参数。您可以在后续分析中使用 bid 变量分组。

merge(bootsample, dat, allow.cartesian=TRUE)

    n bid    y
 1: 2   2  1.1
 2: 2   2  2.2
 3: 2   2 -0.8
 4: 2   2 -1.4
 5: 4   1 -1.3
 6: 4   1 -0.4
 7: 4   1 -1.0
 8: 4   1  0.9
 9: 4   1 -0.3
10: 4   3 -1.3
11: 4   3 -0.4
12: 4   3 -1.0
13: 4   3  0.9
14: 4   3 -0.3
15: 4   4 -1.3
16: 4   4 -0.4
17: 4   4 -1.0
18: 4   4  0.9
19: 4   4 -0.3

可能会有更紧凑的解决方案。请注意，根据您使用 bootstrapped 数据的方式，如果 bootstrapping 组的大小不同，它们可能会给您带来各种问题。

通过按组替换引导，但为重采样单元创建新标识符

Bootstrapping with replacement by group, but creating a new identifier for resampled units

r

data.table

statistics-bootstrap