在一个样本中随机和非随机抽样

Question

有没有办法在单个样本中抽样 X 个随机行和 X 个非随机行？例如，我想获得 4 行 iris 的 1,000 个样本。我想随机抽取 3 行 iris，第四行在每个样本中都是相同的（这是为了模仿混合抽样设计）。

我可以对 3 个随机行 1000x 和固定行 1000x 进行采样，然后将两个数据帧合并在一起，但出于某些原因，这不是理想情况。执行此操作的代码如下所示：

df<- iris

fixed_sample<- iris[7,]

random<- list()
fixed<- list()

counter<- 0
for (i in 1:1000) {
  # sample 4 randomly selected transects 100 time
  tempsample_random<- df[sample(1:nrow(df), 3, replace=F),]
  tempsample_fixed<- fixed_sample[sample(1:nrow(fixed_sample), 1, replace=F), ]
  
  random[[i]]=tempsample_random
  fixed[[i]]=tempsample_fixed
  
  
  counter<- counter+1
  print(counter)
}


random_results<- do.call(rbind, random)
fixed_results<- do.call(rbind, fixed)

从这里开始，我将创建一个新列作为分组变量，然后根据该组将它们合并在一起。所以最终数据帧的每四行在每个样本中有 3 个随机行和行号 7 (fixed_sample)。

我研究过使用 splitstackshape::stratified，但还没有按照我需要的方式使用它。我将在多个级别的抽样工作（样本 2、3、4、5 行等，每行 1,000 倍）中执行此操作，因此最好能够从开始。

如有任何帮助，我们将不胜感激。

Answer 1

我认为您可以使用 lapply 在一行中完成此操作。在这种情况下，我们将抽取 3 个样本，但您可以将 seq(3) 更改为 seq(1000) 以获得 1000 个样本。我按照你的例子选择了第7行作为固定行。

lapply(seq(3), function(i) iris[c(sample(seq(nrow(iris))[-7], 3), 7),])
#> [[1]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 67           5.6         3.0          4.5         1.5 versicolor
#> 105          6.5         3.0          5.8         2.2  virginica
#> 111          6.5         3.2          5.1         2.0  virginica
#> 7            4.6         3.4          1.4         0.3     setosa
#> 
#> [[2]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 147          6.3         2.5          5.0         1.9 virginica
#> 131          7.4         2.8          6.1         1.9 virginica
#> 126          7.2         3.2          6.0         1.8 virginica
#> 7            4.6         3.4          1.4         0.3    setosa
#> 
#> [[3]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 143          5.8         2.7          5.1         1.9  virginica
#> 145          6.7         3.3          5.7         2.5  virginica
#> 60           5.2         2.7          3.9         1.4 versicolor
#> 7            4.6         3.4          1.4         0.3     setosa

^{由 reprex package (v2.0.1)}

于 2022-05-18 创建

Answer 2

这是一个方法：

fixed_row <- 7
set.seed(42)
random <- replicate(1000, df[c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3)),], simplify = FALSE)
random[1:3]
# [[1]]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 7           4.6         3.4          1.4         0.3     setosa
# 50          5.0         3.3          1.4         0.2     setosa
# 66          6.7         3.1          4.4         1.4 versicolor
# 75          6.4         2.9          4.3         1.3 versicolor
# [[2]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 123          7.7         2.8          6.7         2.0 virginica
# 50           5.0         3.3          1.4         0.2    setosa
# [[3]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 129          6.4         2.8          5.6         2.1 virginica
# 48           4.6         3.2          1.4         0.2    setosa
# 25           4.8         3.4          1.9         0.2    setosa

目的是我们对所有行进行采样除了您打算包含在所有样本中的固定行，然后将其添加到行索引列表中。使用 setdiff(.., fixed_row) 的前提允许您在此处使用任意集合，因此 fixed_row 具有 零个或多个 具有所需最终结果的行索引是可行的.

set.seed(42)
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]  7 50 66 75
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 147 123  50
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 129  48  25

（请注意，使用 set.seed 只是为了在 Whosebug 上实现可重复性，您可能不应该在生产中使用它。）

Answer 3

df <- iris

fixed_row = 2
resample_count = 1000

keep_rows <- unlist(
  Map(1:resample_count,
      f = function(x) c(fixed_row, sample(1:nrow(df),3))
      )
)

resamples <- iris[keep_rows,]

在一个样本中随机和非随机抽样

Sampling randomly and non-randomly in one sample

for-loop

r

sampling