提高采样填充数据集的效率

Question

给定 myletters:

library(tidyverse)
myletters <- letters
myletters
#  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

我想一次从 myletters 中采样 4 个字母（不放回），重复此 X 多次并求出在 [=16 中至少采样一次所有字母的概率=] 绘制。

例如，如果 X = 10 我们可以得到：

set.seed(10)
X <- unlist(rerun(10, sample(myletters, 4, replace = F)))
X
#  [1] "k" "i" "j" "p" "l" "w" "h" "v" "g" "s" "x" "o" "o" "j" "g" "y" "b" "x" "m" "h" "n" "g" "f" "y" "v" "r" "u" "y" "m" "e" "a" "g" "z" "r" "d" "y" "x" "s" "v"
# [40] "r"

#test if X contains all 26 letters
n_distinct(X) == 26 #26 = no of letters
#FALSE

以下方法在模拟中完成了我想要的，但不能很好地扩展，因为它在一个单元格中填充了一个数据框列最多 400 个字母，因此很笨拙且效率低下：

output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output

#plot
output %>%
  group_by(drawsX) %>%
  summarise(prob_of_all_letters = mean(all_letters)) %>% 
  ggplot(., aes(drawsX, prob_of_all_letters)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "Probability")

理想情况下我想模拟更多次，例如trial = 1:100000 但是如果我想这样做，上面的方法效率很低。

1) 有没有更有效的方法来用样本填充我的数据集（或使用矩阵）？

2) 此外，是否有一种解析方法可以在 R 中而不是模拟中解决此问题。例如从 10 次抽取 4 个样本中抽取 26 个字母的概率是多少？

谢谢

Answer 1

这是一个稍微改进的版本。代码更高效，当然更清晰：

sample_sets = function(replicates, k, set = letters) {
  draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
  all(seq_along(set) %in% draws)
}

## example use
output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )

## timing
system.time({output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )
})
# user  system elapsed 
# 2.79    0.04    2.95 


## original way
system.time({output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user  system elapsed 
# 4.96    0.06    5.18

所以它在这个数据上的速度提高了大约 40% - 希望随着 draws 的增加，性能会继续提高。

提高采样填充数据集的效率

improve efficiency of filling dataset with sampling

performance

r

probability

sampling