R中的条件随机样本

Question

我想知道解决这个问题的最佳方法是什么。本质上，我想生成 20 个样本，这些样本加起来是 100，但也是 (x1+x2>20)。我正在努力获得快速高效的东西。我意识到我可以过滤掉不符合此条件的行，但如果我生成 10,000 个而不是 20 个，效率不高。

代码如下：

n = 20
x1 = sample(0:100,n,replace = TRUE)
x2 = sample(0:100,n,replace = TRUE)
x3 = sample(0:100,n,replace = TRUE)
index = (x1+x2+x3)>100
G=(x1+x2)>20
while(sum(index)>0&&sum(G)>0){
   x1[index&&G] = sample(0:100,n,replace = TRUE)
   x2[index&&G] = sample(0:100,n,replace = TRUE)
   x3[index&&G] = sample(0:100,n,replace = TRUE)
index =(x1+x2+x3)>100
G=(x1+x2)>20
}
x4=rep(100,n)-x1-x2-x3

df <- data.frame(x1,x2,x3,x4)

提前致谢。

Answer 1

要生成一个这样的向量，您可以这样做：

# generate x1+x2
x1_plus_x2 <- sample.int(79,1) + 20
# generate x1 and x2 
x1x2 <- rmultinom(1, x1_plus_x2, c(1,1))
# generate x3 and x4
x3x4 <- rmultinom(1, 100-x1_plus_x2, c(1,1))
# generated x1,x2,x3,x4
x <- c(x1x2, x3x4)

您可以循环生成 n 个样本。您可以通过在开头生成 x1+x2 的 n 值来提高速度：

n <- 20
# matrix to store the simulations
x <- matrix(NA_integer_, nrow=n, ncol=4)
# generate all the x1+x2's
x1_plus_x2 <- sample.int(79, n, replace=TRUE) + 20
# loop 
for(j in 1:n){
  # generate x1 and x2 
  x1x2 <- rmultinom(1, x1_plus_x2[j], c(1,1))
  # generate x3 and x4
  x3x4 <- rmultinom(1, 100-x1_plus_x2[j], c(1,1))
  #
  x[j,] <- c(x1x2,x3x4)
}

> rowSums(x)
 [1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

Answer 2

另一种可能：选择序列 0:100 的三个 中断点 。然后在这些中断之间生成 x1、x2、x3 和 x4。如果x1 + x2小于20，那么x3 + x4大于20，所以我们可以交换它们。

generate_four_numbers <- function(from = 0, to = 100) {
    breaks <- sort(sample(seq(from, to), 3 ,replace = TRUE))
    x1 <- breaks[1]
    x2 <- breaks[2] - breaks[1]
    x3 <- breaks[3] - breaks[2]
    x4 <- 100 - breaks[3]

    if (x1 + x2 <= 20) {
        return(data.frame(x1 = x4, x2 = x3, x3 = x2, x4 = x1)
    }

    data.frame(x1, x2, x3, x4)
}

res <- do.call(rbind, lapply(1:10000, function(x) generate_four_numbers()))

table(rowSums(res)) # all at 100

length(which(res$x1 + res$x2 > 20)) / nrow(res) # 100 % acceptable

Answer 3

这是一种在 0:n 范围内选择 k 个数字的无偏方法，其总和为 n。它基于 stars and bars encoding:

#picks k random numbers in range 0:n which sum to n:

pick <- function(k,n){
  m <- n + k - 1 #number of stars and bars
  bars <- sort(sample(1:m,k-1)) #positions of the bars
  c(bars,m+1)-c(0,bars)-1
}

这会生成一个示例，返回一个向量。正如@Guillaume Devailly 在他们的回答中观察到的那样，大多数样本将满足前 2 个数字之和的附加约束，因此您可以过滤掉那些不满足的。

请注意，如果您想要 1:100 范围内的 4 个数字，总和为 100，您可以只使用 1 + pick(4,96)。

要对前两个数字强制执行约束：

pick.sample <- function(){
  while(TRUE){
    x <- pick(4,100)
    if(sum(x[1:2]) >20) return(x)
  }
}

然后

df <- data.frame(t(replicate(10000,pick.sample())))

将创建一个 10,000 行的数据框，其中每一行都是一个满足约束的样本。

Answer 4

你可以很容易地暴力破解这个，如下所示

#####
# Brute force solution
set.seed(28550697)
n <- 100000L
time. <- proc.time() # to measure time difference
brute <- t(replicate(
  n, {
    repeat {
      xs <- sample.int(101L, 4, replace = TRUE) - 1L
      if(xs[1] + xs[2] > 20L && sum(xs) == 100L)
        break
    }
    xs
  }))
proc.time() - time. # time taken
#R   user  system elapsed 
#R 192.76    0.13  196.74 

# check result
stopifnot(
  all(rowSums(brute) == 100L),
  all(brute %in% 0:100),
  all(brute[, 1] + brute[, 2] > 20L))

# only the first two columns should be able to take values in 0:100
apply(brute, 2, range)
#R      [,1] [,2] [,3] [,4]
#R [1,]    0    0    0    0
#R [2,]   99   99   79   79

以上我在合理的时间内模拟了 100,000 对（比你要求的多 10 倍）。你当然可以用更聪明的方法做得更好，但这里很明显分布是正确的。

R中的条件随机样本

Conditional Random Sample in R

random

r

sampling