直接生成 powerset 的随机子集

Question

如果我们能够先计算出幂集的所有元素，然后从中随机抽取样本，就很容易生成幂集的随机子集:

set.seed(12)
x = 1:4
n.samples = 3

library(HapEstXXR)
power.set = HapEstXXR::powerset(x)
sample(power.set, size = n.samples, replace = FALSE)
# [[1]]
# [1] 2
# 
# [[2]]
# [1] 3 4
# 
# [[3]]
# [1] 1 3 4

但是，如果x的长度很大，幂集的元素就会太多。因此，我正在寻找一种直接计算随机子集的方法。一种可能性是先绘制 "random length" 然后使用 "random length":

绘制 x 的随机子集

len = sample(1:length(x), size = n.samples, replace = TRUE)
len
# [1] 2 1 1

lapply(len, function(l) sort(sample(x, size = l)))
# [[1]]
# [1] 1 2
# 
# [[2]]
# [1] 1
# 
# [[3]]
# [1] 1

但是，这会生成重复项。当然，我现在可以删除重复项并使用 while 循环重复之前的采样，直到我最终得到幂集的 n.samples 个非重复随机子集：

drawSubsetOfPowerset = function(x, n) {
  ret = list()
  while(length(ret) < n) {
    # draw a "random length" with some meaningful prob to reduce number of loops
    len = sample(0:n, size = n, replace = TRUE, prob = choose(n, 0:n)/2^n)
    # draw random subset of x using the "random length" and sort it to better identify duplicates
    random.subset = lapply(len, function(l) sort(sample(x, size = l)))
    # remove duplicates
    ret = unique(c(ret, random.subset))
  }
  return(ret)
}

drawSubsetOfPowerset(x, n.samples)

当然，我现在可以尝试优化 drawSubsetOfPowerset 函数的几个组件，例如(1) 尽量避免在循环的每次迭代中复制对象 ret，(2) 使用更快的排序，(3) 使用更快的方法删除列表的重复项，...

我的问题是：是否有不同的方法（效率更高）？

Answer 1

使用二进制表示怎么样？这样我们就可以从2^length(v)给定的幂集总数的长度生成一个随机的整数子集。从那里我们可以利用 intToBits 和索引来保证我们以有序的方式生成幂集的随机唯一子集。

randomSubsetOfPowSet <- function(v, n, mySeed) {
    set.seed(mySeed)
    lapply(sample(2^length(v), n) - 1, function(x) v[intToBits(x) > 0])
}

采用 x = 1:4、n.samples = 5 和随机种子 42，我们有：

randomSubsetOfPowSet(1:4, 5, 42)
[[1]]
[1] 2 3 4

[[2]]
[1] 1 2 3 4

[[3]]
[1] 3

[[4]]
[1] 2 4

[[5]]
[1] 1 2 3

说明

二进制表示与幂集有什么关系？

事实证明，给定一个集合，我们可以通过转向位（是的，0 和 1）找到所有子集。通过将子集中的元素视为原始集合中的 on 个元素，将不在该子集中的元素视为 off，我们现在有了一种非常切实的方式来思考如何生成每个子集。观察：

       Original set: {a,  b,  c,  d}
                      |   |   |   |
                      V   V   V   V        b & d 
Existence in subset: 1/0 1/0 1/0 1/0       are on
                                            / \
                                           /   \
                                          |     |
                                          V     V
Example subset: {b, d} gets mapped to {0, 1, 0, 1}
                                       |  \   \  \_______
                                       |   |   \__       \
                                       |   |___   \____   \____
                                       |       |       |       |
                                       V       V       V       V
Thus, {b, d} is mapped to the integer  0*2^0 + 1*2^1 + 0*2^2 + 1*2^3 = 10

现在这是一个长度为 n 的位组合问题。如果您将此映射到 A = {a, b, c, d} 的每个子集，您将获得 0:15。因此，要获得 A 的幂集的随机子集，我们只需生成 0:15 的随机子集并将每个整数映射到 A 的子集。我们该怎么做？

sample comes to mind.

现在，反过来也很容易（即从整数到我们原始集合的子集）

观察：

Given the integer 10 and set A given above (i.e. {a, b, c, d}) we have:

10 in bits is -->> {0, 1, 0, 1}

Which indices are greater than 0?

Answer: 2 and 4

Taking the 2nd the 4th element of our set gives: {b, d} et Voila!

直接生成 powerset 的随机子集

Generate a random subset of the powerset directly

combinations

r

set

powerset

说明