在 R 中的 `k` 单元中分配 `n` 而不重复和零结构

distribute `n` among `k` units without repetition and zero structures in R

我想知道 R 中是否有一种方法可以在 k 个单元中不重复地分配 n(例如,3 5 25 3 2 相同,和 2 3 55 2 3) 并且不考虑 0 组合(即没有 9 1 0)并查看此分布的构成?

例如,如果 n = 9k = 3,那么我们预计构成为:

(注意:k 始终是列数)

3 3 3 4 3 2 4 1 4 5 2 2 5 1 3 6 2 1 7 1 1

 makeup <- function(n, k){

  # your suggested solution #
 }

在基数 R 中使用矩阵:

myfun1 <- function( n, k){
  x <- as.matrix(expand.grid( rep(list(seq_len(n)), k)))
  x <- x[rowSums(x) == n,]
  x[ ! duplicated( t( apply(x, 1, sort)) ),]
}
myfun1( n = 9, k = 3 )

可能是这个使用 data.table

myfun2 <- function( n, k){
  require('data.table')
  dt <- do.call(CJ, rep(list(seq_len(n)), k))
  dt <- dt[rowSums(dt) == n,]
  dt[which(!duplicated(dt[, transpose(lapply( transpose(.SD), sort ))])),]
}

myfun2( n = 9, k = 3 )
#    V1 V2 V3
# 1:  7  1  1
# 2:  6  2  1
# 3:  5  3  1
# 4:  4  4  1
# 5:  5  2  2
# 6:  4  3  2
# 7:  3  3  3

这是使用 expand.grid 的基本解决方案。我不打算将它推荐给大型 n,但它确实有效:

makeup <- function(n, k) {
  x <- expand.grid(rep(list(1:n), 3)) # generate all combinations
  x <- x[rowSums(x) == n,]            # filter out stuff that doesn't sum to n
  x <- as.data.frame(t(apply(x, 1, sort))) # order everything
  unique(x)                                # keep non-duplicates
}

稍加思考就可以大大简化这一过程。如果我们有一个包含 n 个对象的向量,我们可以在 n-1 个不同的位置将其拆分。从这里开始,我们可以大大减少工作量:

makeup <- function(n, k) {
  splits <- combn(n-1, k-1) # locations where to split up the data

  bins <- rbind(rep(0, ncol(splits)), splits) # add an extra "split" before the 1st element
  x <- apply(bins, 2, function(x) c(x[-1],9) -x) # count how many items in each bin

  x <- as.data.frame(t(apply(x, 2, sort))) # order everything
  unique(x)                                # keep non-duplicates
}

您可以使用 repeats.allowed=TRUE 选项尝试 gtools::combinations 来完成这项工作,如下所示:

m <- gtools::combinations(9, 3, repeats.allowed = TRUE)
m[rowSums(m) == 9,]

一个可能的函数可能是,options(expressions = 500000),这个函数可以运行到 n = 500(在我的机器上成功 运行 n=500,r=3):

mycomb <- function(n, r, sumval){
    m <- combinations(n, r, repeats.allowed = TRUE)
    m[rowSums(m) == sumval,]
}
mycomb(9,3,9)

输出:

#     [,1] [,2] [,3]
#[1,]    1    1    7
#[2,]    1    2    6
#[3,]    1    3    5
#[4,]    1    4    4
#[5,]    2    2    5
#[6,]    2    3    4
#[7,]    3    3    3

这些被称为整数分区(更具体地说是受限的整数分区),可以像这样使用包 partitionsarrangements 高效地生成:

partitions::restrictedparts(9, 3, include.zero = FALSE)

[1,] 7 6 5 4 5 4 3
[2,] 1 2 3 4 2 3 3
[3,] 1 1 1 1 2 2 3

arrangements::partitions(9, 3)
     [,1] [,2] [,3]
[1,]    1    1    7
[2,]    1    2    6
[3,]    1    3    5
[4,]    1    4    4
[5,]    2    2    5
[6,]    2    3    4
[7,]    3    3    3

它们比提供的解决方案快得多:

library(microbenchmark)
microbenchmark(arrangePack = arrangements::partitions(20, 5),
               partsPack = partitions::restrictedparts(20, 5, include.zero = FALSE),
               myfun2(20, 5, 20),
               myfun1(20, 5, 20),
               makeup(20, 5),
               mycomb(20, 5), times = 3, unit = "relative")
Unit: relative
              expr          min           lq        mean       median          uq         max neval
       arrangePack     1.000000     1.000000    1.000000     1.000000    1.000000    1.000000     3
         partsPack     3.070203     2.755573    2.084231     2.553477    1.854912    1.458389     3
 myfun2(20, 5, 20) 10005.679667  8528.784033 6636.284386  7580.133387 5852.625112 4872.050067     3
 myfun1(20, 5, 20) 12770.400243 10574.957696 8005.844282  9164.764625 6897.696334 5610.854109     3
     makeup(20, 5) 15422.745155 12560.083171 9248.916738 10721.316721 7812.997976 6162.166646     3
     mycomb(20, 5)  1854.125325  1507.150003 1120.616461  1284.278219  950.015812  760.280469     3

其实对于下面这个例子,其他函数会因为内存问题而报错:

system.time(arrangements::partitions(100, 10))
 user  system elapsed 
0.068   0.031   0.099 

arrangements::npartitions(100, 10)
[1] 2977866