R - 不重复展开网格
R - Expand Grid Without Duplicates
我需要一个类似于expand.grid
但没有重复元素组合的函数。
这是我的问题的简化版本。
X1 = c("x","y","z")
X2 = c("A","B","C")
X3 = c("y","C","G")
d <- expand.grid(X1,X2,X3)
d
Var1 Var2 Var3
1 x A y
2 y A y
3 z A y
4 x B y
. . . .
. . . .
. . . .
23 y B G
24 z B G
25 x C G
26 y C G
27 z C G
d
有 27 行。但是其中 6 个包含我不需要的重复值 Rows: 2, 5, 8, 16, 17 & 18
有没有办法获取不包含任何重复项的其他 21 行。
请注意,向量有超过 3 个元素(c("x","y","z","k","m"...)
,最多 50 个),实际情况下向量的数量超过 3 个。 (X4
、X5
、X6
... 最多 11 个)。由于这个扩展对象变得非常大,RAM 无法处理它。
在RcppAlgos
*中,有一个名为comboGrid
的函数可以解决这个问题:
library(RcppAlgos) ## as of v2.4.3
comboGrid(X1, X2, X3, repetition = F)
# Var1 Var2 Var3
# [1,] "x" "A" "C"
# [2,] "x" "A" "G"
# [3,] "x" "A" "y"
# [4,] "x" "B" "C"
# [5,] "x" "B" "G"
# [6,] "x" "B" "y"
# [7,] "x" "C" "G"
# [8,] "x" "C" "y"
# [9,] "y" "A" "C"
# [10,] "y" "A" "G"
# [11,] "y" "B" "C"
# [12,] "y" "B" "G"
# [13,] "y" "C" "G"
# [14,] "z" "A" "C"
# [15,] "z" "A" "G"
# [16,] "z" "A" "y"
# [17,] "z" "B" "C"
# [18,] "z" "B" "G"
# [19,] "z" "B" "y"
# [20,] "z" "C" "G"
# [21,] "z" "C" "y"
大型测试
set.seed(42)
rnd_lst <- lapply(1:11, function(x) {
sort(sample(LETTERS, sample(26, 1)))
})
## Number of results that expand.grid would return if your machine
## had enough memory... over 300 trillion!!!
prettyNum(prod(lengths(rnd_lst)), big.mark = ",")
# [1] "365,634,846,720"
exp_grd_test <- expand.grid(rnd_lst)
# Error: vector memory exhausted (limit reached?)
system.time(cmb_grd_test <- comboGrid(rnd_lst, repetition=FALSE))
# user system elapsed
# 9.866 0.330 10.196
dim(cmb_grd_test)
# [1] 3036012 11
head(cmb_grd_test)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
# [1,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "K"
# [2,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "L"
# [3,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "M"
# [4,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "N"
# [5,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "O"
# [6,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "P"
* 我是 RcppAlgos
的作者
(抱歉,我刚刚意识到您的问题同样是大小问题,因此删除它们 post 代可能不可行。为此,这可能不是最佳答案,但我我会保留它以解决较小的相关问题。)
基础 R
我硬编码了“3”,但您可以使用 ncol(d)
and/or ncol(d)-1
进行编程。
d[lengths(apply(d, 1, unique)) > 2, ]
# Var1 Var2 Var3
# 1 x A y
# 3 z A y
# 4 x B y
# 6 z B y
# 7 x C y
# 9 z C y
# 10 x A C
# 11 y A C
# 12 z A C
# 13 x B C
# 14 y B C
# 15 z B C
# 19 x A G
# 20 y A G
# 21 z A G
# 22 x B G
# 23 y B G
# 24 z B G
# 25 x C G
# 26 y C G
# 27 z C G
(行名没有重置,你可以看到差距来验证它不是27行。)
为了验证,以下是带有欺骗的行:
d[lengths(apply(d, 1, unique)) < 3, ]
# Var1 Var2 Var3
# 2 y A y
# 5 y B y
# 8 y C y
# 16 x C C
# 17 y C C
# 18 z C C
我需要一个类似于expand.grid
但没有重复元素组合的函数。
这是我的问题的简化版本。
X1 = c("x","y","z")
X2 = c("A","B","C")
X3 = c("y","C","G")
d <- expand.grid(X1,X2,X3)
d
Var1 Var2 Var3
1 x A y
2 y A y
3 z A y
4 x B y
. . . .
. . . .
. . . .
23 y B G
24 z B G
25 x C G
26 y C G
27 z C G
d
有 27 行。但是其中 6 个包含我不需要的重复值 Rows: 2, 5, 8, 16, 17 & 18
有没有办法获取不包含任何重复项的其他 21 行。
请注意,向量有超过 3 个元素(c("x","y","z","k","m"...)
,最多 50 个),实际情况下向量的数量超过 3 个。 (X4
、X5
、X6
... 最多 11 个)。由于这个扩展对象变得非常大,RAM 无法处理它。
在RcppAlgos
*中,有一个名为comboGrid
的函数可以解决这个问题:
library(RcppAlgos) ## as of v2.4.3
comboGrid(X1, X2, X3, repetition = F)
# Var1 Var2 Var3
# [1,] "x" "A" "C"
# [2,] "x" "A" "G"
# [3,] "x" "A" "y"
# [4,] "x" "B" "C"
# [5,] "x" "B" "G"
# [6,] "x" "B" "y"
# [7,] "x" "C" "G"
# [8,] "x" "C" "y"
# [9,] "y" "A" "C"
# [10,] "y" "A" "G"
# [11,] "y" "B" "C"
# [12,] "y" "B" "G"
# [13,] "y" "C" "G"
# [14,] "z" "A" "C"
# [15,] "z" "A" "G"
# [16,] "z" "A" "y"
# [17,] "z" "B" "C"
# [18,] "z" "B" "G"
# [19,] "z" "B" "y"
# [20,] "z" "C" "G"
# [21,] "z" "C" "y"
大型测试
set.seed(42)
rnd_lst <- lapply(1:11, function(x) {
sort(sample(LETTERS, sample(26, 1)))
})
## Number of results that expand.grid would return if your machine
## had enough memory... over 300 trillion!!!
prettyNum(prod(lengths(rnd_lst)), big.mark = ",")
# [1] "365,634,846,720"
exp_grd_test <- expand.grid(rnd_lst)
# Error: vector memory exhausted (limit reached?)
system.time(cmb_grd_test <- comboGrid(rnd_lst, repetition=FALSE))
# user system elapsed
# 9.866 0.330 10.196
dim(cmb_grd_test)
# [1] 3036012 11
head(cmb_grd_test)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
# [1,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "K"
# [2,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "L"
# [3,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "M"
# [4,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "N"
# [5,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "O"
# [6,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "P"
* 我是 RcppAlgos
(抱歉,我刚刚意识到您的问题同样是大小问题,因此删除它们 post 代可能不可行。为此,这可能不是最佳答案,但我我会保留它以解决较小的相关问题。)
基础 R
我硬编码了“3”,但您可以使用 ncol(d)
and/or ncol(d)-1
进行编程。
d[lengths(apply(d, 1, unique)) > 2, ]
# Var1 Var2 Var3
# 1 x A y
# 3 z A y
# 4 x B y
# 6 z B y
# 7 x C y
# 9 z C y
# 10 x A C
# 11 y A C
# 12 z A C
# 13 x B C
# 14 y B C
# 15 z B C
# 19 x A G
# 20 y A G
# 21 z A G
# 22 x B G
# 23 y B G
# 24 z B G
# 25 x C G
# 26 y C G
# 27 z C G
(行名没有重置,你可以看到差距来验证它不是27行。)
为了验证,以下是带有欺骗的行:
d[lengths(apply(d, 1, unique)) < 3, ]
# Var1 Var2 Var3
# 2 y A y
# 5 y B y
# 8 y C y
# 16 x C C
# 17 y C C
# 18 z C C