计算 R 中矩阵中每个唯一列出现次数的最快方法
Fastest way to count the occurrences of each unique column in a matrix in R
我是 R(和 Whosebug)的新手,非常感谢您的帮助。我想计算矩阵中每个唯一列的出现次数。我写了下面的代码,但是速度非常慢:
frequencyofequalcolumnsinmatrix = function(matrixM){
# returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc.
n = nrow(matrixM)
columnswithfrequencyofmtxM = c()
while (ncol(matrixM)>0){
indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n))));
indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero);
frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column
columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0);
matrixM=matrixM[,indexnotzero];
matrixM = as.matrix(matrixM);
}
return(columnswithfrequencyofmtxM)
}
如果我们应用在矩阵'testmtx'上,我们得到:
> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
> frequencyofequalcolumnsinmatrix(testmtx)
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 2 1 2
[3,] 4 1 1
[4,] 2 3 1
其中最后一行包含上一列的出现次数。
对我的代码不满意,我浏览了 Whosebug。我发现了以下问题:
Fastest way to count occurrences of each unique element
结果表明,计算向量中每个唯一元素出现次数的最快方法是使用 data.table() 包。这是代码:
f6 <- function(x){
data.table(x)[, .N, keyby = x]
}
当我们运行时,我们得到:
> vtr = c(1,2,3,1,1,2,4,2,4)
> f6(vtr)
x N
1: 1 3
2: 2 3
3: 3 1
4: 4 2
我已尝试修改此代码以便在我的案例中使用它。这需要能够创建 vtr 作为向量,其中每个元素都是一个向量。但我没能做到这一点。(很可能是因为在 R 中,c(c(1,2),c(3,4)) 与 c(1,2,3,4) 相同)。
我应该尝试修改函数f6吗?如果是,怎么做?
还是我应该采取完全不同的方法?如果有,是哪一个?
谢谢!
一种简单的方法是将您的行一起粘贴到一个向量中,然后使用该函数。
mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
vec <- apply(mat, 2, paste, collapse=" ")
f6(vec)
x N
1: 011 3
2: 121 1
3: 124 2
编辑
@RohitDas 的回答让我想到,在考虑性能时最好检查一下。如果我采用问题中先前显示的所有功能,OP 链接 here 并添加
f7 <- table
同时添加@DavidArenburg
的 f10 建议
f10 <- function(x){
table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")]))
}
结果如下:
添加@MaratTalipov 的解决方案后,它是明显的赢家。直接应用于矩阵,它比所有矢量解决方案都快。
set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000)
microbenchmark(
f1(apply(testmx, 2, paste, collapse=" ")),
f2(apply(testmx, 2, paste, collapse=" ")),
f3(apply(testmx, 2, paste, collapse=" ")),
f4(apply(testmx, 2, paste, collapse=" ")),
f5(apply(testmx, 2, paste, collapse=" ")),
f6(apply(testmx, 2, paste, collapse=" ")),
f7(apply(testmx, 2, paste, collapse=" ")),
f8(apply(testmx, 2, paste, collapse=" ")),
f9(apply(testmx, 2, paste, collapse=" ")),
f10(testmx),
f11(testmx),
f12(testmx)
)
Unit: microseconds
expr min lq mean median uq max neval
f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600 9569.987 100
f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430 6721.318 100
f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155 6631.624 100
f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260 6258.987 100
f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115 4222.575 100
f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175 7979.352 100
f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795 3784.110 100
f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380 5002.109 100
f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957 100
f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581 100
f11(testmx) 500.058 549.1395 624.9526 576.279 636.1395 1176.809 100
f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270 3600.487 100
借用@cdeterman 解决方案。获得已发布列值的向量后,您只需执行 table 即可获得计数
table(vec)
vec
011 121 124
3 1 2
"Brute force"方法:
f11 <- function(testmtx) {
nc <- ncol(testmtx)
z <- seq(nc)
for (i in seq(nc-1)) {
dup <- sapply(seq(i+1,nc),function(j) identical(testmtx[,i],testmtx[,j]))
z[which(dup)+i] <- z[i]
}
table(z)
}
它的复杂度应该是 O(N^2*M),其中 N 和 M 分别是列数和行数。另一个基于 paste
的解决方案具有复杂性 O(N*M^2),因此它们的相对性能应该对 N/M.
[编辑] 实际上,我不确定基于 paste
的解决方案的复杂性——它很可能是 O(N^2*M^2)...
[EDIT2] 比函数 f11()
更有效的替代方法,它使用@BrodieG 比较矩阵列与矩阵的方法:
f13 <- function(testmtx) {
nc <- ncol(testmtx)
z <- seq(nc)
for (i in seq(nc-1)) {
dup <- colSums(abs(testmtx[,seq(i+1,nc),drop=F] - testmtx[,i])) == 0
z[which(dup)+i] <- z[i]
}
table(z)
}
这应该有点效率。首先objective是用duplicated
算出统计哪些列,然后用vector循环和colSums
算出每一列的实例。
f12 <- function(testmx) {
singles <- !duplicated(testmx, MARGIN=2)
rbind(
testmx[, singles],
apply(testmx[, singles], 2, function(x) sum(colSums(abs(testmx - x)) == 0))
)
}
生产:
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 2 1 2
[3,] 4 1 1
[4,] 2 3 1
这似乎比 Marat 的 f11
快得多,但 f6
+ apply
似乎更胜一筹:
set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=3)
library(microbenchmark)
microbenchmark(
f12(testmx),
f11(testmx),
f6(apply(testmx, 2, paste, collapse="")), times=10
)
Unit: milliseconds
expr min lq mean
f12(testmx) 36.576060 36.931514 38.18358
f11(testmx) 2095.305540 2122.316487 2145.72614
f6(apply(testmx, 2, paste, collapse = "")) 7.570614 7.601697 8.78227
这里有 f6prime
给你:
f6prime = function(mat) {
dt = as.data.table(t(mat));
dt[, .N, by = names(dt)]
}
f6prime(mat)
# V1 V2 V3 N
#1: 1 2 4 2
#2: 0 1 1 3
#3: 1 2 1 1
我是 R(和 Whosebug)的新手,非常感谢您的帮助。我想计算矩阵中每个唯一列的出现次数。我写了下面的代码,但是速度非常慢:
frequencyofequalcolumnsinmatrix = function(matrixM){
# returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc.
n = nrow(matrixM)
columnswithfrequencyofmtxM = c()
while (ncol(matrixM)>0){
indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n))));
indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero);
frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column
columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0);
matrixM=matrixM[,indexnotzero];
matrixM = as.matrix(matrixM);
}
return(columnswithfrequencyofmtxM)
}
如果我们应用在矩阵'testmtx'上,我们得到:
> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
> frequencyofequalcolumnsinmatrix(testmtx)
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 2 1 2
[3,] 4 1 1
[4,] 2 3 1
其中最后一行包含上一列的出现次数。
对我的代码不满意,我浏览了 Whosebug。我发现了以下问题:
Fastest way to count occurrences of each unique element
结果表明,计算向量中每个唯一元素出现次数的最快方法是使用 data.table() 包。这是代码:
f6 <- function(x){
data.table(x)[, .N, keyby = x]
}
当我们运行时,我们得到:
> vtr = c(1,2,3,1,1,2,4,2,4)
> f6(vtr)
x N
1: 1 3
2: 2 3
3: 3 1
4: 4 2
我已尝试修改此代码以便在我的案例中使用它。这需要能够创建 vtr 作为向量,其中每个元素都是一个向量。但我没能做到这一点。(很可能是因为在 R 中,c(c(1,2),c(3,4)) 与 c(1,2,3,4) 相同)。
我应该尝试修改函数f6吗?如果是,怎么做?
还是我应该采取完全不同的方法?如果有,是哪一个?
谢谢!
一种简单的方法是将您的行一起粘贴到一个向量中,然后使用该函数。
mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
vec <- apply(mat, 2, paste, collapse=" ")
f6(vec)
x N
1: 011 3
2: 121 1
3: 124 2
编辑
@RohitDas 的回答让我想到,在考虑性能时最好检查一下。如果我采用问题中先前显示的所有功能,OP 链接 here 并添加
f7 <- table
同时添加@DavidArenburg
的 f10 建议f10 <- function(x){
table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")]))
}
结果如下:
添加@MaratTalipov 的解决方案后,它是明显的赢家。直接应用于矩阵,它比所有矢量解决方案都快。
set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000)
microbenchmark(
f1(apply(testmx, 2, paste, collapse=" ")),
f2(apply(testmx, 2, paste, collapse=" ")),
f3(apply(testmx, 2, paste, collapse=" ")),
f4(apply(testmx, 2, paste, collapse=" ")),
f5(apply(testmx, 2, paste, collapse=" ")),
f6(apply(testmx, 2, paste, collapse=" ")),
f7(apply(testmx, 2, paste, collapse=" ")),
f8(apply(testmx, 2, paste, collapse=" ")),
f9(apply(testmx, 2, paste, collapse=" ")),
f10(testmx),
f11(testmx),
f12(testmx)
)
Unit: microseconds
expr min lq mean median uq max neval
f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600 9569.987 100
f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430 6721.318 100
f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155 6631.624 100
f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260 6258.987 100
f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115 4222.575 100
f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175 7979.352 100
f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795 3784.110 100
f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380 5002.109 100
f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957 100
f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581 100
f11(testmx) 500.058 549.1395 624.9526 576.279 636.1395 1176.809 100
f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270 3600.487 100
借用@cdeterman 解决方案。获得已发布列值的向量后,您只需执行 table 即可获得计数
table(vec)
vec
011 121 124
3 1 2
"Brute force"方法:
f11 <- function(testmtx) {
nc <- ncol(testmtx)
z <- seq(nc)
for (i in seq(nc-1)) {
dup <- sapply(seq(i+1,nc),function(j) identical(testmtx[,i],testmtx[,j]))
z[which(dup)+i] <- z[i]
}
table(z)
}
它的复杂度应该是 O(N^2*M),其中 N 和 M 分别是列数和行数。另一个基于 paste
的解决方案具有复杂性 O(N*M^2),因此它们的相对性能应该对 N/M.
[编辑] 实际上,我不确定基于 paste
的解决方案的复杂性——它很可能是 O(N^2*M^2)...
[EDIT2] 比函数 f11()
更有效的替代方法,它使用@BrodieG 比较矩阵列与矩阵的方法:
f13 <- function(testmtx) {
nc <- ncol(testmtx)
z <- seq(nc)
for (i in seq(nc-1)) {
dup <- colSums(abs(testmtx[,seq(i+1,nc),drop=F] - testmtx[,i])) == 0
z[which(dup)+i] <- z[i]
}
table(z)
}
这应该有点效率。首先objective是用duplicated
算出统计哪些列,然后用vector循环和colSums
算出每一列的实例。
f12 <- function(testmx) {
singles <- !duplicated(testmx, MARGIN=2)
rbind(
testmx[, singles],
apply(testmx[, singles], 2, function(x) sum(colSums(abs(testmx - x)) == 0))
)
}
生产:
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 2 1 2
[3,] 4 1 1
[4,] 2 3 1
这似乎比 Marat 的 f11
快得多,但 f6
+ apply
似乎更胜一筹:
set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=3)
library(microbenchmark)
microbenchmark(
f12(testmx),
f11(testmx),
f6(apply(testmx, 2, paste, collapse="")), times=10
)
Unit: milliseconds
expr min lq mean
f12(testmx) 36.576060 36.931514 38.18358
f11(testmx) 2095.305540 2122.316487 2145.72614
f6(apply(testmx, 2, paste, collapse = "")) 7.570614 7.601697 8.78227
这里有 f6prime
给你:
f6prime = function(mat) {
dt = as.data.table(t(mat));
dt[, .N, by = names(dt)]
}
f6prime(mat)
# V1 V2 V3 N
#1: 1 2 4 2
#2: 0 1 1 3
#3: 1 2 1 1