更快的外部替代品

Faster alternatives to outer

我正在尝试以更快的速度对 outer() 进行以下调用。通过 foreach 并行化仍然非常慢,所以我想尝试使用 Rcpp 在 C++ 中调用它,但很想听到任何更快的替代方法。

给定一个矩阵 mat 和一个矩阵同名列表 col.list 我这样总结 mat

mycall <- function(mat, col.list) {
  outer(
    rownames(mat),
    col.list,
    Vectorize(function(x,y) {
      mean(mat[x,y])
    })
  )
}

例如:

set.seed(123)
mat <- matrix(rnorm(100),nrow=10)
rownames(mat) <- letters[1:10]
colnames(mat) <- LETTERS[1:10]
mat
            A          B          C           D           E           F           G          H            I          J
a -0.56047565  1.2240818 -1.0678237  0.42646422 -0.69470698  0.25331851  0.37963948 -0.4910312  0.005764186  0.9935039
b -0.23017749  0.3598138 -0.2179749 -0.29507148 -0.20791728 -0.02854676 -0.50232345 -2.3091689  0.385280401  0.5483970
c  1.55870831  0.4007715 -1.0260044  0.89512566 -1.26539635 -0.04287046 -0.33320738  1.0057385 -0.370660032  0.2387317
d  0.07050839  0.1106827 -0.7288912  0.87813349  2.16895597  1.36860228 -1.01857538 -0.7092008  0.644376549 -0.6279061
e  0.12928774 -0.5558411 -0.6250393  0.82158108  1.20796200 -0.22577099 -1.07179123 -0.6880086 -0.220486562  1.3606524
f  1.71506499  1.7869131 -1.6866933  0.68864025 -1.12310858  1.51647060  0.30352864  1.0255714  0.331781964 -0.6002596
g  0.46091621  0.4978505  0.8377870  0.55391765 -0.40288484 -1.54875280  0.44820978 -0.2847730  1.096839013  2.1873330
h -1.26506123 -1.9666172  0.1533731 -0.06191171 -0.46665535  0.58461375  0.05300423 -1.2207177  0.435181491  1.5326106
i -0.68685285  0.7013559 -1.1381369 -0.30596266  0.77996512  0.12385424  0.92226747  0.1813035 -0.325931586 -0.2357004
j -0.44566197 -0.4727914  1.2538149 -0.38047100 -0.08336907  0.21594157  2.05008469 -0.1388914  1.148807618 -1.0264209

col.list <- replicate(5, sample(colnames(mat),sample(10,1)), simplify = F)
col.list
[[1]]
[1] "I" "H" "F" "C"

[[2]]
[1] "H" "C" "E" "D"

[[3]]
[1] "F" "A" "B" "C"

[[4]]
[1] "I" "G" "H" "F"

[[5]]
[1] "B" "F" "A" "D" "J"


mycall(mat, col.list)
             [,1]        [,2]        [,3]        [,4]        [,5]
 [1,] -0.32494304 -0.45677441 -0.03772476  0.03692275  0.46737855
 [2,] -0.54260254 -0.75753314 -0.02922133 -0.61368967  0.07088301
 [3,] -0.10844910 -0.09763415  0.22265121  0.06475016  0.61009334
 [4,]  0.14372171  0.40224937  0.20522554  0.07130067  0.36000416
 [5,] -0.43982636  0.17912380 -0.31934091 -0.55151435  0.30598183
 [6,]  0.29678266 -0.27389757  0.83293885  0.79433814  1.02136588
 [7,]  0.02527506  0.17601171  0.06195023 -0.07211925  0.43025291
 [8,] -0.01188734 -0.39897791 -0.62342288 -0.03697956 -0.23527315
 [9,] -0.28972770 -0.12070775 -0.24994491  0.22537340 -0.08066115
[10,]  0.61991819  0.16277087  0.13782578  0.81898563 -0.42188074

你可以试试:

sapply(col.list, function(v) rowMeans(mat[, v]))

我怀疑您的解决方案缓慢的原因是 Vectorize:这是将标量函数转换为向量化函数的好方法,但它的成本很高:因为它基于 mapply , 它会在每个元素上一个一个地调用函数。也就是说,为每个条目调用一次 mean。如果 outer 结果很大,那将是非常昂贵的。相反,使用上面的解决方案,代码至少在一个方向上被矢量化,这要归功于 rowMeans.