R：求和相邻矩阵元素。如何加速？

Question

我正在处理大约 2500x2500x50 (lonxlatxtime) 的大型矩阵。矩阵只包含 1 和 0。我需要知道每个时间步周围 24 个元素的总和。到目前为止，我是这样做的：

xdim <- 2500
ydim <- 2500
tdim <- 50
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))

for (t in 1:tdim){
  for (x in 3:(xdim-2)){
    for (y in 3:(ydim-2)){
      res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
    }
  }
}

这行得通，但对我的需要来说太慢了。有没有人请教如何加快速度？

Answer 1

您当前的代码有很多来自冗余子集和计算的开销。如果你想要更快的速度，请清理它。

在 xdim <- ydim <- 20; tdim <- 5，我发现我的机器加速了 23%。
在 xdim <- ydim <- 200; tdim <- 10，我看到了 25% 的加速。

这是以少量的额外内存为代价的，通过检查下面的代码可以明显看出这一点。

xdim <- ydim <- 20; tdim <- 5
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))

microbenchmark(op= {
  for (t in 1:tdim){
    for (x in 3:(xdim-2)){
      for (y in 3:(ydim-2)){
        res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
      }
    }
  }
},
alex= {
  for (t in 1:tdim){
    temp <- a[,,t]
    for (x in 3:(xdim-2)){
      temp2 <- temp[(x-2):(x+2),]
      for (y in 3:(ydim-2)){
        res[x,y,t] <- sum(temp2[,(y-2):(y+2)])
      }
    }
  }
}, times = 50)

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
   op 4.855827 5.134845 5.474327 5.321681 5.626738 7.463923    50   b
 alex 3.720368 3.915756 4.213355 4.012120 4.348729 6.320481    50  a

进一步改进：

如果您用 C++ 编写此代码，我猜识别 res[x,y,t] = res[x,y-1,t] - sum(a[...,y-2,...]) + sum(a[...,y+2,...]) 会节省您额外的时间。在 R 中，它没有出现在我的计时测试中。
这个问题也是并行的尴尬。您没有理由不拆分 t 维度以更多地利用多核架构。

这两个都留给 reader / OP。

Answer 2

这是一个适用于大型阵列的快速解决方案：

res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE)))
dim(res) <- c(xdim, ydim, tdim)

我使用 rep(1,5) 作为每个维度的权重（即 2 邻域内的总和值）过滤数组。然后我修改了 dim 属性，因为它最初是作为矩阵出现的。

请注意，这会将总和环绕在数组的边缘（这可能有意义，因为您正在查看纬度和经度；如果没有，我可以修改我的答案）。

举个具体的例子：

xdim <- 500
ydim <- 500
tdim <- 15
a <- array(0:1,dim=c(xdim,ydim,tdim))

这是您当前正在使用的内容（边缘有 NA）以及此示例在我的笔记本电脑上花费的时间：

f1 <- function(a, xdim, ydim, tdim){
  res <- array(NA_integer_,dim=c(xdim,ydim,tdim))
  for (t in 1:tdim){
    for (x in 3:(xdim-2)){
      for (y in 3:(ydim-2)){
        res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
      }
    }
  }
  return(res)
}

system.time(res1 <- f1(a, xdim, ydim, tdim))
#   user  system elapsed
# 14.813   0.005  14.819

这是与我描述的版本的比较：

f2 <- function(a, xdim, ydim, tdim){
  res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE)))
  dim(res) <- c(xdim, ydim, tdim)
  return(res)
}

system.time(res2 <- f2(a, xdim, ydim, tdim))
#  user  system elapsed
# 1.188   0.047   1.236

您可以看到速度有了显着提升（对于大型阵列）。并检查它是否提供了正确的解决方案（请注意，我添加了 NA，因此两个结果都匹配，因为我以循环方式提供的过滤器）：

## Match NAs
res2NA <- ifelse(is.na(res1), NA, res2)

all.equal(res2NA, res1)
# [1] TRUE

我要补充一点，您的完整阵列 (2500x2500x50) 只用了不到一分钟（大约 55 秒），尽管在此过程中确实使用了大量内存，仅供参考。

Answer 3

简介

不得不说，光是阵法的设置就隐藏着太多的东西了。问题的其余部分虽然微不足道。因此，实际上有两种方法可以解决这个问题：

@Alex 给出的暴力破解（用 C++ 编写）
观察复制模式

使用 OpenMP 进行暴力破解

如果我们想要'brute force'它，那么我们可以使用@Alex 给出的建议来使用OpenMP with Armadillo

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// Add a flag to enable OpenMP at compile time
// [[Rcpp::plugins(openmp)]]

// Protect against compilers without OpenMP
#ifdef _OPENMP
  #include <omp.h>
#endif

// [[Rcpp::export]]
arma::cube cube_parallel(arma::cube a, arma::cube res, int cores = 1) {

  // Extract the different dimensions
  unsigned int tdim = res.n_slices;

  unsigned int xdim = res.n_rows;

  unsigned int ydim = res.n_cols;

  // Same calculation loop
  #pragma omp parallel for num_threads(cores)
  for (unsigned int t = 0; t < tdim; t++){
    // pop the T
    arma::mat temp_mat = a.slice(t);

    // Subset the rows
    for (unsigned int x = 2; x < xdim-2; x++){

      arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);

      // Iterate over the columns with unit accumulative sum
      for (unsigned int y = 2; y <  ydim-2; y++){
        res(x,y,t) = accu(temp_row_sub.cols(y-2,y+2));
      }
    }
  }

  return res;
}

复制模式

然而，更明智的方法是了解 array(0:1, dims) 的构建方式。

最值得注意的是：

情况 1：如果 xdim 是偶数，则只有矩阵的行交替。
情况 2：如果 xdim 是奇数且 ydim 是奇数，则行交替以及矩阵交替。
情况 3：如果 xdim 是奇数且 ydim 是偶数，则只有行交替

例子

让我们看看实际案例以观察模式。

案例一：

xdim <- 2
ydim <- 3
tdim <- 2
a <- array(0:1,dim=c(xdim,ydim,tdim))

输出:

, , 1

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    1    1

, , 2

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    1    1

案例二：

xdim <- 3
ydim <- 3
tdim <- 3
a <- array(0:1,dim=c(xdim,ydim,tdim))

输出：

, , 1

     [,1] [,2] [,3]
[1,]    0    1    0
[2,]    1    0    1
[3,]    0    1    0

, , 2

     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    0    1    0
[3,]    1    0    1

, , 3

     [,1] [,2] [,3]
[1,]    0    1    0
[2,]    1    0    1
[3,]    0    1    0

案例三：

xdim <- 3
ydim <- 4
tdim <- 2
a <- array(0:1,dim=c(xdim,ydim,tdim))

输出：

, , 1

     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

, , 2

     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

模式黑客

好的，基于上述讨论，我们选择编写一些代码来利用这种独特的模式。

创建交替向量

本例中的交替向量在两个不同的值之间切换。

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// ------- Make Alternating Vectors

arma::vec odd_vec(unsigned int xdim){

  // make a temporary vector to create alternating 0-1 effect by row.
  arma::vec temp_vec(xdim);

  // Alternating vector (anyone have a better solution? )
  for (unsigned int i = 0; i < xdim; i++) {
    temp_vec(i) = (i % 2 ? 0 : 1);
  }

  return temp_vec;
}

arma::vec even_vec(unsigned int xdim){

  // make a temporary vector to create alternating 0-1 effect by row.
  arma::vec temp_vec(xdim);

  // Alternating vector (anyone have a better solution? )
  for (unsigned int i = 0; i < xdim; i++) {
    temp_vec(i) = (i % 2 ? 1 : 0); // changed
  }

  return temp_vec;
}

创建矩阵的三种情况

上面说了矩阵的三种情况。偶数、第一奇数和第二奇数情况。

// --- Handle the different cases 

// [[Rcpp::export]]
arma::mat make_even_matrix(unsigned int xdim, unsigned int ydim){

  arma::mat temp_mat(xdim,ydim);

  temp_mat.each_col() = even_vec(xdim);

  return temp_mat;
}

// xdim is odd and ydim is even
// [[Rcpp::export]]
arma::mat make_odd_matrix_case1(unsigned int xdim, unsigned int ydim){

  arma::mat temp_mat(xdim,ydim);

  arma::vec e_vec = even_vec(xdim);
  arma::vec o_vec = odd_vec(xdim);

  // Alternating column 
  for (unsigned int i = 0; i < ydim; i++) {
    temp_mat.col(i) = (i % 2 ? o_vec : e_vec);
  }

  return temp_mat;
}

// xdim is odd and ydim is odd    
// [[Rcpp::export]]
arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){

  arma::mat temp_mat(xdim,ydim);

  arma::vec e_vec = even_vec(xdim);
  arma::vec o_vec = odd_vec(xdim);

  // Alternating column 
  for (unsigned int i = 0; i < ydim; i++) {
    temp_mat.col(i) = (i % 2 ? e_vec : o_vec); // slight change
  }

  return temp_mat;
}

计算引擎

与之前的解决方案相同，只是没有 t 因为我们不再需要重复计算。

// --- Calculation engine

// [[Rcpp::export]]
arma::mat calc_matrix(arma::mat temp_mat){

  unsigned int xdim = temp_mat.n_rows;

  unsigned int ydim = temp_mat.n_cols;

  arma::mat res = temp_mat;

  // Subset the rows
  for (unsigned int x = 2; x < xdim-2; x++){

    arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);

    // Iterate over the columns with unit accumulative sum
    for (unsigned int y = 2; y <  ydim-2; y++){
      res(x,y) = accu(temp_row_sub.cols(y-2,y+2));
    }
  }

  return res;
}

调用主函数

这是将所有内容组合在一起的核心函数。这为我们提供了所需的距离数组。

// --- Main Engine

// Create the desired cube information
// [[Rcpp::export]]
arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) {

  // Initialize values in A
  arma::cube res(xdim,ydim,tdim);

  if(xdim % 2 == 0){
    res.each_slice() = calc_matrix(make_even_matrix(xdim, ydim));
  }else{

    if(ydim % 2 == 0){

      res.each_slice() = calc_matrix(make_odd_matrix_case1(xdim, ydim));

    }else{

      arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case1(xdim, ydim));

      arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim));

      for(unsigned int t = 0; t < tdim; t++){
        res.slice(t) = (t % 2 ? sec_odd_mat : first_odd_mat);
      }

    }

  }

  return res;
}

时机

现在，真正的事实是它的表现如何：

Unit: microseconds
       expr      min        lq       mean    median        uq       max neval
    r_1core 3538.022 3825.8105 4301.84107 3957.3765 4043.0085 16856.865   100
 alex_1core 2790.515 2984.7180 3461.11021 3076.9265 3189.7890 15371.406   100
  cpp_1core  174.508  180.7190  197.29728  194.1480  204.8875   338.510   100
  cpp_2core  111.960  116.0040  126.34508  122.7375  136.2285   162.279   100
  cpp_3core   81.619   88.4485  104.54602   94.8735  108.5515   204.979   100
  cpp_cache   40.637   44.3440   55.08915   52.1030   60.2290   302.306   100

用于计时的脚本：

cpp_parallel = cube_parallel(a,res, 1)
alex_1core = alex(a,res,xdim,ydim,tdim)
cpp_cache = dim_to_cube(xdim,ydim,tdim)
op_answer = cube_r(a,res,xdim,ydim,tdim)

all.equal(cpp_parallel, op_answer)
all.equal(cpp_cache, op_answer)
all.equal(alex_1core, op_answer)

xdim <- 20
ydim <- 20
tdim <- 5
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))


ga = microbenchmark::microbenchmark(r_1core = cube_r(a,res,xdim,ydim,tdim),
                                    alex_1core = alex(a,res,xdim,ydim,tdim),
                                    cpp_1core = cube_parallel(a,res, 1), 
                                    cpp_2core = cube_parallel(a,res, 2), 
                                    cpp_3core = cube_parallel(a,res, 3),
                                    cpp_cache = dim_to_cube(xdim,ydim,tdim))

R：求和相邻矩阵元素。如何加速？

R: Summing up neighboring matrix elements. How to speed up?

r

matrix

openmp

armadillo

rcpp