Rcpp 版本的制表速度较慢；这是哪里来的，怎么理解

Question

在为已经聚合的数据创建一些采样函数的过程中，我发现 table 在我处理的大小数据上相当慢。我尝试了两个改进，首先是如下的 Rcpp 函数

// [[Rcpp::export]]
IntegerVector getcts(NumericVector x, int m) {
  IntegerVector cts(m);
  int t;
  for (int i = 0; i < x.length(); i++) {
    t = x[i] - 1;
    if (0 <= t && t < m)
      cts[t]++;
  }
  return cts;
}

然后在试图理解为什么 table 相当慢时，我发现它是基于表格的。 Tabulate 对我来说效果很好，而且比 Rcpp 版本更快。制表代码位于：

https://github.com/wch/r-source/blob/545d365bd0485e5f0913a7d609c2c21d1f43145a/src/main/util.c#L2204

关键行是：

for(R_xlen_t i = 0 ; i < n ; i++)
  if (x[i] != NA_INTEGER && x[i] > 0 && x[i] <= nb) y[x[i] - 1]++;

现在 tabulate 的关键部分和我的 Rcpp 版本看起来非常接近（我没有费心处理 NA）。

Q1：为什么我的Rcpp版本慢了3倍？

Q2：如何知道这段时间去了哪里？

我非常希望知道时间都花在了哪里，但最好是分析代码的好方法。我的 C++ 技能仅此而已，但这似乎很简单，我应该（祈祷）能够避免任何会使我的时间增加三倍的愚蠢事情。

我的计时码：

max_x <- 100
xs <- sample(seq(max_x), size = 50000000, replace = TRUE)
system.time(getcts(xs, max_x))
system.time(tabulate(xs))

getcts 为 0.318，tabulate 为 0.126。

Answer 1

您的函数在每个循环迭代中调用一个 length 方法。似乎编译器不缓存它。要将向量的存储大小固定在单独的变量中或使用基于范围的循环。另请注意，我们实际上并不需要显式缺失值检查，因为在 C++ 中，所有涉及 NaN 的比较总是 return false.

让我们比较一下性能：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector tabulate1(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (std::size_t i = 0; i < x.size(); i++) {
        if (x[i] > 0 && x[i] <= max)
            counts[x[i] - 1]++;
    }
    return counts;
}

// [[Rcpp::export]]
IntegerVector tabulate2(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    std::size_t n = x.size();
    for (std::size_t i = 0; i < n; i++) {
        if (x[i] > 0 && x[i] <= max)
            counts[x[i] - 1]++;
    }
    return counts;
}

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate3(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (auto& now : x) {
        if (now > 0 && now <= max)
            counts[now - 1]++;
    }
    return counts;
}

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (auto it = x.begin(); it != x.end(); it++) {
        if (*it > 0 && *it <= max)
            counts[*it - 1]++;
    }
    return counts;
}

/***R
library(microbenchmark)
x <- sample(10, 1e5, rep = TRUE)
microbenchmark(
    tabulate(x, 10), tabulate1(x, 10),
    tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
x[sample(10e5, 10e3)] <- NA
microbenchmark(
    tabulate(x, 10), tabulate1(x, 10),
    tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
*/

tabulate1为原版

基准测试结果：

没有NA:

Unit: microseconds
            expr     min       lq     mean   median      uq     max neval
 tabulate(x, 10) 143.557 146.8355 169.2820 156.1970 177.327 286.370   100
tabulate1(x, 10) 390.706 392.6045 437.7357 416.5655 443.065 748.767   100
tabulate2(x, 10) 108.149 111.4345 139.7579 118.2735 153.118 337.647   100
tabulate3(x, 10) 107.879 111.7305 138.2711 118.8650 139.598 300.023   100
tabulate4(x, 10) 391.003 393.4530 436.3063 420.1915 444.048 777.862   100

与NA:

Unit: microseconds
            expr      min        lq     mean   median       uq      max neval
 tabulate(x, 10)  943.555 1089.5200 1614.804 1333.806 2042.320 3986.836   100
tabulate1(x, 10) 4523.076 4787.3745 5258.490 4929.586 5624.098 7233.029   100
tabulate2(x, 10)  765.102  931.9935 1361.747 1113.550 1679.024 3436.356   100
tabulate3(x, 10)  773.358  914.4980 1350.164 1140.018 1642.354 3633.429   100
tabulate4(x, 10) 4241.025 4466.8735 4933.672 4717.016 5148.842 8603.838   100

使用迭代器的tabulate4函数也比tabulate慢。我们可以改进它：

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    auto start = x.begin();
    auto end = x.end();
    for (auto it = start; it != end; it++) {
        if (*(it) > 0 && *(it) <= max)
            counts[*(it) - 1]++;
    }
    return counts;
}

Rcpp 版本的制表速度较慢；这是哪里来的，怎么理解

Rcpp version of tabulate is slower; where is this from, how to understand

r

rcpp