`range(which(..))` 的更快替代品

Question

设R中为TRUE和FALSE的序列

v = c(F,F,F,F,F,F,T,F,T,T,F,T,T,T,T,T,F,T,F,T,T,F,F,F,T,F,F,F,F,F)

我想得到第一个和最后一个TRUE的位置。实现这一目标的一种方法是

range(which(v)) # 7 25

但是这个解决方案相对较慢，因为它必须检查向量的每个元素以获得每个 TRUE 的位置，然后遍历所有位置，在每个位置评估两个 if 语句（我认为）为了获得最大值和最小值。从头开始搜索第一个 TRUE，从头开始搜索第一个，然后只搜索 return 个位置，这将更具战略意义。

是否有比 range(which(..)) 更快的替代方案？

Answer 1

match 很快，因为它在找到搜索的值时停止：

c(match(T,v),length(v)-match(T,rev(v))+1)
[1]  7 25

但是你必须测试速度。

更新：

range_find <- function(v) {
i <- 1
j <- length(v)
while(!v[i]) {
  i <- i+1
}
while(!v[j]) {
  j <- j-1
}
c(i,j)
}

基准

v <- rep(v, 5e4)
microbenchmark(
  rangeWhich = rangeWhich(v),
  range_find = range_find(v),
  richwhich = {w <- which(v)
               w[c(1L, length(w))]},
  match = c(match(T,v),length(v)-match(T,rev(v))+1)
)
Unit: microseconds
       expr       min         lq        mean    median         uq        max neval
 rangeWhich     1.284     3.2090    16.50914    20.211    26.7875     29.836   100
 range_find     9.945    21.4945    32.02652    26.948    34.1660    144.042   100
  richwhich  2941.756  3022.5975  3243.02081  3130.227  3247.6405   5403.911   100
      match 45696.329 46771.8175 50662.45708 47359.526 48718.6055 131439.661   100

此方法符合您提出的策略：

"It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions."

Answer 2

我能想到的不涉及搜索整个向量的最简单方法是 Rcpp 解决方案：

library(Rcpp)
cppFunction(
"NumericVector rangeWhich(LogicalVector x) {
  NumericVector ret(2, NumericVector::get_na());
  int n = x.size();
  for (int idx=0; idx < n; ++idx) {
    if (x[idx]) {
      ret[0] = idx+1;  // 1-indexed for R
      break;
    }
  }
  if (R_IsNA(ret[0]))  return ret;  // No true values
  for (int idx=n-1; idx >= 0; --idx) {
    if (x[idx]) {
      ret[1] = idx + 1;  // 1-indexed for R
      break;
    }
  }
  return ret;
}")
rangeWhich(v)
# [1]  7 25

我们可以在具有随机条目的相当长的向量（长度为 100 万）上进行基准测试。我们希望通过不使用 which:

搜索整个内容来获得相当大的效率提升

set.seed(144)
bigv <- sample(c(F, T), 1000000, replace=T)
library(microbenchmark)
# range_find from @PierreLafortune
range_find <- function(v) {
i <- 1
while(!v[i]) {
  i <- i +1
}
j <- length(v)
while(!v[j]) {
  j <- j-1
}
c(i,j)
}
# shortCircuit from @JoshuaUlrich
shortCircuit <- compiler::cmpfun({
  function(x) {
    first <- 1
    while(TRUE) if(x[first]) break else first <- first+1
    last <- length(x)
    while(TRUE) if(x[last]) break else last <- last-1
    c(first, last)
  }
})
microbenchmark(rangeWhich(bigv), range_find(bigv), shortCircuit(bigv), range(which(bigv)))
# Unit: microseconds
#                expr      min        lq        mean     median         uq       max neval
#    rangeWhich(bigv)    1.476    2.4655     9.45051     9.0640    13.7585    46.286   100
#    range_find(bigv)    1.445    2.2930     8.06993     7.2055    11.8980    26.893   100
#  shortCircuit(bigv)    1.114    1.6920     7.30925     7.0440    10.2210    30.758   100
#  range(which(bigv)) 6821.180 9389.1465 13991.84613 10007.9045 16698.2230 58112.490   100

Rcpp 解决方案比 max(which(v)) 快很多（快 500 倍以上），因为它不需要使用 which 遍历整个向量。对于这个例子，它的运行时间几乎与来自@PierreLafortune 的 range_find 和来自@JoshuaUlrich 的 shortCircuit 几乎相同（事实上，稍微慢一点）。

使用 Joshua 的一些最坏情况行为的优秀示例，其中真值位于向量的正中间（我正在用所有建议的函数重复他的实验，以便我们可以看到全貌），我们看到非常不同的情况：

bigv2 <- rep(FALSE, 1e6)
bigv2[5e5-1] <- TRUE
bigv2[5e5+1] <- TRUE
microbenchmark(rangeWhich(bigv2), range_find(bigv2), shortCircuit(bigv2), range(which(bigv2)))
# Unit: microseconds
#                 expr        min          lq        mean      median         uq        max neval
#    rangeWhich(bigv2)    546.206    555.3820    593.1385    575.3790    599.055    979.924   100
#    range_find(bigv2) 400057.083 406449.0075 434515.1142 411881.4145 427487.041 697529.163   100
#  shortCircuit(bigv2)  74942.612  75663.7835  79095.3795  76761.5325  79703.265 125054.360   100
#  range(which(bigv2))    632.086    679.0955    761.9610    700.1365    746.509   3924.941   100

对于这个向量，循环基础 R 解决方案比原始解决方案慢得多（慢 100-600 倍），而 Rcpp 解决方案仅比 range(which(bigv2)) 快（这是有道理的，因为它们都在循环通过整个向量一次）。

像往常一样，这需要附带一个免责声明——您需要编译您的 Rcpp 函数，这也需要时间，所以这只有在您有非常大的向量或多次重复此操作时才有用.从对你的问题的评论来看，你确实有大量的大向量，所以这对你来说可能是个不错的选择。

Answer 3

纯属娱乐。我能想到的最简单的方法不涉及搜索整个向量 or Rcpp :P

shortCircuit <- compiler::cmpfun({
  function(x) {
    first <- 1
    while(TRUE) if(x[first]) break else first <- first+1
    last <- length(x)
    while(TRUE) if(x[last]) break else last <- last-1
    c(first, last)
  }
})
set.seed(144)
bigv <- sample(c(F, T), 1000000, replace=T)
library(microbenchmark)
microbenchmark(rangeWhich(bigv), shortCircuit(bigv))
# Unit: microseconds
#                expr   min     lq median     uq   max neval
#    rangeWhich(bigv) 1.722 1.8875 1.9995 2.1400 6.850   100
#  shortCircuit(bigv) 1.053 1.1905 1.3245 1.4545 9.207   100

哇哦，我赢了！哦，等等......让我们在最坏的情况下比较两者。

v <- rep(FALSE, 1e6)
v[5e5-1] <- TRUE
v[5e5+1] <- TRUE
library(microbenchmark)
microbenchmark(rangeWhich(v), shortCircuit(v))
# Unit: microseconds
#             expr       min         lq    median        uq       max neval
#    rangeWhich(v)   751.252   884.8805  1109.527  1115.995  1163.135   100
#  shortCircuit(v) 60712.586 61004.2760 61396.715 61994.517 72382.216   100

哦不...我输得很惨。哦，好吧，至少我玩得很开心。 :)

`range(which(..))` 的更快替代品

Faster alternative to `range(which(..))`

binary

performance

boolean

r