如何有效地找到排序数组中值的索引?

How can I efficiently find the index of a value in a sorted array?

我有一个排序的值数组和一个值,如下所示:

x <- c(1.0, 3.45, 5.23, 7.3, 12.5, 23.45)
v <- 6.45

我可以在保持排序顺序的同时找到将 v 插入 x 之后的值的索引:

max(which(x <= v))
[1] 3

这是很好且紧凑的代码,但我有直觉,在幕后这确实效率很低:因为 which() 不知道数组已排序,所以必须检查所有值。

有没有更好的方法找到这个索引值?

注意:我对实际将 v 合并到 x 不感兴趣。我只想要索引值。

如果您需要更快的版本并且不需要检查您的输入,您可以编写一个简单的 C++ 函数:

Rcpp::cppFunction(
  "int foo(double x, const Rcpp::NumericVector& v)
  {
    int min = 0;
    int max = v.size();
    while (max - min > 1)
    {
      int idx = (min + max) / 2;
      if (v[idx] > x)
      {
        max = idx;
      }
      else
      {
        min = idx;
      }
    }
    return min + 1;
  }"
)

有需要的可以自行查看if (x < v[0])(不知道你想看这种情况)。您可以使用包 microbenchmark:

对其进行测试
library(microbenchmark)

n = 1e6
v = sort(rnorm(n, 0, 15))
x = runif(1, -15, 15)
microbenchmark(max(which(v <= x)), sum(v <= x), findInterval(x, v), foo(x, v))

结果:

您可以使用 findInterval,它利用了二进制搜索。

findInterval(v, x)
#[1] 3

或使用 C++ upper_boundRcpp

Rcpp::cppFunction(
  "int upper_bound(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

upper_bound(v, x)
#[1] 3

或者如果您还有一个位置向量,例如 findInterval

Rcpp::cppFunction("
Rcpp::IntegerVector upper_bound2(const Rcpp::NumericVector& v
                               , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")

v <- c(3, 6.45)
upper_bound2(v, x)
#[1] 1 3
findInterval(v, x)
#[1] 1 3

基准基于

# Functions:
Rcpp::cppFunction(
  "int Erop(double x, const Rcpp::NumericVector& v)
  {
    int min = 0;
    int max = v.size();
    while (max - min > 1)
    {
      int idx = (min + max) / 2;
      if (v[idx] > x)
      {
        max = idx;
      }
      else
      {
        min = idx;
      }
    }
    return min + 1;
  }"
)

Rcpp::cppFunction(
  "int GKi(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

Rcpp::cppFunction("
Rcpp::IntegerVector GKi2(const Rcpp::NumericVector& v 
                       , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")
# Data:
set.seed(42)
x <- sort(rnorm(1e6))
v <- sort(c(sample(x, 15), rnorm(15)))
# Result:
bench::mark(whichMax= sapply(v, \(v) max(which(x <= v)))
          , sum = sapply(v, \(v) sum(x<=v))
          , findInterval = findInterval(v, x)
          , Erop = sapply(v, \(v) Erop(v, x))
          , GKi = sapply(v, \(v) GKi(v, x))
          , GKi2 = GKi2(v, x)
)
#  expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 whichMax      92.03ms 102.32ms      9.15        NA    102.      5    56
#2 sum           74.91ms  77.84ms     12.0         NA     37.9     6    19
#3 findInterval 680.41µs 755.61µs   1263.          NA      0     632     0
#4 Erop          57.19µs  62.13µs  12868.          NA     24.0  6432    12
#5 GKi           54.53µs   60.4µs  13316.          NA     24.0  6657    12
#6 GKi2           2.02µs   2.38µs 386027.          NA      0   10000     0