计算事先不知道长度的向量 - 我应该 "grow" 它吗?

Calculate vector whose length is not known beforehand - should I "grow" it?

我需要计算向量的条目我事先不知道其长度。如何高效地做到这一点?

一个简单的解决方案是 "grow" 它:从一个小的或空的向量开始,并连续添加新条目,直到达到停止标准。例如:

foo <- numeric(0)
while ( sum(foo) < 100 ) foo <- c(foo,runif(1))
length(foo)
# 195

但是,出于性能原因,"growing" 向量在 R 中不受欢迎。

当然,我可以"grow it in chunks":预先分配一个"good-sized"向量,填充它,当它满时将它的长度加倍,最后将它缩小到一定大小。但这感觉很容易出错,并且会导致代码不优雅。

有没有更好的或规范的方法来做到这一点? (在我的实际应用中,计算和停止标准当然要复杂一些。)


回复一些有用的评论

Even if you don't know the length beforehand, do you know the maximum possible length it can theoretically have? In such cases I tend to initialize the vector with that length and after the loop cut the NAs or remove the unused entries based on the latest index value.

不,最大长度事先不知道。

Do you need to keep all values as the vector grows?

是的,我知道。

What about something like rand_num <- runif(300); rand_num[cumsum(rand_num) < 100] where you choose a sufficiently large vector that you know for a high probability that the condition will be met? You can of course check it and use an even bigger number if it's not met. I've tested up till runif(10000) it's still faster than "growing".

我的实际用例涉及动态计算,我不能简单地向量化(否则我不会问)。

具体来说,为了逼近负二项式随机变量的卷积,我需要计算Furman, 2007中定理2中整数随机变量$K$的概率质量,直到一个很高的累积概率。这些质量 $pr_k$ 涉及一些复杂的递归求和。

I could "grow it in chunks": pre-allocate a "good-sized" vector, fill it, double its length when it is full, and finally cut it down to size. But this feels error-prone and will make for inelegant code.

听起来您指的是 Collecting an unknown number of results in a loop 已接受的答案。你有没有编码并尝试过?长度加倍的想法绰绰有余(请参阅此答案的末尾),因为长度将呈几何增长。我将在下面演示我的方法。


出于测试目的,请将您的代码包装在一个函数中。请注意我如何避免对每个 while 测试执行 sum(z)

ref <- function (stop_sum, timing = TRUE) {
  set.seed(0)                            ## fix a seed to compare performance
  if (timing) t1 <- proc.time()[[3]]
  z <- numeric(0)
  sum_z <- 0
  while ( sum_z < stop_sum ) {
    z_i <- runif(1)
    z <- c(z, z_i)
    sum_z <- sum_z + z_i
    }
  if (timing) {
    t2 <- proc.time()[[3]]
    return(t2 - t1)                      ## return execution time
    } else {
    return(z)                            ## return result
    }
  }

分块对于降低串联的操作成本是必要的。

template <- function (chunk_size, stop_sum, timing = TRUE) {
  set.seed(0)                            ## fix a seed to compare performance
  if (timing) t1 <- proc.time()[[3]]
  z <- vector("list")                    ## store all segments in a list
  sum_z <- 0                             ## cumulative sum
  while ( sum_z < stop_sum ) {
    segmt <- numeric(chunk_size)         ## initialize a segment
    i <- 1
    while (i <= chunk_size) {
      z_i <- runif(1)                    ## call a function & get a value
      sum_z <- sum_z + z_i               ## update cumulative sum
      segmt[i] <- z_i                    ## fill in the segment
      if (sum_z >= stop_sum) break       ## ready to break at any time
      i <- i + 1
      }
    ## grow the list
    if (sum_z < stop_sum) z <- c(z, list(segmt))
    else z <- c(z, list(segmt[1:i]))
    }
  if (timing) {
    t2 <- proc.time()[[3]]
    return(t2 - t1)                      ## return execution time
    } else {
    return(unlist(z))                    ## return result
    }
  }

让我们先检查一下正确性。

z <- ref(1e+4, FALSE)
z1 <- template(5, 1e+4, FALSE)
z2 <- template(1000, 1e+4, FALSE)

range(z - z1)
#[1] 0 0

range(z - z2)
#[1] 0 0

我们再比较一下速度。

## reference implementation
t0 <- ref(1e+4, TRUE)

## unrolling implementation
trial_chunk_size <- seq(5, 1000, by = 5)
tm <- sapply(trial_chunk_size, template, stop_sum = 1e+4, timing = TRUE)

## visualize timing statistics
plot(trial_chunk_size, tm, type = "l", ylim = c(0, t0), col = 2, bty = "l")
abline(h = t0, lwd = 2)

看起来chunk_size = 200已经足够好了,加速因子是

t0 / tm[trial_chunk_size == 200]
#[1] 16.90598

让我们最终通过分析看看使用 c 增长向量花费了多少时间。

Rprof("a.out")
z0 <- ref(1e+4, FALSE)
Rprof(NULL)
summaryRprof("a.out")$by.self
#        self.time self.pct total.time total.pct
#"c"          1.68    90.32       1.68     90.32
#"runif"      0.12     6.45       0.12      6.45
#"ref"        0.06     3.23       1.86    100.00

Rprof("b.out")
z1 <- template(200, 1e+4, FALSE)
Rprof(NULL)
summaryRprof("b.out")$by.self
#        self.time self.pct total.time total.pct
#"runif"      0.10    83.33       0.10     83.33
#"c"          0.02    16.67       0.02     16.67

自适应chunk_size 线性增长

ref 具有 O(N * N) 操作复杂性,其中 N 是最终向量的长度。 template 原则上具有 O(M * M) 复杂度,其中 M = N / chunk_size。要获得线性复杂度 O(N)chunk_size 需要随着 N 增长,但线性增长就足够了:chunk_size <- chunk_size + 1.

template1 <- function (chunk_size, stop_sum, timing = TRUE) {
  set.seed(0)                            ## fix a seed to compare performance
  if (timing) t1 <- proc.time()[[3]]
  z <- vector("list")                    ## store all segments in a list
  sum_z <- 0                             ## cumulative sum
  while ( sum_z < stop_sum ) {
    segmt <- numeric(chunk_size)         ## initialize a segment
    i <- 1
    while (i <= chunk_size) {
      z_i <- runif(1)                    ## call a function & get a value
      sum_z <- sum_z + z_i               ## update cumulative sum
      segmt[i] <- z_i                    ## fill in the segment
      if (sum_z >= stop_sum) break       ## ready to break at any time
      i <- i + 1
      }
    ## grow the list
    if (sum_z < stop_sum) z <- c(z, list(segmt))
    else z <- c(z, list(segmt[1:i]))
    ## increase chunk_size
    chunk_size <- chunk_size + 1
    }
  ## remove this line if you want
  cat(sprintf("final chunk size = %d\n", chunk_size))
  if (timing) {
    t2 <- proc.time()[[3]]
    return(t2 - t1)                      ## return execution time
    } else {
    return(unlist(z))                    ## return result
    }
  }

快速测试证明我们已经达到了线性复杂度。

template1(200, 1e+4)
#final chunk size = 283
#[1] 0.103

template1(200, 1e+5)
#final chunk size = 664
#[1] 1.076

template1(200, 1e+6)
#final chunk size = 2012
#[1] 10.848

template1(200, 1e+7)
#final chunk size = 6330
#[1] 108.183