为什么当我将计算量减半时，我在时间上只得到了微小的改进？

Question

我想编写一个程序来打印和计算 1 到 k 之间的所有素数。这个想法是我们取 1 和 k 之间的每个 odd 数字 i 并检查 i 的奇数因子直到 sqrt(i)。我的计算机花了 1 分 47 秒来检查 k = 1 000 000 以内的素数。

然后我尝试改进我的代码：我没有检查从 1 到 k 的所有奇数，而是从这些奇数中删除了所有 3、5 和 7 的倍数，所以基本上我删除了大约一半的数字必须检查。我再次运行程序，并且 k = 1 000 000 我得到了 1 分 40 秒的时间，只有 7 秒的改进。

这是我的代码：

   #The program has some weird quirks 
#(it doesn't print 3 and 5 because of some errors I was getting) but other than that 
#it seems to work fine

#k is the number that we check up to
k <- 1000000

#n is the number of primes, initialized at 2
n <- 2

#We take only the odd numbers between 5 and k
y <- seq(5, k, 2)

#We take each member of i and check it for primality
for (i in y) {
  #i is assumed to be prime until proved otherwise
  prime <- TRUE

  #We check the remainder when i is divided by every odd number less than its square root
  for (j in c(2, seq(3, ceiling(sqrt(i)), 2))) {
    if (i %% j == 0) {

      #If it's found that some number divides i, we set prime to false and move on 
      #to the next i
      prime <- FALSE
      break
    }
  }
  #If no number has divided i, we haven't broken the loop so R will get to this point
  #We shouldn't need the if statement, but for some reason the n counter 
  #gets too big if I remove the statement
  if (prime) {
    print(i)
    n <- n + 1
  }
}

#Print out the number of primes
print(paste("There are", n, "prime numbers between", 1, "and", k))

在我还删除了 3、5 和 7 的倍数的版本中，我只是将 y 定义为

y <- setdiff((setdiff((setdiff(seq(5, k, 2), seq(6, k, 3))), seq(10, k, 5))), seq(14, k, 7))

抱歉，它很乱，但它有效。

Answer 1

分析您的代码以找出大部分时间在哪里。作为一般规则，不要使用必须确定其方法的泛型。将 seq.int 换成 seq。尤其是在 R 中，您不能只计算您认为某事会进行的计算次数并减少计算次数。其实看看东西怎么叫。

以下是分析代码的方法：

调用Rprof(tmp <- tempfile())，然后执行你的算法，然后调用Rprof();summaryRprof(tmp)。查看高 total.pct 的调用，并在那里进行调整。

如果您想要一个快速函数来快速确定 R 中某个数以内的所有素数，这里有一个适合您。我不认为它是我写的，但我的文件没有告诉我它是从哪里得到的。它是 Eratosthenes 筛法的一个实现。

sieve <- function(n){
  n <- as.integer(n)
  primes <- rep(TRUE, n)
  primes[1] <- FALSE
  last.prime <- 2L
  fsqr <- floor(sqrt(n))
  while (last.prime <= fsqr){
    primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
    sel <- which(primes[(last.prime+1):(fsqr+1)])
    if(any(sel)){
      last.prime <- last.prime + min(sel)
    }else last.prime <- fsqr+1
  }
  which(primes)
}

编辑：这肯定不是我写的，虽然我的记忆让我觉得我可能已经从其他语言在 R 中实现了它，但事实并非如此。对代码的快速 google 搜索显示，几年前我还是一个潜伏者时，我直接从 post 那里得到了它。第一个筛选答案是由@GeorgeDontas post编辑的：

编辑 2：现在回顾我的文件，我发现我的最小调整可以节省更多时间，并且可能不可靠地避免了罕见的 10 倍增加：

prime_logic <- function(n){
  n <- as.integer(n)
  primes <- rep_len(TRUE, n)
  primes[1] <- FALSE
  last.prime <- 2L
  fsqr <- floor(sqrt(n))
  while (last.prime <= fsqr){
    primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
    sel <- which(primes[seq.int(last.prime+1, fsqr+1)])
    if(any(sel)){
      last.prime <- last.prime + min(sel)
    }else last.prime <- fsqr+1
  }
  primes
}

Unit: milliseconds
               expr      min       lq     mean   median       uq       max neval cld
       sieve(1e+06) 29.80531 36.41955 50.24195 38.53144 41.86716 889.47422   100   a
 prime_logic(1e+06) 22.19734 33.05792 39.35390 35.52268 41.02838  71.68033   100   a

Answer 2

我知道这有点离题，但我认为这是一个有用的注释。

当数字越来越大时，你根本无法使用这种方法来寻找质数。您需要使用概率方法。 Miller–Rabin primality test which is used for example in RSA's prime numbers.

最著名的检查数字是否为素数的方法之一。

Answer 3

这是@DanHall 答案的另一种选择function，它使用了您的逻辑（如您的素数测试方法）：

primetest <- function(i, k){
  if(i >= 5 && i < k){
    vtest <- i %% c(2, seq(3, ceiling(sqrt(i)), 2))
    condition <- !is.element(0, vtest)
    if(condition){return(i)}
  }
}
V1 <- unlist(sapply(y, primetest, k = k))
length(V1)
# number of primes between 5 and k
[1] 78495
# here are the first 100 primes yielded, i.e. >= 5
V1[1:100]
 [1]   5   7  11  13  17  19  23  29  31  37  41  43  47  53  59  61  67  71  73  79  83  89
 [23]  97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181 191 193 197 199
 [45] 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307 311 313 317 331 337
 [67] 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
 [89] 467 479 487 491 499 503 509 521 523 541 547 557

绝对比使用 loop 快，但我必须承认我不知道其他答案是否更快。

为什么当我将计算量减半时，我在时间上只得到了微小的改进？

Why am I getting only marginal improvements in time when I cut the computations in half?

algorithm

performance

primes

r

computation