eratosthenes 的平行筛根据线程数产生错误的输出

Parallel sieve of eratosthenes produces wrong output based on number of threads

这是我用于并行化的方法:

  1. p = 进程 ID,N = 进程总数,n = 输入大小
  2. 每个进程都分配了 n/N 个编号,范围从索引 p*n/N(p+1)*n/N-1
  3. 当进程 0 找到一个质数时,它的所有倍数被并行标记。
  4. 最后,每个进程计算其指定范围内的素数并将其添加到全局计数中。

使用 C++/OpenMP 进行并行化。在调用该函数之前,我将最大线程数设置为 omp_set_num_threads().

// p is thread ID, N is total number of threads, n is the array size
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : (p * n / N)
#define UPPER_BOUND(p, N, n) p == (N - 1) ? (n - 1) : (p + 1) * (n / N) - 1

size_t parallel_soe(std::vector<std::atomic<bool>>& A) {
    size_t n = A.size();
    size_t sqrt_n = sqrt(n);
    for (size_t i = 2; i <= sqrt_n; i++) {
        if (A[i] == true) continue;
        #pragma omp parallel
        {
            uint16_t p = omp_get_thread_num();
            uint16_t N = omp_get_num_threads();
            size_t lower_bound = LOWER_BOUND(p, N, n);
            size_t upper_bound = UPPER_BOUND(p, N, n);
            size_t smallest_multiple = std::max(i * i, lower_bound);
            size_t remainder = smallest_multiple % i;
            if (remainder) smallest_multiple += i - remainder;
            for (size_t j = smallest_multiple; j <= upper_bound; j += i)
                A[j] = true;
        }
    }
    // Count number of primes
    size_t prime_count = 0;
    #pragma omp parallel
    {
        uint16_t p = omp_get_thread_num();
        uint16_t N = omp_get_num_threads();
        size_t lower_bound = LOWER_BOUND(p, N, n);
        size_t upper_bound = UPPER_BOUND(p, N, n);
        size_t count = 0;
        for (size_t i = lower_bound; i <= upper_bound; i++)
            if (A[i] == false)
                count++;
        #pragma omp atomic
        prime_count += count;
    }
    return prime_count;
}

当最大线程数设置为 2、4、8、10 和 16 时函数返回的值是正确的,但对于 6、12 和 14 则不正确。我 运行 它在一个 4 核 Intel i5。

这是我的输出日志:

Finding primes under: 10000
================================
[2-parallel]  Found 1229 primes in 542 microseconds
[4-parallel]  Found 1229 primes in 3173 microseconds
[6-parallel]  Found 1228 primes in 2353 microseconds
[8-parallel]  Found 1229 primes in 3600 microseconds
[10-parallel] Found 1229 primes in 3227 microseconds
[12-parallel] Found 1227 primes in 2778 microseconds
[14-parallel] Found 1226 primes in 2248 microseconds
[16-parallel] Found 1229 primes in 2320 microseconds

Finding primes under: 100000
================================
[2-parallel]  Found 9592 primes in 5186 microseconds
[4-parallel]  Found 9592 primes in 10351 microseconds
[6-parallel]  Found 9591 primes in 9859 microseconds
[8-parallel]  Found 9592 primes in 8500 microseconds
[10-parallel] Found 9592 primes in 12294 microseconds
[12-parallel] Found 9591 primes in 8300 microseconds
[14-parallel] Found 9582 primes in 9252 microseconds
[16-parallel] Found 9592 primes in 8557 microseconds

Finding primes under: 1000000
================================
[2-parallel]  Found 78498 primes in 36091 microseconds
[4-parallel]  Found 78498 primes in 43570 microseconds
[6-parallel]  Found 78498 primes in 48176 microseconds
[8-parallel]  Found 78498 primes in 44201 microseconds
[10-parallel] Found 78498 primes in 43645 microseconds
[12-parallel] Found 78498 primes in 49175 microseconds
[14-parallel] Found 78494 primes in 47411 microseconds
[16-parallel] Found 78498 primes in 53602 microseconds

我明白了。这与并行执行无关,我只是在计算边界时计算错误。

这是计算下限的正确公式:

#define LOWER_BOUND(p, N, n) p == 0 ? 2 : p * (n / N)

注意 n/n 周围的括号。