eratosthenes 的平行筛根据线程数产生错误的输出
Parallel sieve of eratosthenes produces wrong output based on number of threads
这是我用于并行化的方法:
- p = 进程 ID,N = 进程总数,n = 输入大小
- 每个进程都分配了
n/N
个编号,范围从索引 p*n/N
到 (p+1)*n/N-1
。
- 当进程 0 找到一个质数时,它的所有倍数被并行标记。
- 最后,每个进程计算其指定范围内的素数并将其添加到全局计数中。
使用 C++/OpenMP 进行并行化。在调用该函数之前,我将最大线程数设置为 omp_set_num_threads()
.
// p is thread ID, N is total number of threads, n is the array size
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : (p * n / N)
#define UPPER_BOUND(p, N, n) p == (N - 1) ? (n - 1) : (p + 1) * (n / N) - 1
size_t parallel_soe(std::vector<std::atomic<bool>>& A) {
size_t n = A.size();
size_t sqrt_n = sqrt(n);
for (size_t i = 2; i <= sqrt_n; i++) {
if (A[i] == true) continue;
#pragma omp parallel
{
uint16_t p = omp_get_thread_num();
uint16_t N = omp_get_num_threads();
size_t lower_bound = LOWER_BOUND(p, N, n);
size_t upper_bound = UPPER_BOUND(p, N, n);
size_t smallest_multiple = std::max(i * i, lower_bound);
size_t remainder = smallest_multiple % i;
if (remainder) smallest_multiple += i - remainder;
for (size_t j = smallest_multiple; j <= upper_bound; j += i)
A[j] = true;
}
}
// Count number of primes
size_t prime_count = 0;
#pragma omp parallel
{
uint16_t p = omp_get_thread_num();
uint16_t N = omp_get_num_threads();
size_t lower_bound = LOWER_BOUND(p, N, n);
size_t upper_bound = UPPER_BOUND(p, N, n);
size_t count = 0;
for (size_t i = lower_bound; i <= upper_bound; i++)
if (A[i] == false)
count++;
#pragma omp atomic
prime_count += count;
}
return prime_count;
}
当最大线程数设置为 2、4、8、10 和 16 时函数返回的值是正确的,但对于 6、12 和 14 则不正确。我 运行 它在一个 4 核 Intel i5。
这是我的输出日志:
Finding primes under: 10000
================================
[2-parallel] Found 1229 primes in 542 microseconds
[4-parallel] Found 1229 primes in 3173 microseconds
[6-parallel] Found 1228 primes in 2353 microseconds
[8-parallel] Found 1229 primes in 3600 microseconds
[10-parallel] Found 1229 primes in 3227 microseconds
[12-parallel] Found 1227 primes in 2778 microseconds
[14-parallel] Found 1226 primes in 2248 microseconds
[16-parallel] Found 1229 primes in 2320 microseconds
Finding primes under: 100000
================================
[2-parallel] Found 9592 primes in 5186 microseconds
[4-parallel] Found 9592 primes in 10351 microseconds
[6-parallel] Found 9591 primes in 9859 microseconds
[8-parallel] Found 9592 primes in 8500 microseconds
[10-parallel] Found 9592 primes in 12294 microseconds
[12-parallel] Found 9591 primes in 8300 microseconds
[14-parallel] Found 9582 primes in 9252 microseconds
[16-parallel] Found 9592 primes in 8557 microseconds
Finding primes under: 1000000
================================
[2-parallel] Found 78498 primes in 36091 microseconds
[4-parallel] Found 78498 primes in 43570 microseconds
[6-parallel] Found 78498 primes in 48176 microseconds
[8-parallel] Found 78498 primes in 44201 microseconds
[10-parallel] Found 78498 primes in 43645 microseconds
[12-parallel] Found 78498 primes in 49175 microseconds
[14-parallel] Found 78494 primes in 47411 microseconds
[16-parallel] Found 78498 primes in 53602 microseconds
我明白了。这与并行执行无关,我只是在计算边界时计算错误。
这是计算下限的正确公式:
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : p * (n / N)
注意 n/n
周围的括号。
这是我用于并行化的方法:
- p = 进程 ID,N = 进程总数,n = 输入大小
- 每个进程都分配了
n/N
个编号,范围从索引p*n/N
到(p+1)*n/N-1
。 - 当进程 0 找到一个质数时,它的所有倍数被并行标记。
- 最后,每个进程计算其指定范围内的素数并将其添加到全局计数中。
使用 C++/OpenMP 进行并行化。在调用该函数之前,我将最大线程数设置为 omp_set_num_threads()
.
// p is thread ID, N is total number of threads, n is the array size
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : (p * n / N)
#define UPPER_BOUND(p, N, n) p == (N - 1) ? (n - 1) : (p + 1) * (n / N) - 1
size_t parallel_soe(std::vector<std::atomic<bool>>& A) {
size_t n = A.size();
size_t sqrt_n = sqrt(n);
for (size_t i = 2; i <= sqrt_n; i++) {
if (A[i] == true) continue;
#pragma omp parallel
{
uint16_t p = omp_get_thread_num();
uint16_t N = omp_get_num_threads();
size_t lower_bound = LOWER_BOUND(p, N, n);
size_t upper_bound = UPPER_BOUND(p, N, n);
size_t smallest_multiple = std::max(i * i, lower_bound);
size_t remainder = smallest_multiple % i;
if (remainder) smallest_multiple += i - remainder;
for (size_t j = smallest_multiple; j <= upper_bound; j += i)
A[j] = true;
}
}
// Count number of primes
size_t prime_count = 0;
#pragma omp parallel
{
uint16_t p = omp_get_thread_num();
uint16_t N = omp_get_num_threads();
size_t lower_bound = LOWER_BOUND(p, N, n);
size_t upper_bound = UPPER_BOUND(p, N, n);
size_t count = 0;
for (size_t i = lower_bound; i <= upper_bound; i++)
if (A[i] == false)
count++;
#pragma omp atomic
prime_count += count;
}
return prime_count;
}
当最大线程数设置为 2、4、8、10 和 16 时函数返回的值是正确的,但对于 6、12 和 14 则不正确。我 运行 它在一个 4 核 Intel i5。
这是我的输出日志:
Finding primes under: 10000
================================
[2-parallel] Found 1229 primes in 542 microseconds
[4-parallel] Found 1229 primes in 3173 microseconds
[6-parallel] Found 1228 primes in 2353 microseconds
[8-parallel] Found 1229 primes in 3600 microseconds
[10-parallel] Found 1229 primes in 3227 microseconds
[12-parallel] Found 1227 primes in 2778 microseconds
[14-parallel] Found 1226 primes in 2248 microseconds
[16-parallel] Found 1229 primes in 2320 microseconds
Finding primes under: 100000
================================
[2-parallel] Found 9592 primes in 5186 microseconds
[4-parallel] Found 9592 primes in 10351 microseconds
[6-parallel] Found 9591 primes in 9859 microseconds
[8-parallel] Found 9592 primes in 8500 microseconds
[10-parallel] Found 9592 primes in 12294 microseconds
[12-parallel] Found 9591 primes in 8300 microseconds
[14-parallel] Found 9582 primes in 9252 microseconds
[16-parallel] Found 9592 primes in 8557 microseconds
Finding primes under: 1000000
================================
[2-parallel] Found 78498 primes in 36091 microseconds
[4-parallel] Found 78498 primes in 43570 microseconds
[6-parallel] Found 78498 primes in 48176 microseconds
[8-parallel] Found 78498 primes in 44201 microseconds
[10-parallel] Found 78498 primes in 43645 microseconds
[12-parallel] Found 78498 primes in 49175 microseconds
[14-parallel] Found 78494 primes in 47411 microseconds
[16-parallel] Found 78498 primes in 53602 microseconds
我明白了。这与并行执行无关,我只是在计算边界时计算错误。
这是计算下限的正确公式:
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : p * (n / N)
注意 n/n
周围的括号。