openMp 基本和不同的输出

Question

我刚开始学习 openMP，作为并行编程的大多数新手，我正在编写一个简单的脚本来并行求解积分。以下是我写的代码：

        num_chunks = 1000;
        int threads_assigned = 0;
        int num_threads = 10;// omp_get_num_threads();
        sum_arr = (double *) malloc(sizeof(double) * num_threads);
        memset(sum_arr, 0, num_threads);
        int sums_each_thread = num_chunks / num_threads;
        printf("num of threads = %d\n", num_threads);
        sum = 0.0;
        omp_set_num_threads(num_threads);
        chunk_size = 1.0/(double) num_chunks;
        double start_time = omp_get_wtime();
        #pragma omp parallel
        {
                int idx = omp_get_thread_num();
                if (idx == 0) threads_assigned = omp_get_num_threads();
                int sums_count;
                int loop_size = idx * sums_each_thread;
                for (sums_count = loop_size; sums_count < loop_size + sums_each_thread; sums_count++) {
                        printf("starting thread no %d\n", idx);
                        // double x = (idx * chunk_size) + (chunk_size / 2.0); simplified version in the next line
                        double x = chunk_size * (sums_count + 0.5);
                        double y = 4.0 / (1 + x*x);
                        sum_arr[idx - 1] += (y * chunk_size);
                        sum += y;
                }
                printf("ending thread no %d\n", idx);
        }
        double end_time = omp_get_wtime();
        printf("sum = %f, in time %f(s)\n", sum * chunk_size, end_time - start_time);
        int new_idx = 0;
        sum = 0.0;
        for(new_idx = 0; new_idx < num_threads; new_idx++) {
                sum += sum_arr[new_idx];
        }
        printf("arr sum = %f, with total assigned threads = %d\n", sum, threads_assigned);

我通过以下方式计算面积：

1 ) 计算矩形各面积之和

2 ) 计算所有高度的总和（标量和），然后乘以 chunk_size.

从数学上讲，这两种方法都会产生相同的结果，但在上述代码的这种情况下，方法 num 2 总是显示正确的结果，而方法 num 1 失败。

谁能解释一下原因？

Answer 1

您的代码中有两个错误：

一个正在访问 arr_sum 中的无效条目，其中元素 -1 被主线程访问。替换

sum_arr[idx - 1] += (y * chunk_size);

和

sum_arr[idx] += (y * chunk_size);

另一个问题是修改sum没有同步。获取先前的 sum 并存储更新后的值可以与其他线程的类似操作交错。这将破坏结果。从内部循环中删除 printf("starting thread no %d\n", idx); 后，应该可以观察到这种效果。要修复它，只需使用原子更新。

#pragma omp atomic
sum += y;

现在两种方式计算的总和都可以了。

但是，该程序的扩展性很差。一个原因是 arr_sum 上的“错误共享”，当线程访问同一行缓存时导致 CPU 级别的同步代价高昂。

另一个问题是在关键 sum += y 中使用原子指令，它遇到了同样的问题。

可能您需要的是使用 OpenMP 的缩减功能：

    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < num_chunks; ++i) {
        double x = chunk_size * (i + 0.5);
        double y = 4.0 / (1 + x*x);
        sum += y;
    }

现在它可以完美扩展，并且比 OP 问题中更正后的循环快许多 (~100) 倍。它是如此之快，以至于需要 num_chunks > 10e7 才能使循环成为程序执行时间的重要部分。

openMp 基本和不同的输出

openMp basic and different outputs

c

parallel-processing

openmp