当阵列达到一定大小时,OpenMP 变得非常慢

OpenMP gets significantly slower when Array reaches certain size

我正在尝试解决并行计算中的家庭作业示例。这是给定的代码片段:

for (int i=0; i < n-1; i++) {
    x[i] = (y[i] + x[i+1]) / 7;
}

我们必须分析有关依赖项的代码片段,并尝试并行化给定的片段并使用适当的样本大小对其进行基准测试。我的第一次尝试只是复制 x 并删除这里的反依赖:

double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
#pragma omp parallel for 
for (int i = 0; i < n-1; i++) {
    x[i] = (y[i] + x_old[i]) / 7; 
}

此代码有效,我检查了数组的串行和并行校验和,它们都相同。对于 100_000_00 - 400_000_000 数组元素,不计算 memcpy() 的加速在 1.6 - 2 之间.但是,如果我进一步增加数组大小(在470_000_000元素附近),加速会显着降低并且在500_000_000数组元素 加速比为 0.375。 这与我使用多少线程无关。

我在 3 台不同的设备上对其进行了基准测试,结果相同(加速方面):

联想 Ideapad 5

台式电脑

并且在我们的集群节点 PC 上

为什么加速比下降了,为什么下降得这么快?为什么在特定大小的数组中会发生这种情况?

我在这里提供了一个最小的程序示例:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"

void init(double *arr, int size);
double sum(double *arr, int size);

int main(int argc, char **argv){
    if(argc != 3){
        printf("Wrong use of program! (1) Size of array, (2) num threads\n");
        return EXIT_FAILURE;
    }
    int n = atoi(argv[1]);
    int threads = atoi(argv[2]);
    omp_set_num_threads(threads);

    double *x = malloc(sizeof(*x) * n);
    double *y = malloc(sizeof(*y) * n);
    init(x, n);
    init(y, n);
    double start = omp_get_wtime();
    for (int i = 0; i < n-1; i++) {
        x[i] = (y[i] + x[i+1]) / 7;
    }
    double end = omp_get_wtime();
    double elapsed_s = end - start;
    printf("single elapsed: %fs\n", elapsed_s);
    double checksum_s = sum(x, n);
    printf("single checksum: %f\n", checksum_s);
    init(x, n); //reinit
    init(y, n); //reinit
    double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
    memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
    start = omp_get_wtime();
    #pragma omp parallel for 
    for (int i = 0; i < n-1; i++) {
        x[i] = (y[i] + x_old[i]) / 7; 
    }
    end = omp_get_wtime();
    double elapsed_p = end - start;
    printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
    double checksum_p = sum(x, n);
    printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
    printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
    printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
}

void init(double *arr, int size){
    for(int i = 0; i < size; ++i){
        arr[i] = 1.0;
    }
}
double sum(double *arr, int size){
    double sum = 0.0;
    for(int i = 0; i < size; ++i){
        sum += arr[i];
    }
    return sum;
}

这里是一个示例基准:

问题肯定来自交换内存。事实上,代码分配了 3 个 500_000_000 项数组,导致每个数组需要 4 GiB 的 RAM,总共需要 12 GiB。问题是两台基准测试机器只有 16 GiB 的可用 RAM,OS 和 运行 应用程序通常已经需要 1-4 GiB 的 RAM。当可用物理内存量变小时,OS 开始在称为交换内存的 space 上移动内存页面,该内存通常映射到比普通 RAM 慢得多的存储设备上。请注意,某些操作系统(如 Windows)已经使用 ZRAM,以便在可用内存量变小时压缩数据。主流 OS(Windows、Linux 和 MAC 还需要一些额外的 space 用于缓冲区(例如存储设备缓冲区、网络等)。分配当内存量变小时,内存页面的速度也会变慢。另请注意,页面仅在第一次接触主流系统时才在 RAM 中分配。有关此的更多信息,请阅读 this article

分配巨大的 x_old 数组效率不高(时间和 space)。事实上,memcopy 将花费大量时间将数据复制到内存中,并且此操作受到 RAM 吞吐量的限制,RAM 的吞吐量通常较低。可以通过对chunks进行操作来加快运算速度。这个想法是在每个线程中复制一个小数据块(比如说 16 KiB),这样操作就可以在每个内核的 L1 缓存中完成。这意味着 memcpy 的成本将几乎消失,因为 CPU 缓存比 RAM 快得多(延迟和吞吐量)。事实上,notebook/desktop 计算机的 L1 CPU 缓存应该 >100 GiB/core,导致累计吞吐量 >600 GiB,而 RAM 的吞吐量不应超过 40 GiB/s 在实践中(慢 15 倍)。

对于集群节点,由于受NUMA影响,所以稍微复杂一些。 NUMA 架构非常复杂。有关更多信息,我建议您阅读 this article first, then read (including the related links) and 。简而言之,您需要关心页面的分配位置(分配策略)。关键是控制页面上的第一次触摸(在大多数平台上)。

几个问题...

  1. 并行使用x_old(例如x, y, x_old)使用3个数组而不是2个(例如x, y)会给内存控制器和缓存带来额外的压力,所以比较是无效。并行版本处于明显的劣势
  2. 您只显示了单线程和 8 线程的数据。您的程序受内存 总线 [和缓存] 限制(而不是计算限制)。
  3. 来自Atomically increment two integers with CAS follow the video link to a talk: https://www.youtube.com/watch?v=1obZeHnAwz4&t=1251 演讲者在那里对类似的事情进行了基准测试。他的结论是任何超过 4 个线程都会淹没内存总线(即实际上 降低 性能)。由于切换线程、缓存未命中和导致 less 引用位置的开销,额外的线程只会增加额外的开销。
  4. 系统中的其他因素也会影响数字。其他[无关] threads/processes 可以窃取CPU 时间和内存。处理设备和中断的内核开销。为了尽量减少这种影响,每个测试应该 运行 多次 次(例如 10),我们应该使用 minimum 次每个设置。
  5. 对于短时间段(例如小于一秒)的测试,starting/stopping 线程的开销可能是一个因素。

例如,我只是 运行 你的程序:1000000 8 并得到:

single elapsed: 0.008146s
single checksum: 285715.000000
omp threads 8 Thread(s) elapsed: 0.008198s
omp threads 8 Thread(s) single checksum: 285715.000000
OK
Speedup: 0.994

当我运行它时:1000000 4,我得到:

single elapsed: 0.008502s
single checksum: 285715.000000
omp threads 4 Thread(s) elapsed: 0.002677s
omp threads 4 Thread(s) single checksum: 285715.000000
OK
Speedup: 3.176

当我 运行 另一次使用:1000000 8,我得到:

single elapsed: 0.008450s
single checksum: 285715.000000
omp threads 8 Thread(s) elapsed: 0.005630s
omp threads 8 Thread(s) single checksum: 285715.000000
OK
Speedup: 1.501

因此,4 比 8 快。而且,8 中的两个 运行 产生了 [完全] 不同的结果。


更新:

我将程序重构为 运行 如上所述的多重测试:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"

void init(double *arr, int size);
double sum(double *arr, int size);

double *x;
double *y;

typedef void (*fnc_p)(double *x,double *y,int n);

void
dosingle(double *x,double *y,int n)
{

    for (int i = 0; i < n - 1; i++) {
        x[i] = (y[i] + x[i + 1]) / 7;
    }
}

void
doparallel(double *x,double *y,int n)
{

#pragma omp parallel for
    for (int i = 0; i < n - 1; i++) {
        x[i] = (y[i] + x[i + 1]) / 7;
    }
}

double
timeone(double *x,double *y,int n,fnc_p fnc)
{

    init(x, n);
    init(y, n);
    double start = omp_get_wtime();

    fnc(x,y,n);

    double end = omp_get_wtime();
    double elapsed = end - start;

    printf(" elapsed: %fs", elapsed);

    double checksum = sum(x, n);
    printf(" checksum: %f\n", checksum);

    return elapsed;
}

double
timeall(int threads,double *x,double *y,int n,fnc_p fnc,const char *who)
{

    omp_set_num_threads(threads);
    printf("\n");
    printf("threads: %d\n",threads);

    double best = 1e9;

    for (int irun = 1;  irun <= 10;  ++irun) {
        printf("%s/%d:",who,irun);
        double elapsed = timeone(x,y,n,fnc);
        if (elapsed < best)
            best = elapsed;
    }

    return best;
}

int
main(int argc, char **argv)
{
    if (argc != 3) {
        printf("Wrong use of program! (1) Size of array, (2) num threads\n");
        return EXIT_FAILURE;
    }
    int n = atoi(argv[1]);
    int threads = atoi(argv[2]);

    x = malloc(sizeof(*x) * n);
    y = malloc(sizeof(*y) * n);

    double elapsed_s = timeall(1,x,y,n,dosingle,"single");

    double best_elapsed = 1e9;
    int best_nthread = 0;
    for (int ithr = 2;  ithr <= threads;  ++ithr) {
        double elapsed_p = timeall(ithr,x,y,n,doparallel,"parallel");
        printf("Speedup: %.3f\n", elapsed_s / elapsed_p);

        if (elapsed_p < best_elapsed) {
            best_elapsed = elapsed_p;
            best_nthread = ithr;
        }
    }

    printf("\n");
    printf("FINAL %d is best with %fs and speedup %.3f\n",
        best_nthread,best_elapsed,elapsed_s / best_elapsed);

#if 0
    printf("single elapsed: %fs\n", elapsed_s);
    double checksum_s = sum(x, n);

    printf("single checksum: %f\n", checksum_s);
    init(x, n);                         // reinit
    init(y, n);                         // reinit
    double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x

    memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
    start = omp_get_wtime();
#pragma omp parallel for
    for (int i = 0; i < n - 1; i++) {
        x[i] = (y[i] + x_old[i]) / 7;
    }
    end = omp_get_wtime();
    double elapsed_p = end - start;

    printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
    double checksum_p = sum(x, n);

    printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
    printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
    printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
#endif

    return 0;
}

void
init(double *arr, int size)
{
    for (int i = 0; i < size; ++i) {
        arr[i] = 1.0;
    }
}

double
sum(double *arr, int size)
{
    double sum = 0.0;

    for (int i = 0; i < size; ++i) {
        sum += arr[i];
    }
    return sum;
}

这里是测试输出:1000000 16:


threads: 1
single/1: elapsed: 0.008467s checksum: 285715.000000
single/2: elapsed: 0.008178s checksum: 285715.000000
single/3: elapsed: 0.008460s checksum: 285715.000000
single/4: elapsed: 0.008126s checksum: 285715.000000
single/5: elapsed: 0.008250s checksum: 285715.000000
single/6: elapsed: 0.008206s checksum: 285715.000000
single/7: elapsed: 0.007933s checksum: 285715.000000
single/8: elapsed: 0.008440s checksum: 285715.000000
single/9: elapsed: 0.008267s checksum: 285715.000000
single/10: elapsed: 0.008671s checksum: 285715.000000

threads: 2
parallel/1: elapsed: 0.004318s checksum: 285714.897959
parallel/2: elapsed: 0.004216s checksum: 285714.897959
parallel/3: elapsed: 0.004224s checksum: 285714.897959
parallel/4: elapsed: 0.004223s checksum: 285714.897959
parallel/5: elapsed: 0.004201s checksum: 285714.897959
parallel/6: elapsed: 0.004235s checksum: 285714.897959
parallel/7: elapsed: 0.004276s checksum: 285714.897959
parallel/8: elapsed: 0.004161s checksum: 285714.897959
parallel/9: elapsed: 0.004163s checksum: 285714.897959
parallel/10: elapsed: 0.004188s checksum: 285714.897959
Speedup: 1.906

threads: 3
parallel/1: elapsed: 0.002930s checksum: 285714.795918
parallel/2: elapsed: 0.002824s checksum: 285714.795918
parallel/3: elapsed: 0.002788s checksum: 285714.795918
parallel/4: elapsed: 0.002855s checksum: 285714.795918
parallel/5: elapsed: 0.002816s checksum: 285714.795918
parallel/6: elapsed: 0.002800s checksum: 285714.795918
parallel/7: elapsed: 0.002854s checksum: 285714.795918
parallel/8: elapsed: 0.002794s checksum: 285714.795918
parallel/9: elapsed: 0.002807s checksum: 285714.795918
parallel/10: elapsed: 0.002788s checksum: 285714.795918
Speedup: 2.845

threads: 4
parallel/1: elapsed: 0.002296s checksum: 285714.693877
parallel/2: elapsed: 0.002152s checksum: 285714.693877
parallel/3: elapsed: 0.002153s checksum: 285714.693877
parallel/4: elapsed: 0.002122s checksum: 285714.693877
parallel/5: elapsed: 0.002147s checksum: 285714.693877
parallel/6: elapsed: 0.002135s checksum: 285714.693877
parallel/7: elapsed: 0.002135s checksum: 285714.693877
parallel/8: elapsed: 0.002126s checksum: 285714.693877
parallel/9: elapsed: 0.002159s checksum: 285714.693877
parallel/10: elapsed: 0.002156s checksum: 285714.693877
Speedup: 3.738

threads: 5
parallel/1: elapsed: 0.003495s checksum: 285714.591836
parallel/2: elapsed: 0.003311s checksum: 285714.591836
parallel/3: elapsed: 0.003315s checksum: 285714.591836
parallel/4: elapsed: 0.003933s checksum: 285714.591836
parallel/5: elapsed: 0.003397s checksum: 285714.591836
parallel/6: elapsed: 0.003303s checksum: 285714.591836
parallel/7: elapsed: 0.003306s checksum: 285714.591836
parallel/8: elapsed: 0.003371s checksum: 285714.591836
parallel/9: elapsed: 0.003239s checksum: 285714.591836
parallel/10: elapsed: 0.003326s checksum: 285714.591836
Speedup: 2.449

threads: 6
parallel/1: elapsed: 0.002845s checksum: 285714.489795
parallel/2: elapsed: 0.002766s checksum: 285714.489795
parallel/3: elapsed: 0.002774s checksum: 285714.489795
parallel/4: elapsed: 0.002768s checksum: 285714.489795
parallel/5: elapsed: 0.002771s checksum: 285714.489795
parallel/6: elapsed: 0.002779s checksum: 285714.489795
parallel/7: elapsed: 0.002773s checksum: 285714.489795
parallel/8: elapsed: 0.002778s checksum: 285714.489795
parallel/9: elapsed: 0.002767s checksum: 285714.489795
parallel/10: elapsed: 0.002776s checksum: 285714.489795
Speedup: 2.868

threads: 7
parallel/1: elapsed: 0.002559s checksum: 285714.387755
parallel/2: elapsed: 0.002402s checksum: 285714.387755
parallel/3: elapsed: 0.002395s checksum: 285714.387755
parallel/4: elapsed: 0.002392s checksum: 285714.387755
parallel/5: elapsed: 0.002375s checksum: 285714.387755
parallel/6: elapsed: 0.002434s checksum: 285714.387755
parallel/7: elapsed: 0.002373s checksum: 285714.387755
parallel/8: elapsed: 0.002392s checksum: 285714.387755
parallel/9: elapsed: 0.002423s checksum: 285714.387755
parallel/10: elapsed: 0.002387s checksum: 285714.387755
Speedup: 3.342

threads: 8
parallel/1: elapsed: 0.002231s checksum: 285714.285714
parallel/2: elapsed: 0.002112s checksum: 285714.285714
parallel/3: elapsed: 0.002141s checksum: 285714.285714
parallel/4: elapsed: 0.002151s checksum: 285714.285714
parallel/5: elapsed: 0.002149s checksum: 285714.285714
parallel/6: elapsed: 0.002130s checksum: 285714.285714
parallel/7: elapsed: 0.002132s checksum: 285714.285714
parallel/8: elapsed: 0.002136s checksum: 285714.285714
parallel/9: elapsed: 0.002117s checksum: 285714.285714
parallel/10: elapsed: 0.002130s checksum: 285714.285714
Speedup: 3.756

threads: 9
parallel/1: elapsed: 0.003113s checksum: 285714.183673
parallel/2: elapsed: 0.002705s checksum: 285714.183673
parallel/3: elapsed: 0.002863s checksum: 285714.183673
parallel/4: elapsed: 0.002834s checksum: 285714.285714
parallel/5: elapsed: 0.002743s checksum: 285714.183673
parallel/6: elapsed: 0.002759s checksum: 285714.183673
parallel/7: elapsed: 0.002725s checksum: 285714.285714
parallel/8: elapsed: 0.002851s checksum: 285714.183673
parallel/9: elapsed: 0.002587s checksum: 285714.183673
parallel/10: elapsed: 0.002873s checksum: 285714.183673
Speedup: 3.067

threads: 10
parallel/1: elapsed: 0.002834s checksum: 285714.081632
parallel/2: elapsed: 0.003325s checksum: 285714.183673
parallel/3: elapsed: 0.002662s checksum: 285714.183673
parallel/4: elapsed: 0.002531s checksum: 285714.081632
parallel/5: elapsed: 0.003410s checksum: 285714.081632
parallel/6: elapsed: 0.003395s checksum: 285714.081632
parallel/7: elapsed: 0.003235s checksum: 285714.183673
parallel/8: elapsed: 0.002927s checksum: 285714.081632
parallel/9: elapsed: 0.002515s checksum: 285714.081632
parallel/10: elapsed: 0.002818s checksum: 285714.081632
Speedup: 3.154

threads: 11
parallel/1: elapsed: 0.002993s checksum: 285713.979591
parallel/2: elapsed: 0.002404s checksum: 285714.081632
parallel/3: elapsed: 0.002373s checksum: 285714.081632
parallel/4: elapsed: 0.002457s checksum: 285714.183673
parallel/5: elapsed: 0.003038s checksum: 285714.285714
parallel/6: elapsed: 0.002367s checksum: 285714.081632
parallel/7: elapsed: 0.002365s checksum: 285714.081632
parallel/8: elapsed: 0.002999s checksum: 285714.081632
parallel/9: elapsed: 0.003064s checksum: 285714.183673
parallel/10: elapsed: 0.002399s checksum: 285713.979591
Speedup: 3.354

threads: 12
parallel/1: elapsed: 0.002355s checksum: 285714.081632
parallel/2: elapsed: 0.002515s checksum: 285713.877551
parallel/3: elapsed: 0.002774s checksum: 285713.979591
parallel/4: elapsed: 0.002742s checksum: 285713.877551
parallel/5: elapsed: 0.002773s checksum: 285713.979591
parallel/6: elapsed: 0.002722s checksum: 285713.979591
parallel/7: elapsed: 0.002909s checksum: 285714.183673
parallel/8: elapsed: 0.002785s checksum: 285713.877551
parallel/9: elapsed: 0.002741s checksum: 285713.979591
parallel/10: elapsed: 0.002825s checksum: 285714.081632
Speedup: 3.369

threads: 13
parallel/1: elapsed: 0.002908s checksum: 285713.877551
parallel/2: elapsed: 0.002379s checksum: 285713.979591
parallel/3: elapsed: 0.002554s checksum: 285714.081632
parallel/4: elapsed: 0.002653s checksum: 285713.775510
parallel/5: elapsed: 0.002663s checksum: 285713.979591
parallel/6: elapsed: 0.002602s checksum: 285713.877551
parallel/7: elapsed: 0.002527s checksum: 285713.877551
parallel/8: elapsed: 0.002665s checksum: 285713.979591
parallel/9: elapsed: 0.002766s checksum: 285713.979591
parallel/10: elapsed: 0.002620s checksum: 285713.877551
Speedup: 3.335

threads: 14
parallel/1: elapsed: 0.002617s checksum: 285713.877551
parallel/2: elapsed: 0.002438s checksum: 285713.877551
parallel/3: elapsed: 0.002457s checksum: 285713.979591
parallel/4: elapsed: 0.002503s checksum: 285713.979591
parallel/5: elapsed: 0.002350s checksum: 285713.775510
parallel/6: elapsed: 0.002567s checksum: 285713.877551
parallel/7: elapsed: 0.002492s checksum: 285713.979591
parallel/8: elapsed: 0.002458s checksum: 285713.877551
parallel/9: elapsed: 0.002496s checksum: 285713.979591
parallel/10: elapsed: 0.002547s checksum: 285713.979591
Speedup: 3.376

threads: 15
parallel/1: elapsed: 0.002530s checksum: 285713.979591
parallel/2: elapsed: 0.002396s checksum: 285713.877551
parallel/3: elapsed: 0.002267s checksum: 285713.775510
parallel/4: elapsed: 0.002397s checksum: 285713.775510
parallel/5: elapsed: 0.002280s checksum: 285713.571428
parallel/6: elapsed: 0.002415s checksum: 285713.775510
parallel/7: elapsed: 0.002320s checksum: 285713.775510
parallel/8: elapsed: 0.002266s checksum: 285713.979591
parallel/9: elapsed: 0.002356s checksum: 285713.877551
parallel/10: elapsed: 0.002302s checksum: 285713.979591
Speedup: 3.501

threads: 16
parallel/1: elapsed: 0.002679s checksum: 285713.775510
parallel/2: elapsed: 0.002178s checksum: 285713.775510
parallel/3: elapsed: 0.002171s checksum: 285713.571428
parallel/4: elapsed: 0.002446s checksum: 285713.673469
parallel/5: elapsed: 0.002394s checksum: 285713.673469
parallel/6: elapsed: 0.002475s checksum: 285713.571428
parallel/7: elapsed: 0.002165s checksum: 285713.979591
parallel/8: elapsed: 0.002162s checksum: 285713.673469
parallel/9: elapsed: 0.002222s checksum: 285713.673469
parallel/10: elapsed: 0.002186s checksum: 285713.571428
Speedup: 3.670

FINAL 8 is best with 0.002112s and speedup 3.756