为什么 'parallel for loop' 中的 'for loop' 比串行区域中相同的 'for loop' 花费的时间更长？

Question

我正在测试使用 64 个线程的集群的性能。我写了一个简单的代码：

        unsigned int m(67000);
        double start_time_i(0.),end_time_i(0.),start_time_init(0.),end_time_init(0.),diff_time_i(0.),start_time_j(0.),end_time_j(0.),diff_time_j(0.),total_time(0.);

        cout<<"omp_get_max_threads : "<<omp_get_max_threads()<<endl;
        cout<<"omp_get_num_procs : "<<omp_get_num_procs()<<endl;
        omp_set_num_threads(omp_get_max_threads());
        unsigned int dim_i=omp_get_max_threads();
        unsigned int dim_j=dim_i*m;

        std::vector<std::vector<unsigned int>> vector;
        vector.resize(dim_i, std::vector<unsigned int>(dim_j, 0));


        start_time_init = omp_get_wtime();
        for (unsigned int j=0;j<dim_j;j++){
                        vector[0][j]=j;
        }
        end_time_init = omp_get_wtime();

        start_time_i = omp_get_wtime();
        #pragma omp parallel for
        for (unsigned int i=0;i<dim_i;i++){
                        start_time_j = omp_get_wtime();
                        for (unsigned int j=0;j<dim_j;j++) vector[i][j]=i+j;
                        end_time_j = omp_get_wtime();
                        cout<<"i "<<i<<" thread "<<omp_get_thread_num()<<" int_time = "<<(end_time_j-start_time_j)*1000<<endl;

        }
        end_time_i = omp_get_wtime();


        cout<<"time_final = "<<(end_time_i-start_time_i)*1000<<endl;
        cout<<"initial non parallel region "<< " time = "<<(end_time_init-start_time_init)*1000<<endl;

        return 0;

我不明白为什么“(end_time_j-start_time_j)*1000”比我在并行区域之外需要在 j 上执行相同循环所需的时间大得多（大约 50），即“end_time_init-start_time_init”（大约 1）。 omp_get_max_threads() 和 omp_get_num_procs() 都等于 64。

Answer 1

在你的循环中，你只是用很多值填充一个内存位置。这个任务的计算量并不大，它取决于内存写入的速度。一个线程可以以一定的速率完成，但是当您同时使用 N 个线程时，总内存带宽在共享内存多核系统（即大多数 PC、笔记本电脑）上保持不变，而在分布式内存多核系统（高-结束服务）。更多详情请阅读this.

因此，根据系统的不同，当运行多个循环并发时，内存写入速度要么保持不变，要么降低。对我来说，50 倍的差异似乎有点大。我在 compiler explorer 上得到了以下结果（这意味着它必须是分布式内存多核系统）：

omp_get_max_threads : 4
omp_get_num_procs : 2
i 2 thread 2 int_time = 0.095537
i 0 thread 0 int_time = 0.084061
i 1 thread 1 int_time = 0.099578
i 3 thread 3 int_time = 0.10519
time_final = 0.868523
initial non parallel region  time = 0.090862

在我的笔记本电脑上，我得到了以下信息（因此它是一个共享内存多核系统）：

omp_get_max_threads : 8
omp_get_num_procs : 8
i 7 thread 7 int_time = 0.7518
i 5 thread 5 int_time = 1.0555
i 1 thread 1 int_time = 1.2755
i 6 thread 6 int_time = 1.3093
i 2 thread 2 int_time = 1.3093
i 3 thread 3 int_time = 1.3093
i 4 thread 4 int_time = 1.3093
i 0 thread 0 int_time = 1.3093
time_final = 1.915
initial non parallel region  time = 0.1578

总之，这取决于您使用的系统...

为什么 'parallel for loop' 中的 'for loop' 比串行区域中相同的 'for loop' 花费的时间更长？

Why a 'for loop' inside a 'parallel for loop' takes longer than the same 'for loop' in a serial region?

parallel-processing

performance

openmp

performance-testing