为什么 omp_set_num_threads( omp_get_num_threads() ) 会改变什么？

Question

我运行遇到了一些奇怪的事情。我正在一台只有一个不起眼的 4 核 I3 的小型本地机器上测试 MPI + OMP 并行代码。事实证明，我的一个循环非常慢，在此环境中每个进程有超过 1 个 OMP 线程（线程多于内核）。

#pragma omp parallel for
for ( int i = 0; i < HEIGHT; ++i ) 
{
    for ( int j = 0; j < WIDTH; ++j ) 
    {
        double a = 
           ( data[ sIdx * S_SZ + j + i * WIDTH ] - dMin ) / ( dMax - dMin );

        buff[ i ][ j ] = ( unsigned char ) ( 255.0 * a );
    }
}

如果我运行此代码使用默认值（不设置OMP_NUM_THREADS，或使用omp_set_num_threads），则大约需要1秒。但是，如果我使用任一方法（export OMP_NUM_THREADS=1 或 omp_set_num_threads(1)) 显式设置线程数，则大约需要 0.005 秒（快 200 倍）。

不过好像omp_get_num_threads() returns 1不管。事实上，如果我只是这样做 omp_set_num_threads( omp_get_num_threads() ); 那么它需要大约 0.005 秒，而评论那条线需要 1 秒。

知道这里发生了什么吗？为什么在程序开始时调用一次 omp_set_num_threads( omp_get_num_threads() ) 会导致 200 倍的性能差异？

一些上下文，

cpu:             Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz
g++ --version:   g++ (GCC) 10.2.0
compiler flags:  mpic++ -std=c++11 -O3 -fpic -fopenmp ...
running program: mpirun -np 4 ./a.out

Answer 1

I've run across something odd. I am testing an MPI + OMP parallel code on a small local machine with only a single, humble 4 core I3. One of my loops, it turns out, is very slow with more than 1 OMP thread per process in this environment (more threads than cores).

首先，如果 OpenMP 线程（在 MPI 进程内）与内核没有任何显式绑定，则无法确定这些线程最终将位于哪个内核中。自然地，在同一个逻辑内核中拥有多个线程运行通常会增加被并行化的应用程序的整体执行。您可以通过以下任一方式解决此问题：1) 禁用与 MPI 标志 --bind-to none 的绑定，以启用线程分配给不同的核心； 2) 或相应地执行线程的绑定。检查此如何将线程映射到混合并行化中的核心，例如 MPI + OpenMP。

尽管如此，即使一个人（假设）每个进程都映射到一个核心，并且 4 个线程 per 核心，假设每个核心都有两个逻辑核心（即，超线程），应用程序的整体执行时间很可能比运行慢[=13] =] 进程 x 1 线程。在目前的情况下，人们可能希望（最多）使用 4 进程 x 2 个线程来提高性能。

But it seems that omp_get_num_threads() returns 1 regardless. And in fact, if I just do this omp_set_num_threads( omp_get_num_threads() );

从source可以读到：

2.15 omp_get_num_threads – 活跃团队的规模

Description: *Returns the number of threads in the current team. In a sequential section of the program omp_get_num_threads returns 1.

非正式地，如果在并行区域外调用 omp_get_num_threads()，将得到 1 作为线程数，即 初始线程 。

Why should calling omp_set_num_threads( omp_get_num_threads() ) once at the beginning of a program ever result in a 200X difference in performance?

问题根本原因不是调用omp_set_num_threads( omp_get_num_threads() )persi，而是线程在fighting 获取资源。通过将线程数 per process 显式设置为 1，您确保应用程序运行 with 1 thread per 核心，因此导致同一核心内没有多个线程争夺资源。

为什么 omp_set_num_threads( omp_get_num_threads() ) 会改变什么？

Why would omp_set_num_threads( omp_get_num_threads() ) change anything?

c++

parallel-processing

multithreading

mpi

openmp