MPI 大小和 OpenMP 线程数

MPI-size and number of OpenMP-Threads

我正在尝试编写一个混合程序 OpenMP/MPI-program,因此我正在尝试了解 OpenMP 线程数与 MPI 进程数之间的相关性。因此,我创建了一个小测试程序:

#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>

int main(int args, char *argv[]) {
    int rank, nprocs, thread_id, nthreads, cxx_procs;
    MPI_Init(&args, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    #pragma omp parallel private(thread_id, nthreads, cxx_procs) 
    {
        thread_id = omp_get_thread_num();
        nthreads = omp_get_num_threads();
        cxx_procs = std::thread::hardware_concurrency();
        std::stringstream omp_stream;
        omp_stream << "I'm thread " << thread_id 
        << " out of " << nthreads 
        << " on MPI process nr. " << rank 
        << " out of " << nprocs 
        << ", while hardware_concurrency reports " << cxx_procs 
        << " processors\n";
        std::cout << omp_stream.str();
    }

    MPI_Finalize();
    return 0;
}

使用

编译
mpicxx -fopenmp -std=c++17 -o omp_mpi source/main.cpp -lgomp

gcc-9.3.1OpenMPI 3。 现在,在带有 4c/8t 和 ./omp_mpi 的 i7-6700 上执行它时,我得到以下输出

I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors

即不出所料。
当使用 mpirun -n 1 omp_mpi 执行它时,我期望相同,但我得到

I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors

其他线程在哪里?相反,在两个 MPI 进程上执行它时,我得到

I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors

即仍然只有两个 OpenMP 线程,但是在四个 MPI 进程上执行它时,我得到

I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors

现在我突然得到每个 MPI 进程八个 OpenMP 线程。这种变化从何而来?

mpirun 的手册页说明:

If you are simply looking for how to run an MPI application, you probably want to use a command line of the following form:

  % mpirun [ -np X ] [ --hostfile <filename> ]  <program>

This will run X copies of in your current run-time environment (...)

Please note that mpirun automatically binds processes as of the start of the v1.8 series. Three binding patterns are used in the absence of any further directives:

  Bind to core:     when the number of processes is <= 2
  Bind to socket:   when the number of processes is > 2
  Bind to none:     when oversubscribed

If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process.

现在,如果您指定 1 个或 2 个 MPI 进程,mpirun 默认为 --bind-to core,这会导致每个 MPI 进程有 2 个线程。 但是,如果您指定 4 个 MPI 进程,mpirun 默认为 --bind-to socket 并且每个进程有 8 个线程,因为您的机器是单插槽机器。我在笔记本电脑 (1s/2c/4t) 和工作站(2 个插槽,每个插槽 12 个内核,每个内核 2 个线程)上测试了它,程序(没有 np 参数)的行为如上所述:对于工作站有 24 个 MPI 进程,每个进程有 24 个 OpenMP 线程。

您正在观察 Open MPI 的特性与 GNU OpenMP 运行时之间的交互 ​​libgomp

首先,OpenMP 中的线程数由 num-threads ICV(内部控制变量)控制,设置方法是调用 omp_set_num_threads() 或通过在环境中设置 OMP_NUM_THREADS。当 OMP_NUM_THREADS 未设置且不调用 omp_set_num_threads() 时,运行时可以自由选择它认为合理的任何值作为默认值。在 libgomp 的情况下,the manual 表示:

OMP_NUM_THREADS

Specifies the default number of threads to use in parallel regions. The value of this variable shall be a comma-separated list of positive integers; the value specifies the number of threads to use for the corresponding nested level. Specifying more than one item in the list will automatically enable nesting by default. If undefined one thread per CPU is used.

它没有提到的是它使用各种试探法来确定 CPU 的正确数量。在 Linux 和 Windows 上,进程关联掩码用于此(如果你喜欢阅读代码,Linux 的掩码是 right here)。如果进程绑定到单个逻辑CPU,你只会得到一个线程:

$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors

如果将其绑定到多个逻辑 CPUs,则使用它们的计数:

$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors

这种特定于 libgomp 的行为与另一种特定于 Open MPI 的行为相互作用。早在 2013 年,Open MPI 就更改了其默认绑定策略。原因在某种程度上是技术原因和政治因素的结合,您可以在 Jeff Squyres' blog 上阅读更多内容(Jeff 是核心 Open MPI 开发人员)。

故事的寓意是:

始终明确设置 OpenMP 线程数和 MPI 绑定策略。 使用 Open MPI,设置环境变量的方法是 -x:

$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi   
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors

请注意,我启用了超线程,因此 --bind-to core--bind-to hwthread 在没有显式设置 OMP_NUM_THREADS:

的情况下会产生不同的结果
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi 
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors

mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors

--map-by node:PE=3 为每个 MPI 等级每个节点提供三个处理元素 (PE)。当绑定到核心时,PE 就是一个核心。当绑定到硬件线程时,PE 是一个线程,应该使用 --map-by node:PE=#cores*#threads,即在我的例子中是 --map-by node:PE=6

OpenMP 运行时是否遵守 MPI 设置的关联掩码以及它是否将自己的线程关联映射到它上面,如果不这样做,则完全是另一回事。