"omp parallel for" 不适用于 "omp parallel"

"omp parallel for" does not work in "omp parallel"

我希望得到以下输出:

My rank is: 0 num is: 0
My rank is: 1 num is: 1
My rank is: 2 num is: 2
My rank is: 3 num is: 3

来自以下代码:

#pragma omp parallel
{
   int my_rank = omp_get_thread_num();

   #pragma omp parallel for num_threads(4)
   for(int i = 0; i < 4; i++){       
       printf("My rank is: %d num is: %d\n",my_rank, i);       
   }
}

但它给出了以下输出:

My rank is: 0 num is: 0
My rank is: 0 num is: 1
My rank is: 0 num is: 2
My rank is: 0 num is: 3
My rank is: 2 num is: 0
My rank is: 2 num is: 1
My rank is: 2 num is: 2
My rank is: 2 num is: 3
My rank is: 3 num is: 0
My rank is: 3 num is: 1
My rank is: 3 num is: 2
My rank is: 3 num is: 3
My rank is: 1 num is: 0
My rank is: 1 num is: 1
My rank is: 1 num is: 2
My rank is: 1 num is: 3

有什么问题?

你不应该重复parallel,你已经在parallel块中,所以你只需要pragma omp for循环,每个线程执行parallel如果指定 pragma omp for,block 将自动占用循环的一部分。如果要指定线程数,可以先 pragma omp parallel num_threads(4) 然后 pragma omp for。在任何情况下,对于这样一段简单的代码,您都可以删除似乎不需要的整个外部块。

这是正确的版本:

#pragma omp parallel num_threads(4)
{
  int my_rank = omp_get_thread_num();

  #pragma omp for
  for(int i = 0; i < 4; i++){       
      printf("My rank is: %d num is: %d\n", my_rank, i);       
  }
}

或者简单地说:

#pragma omp parallel for num_threads(4)
for(int i = 0; i < 4; i++){       
    printf("My rank is: %d num is: %d\n", omp_get_thread_num(), i);       
}

的回答很准确,我只是想通过更多关于幕后发生的事情的信息来扩展它。

默认情况下,nested parallelism禁用。尽管如此,可以通过以下任一方式显式 启用 nested parallelism

   omp_set_nested(1);

或通过将 OMP_NESTED 环境变量设置为 true。

同样从 OpenMP standard 我们知道:

When a thread encounters a parallel construct, a team of threads is created to execute the parallel region. The thread that encountered the parallel construct becomes the master thread of the new team, with a thread number of zero for the duration of the new parallel region. All threads in the new team, including the master thread, execute the region. Once the team is created, the number of threads in the team remains constant for the duration of that parallel region.

来自source您可以阅读以下内容。

OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.

这解释了为什么添加第二个 parallel region #pragma omp parallel for num_threads(4 )) 每个 团队只有一个线程 执行封闭代码(即, for循环)。换句话说,从第一个 parallel region4 个线程被创建,每个线程在遇到第二个 parallel region 时都会创建一个新团队并成为该团队的主人(即,将在新创建的团队中拥有 ID=0)。 但是,由于您没有明确启用嵌套并行性,因此每个团队仅由一个线程组成。因此,4 个团队每个线程将执行 for 循环。因此,您将得到以下语句:

   printf("My rank is: %d num is: %d\n",my_rank, i); 

正在打印 4 x 4 = 16 times即, 循环迭代总数乘以 4 团队中的线程总数)。这就是您得到以下输出的原因:

My rank is: 0 num is: 0
My rank is: 0 num is: 1
My rank is: 0 num is: 2
My rank is: 0 num is: 3
My rank is: 2 num is: 0
My rank is: 2 num is: 1
My rank is: 2 num is: 2
My rank is: 2 num is: 3
My rank is: 3 num is: 0
My rank is: 3 num is: 1
My rank is: 3 num is: 2
My rank is: 3 num is: 3
My rank is: 1 num is: 0
My rank is: 1 num is: 1
My rank is: 1 num is: 2
My rank is: 1 num is: 3

下图提供了该流程的可视化效果:

请记住,在上图中,我假设循环之间的迭代有一定的 static 循环分布,我并不是说循环迭代将始终像这样在所有实现中划分OpenMP 标准。

I expect to get the following output:

My rank is: 0 num is: 0
My rank is: 1 num is: 1
My rank is: 2 num is: 2
My rank is: 3 num is: 3

很明显你要找的是:

 #pragma omp parallel for num_threads(4)
 for(int i = 0; i < 4; i++){       
     printf("My rank is: %d num is: %d\n", omp_get_thread_num(), i);       
 }

该:

#pragma omp parallel for

将创建一个 parallel region(如前所述),并使用 default chunk size 将其包含的循环迭代分配给该区域的 threadsdefault schedule 通常是 static。但是请记住,default scheduleOpenMP 标准的不同具体实现中可能会有所不同。

OpenMP 5.1你可以看到更正式的描述:

The worksharing-loop construct specifies that the iterations of one or more associated loops will be executed in parallel by threads in the team in the context of their implicit tasks. The iterations are distributed across threads that already exist in the team that is executing the parallel region to which the worksharing-loop region binds.

Moreover,

The parallel loop construct is a shortcut for specifying a parallel construct containing a loop construct with one or more associated loops and no other statements.

或者通俗地说,#pragma omp parallel for 是构造函数 #pragma omp parallel#pragma omp for 的组合。