对 "region cannot be closely nested inside 'parallel for' region" 的澄清

Clarification on "region cannot be closely nested inside 'parallel for' region"

我想了解减少在 OpenMP 中的工作原理。 我有这个涉及减少的简单代码。

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int N = 100;
int M = 200;
int O = 300;

double r2() {
    return ((double) rand() / (double) RAND_MAX);
}

int main(void) {
    double S = 0;
    double *K = (double*) calloc(M * N, sizeof(double));
    #pragma omp parallel for collapse(2)
    {
      for (int m = 0; m < M; m++) {
         for (int n = 0; n < N; n++) {
            #pragma omp for reduction(+:S)
            for (int o = 0; o < O; o++) {
                S += r2() - 0.25;
            }
            K[m * N + n] = S;
         }
      }
   }
}

我收到此错误消息

Blockquote test.cc:30:1: error: region cannot be closely nested inside 'parallel for' region; perhaps you forget to enclose 'omp for' directive into a parallel region? #pragma omp for reduction(+:S) ^

如果我这样做就符合

#pragma omp parallel for reduction(+:S)

这是执行嵌套循环的正确方法吗?


编辑: 对原始问题进行更改。我希望并行和顺序代码具有相同的结果。

#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        #pragma omp for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += o;
        }
        K[m * N + n] = S;
    }
}

重要 TL;DR rand is not thread safe:

来自 rand 手册页:

The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.

多线程代码使用(例如)rand_r

I am trying to understand how reduction works in OpenMP.

为了论证,让我们假设 r2() 将始终产生相同的值。

当自己的代码有多个线程并发修改某个变量时,代码如下:

   double S = 0;
   #pragma omp parallel 
   for (int o = 0; o < O; o++) {
        S += r2() - 0.25;
    }

其中一个变量 S 的更新存在 竞争条件 。要解决它,可以使用 OpenMP reduction 子句,从 OpenMP standard 可以读到:

The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.

在这种情况下,代码将如下所示:

    #pragma omp for reduction(+:S)
    for (int o = 0; o < O; o++) {
        S += r2() - 0.25;
    }

但是,在您的完整代码中

#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        #pragma omp for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;
    }
}

您首先使用 #pragma omp for collapse(2) 划分两个外部循环的迭代,然后尝试使用不同的子句 #pragma omp for 再次划分最内层循环的迭代,这不是允许。

Is this the right way to do a nested loop?

您可以进行以下并行化:

#pragma omp parallel for collapse(2) firstprivate (S)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;
    }
}

没有竞争条件 因为变量S 是私有的。此外,在这种情况下,由于两个最外层循环的迭代在线程之间分配,每个线程都有一对唯一的 mn 迭代,因此每个线程将访问数组的唯一位置K 访问期间 K[m * N + n].

但问题在于,并行化两个外部循环的版本不会产生与其顺序版本相同的结果。之所以如此,是因为

        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;

在三个循环的所有迭代中添加隐式依赖。

S 的值明确取决于执行迭代 mno 的顺序。因此,如果在线程之间划分这些循环的迭代,则给定 mnS 的值在顺序或并行执行代码时将不相同。尽管如此,这可以通过仅并行化最内层循环并减少变量 S:

来解决
for (int m = 0; m < M; m++) {
   for (int n = 0; n < N; n++) {
        #pragma omp parallel for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;
    }
}

如果您关心 S 的值,那么所有这些(当然)都很重要,因为有人可能会争辩说,既然您使用的是产生随机值的函数,那么保持值的顺序S不是最重要的。

带有线程安全随机生成器的版本

版本 1

#pragma omp parallel
{
    unsigned int myseed = omp_get_thread_num();
    #pragma omp for collapse(2)
    for (int m = 0; m < M; m++) {
      for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
            S += r - 0.25;
        }
     K[m * N + n] = S;
     }
   }

}

版本 2

double *K = (double*) calloc(M * N, sizeof(double));
for (int m = 0; m < M; m++) {
   for (int n = 0; n < N; n++) {
        #pragma omp parallel
        {
            unsigned int myseed = omp_get_thread_num();
            #pragma omp for reduction(+:S)
            for (int o = 0; o < O; o++) {
                    double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
                    S += r - 0.25;
            }
        }
        K[m * N + n] = S;
    }
}

编辑:

Making a change in the original question. I want the parallel and sequential code to have the same result.

而不是:

#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        #pragma omp for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += o;
        }
        K[m * N + n] = S;
    }
}

做:

    for (int m = 0; m < M; m++) {
      for (int n = 0; n < N; n++) {
        #pragma omp parallel for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += o;
        }
        K[m * N + n] = S;
      }
   }