OpenMP 中的每个线程执行相同的工作量是否正常?
Is it normal that each thread in OpenMP does the same amount of work?
对于以下代码,我已经计算了每个线程的执行时间,令我感到奇怪的是,在我使用静态或动态计划的所有运行中,每个线程都有几乎准确的时间调用。这在 OpenMP 中是预期的吗?我们是否遇到过一个或多个线程执行更多作业的情况?
我不明白的另一件事是使用静态和动态计划的时间执行是相同的。恐怕我计算时间的方式有误。
#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <omp.h>
#include <fstream>
#include <cfloat>
#include <chrono>
using namespace std;
using namespace chrono;
int main()
{
const int N = 100000;
ofstream result{"Result.txt"};
vector<vector<double>> c;
default_random_engine g(0);
uniform_real_distribution<double> d(0.0f, nextafter(1.0f, DBL_MAX));
c.reserve(N);
for (int i = 0; i < N; i++) {
const unsigned size = pow(10, i % 4);
vector<double> a;
a.reserve(size);
for (int j = 0; j < size; j++) {
const double number = d(g);
a.push_back(number);
}
c.push_back(std::move(a));
}
double sum = 0.0;
vector<double> b(N);
int total_threads=4;
double time_taken_by_threads[total_threads];
auto t1= high_resolution_clock::now();
#pragma omp parallel num_threads(4) firstprivate(N) shared(b,c,sum)
{
int threadID = omp_get_thread_num();
double start = omp_get_wtime();
#pragma omp for reduction(+:sum) schedule(dynamic)
for (int i = 0; i < N ; i++) {
double sumLocal = 0.0;
for (int j = 0; j < c[i].size();j++) {
sumLocal += pow(c[i][j], 2);
}
const double n = sqrt(sumLocal);
b[i] = n;
sum += sumLocal;
}
double end = omp_get_wtime();
time_taken_by_threads[threadID] = end - start;
}
auto t2=high_resolution_clock::now();
auto diff=duration_cast<milliseconds>(t2-t1);
cout<<"The total job has been taken : "<<diff.count()<<endl;
for(int i=0; i<total_threads ; i++){
cout<<" Thread work "<< time_taken_by_threads[i]<<endl;
}
}
TL;DR 你在 #pragma omp for reduction(+:sum)
的末尾有一个隐式屏障
I'm afraid that maybe, I'm calculating time in a wrong manner.
事实上,它总是会给出类似的结果,因为 #pragma omp for
:
double start = omp_get_wtime();
#pragma omp for reduction(+:sum) schedule(dynamic)
for (int i = 0; i < N ; i++) {
// ....
}
// <--- threads will wait here for one another.
double end = omp_get_wtime();
time_taken_by_threads[threadID] = end - start;
在循环后引入一个隐含的 barrier
。所以先完成的线程仍然会等待那些还没有完成的线程。要删除该隐式障碍,您可以使用 nowait 子句:
#pragma omp for reduction(+:sum) schedule(dynamic) nowait
尽管在这段代码中这不是问题,但在删除隐式 barrier
时需要小心,因为它可能会导致 race-conditions。因此,为了将来的使用,您可以使用以下模式来测量 per 线程所花费的时间,并且仍然可以避免潜在的 race-conditions.
double start = omp_get_wtime();
// The parallel loop with nowait
double end = omp_get_wtime();
#pragma omp barrier
time_taken_by_threads[threadID] = end - start;
然而,即使进行了该更改,每个 线程所花费的时间应该大致相同。我将在下面解释为什么会这样。
For the following code, I have calculated time execution for each
thread and it is odd to me that with all runs I get using the static
or dynamic schedule, each thread has nearly exact time invocation. Is
this something expected in OpenMP?
可以预期,当使用 static 计划时,OpenMP 会尝试尽可能平均地分配线程之间的循环迭代次数。
从 OpenMP 5.1 标准中,可以阅读以下有关 for 调度子句的内容:
When kind is static, iterations are divided into chunks of size
chunk_size, and the chunks are assigned to the threads in the team in
a round-robin fashion in the order of the thread number. Each chunk
contains chunk_size iterations, except for the chunk that contains the
sequentially last iteration, which may have fewer iterations. When no
chunk_size is specified, the iteration space is divided into chunks
that are approximately equal in size, and at most one chunk is
distributed to each thread. The size of the chunks is unspecified in
this case.
在您的情况下,当使用具有默认块大小的 static
分布时,4 个线程中的每一个都将计算 25000 次迭代(即, 100000/4) .
如果我们分析并行循环:
#pragma omp for reduction(+:sum) schedule(static)
for (int i = 0; i < N ; i++) {
double sumLocal = 0.0;
for (int j = 0; j < c[i].size();j++) {
sumLocal += pow(c[i][j], 2);
}
const double n = sqrt(sumLocal);
b[i] = n;
sum += sumLocal;
}
我们可以看到每次迭代都执行 相同数量的 计算,并且计算主要是 CPU-bound,因此可以预期每个线程将花费大致相同的时间。
关于 OpenMP 5.1 标准中的 动态 时间表,可以阅读:
When kind is dynamic, the iterations are distributed to threads in the
team in chunks. Each thread executes a chunk of iterations, then
requests another chunk, until no chunks remain to be distributed. Each
chunk contains chunk_size iterations, except for the chunk that
contains the sequentially last iteration, which may have fewer
iterations.
When no chunk_size is specified, it defaults to 1.
因为默认情况下块大小为 1,并且我们已经知道循环的迭代将花费大约相同的时间,所以可以预期线程也将花费相同的时间。
Do we ever have the situation that one or more threads perform more
jobs?
当然你只需要创造一个导致负载不平衡的情况,例如:
#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++){
for(int k=0; k<i; k++){
// some computation
}
}
如果你仔细看,你会发现内层循环的工作量呈三角形增长(N = SIZE):
*k/i 0 1 2 3 4 5 ... N-1
* 0 - x x x x x ... x
* 1 - - x x x x ... x
* 2 - - - x x x ... x
* 3 - - - - x x ... x
* 4 - - - - - x ... x
* 5 - - - - - - ... x
* . - - - - - - ... x
* . - - - - - - ... x
*N-1 - - - - - - ... -
* N - - - - - - ... -
所以对于 4 个线程和 N
这样 N % 4 = 0
,线程 1 将被分配循环的前 N/4
次迭代,线程 2 将被分配下一个 N/4
等等。因此,线程 1 使用较少的最内层循环迭代计算最外层循环迭代,这会导致负载不平衡,并最终导致线程完成并行工作所花费的时间之间存在较大差异。
您可以在您的代码中模拟该场景,如下所示:
#pragma omp for reduction(+:sum) schedule(static) nowait
for (int i = 0; i < N ; i++) {
double sumLocal = 0.0;
for (int j = i; j < c[i].size();j++) {
sumLocal += pow(c[i][j], 2);
}
const double n = sqrt(sumLocal);
b[i] = n;
sum += sumLocal;
}
Another thing which I do not understand is that time execution for
both using static and dynamic schedule is the same.
正如我们已经解释过的,考虑到分配给每个线程的并行任务的性质,这是可以预料的。
对于以下代码,我已经计算了每个线程的执行时间,令我感到奇怪的是,在我使用静态或动态计划的所有运行中,每个线程都有几乎准确的时间调用。这在 OpenMP 中是预期的吗?我们是否遇到过一个或多个线程执行更多作业的情况? 我不明白的另一件事是使用静态和动态计划的时间执行是相同的。恐怕我计算时间的方式有误。
#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <omp.h>
#include <fstream>
#include <cfloat>
#include <chrono>
using namespace std;
using namespace chrono;
int main()
{
const int N = 100000;
ofstream result{"Result.txt"};
vector<vector<double>> c;
default_random_engine g(0);
uniform_real_distribution<double> d(0.0f, nextafter(1.0f, DBL_MAX));
c.reserve(N);
for (int i = 0; i < N; i++) {
const unsigned size = pow(10, i % 4);
vector<double> a;
a.reserve(size);
for (int j = 0; j < size; j++) {
const double number = d(g);
a.push_back(number);
}
c.push_back(std::move(a));
}
double sum = 0.0;
vector<double> b(N);
int total_threads=4;
double time_taken_by_threads[total_threads];
auto t1= high_resolution_clock::now();
#pragma omp parallel num_threads(4) firstprivate(N) shared(b,c,sum)
{
int threadID = omp_get_thread_num();
double start = omp_get_wtime();
#pragma omp for reduction(+:sum) schedule(dynamic)
for (int i = 0; i < N ; i++) {
double sumLocal = 0.0;
for (int j = 0; j < c[i].size();j++) {
sumLocal += pow(c[i][j], 2);
}
const double n = sqrt(sumLocal);
b[i] = n;
sum += sumLocal;
}
double end = omp_get_wtime();
time_taken_by_threads[threadID] = end - start;
}
auto t2=high_resolution_clock::now();
auto diff=duration_cast<milliseconds>(t2-t1);
cout<<"The total job has been taken : "<<diff.count()<<endl;
for(int i=0; i<total_threads ; i++){
cout<<" Thread work "<< time_taken_by_threads[i]<<endl;
}
}
TL;DR 你在 #pragma omp for reduction(+:sum)
I'm afraid that maybe, I'm calculating time in a wrong manner.
事实上,它总是会给出类似的结果,因为 #pragma omp for
:
double start = omp_get_wtime();
#pragma omp for reduction(+:sum) schedule(dynamic)
for (int i = 0; i < N ; i++) {
// ....
}
// <--- threads will wait here for one another.
double end = omp_get_wtime();
time_taken_by_threads[threadID] = end - start;
在循环后引入一个隐含的 barrier
。所以先完成的线程仍然会等待那些还没有完成的线程。要删除该隐式障碍,您可以使用 nowait 子句:
#pragma omp for reduction(+:sum) schedule(dynamic) nowait
尽管在这段代码中这不是问题,但在删除隐式 barrier
时需要小心,因为它可能会导致 race-conditions。因此,为了将来的使用,您可以使用以下模式来测量 per 线程所花费的时间,并且仍然可以避免潜在的 race-conditions.
double start = omp_get_wtime();
// The parallel loop with nowait
double end = omp_get_wtime();
#pragma omp barrier
time_taken_by_threads[threadID] = end - start;
然而,即使进行了该更改,每个 线程所花费的时间应该大致相同。我将在下面解释为什么会这样。
For the following code, I have calculated time execution for each thread and it is odd to me that with all runs I get using the static or dynamic schedule, each thread has nearly exact time invocation. Is this something expected in OpenMP?
可以预期,当使用 static 计划时,OpenMP 会尝试尽可能平均地分配线程之间的循环迭代次数。
从 OpenMP 5.1 标准中,可以阅读以下有关 for 调度子句的内容:
When kind is static, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.
在您的情况下,当使用具有默认块大小的 static
分布时,4 个线程中的每一个都将计算 25000 次迭代(即, 100000/4) .
如果我们分析并行循环:
#pragma omp for reduction(+:sum) schedule(static)
for (int i = 0; i < N ; i++) {
double sumLocal = 0.0;
for (int j = 0; j < c[i].size();j++) {
sumLocal += pow(c[i][j], 2);
}
const double n = sqrt(sumLocal);
b[i] = n;
sum += sumLocal;
}
我们可以看到每次迭代都执行 相同数量的 计算,并且计算主要是 CPU-bound,因此可以预期每个线程将花费大致相同的时间。
关于 OpenMP 5.1 标准中的 动态 时间表,可以阅读:
When kind is dynamic, the iterations are distributed to threads in the team in chunks. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, it defaults to 1.
因为默认情况下块大小为 1,并且我们已经知道循环的迭代将花费大约相同的时间,所以可以预期线程也将花费相同的时间。
Do we ever have the situation that one or more threads perform more jobs?
当然你只需要创造一个导致负载不平衡的情况,例如:
#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++){
for(int k=0; k<i; k++){
// some computation
}
}
如果你仔细看,你会发现内层循环的工作量呈三角形增长(N = SIZE):
*k/i 0 1 2 3 4 5 ... N-1
* 0 - x x x x x ... x
* 1 - - x x x x ... x
* 2 - - - x x x ... x
* 3 - - - - x x ... x
* 4 - - - - - x ... x
* 5 - - - - - - ... x
* . - - - - - - ... x
* . - - - - - - ... x
*N-1 - - - - - - ... -
* N - - - - - - ... -
所以对于 4 个线程和 N
这样 N % 4 = 0
,线程 1 将被分配循环的前 N/4
次迭代,线程 2 将被分配下一个 N/4
等等。因此,线程 1 使用较少的最内层循环迭代计算最外层循环迭代,这会导致负载不平衡,并最终导致线程完成并行工作所花费的时间之间存在较大差异。
您可以在您的代码中模拟该场景,如下所示:
#pragma omp for reduction(+:sum) schedule(static) nowait
for (int i = 0; i < N ; i++) {
double sumLocal = 0.0;
for (int j = i; j < c[i].size();j++) {
sumLocal += pow(c[i][j], 2);
}
const double n = sqrt(sumLocal);
b[i] = n;
sum += sumLocal;
}
Another thing which I do not understand is that time execution for both using static and dynamic schedule is the same.
正如我们已经解释过的,考虑到分配给每个线程的并行任务的性质,这是可以预料的。