在 OpenMP while 循环中手动同步
Manual synchronization in OpenMP while loop
我最近开始使用 OpenMP 为大学的一个项目做一些 'research'。我有一个矩形且间隔均匀的网格,我正在使用迭代方案在其上求解偏微分方程。所以我基本上有两个 for 循环(每个循环在网格的 x 和 y 方向上)由一个 while 循环包裹以进行迭代。
现在我想为此研究不同的并行化方案。第一种(显而易见的)方法是对 for 循环进行空间并行化。
也很好用。
我遇到问题的方法是一个更棘手的想法。每个线程计算所有网格点。第一个线程从第一个网格行 (y=0) 开始求解方程。完成后,线程继续处理下一行 (y=1),依此类推。同时线程#2 已经可以从 y=0 开始,因为所有必要的信息都已经可用。我只需要在线程之间进行一种手动同步,这样它们就不会相互超越。
因此我使用了一个名为check
的数组。它包含当前允许在每个网格行上工作的线程 ID。当即将到来的行不是 'ready'(check[j]
中的值不正确)时,线程进入一个空的 while 循环,直到它是。
有了 MWE,事情会变得更清晰:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
事实是,这个 MWE 实际上可以在我的机器上运行。但是如果我将它复制到我的项目中,它不会。此外,结果总是不同的:它要么在第一次迭代后停止,要么在第三次迭代后停止。
另一件奇怪的事情:当我在内部 while 循环中删除评论的斜线时,它起作用了!输出包含一些
"Thread #1 is waiting!"
但这是合理的。在我看来,我似乎以某种方式创造了竞争条件,但我不知道在哪里。
有人知道问题出在哪里吗?或者提示如何实现这种同步?
我认为你混淆了原子性和内存连续性。 OpenMP standard 实际上在
中描述得非常好
1.4 内存模型(强调我的):
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
1.4.3 刷新操作
The memory model has relaxed-consistency because a thread’s temporary
view of memory is not required to be consistent with memory at all
times. A value written to a variable can remain in the thread’s
temporary view until it is forced to memory at a later time. Likewise,
a read from a variable may retrieve the value from the thread’s
temporary view, unless it is forced to read from memory. The OpenMP
flush operation enforces consistency between the temporary view and
memory.
为避免这种情况,您还应该阅读 check[]
atomic
并在 atomic
结构中指定 seq_cst
子句。此子句强制对操作进行隐式刷新。 (它被称为顺序一致的原子结构)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
免责声明:我现在无法真正尝试代码。
补充说明:
我认为 iter
停止标准是伪造的,您使用它的方式,但我想这无关紧要,因为它不是您的实际标准。
我认为这个变体的性能会比空间分解差。你失去了很多数据局部性,尤其是在 NUMA 系统上。但是当然可以尝试测量。
您的代码(使用 check[j + 1]
)和您的描述之间似乎存在差异 "At the same time thread #2 can already start at y=0"
我最近开始使用 OpenMP 为大学的一个项目做一些 'research'。我有一个矩形且间隔均匀的网格,我正在使用迭代方案在其上求解偏微分方程。所以我基本上有两个 for 循环(每个循环在网格的 x 和 y 方向上)由一个 while 循环包裹以进行迭代。
现在我想为此研究不同的并行化方案。第一种(显而易见的)方法是对 for 循环进行空间并行化。 也很好用。
我遇到问题的方法是一个更棘手的想法。每个线程计算所有网格点。第一个线程从第一个网格行 (y=0) 开始求解方程。完成后,线程继续处理下一行 (y=1),依此类推。同时线程#2 已经可以从 y=0 开始,因为所有必要的信息都已经可用。我只需要在线程之间进行一种手动同步,这样它们就不会相互超越。
因此我使用了一个名为check
的数组。它包含当前允许在每个网格行上工作的线程 ID。当即将到来的行不是 'ready'(check[j]
中的值不正确)时,线程进入一个空的 while 循环,直到它是。
有了 MWE,事情会变得更清晰:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
事实是,这个 MWE 实际上可以在我的机器上运行。但是如果我将它复制到我的项目中,它不会。此外,结果总是不同的:它要么在第一次迭代后停止,要么在第三次迭代后停止。
另一件奇怪的事情:当我在内部 while 循环中删除评论的斜线时,它起作用了!输出包含一些
"Thread #1 is waiting!"
但这是合理的。在我看来,我似乎以某种方式创造了竞争条件,但我不知道在哪里。
有人知道问题出在哪里吗?或者提示如何实现这种同步?
我认为你混淆了原子性和内存连续性。 OpenMP standard 实际上在
中描述得非常好1.4 内存模型(强调我的):
The OpenMP API provides a relaxed-consistency, shared-memory model. All OpenMP threads have access to a place to store and to retrieve variables, called the memory. In addition, each thread is allowed to have its own temporary view of the memory. The temporary view of memory for each thread is not a required part of the OpenMP memory model, but can represent any kind of intervening structure, such as machine registers, cache, or other local storage, between the thread and the memory. The temporary view of memory allows the thread to cache variables and thereby to avoid going to memory for every reference to a variable.
1.4.3 刷新操作
The memory model has relaxed-consistency because a thread’s temporary view of memory is not required to be consistent with memory at all times. A value written to a variable can remain in the thread’s temporary view until it is forced to memory at a later time. Likewise, a read from a variable may retrieve the value from the thread’s temporary view, unless it is forced to read from memory. The OpenMP flush operation enforces consistency between the temporary view and memory.
为避免这种情况,您还应该阅读 check[]
atomic
并在 atomic
结构中指定 seq_cst
子句。此子句强制对操作进行隐式刷新。 (它被称为顺序一致的原子结构)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
免责声明:我现在无法真正尝试代码。
补充说明:
我认为 iter
停止标准是伪造的,您使用它的方式,但我想这无关紧要,因为它不是您的实际标准。
我认为这个变体的性能会比空间分解差。你失去了很多数据局部性,尤其是在 NUMA 系统上。但是当然可以尝试测量。
您的代码(使用 check[j + 1]
)和您的描述之间似乎存在差异 "At the same time thread #2 can already start at y=0"