在 OpenACC 的 PGI 编译器中使用 -fast 时如何解决 Loop carried dependency preventing loop vectorization
How to resolve Loop carried dependency preventing loop vectorization while using -fast in PGI compiler in OpenACC
我想在 C 语言中使用 OpenACC 并行化基于粒子方法的流体流代码。我是 OpenACC 的新手,目前正在将它应用于多核计算机上的代码,同时试图了解它的基础知识。稍后,我将尝试将其卸载到 GPU。我在代码中的 for 循环中添加了一些#pragmas。在一部分代码中,当我在不使用 -fast 的情况下编译代码时,它编译没有任何问题,但只并行化了外部循环,但是,当我在编译代码期间包含 -fast 时,它给了我一些数据依赖消息和内部循环( s) 没有并行化。在阅读了可用的文献之后,我尝试了很多事情,包括使用 restrict 和指针声明以及使用原子和例程语句等,但到目前为止似乎没有任何效果。部分代码的精简版在这里:
// the code intends to compute the total number of neighbour particles of "iParticle" in
// particle.numberOfNeighborParticles[iParticle] and saves the list of these neighbour particles in
// particle.neighborTable[iParticle][Neigh]
int iX, iY;
#pragma acc parallel loop private(iX, iY) //line 98
for (iParticle = 0; iParticle < particle.totalNumber; iParticle++)
{
BUCKET_findBucketWhereParticleIsStored(&iX, &iY, iParticle, position);
#pragma acc loop seq // line 133
for (jX = iX - 1; jX <= iX + 1; jX++)
{
.....
#pragma acc loop seq // line 179
for (jY = iY - 1; jY <= iY + 1; jY++)
{
......
#pragma acc loop // line 186
for (iStoredParticle = 0; iStoredParticle < domain.bucket[jX][jY].count; iStoredParticle++)
{
jParticle = domain.bucket[jX][jY].list[iStoredParticle];
xij = (position[XDIM][jParticle] - position[XDIM][iParticle]);
distanceIJ_squared = xij * xij;
yij = (position[YDIM][jParticle] - position[YDIM][iParticle]);
distanceIJ_squared += yij * yij;
if (distanceIJ_squared > parameter.maxRadius_squared)
continue;
NEIGH_addParticleInNeighborTable(iParticle, jParticle, particle.numberOfNeighborParticles, particle.neighborTable);
}
}
}
}
//The *NEIGH_addParticleInNeighborTable()* function is as under:
void
NEIGH_addParticleInNeighborTable(
int iParticle
,int jParticle
,int *restrict numberOfNeighborParticles
,int **restrict neighborTable
){
int iNeigh;
iNeigh = numberOfNeighborParticles[iParticle];
neighborTable[iParticle][iNeigh] = jParticle;
#pragma acc atomic
numberOfNeighborParticles[iParticle]++;
}
编辑:
我在下面添加了一个伪代码,它与我的问题非常相似,以详细说明问题:
//This pseudo code intends to find the contiguous states from a given list for each state of US
count=0;
//state[] is a list of all the states of US
#pragma acc paralel loop gang
for(i=0;i<no_of_states_in_US;i++)
{
iState=state[i];
#pragma acc loop vector
for (j = 0; j < no_of_states_to_check_from_for[iState]; j++){ //no_of_states_to_check_from_for[iState] may be 5
jState = names_of_states_to_check_for_iState[j]; // for KS the given states to check from may be CO, NE, CA, UT and OK
// some logic to check whether jState is contiguous to iState
if(jState is_NOT_adjacent_to_iState) continue;
//race condition occurs below if inner loop is vectorized, but no race condition if outer loop is parallelized only
// Any suggestions / work around to vectorize the inner loop here and to avoid race condition would be helpful
contiguous_state[iState][count]=jState;
#pragma acc atomic //?? does not seem to work
count++;
}
}
我有兴趣对内部循环进行矢量化,因为这部分代码属于计算密集型部分,并且在代码中重复了多次。
我在 Windows 10 上使用 PGI 19.4 社区版。请求这方面的帮助。提前致谢。
请注意,这不是 OpenACC 问题,而是编译器只是告诉您它无法对循环进行矢量化(使用 -fast 或 -O2 启用矢量化),因为潜在的循环依赖于 particle.numberOfNeighborParticles和 particle.neighborTable。这不应该影响您的结果或循环的 OpenACC 并行化,您只是不会获得矢量化的额外性能优势。
您可以尝试在向编译器断言不存在指针别名的地方添加标志“-Msafeptr”,这通常会导致此类问题。警告是如果你确实有别名,代码可能会得到不正确的结果。
对于第二个已编辑的问题,只要更新计数的顺序无关紧要,您就可以改用原子捕获。这会将计数的值捕获到局部变量中,因此您无需担心它会发生变化。类似于:
int cnt;
#pragma acc atomic capture
{
cnt = count;
count++;
}
contiguous_state[iState][cnt]=jState;
如果计数的顺序很重要,则循环不可并行化。
我想在 C 语言中使用 OpenACC 并行化基于粒子方法的流体流代码。我是 OpenACC 的新手,目前正在将它应用于多核计算机上的代码,同时试图了解它的基础知识。稍后,我将尝试将其卸载到 GPU。我在代码中的 for 循环中添加了一些#pragmas。在一部分代码中,当我在不使用 -fast 的情况下编译代码时,它编译没有任何问题,但只并行化了外部循环,但是,当我在编译代码期间包含 -fast 时,它给了我一些数据依赖消息和内部循环( s) 没有并行化。在阅读了可用的文献之后,我尝试了很多事情,包括使用 restrict 和指针声明以及使用原子和例程语句等,但到目前为止似乎没有任何效果。部分代码的精简版在这里:
// the code intends to compute the total number of neighbour particles of "iParticle" in
// particle.numberOfNeighborParticles[iParticle] and saves the list of these neighbour particles in
// particle.neighborTable[iParticle][Neigh]
int iX, iY;
#pragma acc parallel loop private(iX, iY) //line 98
for (iParticle = 0; iParticle < particle.totalNumber; iParticle++)
{
BUCKET_findBucketWhereParticleIsStored(&iX, &iY, iParticle, position);
#pragma acc loop seq // line 133
for (jX = iX - 1; jX <= iX + 1; jX++)
{
.....
#pragma acc loop seq // line 179
for (jY = iY - 1; jY <= iY + 1; jY++)
{
......
#pragma acc loop // line 186
for (iStoredParticle = 0; iStoredParticle < domain.bucket[jX][jY].count; iStoredParticle++)
{
jParticle = domain.bucket[jX][jY].list[iStoredParticle];
xij = (position[XDIM][jParticle] - position[XDIM][iParticle]);
distanceIJ_squared = xij * xij;
yij = (position[YDIM][jParticle] - position[YDIM][iParticle]);
distanceIJ_squared += yij * yij;
if (distanceIJ_squared > parameter.maxRadius_squared)
continue;
NEIGH_addParticleInNeighborTable(iParticle, jParticle, particle.numberOfNeighborParticles, particle.neighborTable);
}
}
}
}
//The *NEIGH_addParticleInNeighborTable()* function is as under:
void
NEIGH_addParticleInNeighborTable(
int iParticle
,int jParticle
,int *restrict numberOfNeighborParticles
,int **restrict neighborTable
){
int iNeigh;
iNeigh = numberOfNeighborParticles[iParticle];
neighborTable[iParticle][iNeigh] = jParticle;
#pragma acc atomic
numberOfNeighborParticles[iParticle]++;
}
编辑:
我在下面添加了一个伪代码,它与我的问题非常相似,以详细说明问题:
//This pseudo code intends to find the contiguous states from a given list for each state of US
count=0;
//state[] is a list of all the states of US
#pragma acc paralel loop gang
for(i=0;i<no_of_states_in_US;i++)
{
iState=state[i];
#pragma acc loop vector
for (j = 0; j < no_of_states_to_check_from_for[iState]; j++){ //no_of_states_to_check_from_for[iState] may be 5
jState = names_of_states_to_check_for_iState[j]; // for KS the given states to check from may be CO, NE, CA, UT and OK
// some logic to check whether jState is contiguous to iState
if(jState is_NOT_adjacent_to_iState) continue;
//race condition occurs below if inner loop is vectorized, but no race condition if outer loop is parallelized only
// Any suggestions / work around to vectorize the inner loop here and to avoid race condition would be helpful
contiguous_state[iState][count]=jState;
#pragma acc atomic //?? does not seem to work
count++;
}
}
我有兴趣对内部循环进行矢量化,因为这部分代码属于计算密集型部分,并且在代码中重复了多次。 我在 Windows 10 上使用 PGI 19.4 社区版。请求这方面的帮助。提前致谢。
请注意,这不是 OpenACC 问题,而是编译器只是告诉您它无法对循环进行矢量化(使用 -fast 或 -O2 启用矢量化),因为潜在的循环依赖于 particle.numberOfNeighborParticles和 particle.neighborTable。这不应该影响您的结果或循环的 OpenACC 并行化,您只是不会获得矢量化的额外性能优势。
您可以尝试在向编译器断言不存在指针别名的地方添加标志“-Msafeptr”,这通常会导致此类问题。警告是如果你确实有别名,代码可能会得到不正确的结果。
对于第二个已编辑的问题,只要更新计数的顺序无关紧要,您就可以改用原子捕获。这会将计数的值捕获到局部变量中,因此您无需担心它会发生变化。类似于:
int cnt;
#pragma acc atomic capture
{
cnt = count;
count++;
}
contiguous_state[iState][cnt]=jState;
如果计数的顺序很重要,则循环不可并行化。