在 OpenACC 的 PGI 编译器中使用 -fast 时如何解决 Loop carried dependency preventing loop vectorization

How to resolve Loop carried dependency preventing loop vectorization while using -fast in PGI compiler in OpenACC

我想在 C 语言中使用 OpenACC 并行化基于粒子方法的流体流代码。我是 OpenACC 的新手,目前正在将它应用于多核计算机上的代码,同时试图了解它的基础知识。稍后,我将尝试将其卸载到 GPU。我在代码中的 for 循环中添加了一些#pragmas。在一部分代码中,当我在不使用 -fast 的情况下编译代码时,它编译没有任何问题,但只并行化了外部循环,但是,当我在编译代码期间包含 -fast 时,它给了我一些数据依赖消息和内部循环( s) 没有并行化。在阅读了可用的文献之后,我尝试了很多事情,包括使用 restrict 和指针声明以及使用原子和例程语句等,但到目前为止似乎没有任何效果。部分代码的精简版在这里:

// the code intends to compute the total number of neighbour particles of "iParticle" in 
// particle.numberOfNeighborParticles[iParticle] and saves the list of these neighbour particles in
// particle.neighborTable[iParticle][Neigh]

int iX, iY;
#pragma acc parallel loop private(iX, iY)                                       //line 98
for (iParticle = 0; iParticle < particle.totalNumber; iParticle++)
{
    BUCKET_findBucketWhereParticleIsStored(&iX, &iY, iParticle, position);

#pragma acc loop seq                                                           // line 133
    for (jX = iX - 1; jX <= iX + 1; jX++)
    {
       .....
#pragma acc loop seq                                                           // line 179
        for (jY = iY - 1; jY <= iY + 1; jY++)
        {
     ......
#pragma acc loop                                                              // line 186
            for (iStoredParticle = 0; iStoredParticle < domain.bucket[jX][jY].count; iStoredParticle++)
            {
                jParticle = domain.bucket[jX][jY].list[iStoredParticle];

                xij = (position[XDIM][jParticle] - position[XDIM][iParticle]);

                distanceIJ_squared = xij * xij;

                yij = (position[YDIM][jParticle] - position[YDIM][iParticle]);
                distanceIJ_squared += yij * yij;
                if (distanceIJ_squared > parameter.maxRadius_squared)
                    continue;

                NEIGH_addParticleInNeighborTable(iParticle, jParticle, particle.numberOfNeighborParticles, particle.neighborTable);
            }
        }
    }
}

//The *NEIGH_addParticleInNeighborTable()* function is as under:

void
NEIGH_addParticleInNeighborTable( 
                                 int iParticle
                                 ,int jParticle
                                 ,int *restrict numberOfNeighborParticles
                                 ,int **restrict neighborTable
                                 ){

  int iNeigh;
    iNeigh = numberOfNeighborParticles[iParticle];
    neighborTable[iParticle][iNeigh] = jParticle;
    #pragma acc atomic
    numberOfNeighborParticles[iParticle]++; 
}

编辑:

我在下面添加了一个伪代码,它与我的问题非常相似,以详细说明问题:

//This pseudo code intends to find the contiguous states from a given list for each state of US
count=0;
//state[] is a list of all the states of US
#pragma acc paralel loop gang
for(i=0;i<no_of_states_in_US;i++)
{
    iState=state[i];
    #pragma acc loop vector
    for (j = 0; j < no_of_states_to_check_from_for[iState]; j++){    //no_of_states_to_check_from_for[iState] may be 5
        jState = names_of_states_to_check_for_iState[j];   // for KS the given states to check from may be CO, NE, CA, UT and OK
    // some logic to check whether jState is contiguous to iState
        if(jState is_NOT_adjacent_to_iState) continue;
        //race condition occurs below if inner loop is vectorized, but no race condition if outer loop is parallelized only
        // Any suggestions / work around to vectorize the inner loop here and to avoid race condition would be helpful
        contiguous_state[iState][count]=jState;  
    #pragma acc atomic //?? does not seem to work   
count++;
    }
}

我有兴趣对内部循环进行矢量化,因为这部分代码属于计算密集型部分,并且在代码中重复了多次。 我在 Windows 10 上使用 PGI 19.4 社区版。请求这方面的帮助。提前致谢。

请注意,这不是 OpenACC 问题,而是编译器只是告诉您它无法对循环进行矢量化(使用 -fast 或 -O2 启用矢量化),因为潜在的循环依赖于 particle.numberOfNeighborParticles和 particle.neighborTable。这不应该影响您的结果或循环的 OpenACC 并行化,您只是不会获得矢量化的额外性能优势。

您可以尝试在向编译器断言不存在指针别名的地方添加标志“-Msafeptr”,这通常会导致此类问题。警告是如果你确实有别名,代码可能会得到不正确的结果。

对于第二个已编辑的问题,只要更新计数的顺序无关紧要,您就可以改用原子捕获。这会将计数的值捕获到局部变量中,因此您无需担心它会发生变化。类似于:

int cnt;
#pragma acc atomic capture 
{
    cnt = count; 
    count++;
}
contiguous_state[iState][cnt]=jState;  

如果计数的顺序很重要,则循环不可并行化。