算法的 Cuda In-Situ 内存竞争问题，例如 morphologicam 膨胀的卷积

Question

我在 CUDA 中编写了一个扩张内核，当我的输入和输出图像是不同的缓冲区时它运行良好，但是当我在原位调用我的内核时，我面临着我所理解的内存竞争问题情况，即输入和输出缓冲区指向相同的内存位置。

我试过了：

一个。使用合作小组，

b。使用互斥锁和原子加法，但正如 this paper 和网络上的多个来源所建议的那样，

c。使用无锁块间同步，this same paper.

中提出的同步

我所有的尝试都失败了，因为：

一个。没有工作，因为我的输入缓冲区是一个 const 指针，当我必须将它转换为 void* 参数（这是有道理的）时出现编译错误，所以我无法进一步。

b。没有工作，因为我遇到了一个奇怪的行为：我有 16x16 块，每个块有 32x32 线程。同步块应该将互斥锁增加到 256，但程序在 48 个原子添加后阻塞。

c。没有用，因为它似乎没有块间同步，尽管我直接从论文中使用的代码对我来说似乎不错。我可以通过添加一些 __syncthreads()

来改善一点比赛效果

这是膨胀函数；

template <typename T>
__global__ void GenericDilate2dImg_knl(const ImageSizeInfo imgSizeInfo,
                                       volatile int* syncArrayIn, volatile int* syncArrayOut, 
                                       const unsigned long localSizeX, const unsigned long localSizeY,
                                       const int borderPolicyType, const T outOfImageValue,
                                       const struct StructuringElementInfo seInfo,
                                       const T* pInBuf, T* pOutBuf)
{
    // Extract sizeX, sizeY, etc. from imgSizeInfo
    SPLIT_SIZES_FROM_STRUCT(imgSizeInfo)

    // Declare the shared buffer pSharedBuf
    extern __shared__ char pSharedMem[];
    T* pSharedBuf = reinterpret_cast<T*>(pSharedMem);

    const unsigned long x = blockDim.x * blockIdx.x + threadIdx.x;
    const unsigned long y = blockDim.y * blockIdx.y + threadIdx.y;
    const unsigned long planIdx = blockDim.z * blockIdx.z + threadIdx.z;

    const unsigned long nbPlans = sizeZ * sizeC * sizeT;

    const unsigned long idx = x + y * sizeX + planIdx * sizeX*sizeY;

    // Copy the input image data into shared memory
    if (x < blockDim.x * gridDim.x && y < blockDim.y * gridDim.y && planIdx < blockDim.z * gridDim.z) {
        copyDataToSharedMemory2d(pInBuf, sizeX, sizeY, planIdx,
                                 localSizeX, localSizeY, 
                                 seInfo._paddingX, seInfo._paddingY,
                                 borderPolicyType, outOfImageValue,
                                 pSharedBuf);
    }

    // Wait to ensure that the copy is terminated
    if (pInBuf == pOutBuf) {
        // Grid synchronization for in-situ case
        //__gpu_sync(gridDim.x * gridDim.y);        // Use a mutex
        __gpu_sync2(1, syncArrayIn, syncArrayOut);  // Use a lock-free barrier
    }
    else
        // The input and ouput buffers point to different data 
        // -> we simply need to synchronize the threads inside the block
        __syncthreads();

    // Compute the convolution for pixels inside the image
    if (x < sizeX && y < sizeY && planIdx < nbPlans) {
        T vMax = 0;
        for (unsigned int curCoefIdx = 0; curCoefIdx < seInfo._nbOffsets; ++curCoefIdx) {
            const unsigned int sx = threadIdx.x + seInfo._paddingX + seInfo._pOffsetsX[curCoefIdx];
            const unsigned int sy = threadIdx.y + seInfo._paddingY + seInfo._pOffsetsY[curCoefIdx];
            const unsigned long sidx = sx + sy * localSizeX;
            const T curVal = pSharedBuf[sidx];
            vMax = (vMax > curVal ? vMax : curVal);
        }

        // Round the result
        pOutBuf[idx] = vMax;
    }
}

我从全局复制到共享内存的函数是：

template <typename T>
__device__ void copyDataToSharedMemory2d(const T* pInBuf,
                                         const unsigned long sizeX, const unsigned long sizeY, const unsigned long planIdx,
                                         const unsigned long localSizeX, const unsigned long localSizeY,
                                         const int paddingX, const int paddingY,
                                         const int borderPolicyType, const T outOfImageValue,
                                         T* pSharedBuf)
{
    const int x = blockDim.x * blockIdx.x + threadIdx.x;
    const int y = blockDim.y * blockIdx.y + threadIdx.y;
    const int localX = threadIdx.x;
    const int localY = threadIdx.y;

    // Fill the shared buffer tile by tile
    // A tile is related to the group size
    const unsigned int groupSizeX = blockDim.x;
    const unsigned int groupSizeY = blockDim.y;

    // For each tile
    for (int offsetY = 0; offsetY < localSizeY; offsetY += groupSizeY) {
        int curLocalY = localY + offsetY;
        int curGlobalY = y + offsetY - paddingY;
        for (int offsetX = 0; offsetX < localSizeX; offsetX += groupSizeX) {
            int curLocalX = localX + offsetX;
            int curGlobalX = x + offsetX - paddingX;

            // If the current coordinate is inside the shared sub-image
            if (curLocalX < localSizeX && curLocalY < localSizeY) {
                const int idx = curLocalX + curLocalY * localSizeX;
                pSharedBuf[idx] = getPixel2d(pInBuf, sizeX, sizeY, curGlobalX, curGlobalY, planIdx, borderPolicyType, outOfImageValue);
            }
        }
    }
}

其中 getPixel2d 允许我管理图像外的数据：


template <typename T>
__device__
T getPixel2d(const T* pInBuf,
             const unsigned long sizeX, const unsigned long sizeY,
             const int x, const int y, const int z,
             const int borderPolicyType, const T outOfImageValue)
{
    int x_inside = x;
    if (x < 0 || x >= sizeX) {
        switch (borderPolicyType) {
        case 0://outside the image, there is a constant value
            return outOfImageValue;
        case 1://outside the image, we propagate the data at the image borders
            if (x < 0)
                x_inside = 0;
            else // x >= sizeX
                x_inside = sizeX - 1;
            break;
        case 2://Miror effect
            if (x < 0)
                x_inside = -(x + 1);
            else // x >= sizeX
                x_inside = sizeX - ((x - sizeX) + 1);
            break;
        }
    }

    // y-coordinate inside the image
    int y_inside = y;
    if (y < 0 || y >= sizeY) {
        switch (borderPolicyType) {
        case 0://outside the image, there is a constant value
            return outOfImageValue;
        case 1://outside the image, we propagate the data at the image borders
            if (y < 0)
                y_inside = 0;
            else // y >= sizeY
                y_inside = sizeY - 1;
            break;
        case 2://Miror effect
            if (y < 0)
                y_inside = -(y + 1);
            else // y >= sizeY
                y_inside = sizeY - ((y - sizeY) + 1);
            break;
        default: break;
        }
    }

    return pInBuf[x_inside + y_inside * sizeX + z * sizeX * sizeY];
}

现在，这是我的块间同步函数：

// Using a mutex
__device__ volatile int g_mutex;
__device__ void __gpu_sync(int goalVal) {
    //thread ID in a block
    int tid_in_block = threadIdx.x * blockDim.y + threadIdx.y;
    // only thread 0 is used for synchronization
    if (tid_in_block == 0) {
        atomicAdd((int*)&g_mutex, 1);
        printf("[%d] %d Vs %d\n", blockIdx.x * gridDim.y + blockIdx.y, g_mutex, goalVal);
        //only when all blocks add 1 to g_mutex
        //will g_mutex equal to goalVal
        while (g_mutex </*!=*/ goalVal) {
            ;//Do nothing here
        }
    }
    __syncthreads();
}

// Lock-free barrier
__device__ void __gpu_sync2(int goalVal, volatile int* Arrayin, volatile int* Arrayout) {
    // thread ID in a block
    int tid_in_blk = threadIdx.x * blockDim.y + threadIdx.y;
    int nBlockNum = gridDim.x * gridDim.y;
    int bid = blockIdx.x * gridDim.y + blockIdx.y;
    // only thread 0 is used for synchronization
    if (tid_in_blk == 0) {
        Arrayin[bid] = goalVal;
    }
    if (bid == 1) {
        if (tid_in_blk < nBlockNum) {
            while (Arrayin[tid_in_blk] != goalVal) {
                ;//Do nothing here
            }
        }
        __syncthreads();
        if (tid_in_blk < nBlockNum) {
            Arrayout[tid_in_blk] = goalVal;
        }
    }
    if (tid_in_blk == 0) {
        while (Arrayout[bid] != goalVal) {
            ;//Do nothing here
        }
    }
    __syncthreads();
}

我原位计算得到的图像是：

我使用了 11x15 结构化元素，共享缓冲区的大小是 (nbThreadsPerBlock+2*paddindX) * (nbThreadsPerBlock+2*paddindY)。错误的结果（如箭头所示）出现在某些块的顶部，但始终位于相同的位置并具有相同的值。我希望内存竞争效应的结果更随机...

是否有更好的方法来管理现场计算或任何会阻止网格同步工作的原因？

编辑我使用的图像大小是 510x509，我运行我的代码在 NVidia Quadro RTX 5000 上。

Answer 1

对于这样的问题，我通常会建议最小的可重现示例，以及您正在使用的 GPU 的指示运行，但我们可能可以在没有这些的情况下继续。简而言之，正如您已经发现的那样，您尝试做的事情不会可靠地工作。

您已选择在网格中为每个输出点分配一个线程的线程策略：

    pOutBuf[idx] = vMax;

这很明智也很好。我基于此想象：

I have 16x16 blocks, each with 32x32 threads.

您的输入图像为 512x512（每个方向 16x32 个线程，每个输出点一个线程）。

正如您已经说过的，您的网格中有 256 个块（每个块有 1024 个线程）。此外，对于原地情况，我们可以将您的内核简化为以下伪代码：

__global__ void GenericDilate2dImg_knl(...){
  read_in_image();
  grid_wide_sync();
  write_out_image();
}

为了使这种方法起作用，read_in_image() 步骤必须能够在任何写入发生之前读取整个图像。但是，您的方法不适用于 general 情况，显然也不适用于您的特定 GPU。为了按照上面的方式读入整个图像，我们必须 网格中的每个线程块 同时驻留在 GPU 中的 SM 上。所有 256 个区块都需要存入，并且运行ning 在一个 SM 上。但是 GPU 不提供这种事情的内在保证。例如，如果您的 GPU 中有 24 个 SM，每个 SM 最多可容纳 2048 个线程，那么您的 GPU 将具有 24*2048 个线程的“运行ning”或“瞬时”容量，或 48你的线程块。没有足够的空间让所有 256 个线程块运行ning。不仅您的算法依赖于此，而且您的所有 3 种网格同步方法也都依赖于该概念。

您的第二个网格同步方法在 48 次“原子加法”后停止这一事实向我暗示了上面的示例数字。对于为什么该方法可能以这种方式失败，这是一个似是而非的近端解释：您的 GPU 只允许 48 个线程块驻留，而其他 208 个线程块正在等待，尚未存放在任何 SM 上，因此不允许任何他们的线程到运行。这 208 个线程块中的那些线程需要运行获取相关的输入数据，以及满足网格范围同步的要求。但他们不是运行ning，因为他们正在等待在 SM 上打开空间。并且空间永远不会在 SM 上打开，因为完整的 SM 具有在网格同步点等待的线程块。所以你有死锁。

这个问题在一般情况下是不容易解决的。任何网格同步机制，包括协作组，都有一个内在要求，即所有线程块实际上都可以在您的特定 GPU 上同时调度。因此，在一般情况下，如果我们不知道数据集大小或我们将运行使用的 GPU，问题就非常困难。

一种可能的方法是将输入数据集划分为多个区域，然后让内核一次处理一个区域。这可能需要多个网格同步，一个用于处理每个区域中的 in/out 划分，另一个用于处理内核遍历区域时的进程。您还必须小心处理区域边缘。

另一种可能的方法，如果你知道数据集大小的细节和你运行正在使用的 GPU，只是确保你运行正在使用“足够大”的 GPU " 来处理数据集的大小。例如，A100 GPU 可能同时驻留多达 216 个块，因此对于这种情况，您可以处理较小的图像尺寸，也许是 14x32=448 高和 448 宽尺寸。

鉴于针对此特定示例的就地或现场工作的这些方法需要相当大的复杂性，我个人非常愿意使用输出不同于输入的方法。这种方法也可能运行明显更快。从性能角度来看，网格范围同步不是“免费”构造。

算法的 Cuda In-Situ 内存竞争问题，例如 morphologicam 膨胀的卷积

Cuda In-Situ memory race issue for algorithms such as convolution of morphologicam dilation

cuda