CUDA内核的奇怪行为

Question

我正在尝试制作一个简单的 Cuda 应用程序来创建给定矩阵的积分图像。我需要做的步骤之一是创建每一行的完整图像。为此，我想为每一行分配 1 个线程。应该执行此操作的函数：

__global__ void IntegrateRows(const uchar* img, uchar* res)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= Width || y >= Height)
        return;

    int sum = 0;
    int row = y * Width;
    for (int i = 0; i < Width - x; ++i)
    {
        res[row + i + x] = sum + img[row + i + x];
        sum += img[row + i + x];
    }
}

我使用大小为 3840x2160 的矩阵进行测试，其中填充了 1 (cv::Mat::ones(Size(Width, Height), CV_8UC1))。当我尝试打印出结果的内容时，它总是 returns 从 1 到 255 的数字序列：

执行配置为：

dim3 threadsPerBlock(1, 256);
dim3 numBlocks(1, 16);
IntegrateRows<<<numBlocks, threadsPerBlock >>>(img, res);

我的 GPU 是 Nvidia RTX 3090。

Answer 1

tl;dr: 让你的输出矩阵有更大的元素

如果你integrate/prefix-sum序列

1, 1, 1, 1, ...

你得到：

0、1、2、3、...

当您达到元素类型的最大值时，此序列将返回到 0。在您的例子中，它是 uchar，即 unsigned char。它的最大值是 255。再加上 1，得到 0。所以：0, 1, 2, 3, ... 253, 254, 255, 0, 1, ... 等等。

如果您将输出矩阵元素类型更改为 unsigned short（或者可能只是 unsigned int）- 您将不会获得环绕行为。当然，如果你加起来是 255 而不是 1，and/or 你的矩阵更大，那么类型的表示范围可能又不够大。

CUDA内核的奇怪行为

Strange behavior of CUDA kernel

c++

opencv

cuda