CUDA

Question

我对以下任务有疑问：

"Given a two-dimensional array "a[N][M]" 所以 N 行长度为 M。数组的每个元素都包含一个介于 0 和 16 之间的随机整数值。编写一个内核 "compact(int *a, int *listM, int *listN)"，它仅包含一个块 N 个线程，每个线程计算数组的一行中有多少个元素的值为 16。

线程将这些数字写入共享内存中长度为N的数组"num"，然后（在屏障之后）其中一个线程执行下面列出的前缀代码"PrefixSum(int *num, int N)"（在下面的代码我解释了这段代码的作用）。最后（又是障碍），每个线程 "Idx" 将其行中值为 16 的元素的 N 值和 M 值分别写入两个数组 "listM" 和 "listN" 在全局内存中，从这些数组中的位置 "num[Idx]" 开始。为了更容易的实现这最后一个任务，就有了上面提到的前缀代码。"

我写了一个内核和一个合适的主程序来测试它。但是，我还有一个问题无法解决。

在"listeM"和"listeN"两个数组中，应存储数组"a[M][N]"中出现的每个16的各个位置。因此，它们的大小必须等于出现的总次数 16，这可能会有所不同。

由于您不知道值为 16 的元素的确切数量，您只能在内核运行时知道 "listeM" 和 "listeN" 这两个数组需要多少内存。当然你可以在程序开始时释放足够的内存来满足最大可能的数量，即 N 乘以 M，但那样效率很低。是否可以编写内核，以便每个线程在计算其行中值为 16 的元素数量（仅此数字）后动态扩大两个数组 "listeM" 和 "listeN"？

这是我的内核：

__global__ void compact(int* a, int* listM, int* listN)
{
    int Idx = threadIdx.x;
    int elements, i;

    i = elements = 0;

    __shared__ int num[N];

    for (i = 0; i < M; i++)
    {
        if (a[Idx][i] == 16)
        {
            elements++;
        }
    }
    num[Idx] = elements;

        //Here at this point, the thread knows the number of elements with the value 16 of its line and would 
        //need to allocate just as much extra memory in "listeM" and "listeN". Is that possible ?

    __syncthreads();

    if (Idx == 0)
    {
                //This function sets the value of each element in the array "num" to the total value of the 
                //elements previously counted in all lines with the value 16.
                //Example: Input: num{2,4,3,1} Output: num{0,2,6,9}
        PrefixSum(num, N);
    }

    __syncthreads();

        // The output of PrefixSum(num, N) can now be used to realize the last task (put the "coordinates" of 
        //each 16 in the two arrays ("listM" and "listN") and each thread starts at the position equal the 
        //number of counted 16s).
    for (i = 0; i < M; i++)
    {
        if (a[Idx][i] == 16)
        {
            listM[num[Idx] + i] = Idx;
            listN[num[Idx] + i] = i;
        }
    }
}

Answer 1

Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays "listeM" and "listeN" after counting the number of elements with the value 16 in its row (just this number)?

CUDA 设备代码无法扩大使用主机端 cudaMalloc、cudaMallocManaged、cudaHostAlloc 或类似内容创建的现有分配。

CUDA 设备代码可以使用内核 new 或 malloc create new allocations，但是来自此类分配的数据无法直接传输回主机。要将其传输回主机，需要主机端分配，可以将此类分配中的数据复制到主机端分配中，这使您回到原来的问题。

因此确实没有方便的方法来做到这一点。您的选择是：

（过度）根据可能的最大 returned 大小分配所需的大小。
创建一个运行内核一次的算法来确定所需的大小，return主机的大小。然后主机分配该大小并将其传递给内核以供使用，在第二次调用算法时，它会完成实际需要的工作。

"possible" 第三种方法是：

运行算法一次，让内核在内核中分配以提供所需的额外 space。主机端操作无法访问此 space。该内核还将 return 此类分配的大小 and/or 排列。
根据 returned size/arrangement 的设备大小分配，主机将分配所需大小的新内存。
然后主机将启动一个新的 "copy kernel"，它将数据从步骤 1 中的设备端分配复制到步骤 2 中提供的主机端分配。
然后主机会将步骤 2 中主机端分配的数据复制到主机内存。

对于您所概述的这样一个微不足道的问题来说，这是一个极端复杂的水平，其中显而易见的解决方案只是过度分配所需的 space 并解决它。

CUDA - 在内核中动态重新分配更多全局内存

CUDA - dynamically reallocate more global memory in Kernel

kernel

nvidia

dynamic-memory-allocation

memory-reallocation