CUDA 数据数量不能被 CUDA 线程平均分配

Question

比如有两个4线程，但是我有5个数据，第一个0-3可以映射到第一个4线程，剩下的怎么样，它只是说可能会运行时出错，但如何解决呢？

我想我问这个问题的方向是错误的，现在假设我有

perfromwork<<<2,2>>>;

现在我这个伪代码计算出来的dataIndex小于数据元素个数（N=5），那么最后一个（5-2x2=1）怎么办？如果我为它使用另一个块，它会遇到同样的问题，<<<2, 2>>> 块将创建一个更大的 dataIndex。

Answer 1

这里有两种规范的方法。

将网格大小设置为大于或等于数据集大小，并确保使用“线程检查”以防止不需要的额外线程执行任何工作。
使用 grid-stride loop，它允许独立于数据集大小（如果您愿意）确定网格大小，同时仍然提供正确的结果。

向量为每个添加示例内核：

__global__ void vectorAdd(float *x, float *y, float *z, int size){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < size) // thread check
    z[idx] = x[idx] + y[idx];
}

上面的内核没有使用网格步幅循环。它将要求您将网格大小设置为大于或等于数据集大小，以便处理所有元素。该尺寸代码可能如下所示：

  int size = MY_DATA_SET_SIZE;
  dim3 block(256); // this is threads per block, the choice here is not critical for correctness, but must be 1 or larger and less than or equal to 1024;
  dim3 grid((size+block.x-1)/block.x);
  vectorAdd<<<grid,block>>>(...);

实现网格步幅循环以执行相同操作的内核可能如下所示：

__global__ void vectorAdd(float *x, float *y, float *z, int size){

  for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < size; idx += blockDim.x*gridDim.x)
    z[idx] = x[idx] + y[idx];
}

在这种情况下，网格大小可以是任意的（1 或更大）并且仍然会产生正确的结果。

CUDA 数据数量不能被 CUDA 线程平均分配

CUDA number of data cannot be divide by the CUDA threads evenly

cuda