如何理解此 CUDA 矩阵复制代码中的合并访问？

Question

__global__ void Matrixcopy(float *odata, const float *idata)
{
  // threadblock size = (TILE_DIM, BLOCK_ROWS) = (32, 8)
  // each block copies a 32 * 32 tile
  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
    odata[(y+j)*width + x] = idata[(y+j)*width + x];
}

我对 multi-dim 数组的联合访问概念感到很困惑。联合全局内存访问的定义是

Sequential memory access is adjacent

根据文献Learn CUDA Programming。对于 1-dim 数组，很容易理解线程被索引为 threadIdx.x + blockDim.x * blockIdx.x，它可以很容易地映射到真正的 1-dim 数组：warp 中的相邻线程访问 1-dim 的相邻物理地址数组。

然而，对于 2-dim 数组或矩阵和 2-dim threadblock，如上面的代码，我不确定我是否理解正确：warp 中的相邻线程位于同一行，即相同的y值，不同的x值。如果 y = 0，则连续线程为 x = [1, 2, 3, 4, 5, 6...]，并且它们访问连续地址 [1, 2, 3, 4, 5, 6...] 如果j = 0。因此此代码具有合并访问权限。我理解正确吗？而这只是一个简单的cuda代码，如果我们有一个复杂的cuda内核，我们如何快速判断一个访问是否合并？

Answer 1

So this code is with coalesced access. Am I correctly understood?

是的，差不多。我会说线程是 x = [0, 1, 2, 3, 4, 5, 6...]，它们访问连续的地址 [0 , 1, 2, 3, 4, 5, 6...] 但基本上我们是一致的。

if we have a complex cuda kernel, how can we quickly determine whether an access is coalesced or not?

您可以查看任何索引构造并使用以下测试：如果 threadIdx.x 变量作为加法因子包含在索引中，并且其上没有乘法因子，则访问将合并在典型用法中（你有方形线程块）。任何可以表示为：

的指标

idx = f + threadIdx.x

其中 f 是任意的，但不包括 threadIdx.x 将导致合并访问。 “in x”中的相邻线程将访问内存中的相邻位置。对于“非方形”线程块，您可以使用 threadIdx.y 制定类似的规则。例如，尺寸为 (1,32) 的线程块将要求 threadIdx.y 作为仅添加因子包含在内。

如何理解此 CUDA 矩阵复制代码中的合并访问？

How to understand the coalesced access in this CUDA matrix copy code?

c++

memory

arrays

cuda