CUDA 循环中的空间局部性

Question

我正在阅读 Even Easier Introduction to CUDA，我在想这样的例子：

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

其中每个线程跨过数组。在正常的 CPU 计算中，人们宁愿将数组拆分成连续的子数组，这些子数组在线程之间拆分，这样它们每个都可以更好地利用空间局部性。

这个概念是否也适用于CUDA的统一内存？我想了解在这种情况下最有效的方法是什么。

Answer 1

a grid-stride loop is beneficial for memory access is that it promotes "coalesced" access to global memory. In a nutshell, coalesced access means that adjacent threads in the warp 在任何给定的读取或写入时访问内存中的相邻位置的原因 cycle/operation，被认为是 warp-wide。

grid-stride 循环在 warp 上安排索引以促进这种模式。

这与内存是使用“普通”设备分配器（例如 cudaMalloc）还是“统一”分配器（例如 cudaMallocManaged）分配的正交。在任何一种情况下，设备代码访问此类分配的最佳方式是使用联合访问。

您没有问过，但 CUDA shared memory 也有其“最佳访问模式”之一，由 warp 中的相邻线程访问（共享）内存中的相邻位置组成。

CUDA 循环中的空间局部性

Spatial locality in CUDA loops

c++

performance

cuda

unified-memory