CUDA 合并和全局内存

Question

我在我的 CUDA 课程中被告知，如果我的 "a" 数组的元素大小为 4,8 或 16 字节，则以下访问（全局内存）被合并。

int i = blockIdx.x*blockDim.x + threadIdx.x;
a[i];

合并的 2 个条件是： warp 的线程必须访问 32、64 或 128 字节的块。 Warp 的第一个线程必须访问一个地址，该地址是 32、64 或 128 的倍数

但是在这个例子中（第一个条件），没有任何东西可以保证 warp 将访问 32 字节的块。

如果我假设 a 的元素是浮点数（4 字节），并且如果我将 blockDim.x 定义为 5，那么每个 warp 将访问 20 (4x5) 字节的块，即使我的元素 "a" 数组的大小为 4,8 或 16 字节，而不是 32。那么，关于合并的第一个声明是假的吗？

感谢您的回答。

Answer 1

But in this example(first condition), nothing guarantees that the warp will access a chunk of 32 bytes.

因为thread ordering，它保证每个warp访问128字节（32线程x 4字节）。这是合并内存访问的必要条件。

If I assume that a's elements are floats (4 bytes), and if I define blockDim.x as 5, then every warp will access chunks of 20 (4x5) bytes even though elements of my "a" array have a size of 4,8 or 16 bytes, and not 32.

Warps 总是 32 个线程。如果将 blockDim.x 定义为 5，则每个块将包含 1 个 warp 和 27 个空线程。合并规则仍将适用并且事务将被合并，但是您正在浪费 27/32 的潜在计算能力和内存带宽。

So, is the very first claim about coalescing false ?

没有

CUDA 合并和全局内存

CUDA coalescing and global memory

memory

cuda

gpgpu

coalescing