为什么 CUDA 内存分配对齐到 256 字节？

Why are CUDA memory allocations aligned to 256 bytes?

根据 cuda alignment 256bytes seriously? CUDA 内存分配保证至少对齐 256 字节。

为什么会这样？ 256 字节比任何数字数据类型都大得多。它可能是向量的大小，但 GPU 不需要 load/store 与整个向量的大小对齐，实际上它们甚至支持 gather/scatter 可以放置每个单独的元素在元素大小的倍数的任何内存地址。

256 字节对齐有什么用？

Why is that the case? 256 bytes is much larger than any numeric data type.

好吧，我确定有多种原因（例如，管理更少、更大的分配更容易），但关于您的具体观点：不要考虑单一数值数据类型的值 - 考虑一个完整的 warp 的价值：如果 sizeof(float) 是 4，那么一个 warp 的 floats 的价值是 32 * 4 = 128 字节。如果它是 double 或 long int（64 位整数），那么你会得到 32 * 8 = 256 .

注意：不需要 warp 从内存中合并读取多个值。单个线程可以读取单个未对齐的字节，这将起作用。但是 - 如果读取模式未合并为读取连续、对齐的块（通常为 128 字节或 32 字节），性能将会受到影响；另见：

In CUDA, what is memory coalescing, and how is it achieved?

为什么 CUDA 内存分配对齐到 256 字节？

Why are CUDA memory allocations aligned to 256 bytes?

cuda

gpu

gpgpu

memory-alignment