CUDA 合并访问 FP64 数据

CUDA coalesced access of FP64 data

我对 warp 发出的内存访问如何受 FP64 数据的影响感到有点困惑。

一个 warp 总是由 32 个线程组成，无论这些线程是在执行 FP32 还是 FP64 计算。对吧？
我读到每次 warp 中的线程尝试 read/write 全局内存时，warp 访问 128 个字节（32 个单精度浮点数）。对吧？
因此，如果 warp 中的所有线程都从内存中读取不同的单精度浮点数（总共 128 字节），但以合并的方式，warp 将发出单个内存事务。对吧？

现在我的问题是：

如果 warp 中的所有线程都尝试以合并方式访问不同的双精度浮点数（总共 256 字节）怎么办？ warp 会发出两个内存事务 (128+128) 吗？

PS：我最感兴趣的是 Compute Capability 2.0+ 架构

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?

正确

I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?

不完全是。还有 32 字节的事务大小。

So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

正确

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

是的。编译器将发出一条 64 位加载指令，当可以合并内存访问时，每个 warp 将由两个 128 字节的事务提供服务。

CUDA 合并访问 FP64 数据

CUDA coalesced access of FP64 data

double

cuda

gpgpu

gpu-warp