CUDA 合并访问 FP64 数据

CUDA coalesced access of FP64 data

我对 warp 发出的内存访问如何受 FP64 数据的影响感到有点困惑。

现在我的问题是:

PS:我最感兴趣的是 Compute Capability 2.0+ 架构

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?

正确

I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?

不完全是。还有 32 字节的事务大小。

So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

正确

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

是的。编译器将发出一条 64 位加载指令,当可以合并内存访问时,每个 warp 将由两个 128 字节的事务提供服务。