为什么访问两个连续元素的线程会导致 "bank conflict"?
why does a thread accessing two contiguous elements cause a "bank conflict"?
如上图红框所示,我不明白为什么一个线程连续访问两个数据数组会造成bank冲突,但是下面的访问,如下图,不会造成冲突。
感谢您的回答!!!
https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/
Shared memory bank conflicts
To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.
However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.
假设您的并行缩减示例有 8 个大小为 4 字节的内存组。元素 i 由银行 i 提供服务 % 8.
那么,banks 0,2,4,6 需要服务第一个例子中的两个请求。
在第二个例子中,每家银行只需要处理一个请求。
如上图红框所示,我不明白为什么一个线程连续访问两个数据数组会造成bank冲突,但是下面的访问,如下图,不会造成冲突。
感谢您的回答!!!
https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/
Shared memory bank conflicts
To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.
However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.
假设您的并行缩减示例有 8 个大小为 4 字节的内存组。元素 i 由银行 i 提供服务 % 8.
那么,banks 0,2,4,6 需要服务第一个例子中的两个请求。
在第二个例子中,每家银行只需要处理一个请求。