Cuda 中的有效带宽

Effective Bandwidth in Cuda

在Cuda中计算有效带宽时,我会计算共享内存中reads/write的数量吗?下面给出了示例代码。

__global__ void kernel(float *a, float * b, float * c, int num){
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    __shared__ a_shared[NUM];
    __shared__ b_shared[NUM];
    if (i < NUM){
       a_shared[i] = a[i];
       b_shared[i] = b[i]   
       c[i] = a_shared[i] + b_shared[i];
    }
}

关于您在上面评论中指出的最佳实践指南的the section,我认为答案是否定的,不应包括共享流量。

我们怎么知道的?

  1. 计算有效带宽的主要目的是将其与理论带宽进行比较:

To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.

然而,理论带宽计算仅包括 DRAM 的全局内存流量:

Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla M2090 is 177.6 GB/s:

该数字是 DRAM 带宽。它不包括共享内存带宽。

  1. 探查器测量的参考都与全局内存流量有关,而不是共享内存:

Requested Global Load Throughput

Requested Global Store Throughput

Global Load Throughput

Global Store Throughput

DRAM Read Throughput

DRAM Write Throughput

  1. 计算理论共享内存带宽的方法在我所知道的 CUDA 正式文档的任何地方都没有记录,因此它不能包含在理论带宽计算中。因此,包括共享内存带宽的测量对于比较目的没有意义。