CUDA 的 nvvp 报告内存访问模式不理想，但带宽几乎达到峰值

Question

编辑：新的最小工作示例来说明问题并更好地解释 nvvp 的结果（遵循评论中给出的建议）。

所以，我制作了一个 "minimal" 工作示例，如下所示：

#include <cuComplex.h>
#include <iostream>

int const n = 512 * 100;

typedef float real;

template < class T >
struct my_complex {
   T x;
   T y;
};

__global__ void set( my_complex< real > * a )
{
   my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
   d = { 1.0f, 0.0f };
}

__global__ void duplicate_whole( my_complex< real > * a )
{
   my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
   d = { 2.0f * d.x, 2.0f * d.y };
}

__global__ void duplicate_half( real * a )
{
   real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
   d *= 2.0f;
}

int main()
{
   my_complex< real > * a;
   cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );

   set<<< n, 1024 >>>( a );
   cudaDeviceSynchronize();
   duplicate_whole<<< n, 1024 >>>( a );
   cudaDeviceSynchronize();
   duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
   cudaDeviceSynchronize();

   my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
   cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );

   std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )"  << std::endl;

   return 0;
}

当我编译和运行上述代码时，内核 duplicate_whole 和 duplicate_half 花费几乎相同的时间运行。

但是，当我使用 nvvp 分析内核时，我在以下意义上得到了每个内核的不同报告。对于内核 duplicate_whole，nvvp 警告我在第 23 行（d = { 2.0f * d.x, 2.0f * d.y };）内核正在执行

Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4

我同意我正在加载 8 字节的字。我不明白的是为什么 4 个字节是理想的字长。特别是，内核之间没有性能差异。

我想在某些情况下，这种全局存储访问模式可能会导致性能下降。这些是什么？

为什么我的性能没有受到影响？

希望这次编辑澄清了一些不清楚的地方。

+++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++

我将从一些内核代码开始举例说明我的问题，下面将跟进

template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
                                        std::complex< data_t > * data,
                                        int M,
                                        int row_length,
                                        int b,
                                        int i_0
                                        )
{
#ifndef CUGALE_MUL_SHUFFLE
    // Output array length:
    int plane_area = row_length * M;
    // Process element:
    int i = blockIdx.x * row_length + threadIdx.x + i_0;
    my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
    my_complex< data_t > datum;
    my_complex< data_t > datum_new;

    for ( int i_b = 0; i_b < b; ++ i_b )
    {
        my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
        datum = ref_datum;
        datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
        datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
        ref_datum = datum_new;
    }
#else
    // Output array length:
    int plane_area = row_length * M;
    // Element to process:
    int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
    my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );

    // Real and imaginary part of datum (not respectively for odd threads):
    data_t datum_a;
    data_t datum_b;

    // Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
    int parity = ( threadIdx.x % 2 );
    int shuffle_dir = 1 - 2 * parity;
    int inwarp_tid = threadIdx.x % warpSize;

    for ( int i_b = 0; i_b < b; ++ i_b )
    {
        int data_idx = i_b * plane_area + i;
        datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
        datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );

        // Even TIDs compute real part, odd TIDs compute imaginary part:
        reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
    }
#endif // #ifndef CUGALE_MUL_SHUFFLE
}

让我们考虑一下 data_t 是浮点数的情况，这是内存带宽受限的情况。从上面可以看出，内核有两个版本，一个是每个线程 reads/writes 8 个字节（一个完整的复数），另一个是每个线程 reads/writes 4 个字节，然后将结果打乱复积计算正确。

我使用 shuffle 编写版本的原因是因为 nvvp 坚持每个线程读取 8 个字节不是最好的主意，因为这种内存访问模式效率低下。即使在测试的两个系统（GTX 1050 和 GTX Titan Xp）中内存带宽非常接近理论最大值，情况也是如此。

我当然知道不会有任何改进，事实确实如此：两个内核花费的时间几乎相同运行。所以，我的问题如下：

为什么 nvvp 报告每个线程读取 8 个字节的效率低于读取 4 个字节的效率？在什么情况下会出现这种情况？

附带说明一下，单精度对我来说更重要，但双精度在某些情况下也很有用。有趣的是，在 data_t 为 double 的情况下，两个内核版本之间也没有执行时间差异，即使在这种情况下内核受计算限制并且随机版本比原始版本执行更多的触发器.

注意：内核应用于 row_length * M * b 数据集（b 图像 with row_length 列和 M 行）并且 chirp_factor 数组是 row_length * M。两个内核运行都很好（如果您对此有疑问，我可以编辑问题以向您展示对两个版本的调用）。

Answer 1

这里的问题与编译器如何处理您的代码有关。 nvvp 只是尽职尽责地报告当您运行您的代码时发生的事情。

如果您在可执行文件上使用 cuobjdump -sass 工具，您会发现 duplicate_whole 例程正在执行两个 4 字节加载和两个 4 字节存储。这不是最优的，部分原因是每次加载和存储都有一个跨度（每次加载和存储都涉及内存中的交替元素）。

原因是编译器不知道您的 my_complex 结构的对齐方式。在阻止编译器生成（合法的）8 字节加载的情况下，您的结构将是合法的。正如讨论的那样 here 我们可以通过通知编译器我们只打算在 CUDA 8 字节加载合法的对齐场景中使用结构来解决这个问题（即它是 "naturally aligned"）。对结构的修改如下所示：

template < class T >
struct  __align__(8) my_complex {
   T x;
   T y;
};

通过对代码的更改，编译器会为 duplicate_whole 内核生成 8 字节的负载，您应该会从分析器中看到不同的报告。仅当您了解其含义并愿意与编译器签订合同以确保情况如此时，才应使用这种装饰。如果你做了一些不寻常的事情，比如不寻常的指针转换，你可能会违反约定并产生机器故障。

您看不到太多性能差异的原因几乎肯定与 CUDA load/store 行为和 GPU 缓存

有关

当你进行跨步加载时，GPU 无论如何都会加载整个缓存行，即使（在这种情况下）你只需要一半的元素（实际元素）来进行特定的加载操作。但是，无论如何您都需要另一半元素（虚构元素）；它们将在下一条指令上加载，并且由于之前的加载，这条指令很可能命中缓存。

在这种情况下，在跨步存储中，在一条指令中写入跨步元素，并在下一条指令中写入替代元素，最终将使用其中一个缓存作为 "coalescing buffer"。这不是 CUDA 术语中使用的典型意义上的合并；这种合并仅适用于单个指令。然而，缓存 "coalescing buffer" 行为允许它 "accumulate" 多次写入已驻留的行，然后该行被写出或逐出。这大约相当于 "write-back" 缓存行为。

CUDA 的 nvvp 报告内存访问模式不理想，但带宽几乎达到峰值

CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking

cuda

nvvp