为什么 cuFFT 性能会因输入重叠而受到影响？

Question

我正在尝试使用 cuFFT 的回调功能来动态执行输入格式转换（例如，计算 8 位整数输入数据的 FFT，而无需首先将输入缓冲区显式转换为 float).在我的许多应用程序中，我需要在输入缓冲区 as described in this previous SO question 上计算重叠 FFT。通常，相邻的 FFT 可能重叠 FFT 长度的 1/4 到 1/8。

cuFFT 具有类似 FFTW 的接口，明确支持此 via the idist parameter of the cufftPlanMany() function。具体来说，如果我想计算大小为 32768 且连续输入之间有 4096 个样本重叠的 FFT，我会设置 idist = 32768 - 4096。确实在产生正确输出的意义上正常工作。

但是，以这种方式使用 cuFFT 时，我发现性能出现奇怪的下降。我设计了一个测试，以两种不同的方式实现这种格式转换和重叠：

明确告诉 cuFFT 输入的重叠性质：设置 idist = nfft - overlap 如上所述。安装一个加载回调函数，根据需要在提供给回调的缓冲区索引上执行从 int8_t 到 float 的转换。
不要告诉 cuFFT 输入的重叠性质；骗它一个 dset idist = nfft。然后，让回调函数通过计算应该为每个 FFT 输入读取的正确索引来处理重叠。

A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist。为了简洁起见，我没有在这里全部重现。该程序计算了一批 1024 个 32768 点的 FFT，由 4096 个样本重叠；输入数据类型是 8 位整数。当我运行它在我的机器上时（使用 Geforce GTX 660 GPU，在 Ubuntu 16.04 上使用 CUDA 8.0 RC），我得到以下结果：

executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec

方法 2 明显更快，这出乎我的意料。查看回调函数的实现：

方法一：

template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index, 
    void *, void *)
{
    return (cufftReal)(((const T *) inbuf)[fft_index]);
}

方法二：

template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf, 
    size_t fft_index, void *, void *)
{
    // fft_index is the index of the sample that we need, not taking 
    // the overlap into account. Convert it to the appropriate sample 
    // index, considering the overlap structure. First, grab the FFT 
    // parameters from constant memory.
    int nfft = overlap_params.nfft;
    int overlap = overlap_params.overlap;
    // Calculate which FFT in the batch that we're reading data for. This
    // tells us how much overlap we need to account for. Just use integer 
    // arithmetic here for speed, knowing that this would cause a problem 
    // if we did a batch larger than 2Gsamples long.
    int fft_index_int = fft_index;
    int fft_batch_index = fft_index_int / nfft;
    // For each transform past the first one, we need to slide "overlap" 
    // samples back in the input buffer when fetching the sample.
    fft_index_int -= fft_batch_index * overlap;
    // Cast the input pointer to the appropriate type and convert to a float.
    return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}

方法 2 有一个显着更复杂的回调函数，一个甚至涉及除以非编译时间值的整数！我希望这比方法 1 慢得多，但我看到的恰恰相反。对此有一个很好的解释吗？当输入重叠时，cuFFT 的处理结构是否有可能大不相同，从而导致性能下降？

如果可以从回调中删除索引计算（但这需要重叠指定给 cuFFT）。

编辑： 运行在 nvvp 下运行我的测试程序后，我可以看到 cuFFT 的计算结构肯定有所不同。很难理解内核符号名称，但内核调用分解如下：

方法一：

__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE：3.72 毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>：7.71 毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>：12.75 毫秒（是的，它被调用了两次）
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_：7.49 毫秒

方法二：

spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>：5.15 毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 毫秒
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_：7.51 毫秒

有趣的是，看起来 cuFFT 使用方法 1（当 cuFFT 知道重叠时）调用两个内核来实际计算 FFT，但是使用方法 2（它不知道 FFT 重叠）时，它只需一个即可完成工作。对于在这两种情况下使用的内核，它似乎确实在方法 1 和方法 2 之间使用了相同的网格参数。

我不明白为什么它必须在这里使用不同的实现，特别是因为输入步幅 istride == 1。在转换输入处获取数据时，它应该只使用不同的基地址；我认为算法的其余部分应该完全相同。

编辑 2： 我看到了一些更奇怪的行为。我偶然意识到，如果我未能适当地破坏 cuFFT 句柄，我会看到测量性能的差异。例如，我修改了测试程序以跳过 cuFFT 句柄的破坏，然后以不同的顺序执行测试：方法 1、方法 2，然后是方法 2，然后再次执行方法 1。我得到以下结果：

executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec

因此，在为测试用例创建计划时，性能似乎会发生变化，具体取决于是否存在其他 cuFFT 计划！使用探查器，我看到内核启动的结构在这两种情况下没有改变；内核似乎都执行得更快。这种效果我也没有合理的解释。

Answer 1

如果您指定非标准步幅（如果 batch/transform 则无关紧要）cuFFT 在内部使用不同的路径。

广告编辑 2：这可能是 GPU Boost 调整了 GPU 上的时钟。 cuFFT 计划相互之间没有影响

获得更稳定结果的方法：

运行预热内核（任何能让 GPU 充分工作的东西都很好）然后是你的问题
增加批量大小
运行测试几次取平均值
GPU 的锁定时钟（在 GeForce 上不太可能 - Tesla 可以做到）

Answer 2

在@llukas 的建议下，我就此问题向 NVIDIA 提交了错误报告（https://partners.nvidia.com/bug/viewbug/1821802 如果您已注册为开发人员）。他们承认计划重叠的表现较差。他们实际上表示在这两种情况下使用的内核配置都不是最佳的，他们计划最终改进它。没有给出 ETA，但很可能不会出现在下一个版本中（8.0 上周刚刚发布）。最后，他们表示，从 CUDA 8.0 开始，没有解决方法可以使 cuFFT 使用更有效的跨步输入方法。

为什么 cuFFT 性能会因输入重叠而受到影响？

Why does cuFFT performance suffer with overlapping inputs?

cuda

fft

cufft