CUDA 袖带 2D 示例

Question

我目前正在开发一个必须实现 2D-FFT（用于互相关）的程序。我用 CUDA 做了一个 1D FFT，它给了我正确的结果，我现在正在尝试实现一个 2D 版本。通过一些示例和在线文档，我发现很难找出错误所在。

到目前为止，我一直只使用 cuFFT 手册。

无论如何，我创建了两个 5x5 数组并用 1 填充它们。我已将它们复制到 GPU 内存中并进行前向 FFT，将它们相乘，然后对结果进行 ifft。这给了我一个值为 650 的 5x5 阵列。我希望在 5x5 阵列的一个插槽中获得一个值为 25 的直流信号。相反，我在整个数组中得到 650。

此外，在将信号复制到 GPU 内存后，我不允许打印出信号值。写作

cout << d_signal[1].x << endl;

给我一个访问冲突。我在其他 cuda 程序中做过同样的事情，这不是问题。与复杂变量的工作方式有关，还是人为错误？

如果有人指出出了什么问题，我将不胜感激。这是代码

   #include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <helper_functions.h>
#include <helper_cuda.h>

#include <ctime>
#include <time.h>
#include <stdio.h>
#include <iostream>
#include <math.h>
#include <cufft.h>
#include <fstream>

using namespace std;
typedef float2 Complex;





__global__ void ComplexMUL(Complex *a, Complex *b)
{
    int i = threadIdx.x;
    a[i].x = a[i].x * b[i].x - a[i].y*b[i].y;
    a[i].y = a[i].x * b[i].y + a[i].y*b[i].x;
}


int main()
{


    int N = 5;
    int SIZE = N*N;


    Complex *fg = new Complex[SIZE];
    for (int i = 0; i < SIZE; i++){
        fg[i].x = 1; 
        fg[i].y = 0;
    }
    Complex *fig = new Complex[SIZE];
    for (int i = 0; i < SIZE; i++){
        fig[i].x = 1; // 
        fig[i].y = 0;
    }
    for (int i = 0; i < 24; i=i+5)
    {
        cout << fg[i].x << " " << fg[i + 1].x << " " << fg[i + 2].x << " " << fg[i + 3].x << " " << fg[i + 4].x << endl;
    }
    cout << "----------------" << endl;
    for (int i = 0; i < 24; i = i + 5)
    {
        cout << fig[i].x << " " << fig[i + 1].x << " " << fig[i + 2].x << " " << fig[i + 3].x << " " << fig[i + 4].x << endl;
    }
    cout << "----------------" << endl;

    int mem_size = sizeof(Complex)* SIZE;


    cufftComplex *d_signal;
    checkCudaErrors(cudaMalloc((void **) &d_signal, mem_size)); 
    checkCudaErrors(cudaMemcpy(d_signal, fg, mem_size, cudaMemcpyHostToDevice));

    cufftComplex *d_filter_kernel;
    checkCudaErrors(cudaMalloc((void **)&d_filter_kernel, mem_size));
    checkCudaErrors(cudaMemcpy(d_filter_kernel, fig, mem_size, cudaMemcpyHostToDevice));

    // cout << d_signal[1].x << endl;
    // CUFFT plan
    cufftHandle plan;
    cufftPlan2d(&plan, N, N, CUFFT_C2C);

    // Transform signal and filter
    printf("Transforming signal cufftExecR2C\n");
    cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD);
    cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD);

    printf("Launching Complex multiplication<<< >>>\n");
    ComplexMUL <<< 32, 256 >> >(d_signal, d_filter_kernel);

    // Transform signal back
    printf("Transforming signal back cufftExecC2C\n");
    cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);

    Complex *result = new Complex[SIZE];
    cudaMemcpy(result, d_signal, sizeof(Complex)*SIZE, cudaMemcpyDeviceToHost);

    for (int i = 0; i < SIZE; i=i+5)
    {
        cout << result[i].x << " " << result[i + 1].x << " " << result[i + 2].x << " " << result[i + 3].x << " " << result[i + 4].x << endl;
    }

    delete result, fg, fig;
    cufftDestroy(plan);
    //cufftDestroy(plan2);
    cudaFree(d_signal);
    cudaFree(d_filter_kernel);

}

以上代码给出了以下终端输出：

1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
Transforming signal cufftExecR2C
Launching Complex multiplication<<< >>>
Transforming signal back cufftExecC2C

625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625

Answer 1

这给了我一个值为 650 的 5x5 数组：它读取 625 即 5555。您使用的卷积算法需要补充除以 NN。事实上，在cufft中，正向变换中没有归一化系数。因此，您的卷积不能是频域中两个场的简单相乘。（有人会称它为数学家的 DFT 而不是物理学家的 DFT）。

此外，在将信号复制到 GPU 内存后，我不允许打印出信号值：这是标准 CUDA 行为。在设备上分配内存时，数据存在于设备内存地址 space 中，并且无法通过 CPU 访问，而不需要额外的努力。搜索 managed 内存，或 zerocopy 以从 PCI Express 的两侧访问数据（这在许多其他帖子中讨论过）。

Answer 2

这里有几个问题：

对于乘法内核中输入数组的大小，您启动的线程太多，因此应该会因内存越界错误而失败。我很惊讶您没有收到任何类型的运行时错误。
我认为您对 fft/fft - 点积 - ifft 序列的预期解是不正确的。正确的解决方案是一个 5x5 矩阵，每个条目有 25 个。
正如 cuFFT 文档中清楚描述的那样，该库执行 非规范化 FFT：

cuFFT performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the number of elements. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit.

所以根据我的估计，您的代码的正确输出解决方案应该是一个 5x5 矩阵，每个条目中有 625，这将被标准化为每个条目中有 25 的 5x5 矩阵，即。预期的结果。我不明白 (1) 处的问题为何不会产生不同的结果，因为乘法内核应该会失败。

TLDR；这里没什么可看的，继续前进...

Answer 3

就像已经提到的其他事情的加法：我认为你的复杂乘法内核没有做正确的事情。您在第一行中覆盖 a[i].x，然后在第二行中使用 a[i].x 的新值来计算 a[i].y。我认为您需要在覆盖之前先生成 a[i].x 的备份，例如：

float aReal_bk = a[i].x;
a[i].x = a[i].x * b[i].x - a[i].y * b[i].y;
a[i].y = aReal_bk * b[i].y + a[i].y * b[i].x;

CUDA 袖带 2D 示例

CUDA cufft 2D example

c++

cuda

cufft