多次调用袖套存储回调

Question

我在复杂到复杂的异地一维批量 FFT 中使用袖带存储回调（即我正在做许多相同大小的一维 FFT）。从 Section 2.9.4 of the documentation 开始，我希望为每个输出准确调用一次此回调。请特别参阅此引用，逐字摘自 link:

cuFFT will call the load callback routine, for each point in the input, once and only once. Similarly it will call the store callback routine, for each point in the output, once and only once.

尽管如此，我似乎有一个反驳这一点的例子。在下面的代码中，我希望看到每个数字 0-19 只出现一次，对应于每个输出样本只调用一次存储回调。但是，当我执行 504 个大小为 32 的 1D FFT 时，存储回调被调用两次每个输出！

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include <cuda.h>    
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>


// Very simple store callback: prints the index and does the store
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
                               void *cb_info, void *sharedmem) {

    // Print the index. Each index should appear exactly once.
    if (index < 20) printf("%8llu\n", index);

    // Do the store
    ((cufftComplex *)a)[index] = z;
}
__device__ cufftCallbackStoreC stor_cb_ptr_d = stor_cb;


int main() {
    size_t work_size;

    // With these parameters, the store callback is
    // called twice for each output
    int fft_sz = 32;            // Size of each FFT
    int num_ffts = 504;         // How many FFTs to do

    // With these parameters, the store callback is
    // called once for each output
//    int fft_sz = 1024;         // Size of each FFT
//    int num_ffts = 20;         // How many FFTs to do

    // Buffers
    cufftComplex *in_buf_h, *in_buf_d, *out_buf_d;

    // Allocate buffers on host and device
    in_buf_h = new cufftComplex[fft_sz*num_ffts];
    cudaMalloc(&in_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
    cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));

    // Fill input buffer with zeros and copy to device
    memset(in_buf_h, 0, fft_sz*num_ffts*sizeof(cufftComplex));
    cudaMemcpy(in_buf_d, in_buf_h, fft_sz*num_ffts*sizeof(cufftComplex), cudaMemcpyHostToDevice);

    // Plan num_ffts of size fft_sz
    cufftHandle plan;
    cufftCreate(&plan);
    cufftMakePlan1d(plan, fft_sz, CUFFT_C2C, num_ffts, &work_size);

    // Associate save callback with plan
    cufftCallbackStoreC stor_cb_ptr_h;
    cudaMemcpyFromSymbol(&stor_cb_ptr_h, stor_cb_ptr_d, sizeof(stor_cb_ptr_h));
    cufftXtSetCallback(plan, (void **)&stor_cb_ptr_h, CUFFT_CB_ST_COMPLEX, 0);

    // Execute the plan. We don't actually care about values. The idea
    // is that the store callback should be called exactly once for
    // each of the fft_sz*num_ffts samples.
    cufftExecC2C(plan, in_buf_d, out_buf_d, -1);

    // Sync the device to flush the output
    cudaDeviceSynchronize();

    return 0;
}

fft_sz=32、num_ffts=504 的示例输出：

$ stor_cb_tst 
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19

相比之下，如果我执行 20 个大小为 1024 的 FFT，那么我会得到预期的行为：存储回调只为每个输出调用一次。 fft_sz=1024、num_ffts=20:

的示例输出

$ stor_cb_tst 
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19

我是不是误会了什么，我有什么错误，还是袖带有问题？

我运行在 Linux Mint 上使用 cuda V8.0.61，g++ 5.4.0，在 GeForce GTX 1080 上：

$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8114 MBytes (8507752448 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1848 MHz (1.85 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS

这是我的编译命令：

$ nvcc -ccbin g++ -dc -m64 -o stor_cb_tst.o -c stor_cb_tst.cu 
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ nvcc -ccbin g++ -m64 -o stor_cb_tst stor_cb_tst.o -lcufft_static -lculibos
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./stor_cb_tst 
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19

Answer 1

我能够在 CUDA 8 上重现观察结果，但不能在 CUDA 9 上重现。但是我认为即使使用 CUDA 8 也没有任何问题。让我们从更仔细地查看文档开始：

来自 CUFFT doc section 2.9.4:

Similarly it will call the store callback routine, for each point in the output, once and only once.

您假设输出中的每个点都有一个相应的唯一值 index 传递给存储回调例程，但是我们很快就会看到情况并非如此。

it will only call the store callback routine from the last phase kernel(s).

因此我们看到，在转换的最后阶段，存储回调例程可能会从多个独立内核调用（注意使用 kernel(s)）。

For some configurations, threads may load or store inputs or outputs in any order, and cuFFT does not guarantee that the inputs or outputs handled by a given thread will be contiguous. These characteristics may vary with transform size, transform type (e.g. C2C vs C2R), number of dimensions, and GPU architecture. These variations may also change from one library version to the next.

这提供了一些额外的线索，我们不应该期望在每种情况下都能很好地连续处理所有输出数据。指示的可变性可能取决于确切的变换参数，以及 CUFFT 库版本。

所以让我们开始讨论正题。 CUFFT 是否在每个输出点多次调用存储回调？它不是。为了证明这一点，让我们按如下方式修改您的商店回调：

static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
                               void *cb_info, void *sharedmem) {

    // Print the index. Each index should appear exactly once.
    //if (index < 20) printf("%8llu, %p, \n", index, a);
    cufftComplex temp = ((cufftComplex *)a)[index];
    temp.x++;
    ((cufftComplex *)a)[index] = temp;
    // Do the store
    //((cufftComplex *)a)[index] = z;
    if (index < 20) printf("%8llu, %p, %f\n", index, a, temp.x);

}

这个存储回调不会写入预期的输出，只会将给定的输出点增加 1。此外，我们不会只打印出 index 值并可能做出不正确的假设，而是打印出 index，加上基地址 a，再加上我们递增的实际值。为了使这一切正常，我们需要将整个输出数组预初始化为零：

cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemset(out_buf_d, 0, fft_sz*num_ffts*sizeof(cufftComplex));  // add this

当我在 CUDA 8 linux 和 cc3.5 设备 (Tesla K20x) 上编译并运行修改代码时，输出如下：

$ nvcc -arch=sm_35 -o t20 t20.cu -rdc=true -lcufft_static -lcudadevrt -lculibos
$ ./t20
       0, 0x2305b5f800, 1.000000
       1, 0x2305b5f800, 1.000000
       2, 0x2305b5f800, 1.000000
       3, 0x2305b5f800, 1.000000
       4, 0x2305b5f800, 1.000000
       5, 0x2305b5f800, 1.000000
       6, 0x2305b5f800, 1.000000
       7, 0x2305b5f800, 1.000000
       8, 0x2305b5f800, 1.000000
       9, 0x2305b5f800, 1.000000
      10, 0x2305b5f800, 1.000000
      11, 0x2305b5f800, 1.000000
      12, 0x2305b5f800, 1.000000
      13, 0x2305b5f800, 1.000000
      14, 0x2305b5f800, 1.000000
      15, 0x2305b5f800, 1.000000
      16, 0x2305b5f800, 1.000000
      17, 0x2305b5f800, 1.000000
      18, 0x2305b5f800, 1.000000
      19, 0x2305b5f800, 1.000000
       0, 0x2305b7d800, 1.000000
       1, 0x2305b7d800, 1.000000
       2, 0x2305b7d800, 1.000000
       3, 0x2305b7d800, 1.000000
       4, 0x2305b7d800, 1.000000
       5, 0x2305b7d800, 1.000000
       6, 0x2305b7d800, 1.000000
       7, 0x2305b7d800, 1.000000
       8, 0x2305b7d800, 1.000000
       9, 0x2305b7d800, 1.000000
      10, 0x2305b7d800, 1.000000
      11, 0x2305b7d800, 1.000000
      12, 0x2305b7d800, 1.000000
      13, 0x2305b7d800, 1.000000
      14, 0x2305b7d800, 1.000000
      15, 0x2305b7d800, 1.000000
      16, 0x2305b7d800, 1.000000
      17, 0x2305b7d800, 1.000000
      18, 0x2305b7d800, 1.000000
      19, 0x2305b7d800, 1.000000
$

我们看到的是：

是的，index 值是重复的，但是每个重复案例的基地址（指针）是不同的。因此，即使 index 值重复， 输出点 也只写入一次。
作为进一步确认，如果我们多次写入输出点，对于我们的特定回调，我们希望看到输出增加到 2.000000。但是我们在输出中只看到 1.000000。所以 none 这些输出点被写入了不止一次。

我认为这种特定的输出模式很可能是在转换的最后阶段从 2 个独立的内核调用中产生的。可以从探查器中获得一些进一步的证据。

正如我在开头提到的，当我在这个测试用例中使用 CUDA 9 而不是 CUDA 8 时，我看到了不同的行为（只打印了一组从 0 到 19 的输出索引。）但是这种可能性（变化从库版本到库版本的行为）也在文档中说明，如前所述。

期待后续问题：

But if the index value is not unique, and I want to apply some transformation to the output that varies based on index, what should I do?

我认为这里的假设是，您打算应用于 批处理 转换输出的任何转换应该仅取决于索引位置批次。在这个假设下，我的期望是：

索引的多内核复制将始终在批边界上完成。
可以通过对传递给回调例程的 index 值执行模批量大小操作来应用适当的转换。

我在没有证据的情况下提出了这一点，也没有尝试通过文档来证实这一点，但鉴于已经涵盖的观察结果，这是对我来说唯一有意义的实现。一个要点是，如果你有一个你希望应用的转换因批次而异，这可能不是实现它的方法（即通过回调）。但是，正如我已经提到的，CUDA 9 中的情况似乎发生了变化。如果您对此有任何顾虑，请随时提交具有 desired/expected 行为的 RFE（错误报告）（and/or文档更新请求）在 http://developer.nvidia.com ，请记住您的预期行为可能已经在 CUDA 9 中实现。

多次调用袖套存储回调

cufft store callback called more than once

cuda

fft

nvidia

cufft