使用 CUB 减少总和

Sum reduction with CUB

根据下面的 this article, sum reduction with CUB Library should be one of the fastest way to make parallel reduction. As you can see in a code 片段,执行时间是不包括第一个 cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum()); 我假设它与内存准备有关的东西,当我们减少多次相同的数据时,它是不必要的每次都调用它,但是当我有许多具有相同元素数量和数据类型的不同数组时,我是否必须每次都调用它?如果答案是肯定的,那就意味着CUB库的使用变得毫无意义。

  size_t temp_storage_bytes;
  int* temp_storage=NULL;
  cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
  cudaMalloc(&temp_storage,temp_storage_bytes);

  cudaDeviceSynchronize();
  cudaCheckError();
  cudaEventRecord(start);

  for(int i=0;i<REPEAT;i++) {
    cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
  }
  cudaEventRecord(stop);
  cudaDeviceSynchronize();

I assume that it's something connected with memory preparation and when we reduce several times the same data it isn't neccesary to call it every time

没错。

but when I've got many different arrays with the same number of elements and type of data do I have to do it every time?

不,您不需要每次都这样做。 "first" 调用 cub::DeviceReduce::Reduce 的唯一目的(即当 temp_storage=NULL 时)是提供 CUB 所需的临时存储所需的字节数。如果您的数据的类型和大小没有改变,则无需重新运行 此步骤或后续的 cudaMalloc 操作。只要数据的大小和类型是一样。