阵列缩减推力很慢

Question

我正在尝试使用 thrust 将 1M 元素的数组减少为单个值。我的代码如下：

#include<chrono>
#include<iostream>

#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>


int main()
{
    int N,M;
    N = 1000;
    M = 1000;
    thrust::device_vector<float> D(N*M,5.0);
    int sum;
    
    auto start = std::chrono::high_resolution_clock::now();
    sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);

    std::cout<<duration.count()<<" ";
    std::cout<<sum;
}

问题是，在我的 RTX 3070 笔记本电脑 GPU 上，仅 thrust::reduce 就需要大约 4 毫秒到运行。这比我根据 this CUDA reference by Mark Harris 中的 reduction#4 编写的代码要慢得多，后者大约需要 150 微秒。我是不是做错了什么？

编辑 1： 将 high_resolution_clock 更改为 steady_clock。 thrust::reduce 现在需要 2 毫秒才能达到运行。更新代码如下：

#include<chrono>
#include<iostream>

#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>


int main()
{
    int N,M;
    N = 1000;
    M = 1000;
    thrust::device_vector<float> D(N*M,5.0);
    int sum;
    
    auto start = std::chrono::steady_clock::now();
    
    sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
    auto end = std::chrono::steady_clock::now();
    auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);

    std::cout<<duration.count()<<" ";
    std::cout<<sum;
}

附加信息：
我在 WSL2 Ubuntu Ubuntu 上运行ning CUDA C++
CUDA 版本 - 11.4
我使用的是nvcc编译器编译：

nvcc -o reduction reduction.cu

至运行：

./reduction

Answer 1

Am I doing something wrong here?

我不会说你在这里做错了什么。然而，这可能是一个见仁见智的问题。让我们使用分析器对其进行一些解压缩。我没有使用与您完全相同的设置（我使用的是不同的 GPU - Tesla V100，Linux，CUDA 11.4）。在我的例子中，代码吐出的测量值是 ~0.5ms，而不是 2ms。

探查器告诉我 thrust::reduce 是通过调用 cub::DeviceReduceKernel 然后调用 cub::DeviceReduceSingleTileKernel 在后台完成的。如果您研究过 Mark Harris 的还原 material，那么这种 two-kernel 方法应该对您有意义。探查器告诉我，这两个调用总共占了 ~500us 总时间的 ~40us。假设您仅对内核计时，那么这个时间与您对 Mark Harris 缩减代码的实施所做的测量最具有可比性。如果我们乘以 4 来计算整体性能比，它非常接近您的 150us 测量值。
探查器告诉我，在我的案例中报告的 ~500us 时间的主要贡献者是对 cudaMalloc (~200us) 的调用和对 cudaFree (~200us) 的调用。这并不奇怪，因为如果您研究显然被 thrust 使用的 cub::DeviceReduce methodology，它需要初始调用来进行临时分配。由于 thrust 为 thrust::reduce 提供了 self-contained 调用，它必须执行该调用，并为指示的临时存储执行 cudaMalloc 和 cudaFree 操作。

那么有什么可以做的吗？

推力设计者意识到了这种情况。要在仅测量 CUDA C++ 实现的内核持续时间和使用 thrust 做同样的事情之间获得（更接近）apples-apples 比较，您可以使用分析器来比较测量结果，或者控制你自己的临时分配。

一种方法是从推力切换到。

最有效的方法是使用 thrust custom allocator.

在影响您的测量的方法方面可能存在一些其他细节差异。例如，thrust 调用本质上将缩减结果复制回主机内存。您可能会或可能不会在您未展示的其他方法中计时该步骤。但根据我的分析器测量，这只占几微秒。

阵列缩减推力很慢

Thrust is very slow for array reduction

c++

arrays

cuda

reduction

thrust