关于 thrust::execution_policy 将数据从设备复制到主机时

Question

我使用 thrust::copy 将数据从 设备传输到多 GPU 系统中的主机 。每个 GPU 都有一个大小相同的数据分区。使用 OpenMP，我在每台设备上调用该函数。在我当前的系统上，我正在使用 4 个 GPU。

#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i) 
{
    const int device = devices[i];
    thrust::copy(thrust::device, // execution policy
                 device_buffers->At(device)->begin(), // thrust::device_vector
                 device_buffers->At(device)->end(),
                 elements->begin() + (device * block_size)); // thrust::host_vector
}

阅读文档和以下 post 后，我了解到默认 thrust::execution_policy 是根据传递的迭代器选择的。

将数据从设备复制到主机时，两个迭代器都作为函数参数。

1.这里默认选择哪个执行策略？ thrust::host 或 thrust::device?
在做了一些基准测试后，我观察到通过了 thrust::device 与不传递显式相比，显式提高了性能参数.
2。性能提升的原因可能是什么？ 该系统是一台 POWER9 机器。 thrust::copy如何及具体执行内部政策工作？ 每个的4个复制引擎中有多少个设备实际使用了吗？
但是nvprof不显示[CUDA memcpy DtoH]类别不再显示 void thrust::cuda_cub::core [...] __parallel_for::ParallelForAgent [...] 甚至显示时间 (s) 的增加。这没有意义，因为正如我所说，我观察到一致的性能改进（更小的总执行时间）使用 thrust::device.
时
3。这只是一个 nvprof + thrust-specific 行为导致分析数字与实际执行时间不相关吗？ 我观察到与 cudaFree 类似的东西：似乎 cudaFree 是将控制权返回到主机代码非常快，这导致小执行时间，而 nvprof 显示的数字要高得多，因为实际的重新分配可能以懒惰的方式发生。

Answer 1

thrust::device 上的 Thrust doc 陈述如下：

Raw pointers allocated by host APIs should not be mixed with a thrust::device algorithm invocation when the device backend is CUDA

根据我的理解，这意味着 host-device 使用 thrust::device 执行策略的复制首先是无效的，除非主机内存已固定。

我们暗示您的主机分配未固定，但是： 一种可能性是，在带有 NVLINK 的 POWER9 上，您可能很幸运，任何 host-allocated 内存都可以从在 GPU 内。多亏了这一点，host-device 复制 thrust::device 才有效，尽管它不应该。

在常规系统上，仅当此主机内存分配有 cudaMallocHost（固定）时，主机内存才可从 GPU 内寻址。因此，问题是您的 POWER 系统是否已自动升级所有分配以固定。观察到的性能奖励是由于 implicitly-pinned 内存，还是如果分配也明确地使用 cudaMallocHost 完成，你会得到额外的加速吗？

另一个推力 design-based 证据是 thrust::device 政策有 par.on(stream) 支持，而 thrust::host 没有。这与异步 host-device 副本仅适用于固定内存这一事实非常吻合。

关于 thrust::execution_policy 将数据从设备复制到主机时

About thrust::execution_policy when copying data from device to host

c++

benchmarking

cuda

gpgpu

thrust