如何防止 thrust::reduce_by_key 写入可分页内存?

How do you prevent thrust::reduce_by_key from writing to pageable memory?

我正在编写一个使用多个并发 CUDA 流的应用程序。当我的 thrust::reduce_by_key 调用似乎要写入可分页内存时,我的其他流正在阻塞。我认为 returned 值是问题所在。

如何防止将 return 值写入可分页内存?

我将包含演示我尝试的解决方案的代码。


#include <thrust/system/cuda/vector.h>
#include <thrust/host_vector.h>
#include <thrust/pair.h>
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/random.h>

int main(void)
{
  int N = 20;
  thrust::default_random_engine rng;
  thrust::uniform_int_distribution<int> dist(10, 99);

  // initialize data
  thrust::device_vector<int> array(N);
  for (size_t i = 0; i < array.size(); i++)
    array[i] = dist(rng);

  // allocate storage for sums and indices
  thrust::device_vector<int> sums(N);
  thrust::device_vector<int> indices(N);

  // make a pinned memory location for the returned pair of iterators
  typedef thrust::device_vector<int>::iterator  dIter;
  thrust::pair<dIter, dIter>*  new_end;

  const unsigned int bytes =  sizeof(thrust::pair<dIter, dIter>);
  cudaMallocHost((void**)&new_end, bytes);

  for(int i = 0 ; i< 20; i++){   // you can see in the profiler each operator writes 4 bytes to pageable memory

        *new_end = thrust::reduce_by_key
            (thrust::make_counting_iterator(0),
             thrust::make_counting_iterator(N),
             array.begin(),
             indices.begin(),
             sums.begin(),
             thrust::equal_to<int>(),
             thrust::plus<int>());
  }
  std::cout << "done \n";
  return 0;
}

这是我的探查器的图片,显示了从设备到主机可分页内存的副本

I am writing an application which uses several concurrent CUDA streams. My other streams are blocking when my thrust::reduce_by_key appears to write to pageable memory

这种阻塞行为不是由“写入可分页内存”引起的。这是由 explicit synchronization call. In general, as of the CUDA 10.1 (Thrust 1.9.4) release, all normal synchronous algorithms are blocking. You can confirm this yourself by examining an API trace with the profiler. However, you might be able to at least limit the scope of the blocking by launching the call into a stream 引起的,尽管我懒得测试这是否以有用的方式修改了 cuda_cub::synchronize 的行为。

How do you prevent the return value from being written to pageable memory?

并不是说这与您的问题有任何关系,但您不能。重要的是要记住,与您最初的问题断言相反,thrust::reduce_by_key 不是内核,它是执行一系列操作的主机代码,包括 copying the return value from device memory to a host stack variable。程序员无法控制内部结构,显然您尝试使用自己的固定内存值来接受按值传递的结果是荒谬的,并且不会产生任何效果。

正如评论中所建议的那样,如果您需要您的问题所建议的操作内部控制的粒度级别,那么 thrust 是错误的选择。使用 cub::device::reduce_by_key——这与 thrust 使用的算法实现相同,但您可以明确控制临时内存、同步、流以及如何访问调用结果。但是,这不适合初学者。