CUDA 相当于 pragma omp 任务

Question

我正在处理每个线程之间的工作可能变化很大的问题，例如，一个线程这次可能处理 1000000 个元素，但另一个线程可能只处理 1 或 2 个元素。所以我偶然发现答案通过在 CPU 上使用 openmp 任务解决了不平衡的工作负载，所以我的问题是我可以在 CUDA 上实现同样的目标吗？

如果您需要更多上下文：我要解决的问题是，我有一个 n 元组，每个元组都有一个起点、一个终点和一个值。

(0, 3, 1), (3, 6, 2), (6, 10, 3), ...

所以对于每个元组，我想将值写入另一个空数组的起点和终点之间的每个位置。

1, 1, 1, 2, 2, 2, 3, 3, 3, 3, ...

保证没有起止重叠

我目前的做法是每个元组一个线程，但开始和结束可能会有很大差异，因此线程之间不平衡的工作负载可能会导致程序出现瓶颈，虽然这种情况很少见，但很有可能。

Answer 1

我能想到的最常见的 CUDA 线程策略是为每个输出点分配一个线程，然后让每个线程执行必要的工作来填充其输出点。

对于您所说的 objective（让每个线程做大致相同的工作）这是一个有用的策略。

我建议为此使用推力。基本思路是：

根据输入确定输出的必要大小
启动一组等于输出大小的线程，其中每个线程通过对输入使用矢量化二进制搜索来确定其在输出数组中的“插入索引”
使用插入索引，在输出数组中插入适当的值。

我已经使用了您的数据，唯一的变化是我将插入值从 1,2,3 更改为 5,2,7:

$ cat t1871.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <iostream>

using namespace thrust::placeholders;

typedef thrust::tuple<int,int,int> mt;

// returns selected item from tuple
struct my_cpy_functor1
{
  __host__ __device__ int operator()(mt d){ return thrust::get<1>(d); }
};
struct my_cpy_functor2
{
  __host__ __device__ int operator()(mt d){ return thrust::get<2>(d); }
};

int main(){

  mt my_data[] = {{0, 3, 5}, {3, 6, 2}, {6, 10, 7}};
  int ds = sizeof(my_data)/sizeof(my_data[0]); // determine data size
  int os = thrust::get<1>(my_data[ds-1]) - thrust::get<0>(my_data[0]); // and output size
  thrust::device_vector<mt> d_data(my_data, my_data+ds);  // transfer data to device
  thrust::device_vector<int> d_idx(ds+1); // create index array for searching of insertion points
  thrust::transform(d_data.begin(), d_data.end(), d_idx.begin()+1, my_cpy_functor1()); // set index array
  thrust::device_vector<int> d_ins(os);  // create array to hold insertion points
  thrust::upper_bound(d_idx.begin(), d_idx.end(), thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(os), d_ins.begin()); // identify insertion points
  thrust::transform(thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.begin(), _1 -1)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.end(), _1 -1)), d_ins.begin(), my_cpy_functor2()); // insert
  thrust::copy(d_ins.begin(), d_ins.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
}


$ nvcc -o t1871 t1871.cu -std=c++14
$ ./t1871
5,5,5,2,2,2,7,7,7,7,
$

CUDA 相当于 pragma omp 任务

CUDA equivalent of pragma omp task

multithreading

cuda

openmp