reduce_by_key() 的输出作为两个简化向量的函数

Question

我正在通过从 AoS 转换为 SoA 方法来重构推力代码以利用内存合并。为此，我有两个向量，它们由一个公共键减少，然后用于计算输出向量的值。原始代码使用单个仿函数完成此操作，我想效仿它。

本质上：

Oᵢ = Rᵢ / Sᵢ，其中Rᵢ和Sᵢ是同一个key约简的向量，Oᵢ是对应的输出向量。

下面的代码举例说明了我正在尝试做的事情：

typedef tuple<int,int> Tuple;

struct BinaryTupleOp : public thrust::binary_function<Tuple const &, Tuple const &, int>
{
  __host__ __device__
  int operator()(Tuple const & lhs, Tuple const & rhs) const {
    // get<0> = vals, get<1> = other_vals                                                                                           

    return (get<0>(lhs) + get<0>(rhs)) / (get<1>(lhs) + get<1>(rhs));
  }

};


int main(int argc, char ** argv)
{
  const int N = 7;

  device_vector<int> keys(N);

  keys[0] = 1; // represents sorted keys                                                                                            
  keys[1] = 1;
  keys[2] = 2;
  keys[3] = 2;
  keys[4] = 3;
  keys[5] = 3;
  keys[6] = 3;

  device_vector<int> vals(N);

  vals[0] = 6; // just some random numbers                                                                                          
  vals[1] = 3;
  vals[2] = 9;
  vals[3] = 4;
  vals[4] = 6;
  vals[5] = 1;
  vals[6] = 5;

  device_vector<int> other_vals(N);

  other_vals[0] = 4; // more randomness                                                                                             
  other_vals[1] = 1;
  other_vals[2] = 3;
  other_vals[3] = 6;
  other_vals[4] = 2;
  other_vals[5] = 5;
  other_vals[6] = 7;


  device_vector<int> new_keys(N);
  device_vector<int> output(N);

  typedef device_vector<int>::iterator Iterator;
  thrust::pair<Iterator, Iterator> new_end;

  thrust::equal_to<int> binary_pred;

  new_end = thrust::reduce_by_key(keys.begin(), keys.end(),
                                  make_zip_iterator(make_tuple(vals.begin(), other_vals.begin())),
                                  new_keys.begin(),
                                  output.begin(),
                                  binary_pred,
                                  BinaryTupleOp() );

  Iterator i = new_keys.begin();
  Iterator j = output.begin();

  for (;
       i != new_end.first;
       i++, j++ ) {
    std::cout << "key " << *i << " sum " << *j << endl;
  }

  return 0;
}

不幸的是，这会产生 error: no operator "=" matches these operands、error: no suitable conversion function from "InputValueType" to "TemporaryType" exists 和 error: no suitable conversion function from "const thrust::tuple<int, int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>" to "int" exists 等错误。我试过仿函数中的参数类型变体，因为我认为这是问题的最终根源，但无济于事。

作为解决方法，我可能会分别分解这两个归约，然后使用转换来创建输出向量。（这可能建议将各种 transform_reduce() 调用链接在一起，但似乎我想要相反的东西，类似于 reduce_transform()，据我所知，它不存在。）

同时，我做错了什么？

Answer 1

Meanwhile, what am I doing wrong?

thrust::reduce（或thrust::reduce_by_key）将执行并行缩减。这种并行归约需要一个可以成对应用的归约算子。举一个非常简单的例子，假设我们要减少 3 个元素（E1、E2 和 E3）并且我们有一个二元运算（bOp），我们将用于定义归约操作。 Thrust 可能会做这样的事情：

E1       E2      E3
 \        /
    bOp
     \           /
          bOp
           |
         result

也就是说，二元运算将用于将"reduce"元素E1和E2组合成一个临时的部分结果，这个结果将反馈到二元运算与元素 E3 结合产生最终的 result.

这意味着二元运算的输出（因此值输出迭代器的输出类型）必须与其输入类型匹配（因此值输入迭代器的输入类型）。

但是您的二元运算不满足此要求，您为值输入和值输出传递的迭代器类型也不满足此要求：

  new_end = thrust::reduce_by_key(keys.begin(), keys.end(),
                              make_zip_iterator(make_tuple(vals.begin(), other_vals.begin())),
     /* dereferencing the above iterator produces an <int, int> tuple */
                              new_keys.begin(),
                              output.begin(),
     /* dereferencing the above iterator produces an int  */
                              binary_pred,
                              BinaryTupleOp() );

我相信上面对二元运算的一般要求（即对于值输入和输出迭代器类型）在 thrust docs for this function 中表示为：

InputIterator2's value_type is convertible to OutputIterator2's value_type.

我可以想到两种方法来解决您在此处指出的问题：

Oᵢ = Rᵢ / Sᵢ, where Rᵢ and Sᵢ are vectors reduced by the same key, and Oᵢ is the corresponding output vector.

我想你已经提到了第一个：

仅对压缩在一起的两个值序列执行 reduce_by_key。这可以在单个 reduce_by_key 调用中完成。然后将压缩的结果序列传递给 thrust::transform 调用以执行元素划分。
如果你急于在一次推力调用中完成所有事情，那么@m.s 完成的聪明的输出迭代器工作。可能是一种可能的方法，用于对值输出执行转换操作。（编辑：经过一些研究，我不确定这种方法是否可以减少使用）

如果您不喜欢上述任何一个建议，则可以在对 reduce_by_key 的单次调用中实现您想要的减少，代价是一些额外的存储空间（并且可以说是一些浪费的操作）。基本思想是围绕 3 元组而不是 2 元组进行组织。在每个缩减步骤中，我们将组合（求和）lhs 和 rhs 元组的相应（第一和第二）分量，将这些结果存储在输出元组的第一和第二位置。此外，我们将计算输出元组的第 3 个位置的值作为输出元组的第一个和第二个位置相除的结果。您在这里进行 int 操作，包括除法，因此我选择稍微修改您的输入数据以便于结果验证。这是一个基于您的代码的有效示例：

$ cat t1143.cu
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/constant_iterator.h>
#include <iostream>


using namespace std;
using namespace thrust;
typedef tuple<int,int,int> Tuple;

struct BinaryTupleOp : public thrust::binary_function<const Tuple &, const Tuple &, Tuple>
{
  __host__ __device__
  Tuple operator()(const Tuple & lhs, const Tuple & rhs) const {
    Tuple temp;
    get<0>(temp) = get<0>(lhs)+get<0>(rhs);
    get<1>(temp) = get<1>(lhs)+get<1>(rhs);
    get<2>(temp) = get<0>(temp)/get<1>(temp);                                   

    return temp;
  }

};


int main(int argc, char ** argv)
{
  const int N = 7;

  device_vector<int> keys(N);

  keys[0] = 1; // represents sorted keys                                        
  keys[1] = 1;
  keys[2] = 2;
  keys[3] = 2;
  keys[4] = 3;
  keys[5] = 3;
  keys[6] = 3;

  device_vector<int> vals(N);

  vals[0] = 6; // just some random numbers                                      
  vals[1] = 3;
  vals[2] = 8;
  vals[3] = 4;
  vals[4] = 5;
  vals[5] = 5;
  vals[6] = 5;

  device_vector<int> other_vals(N);

  other_vals[0] = 1; // more randomness                                         
  other_vals[1] = 2;
  other_vals[2] = 1;
  other_vals[3] = 2;
  other_vals[4] = 1;
  other_vals[5] = 1;
  other_vals[6] = 1;


  device_vector<int> new_keys(N);
  device_vector<int> output(N);

  device_vector<int> v1(N);
  device_vector<int> v2(N);

  thrust::equal_to<int> binary_pred;

  int rsize = thrust::get<0>(thrust::reduce_by_key(keys.begin(), keys.end(),
                                  make_zip_iterator(make_tuple(vals.begin(), other_vals.begin(), thrust::constant_iterator<int>(0))),
                                  new_keys.begin(),
                                  make_zip_iterator(make_tuple(v1.begin(), v2.begin(), output.begin())),
                                  binary_pred,
                                  BinaryTupleOp() )) - new_keys.begin();

  for (int i = 0; i < rsize; i++){
    int key = new_keys[i];
    int val = output[i];
    std::cout << "key " << key << " sum " << val << endl;
  }

  return 0;
}
$ nvcc -o t1143 t1143.cu
$ ./t1143
key 1 sum 3
key 2 sum 4
key 3 sum 5
$

其他人可能对如何制作您想要的结果有更好的想法。

reduce_by_key() 的输出作为两个简化向量的函数

Output from reduce_by_key() as a function of two reduced vectors

cuda

thrust