使用 Thrust 按键组合两个列表

Combining two lists by key using Thrust

给定两个键值列表,我试图通过匹配键并在键匹配时将函数应用于两个值来组合两侧。在我的例子中,我想乘以这些值。一个小例子让它更清楚:

Left keys:   { 1, 2, 4, 5, 6 }
Left values: { 3, 4, 1, 2, 1 }

Right keys:   { 1, 3, 4, 5, 6, 7 };
Right values: { 2, 1, 1, 4, 1, 2 };

Expected output keys:   { 1, 4, 5, 6 }
Expected output values: { 6, 1, 8, 1 }

我已经能够使用 C++ 在 CPU 上使用下一个代码实现此功能:

int main() {
    int leftKeys[5] =   { 1, 2, 4, 5, 6 };
    int leftValues[5] = { 3, 4, 1, 2, 1 };
    int rightKeys[6] =   { 1, 3, 4, 5, 6, 7 };
    int rightValues[6] = { 2, 1, 1, 4, 1, 2 };

    int leftIndex = 0, rightIndex = 0;
    std::vector<std::tuple<int, int>> result;
    while (leftIndex < 5 && rightIndex < 6) {
        if (leftKeys[leftIndex] < rightKeys[rightIndex]) {
            leftIndex++;
        }
        if (leftKeys[leftIndex] > rightKeys[rightIndex]) {
            rightIndex++;
        }
        result.push_back(std::make_tuple(leftKeys[leftIndex], leftValues[leftIndex] * rightValues[rightIndex]));
        leftIndex++;
        rightIndex++;
    }

    // Print results
    for (int i = 0; i < result.size(); i++) {
        std::cout << "Key: " << std::get<0>(result[i]) << "; Value: " << std::get<1>(result[i]) << "\n";
    }
}

但是,我在 Thrust 的 device_vector 中有输入键和值,我也需要 GPU 上的结果。因此,如果我不需要将所有输入复制到主机并将所有输出复制回设备,效率会更高。

问题是我找不到可用于使用一组键组合两个列表(并将函数应用于两个值)的 Thrust 函数。是否存在这样的功能,或者是否有一种简单的方法可以自己实现,我应该只在主机上执行吗?

更新:

可以对输入做出以下假设:

更新二:

在@Robert 的回答中实施第二种方法时,我陷入了转型。到目前为止我的代码如下:

struct multiply_transformation : public thrust::binary_function<std::tuple<int, int>, std::tuple<int, int>, std::tuple<int, int>>
{
    __host__ __device__
        thrust::tuple<int, int> operator()(thrust::tuple<int, int> d_left, thrust::tuple<int, int> d_right)
    {
        if (thrust::get<0>(d_left) == thrust::get<0>(d_right)) {
            return thrust::make_tuple(thrust::get<0>(d_left), thrust::get<1>(d_left) * thrust::get<1>(d_right));
        }
        return thrust::make_tuple(-1, -1);
    }
};


thrust::device_vector<int> d_mergedKeys(h_leftCount + h_rightCount);
thrust::device_vector<int> d_mergedValues(h_leftCount + h_rightCount);
thrust::merge_by_key(d_leftCountKeys.begin(), d_leftCountKeys.begin() + h_leftCount,
    d_rightCountKeys.begin(), d_rightCountKeys.begin() + h_rightCount,
    d_leftCounts.begin(), d_rightCounts.begin(), d_mergedKeys.begin(), d_mergedValues.begin());

typedef thrust::tuple<int, int> IntTuple;
thrust::zip_iterator<IntTuple> d_zippedCounts(thrust::make_tuple(d_mergedKeys.begin(), d_mergedValues.begin()));
thrust::zip_iterator<IntTuple> d_zippedCountsOffset(d_zippedCounts + 1);

multiply_transformation transformOperator;
thrust::device_vector<IntTuple> d_transformedResult(h_leftCount + h_rightCount);
thrust::transform(d_zippedCounts, d_zippedCounts + h_leftCount + h_rightCount - 1, d_zippedCountsOffset, d_transformedResult.begin(), transformOperator);

但是,我收到错误提示,没有重载函数 thrust::transform 与参数列表匹配。上面代码中h_leftCounth_rightCount是左右输入的大小。 d_leftCountKeysd_rightCountKeysd_leftCountsd_rightCountsthrust::device_vector<int>

好吧,我不确定这是最好的方法(@m.s。通常会提出比我更好的方法),但一种可能的方法是(方法 1):

  1. set_intersection_by_key(左,右)
  2. set_intersection_by_key(右,左)
  3. 获取步骤 1 和步骤 2 的输出值,并对它们执行 transform 以将值结果相乘(或者您希望对相应值结果执行的任何其他数学运算步骤 1 和步骤 2)。

我不知道您的 thrust 技能水平如何,但如果需要,我可以提供上述内容的简单示例。

另一种可能的方法(方法 2):

  1. merge_by_key两个列表一起
  2. 使用步骤 1 的结果列表的两个版本执行转换:第一个由 [first, last-1) 组成,第二个由 [first+1, last) 组成。这将需要一个特殊的仿函数,它采用键和值的压缩版本,并比较所显示的两个键。如果匹配,则对两个呈现值输出所需的数学运算;如果不匹配,则输出某种标记或已知的非法值。 (如果无法构造这样的非法值,我们可以根据需要压缩第三个标记数组。)
  3. 对步骤 2 的输出执行 remove_if,将结果压缩为所需结果,删除所有非法值条目,或者删除标记数组指示的所有值条目.

我的感觉是第二种方法可能会更快,但我没有仔细考虑过。无论如何,最好对测试用例进行基准测试而不是(我的)直觉。

根据下面的评论,这里描述了从方法 2 的第二步开始发生的情况,使用您的示例数据集:

步骤 1(merge_by_key 操作)的输出如下所示:

keys:   { 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values: { 3, 2, 4, 1, 1, 1, 2, 4, 1, 1, 2 };

让我们构造两个版本,第一个是"the item",第二个是"the next item to the right":

keys1:   { 1, 1, 2, 3, 4, 4, 5, 5, 6, 6 };
values1: { 3, 2, 4, 1, 1, 1, 2, 4, 1, 1 };

keys2:   { 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values2: { 2, 4, 1, 1, 1, 2, 4, 1, 1, 2 };

实际的 "construction" 是微不足道的。 keys1 就是 [keys.begin(), keys.end()-1), key2 就是 [keys.begin()+1, keys.end())。同样对于 values1 和 values2.

我们将 keys1 和 values1 压缩在一起,我们将 keys2 和 values2 压缩在一起。然后我们将这两个压缩实体传递给一个转换操作,该操作有一个特殊的函子,它将执行以下操作:

如果 keys1 == keys2,对 values1 和 values2 数量进行所需的数学运算,并在标记数组中放置一个 1。如果不是,则在标记数组中放置一个 0。此操作的输出将是:

 keys:   { 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
 values: { 6, 4, 1, 1, 1, 8, 4, 1, 1, 2 };
 marker: { 1, 0, 0, 1, 0, 1, 0, 1, 0, 0 };

现在将上面的 3 个向量压缩在一起,并将其传递给 remove_if。 remove_if 仿函数将指示删除任何标记 == 0 的项目,留下:

 keys:   { 1, 4, 5, 6 };
 values: { 6, 1, 8, 1 };
 marker: { 1, 1, 1, 1 };

这是一个完整的示例,演示了这两种方法:

$ cat t1007.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/set_operations.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/merge.h>
#include <thrust/remove.h>
#include <assert.h>

struct mark_mpy_func
{
  template <typename T1, typename T2>
  __host__ __device__
  int operator()(T1 &z1, T2 &z2){
    int res = 0;
    if (thrust::get<0>(z1) == thrust::get<0>(z2)){
      res = thrust::get<1>(z1) * thrust::get<1>(z2);
      thrust::get<2>(z1) = 1;}
    return res;
    }
};

struct mtest_func
{
  __host__ __device__
  bool operator()(int t){
    if (t == 0) return true;
    return false;
    }
};

int main(){

  int Lkeys[] =   { 1, 2, 4, 5, 6 };
  int Lvals[] =   { 3, 4, 1, 2, 1 };
  int Rkeys[] =   { 1, 3, 4, 5, 6, 7 };
  int Rvals[] =   { 2, 1, 1, 4, 1, 2 };

  size_t Lsize = sizeof(Lkeys)/sizeof(int);
  size_t Rsize = sizeof(Rkeys)/sizeof(int);

  thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
  thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
  thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
  thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);

  // method 1

  thrust::device_vector<int> Lkeysvo(Lsize);
  thrust::device_vector<int> Lvalsvo(Lsize);
  thrust::device_vector<int> Rkeysvo(Rsize);
  thrust::device_vector<int> Rvalsvo(Rsize);

  size_t Lsizeo = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Lkeysvo.begin(), Lvalsvo.begin()).first - Lkeysvo.begin();
  size_t Rsizeo = thrust::set_intersection_by_key(Rkeysv.begin(), Rkeysv.end(), Lkeysv.begin(), Lkeysv.end(), Rvalsv.begin(), Rkeysvo.begin(), Rvalsvo.begin()).first - Rkeysvo.begin();

  assert(Lsizeo == Rsizeo);
  thrust::device_vector<int> res1(Lsizeo);
  thrust::transform(Lvalsvo.begin(), Lvalsvo.begin()+Lsizeo, Rvalsvo.begin(), res1.begin(), thrust::multiplies<int>());

  std::cout << "Method 1 result:" << std::endl << "keys: ";
  thrust::copy_n(Lkeysvo.begin(), Lsizeo, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl << "vals: ";
  thrust::copy_n(res1.begin(), Lsizeo, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;

  // method 2

  thrust::device_vector<int> Mkeysv(Lsize + Rsize);
  thrust::device_vector<int> Mvalsv(Lsize + Rsize);

  thrust::merge_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Rvalsv.begin(), Mkeysv.begin(), Mvalsv.begin());
  thrust::device_vector<int> marker(Lsize + Rsize - 1);
  thrust::device_vector<int> res2(Lsize + Rsize - 1);
  thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), Mvalsv.begin(), marker.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, Mvalsv.end()-1, marker.end())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin()+1, Mvalsv.begin()+1)), res2.begin(), mark_mpy_func());
  size_t rsize2 = thrust::remove_if(thrust::make_zip_iterator(thrust::make_tuple( Mkeysv.begin(), res2.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, res2.end())), marker.begin(), mtest_func()) - thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), res2.begin()));
  std::cout << "Method 2 result:" << std::endl << "keys: ";
  thrust::copy_n(Mkeysv.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl << "vals: ";
  thrust::copy_n(res2.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;


  return 0;
}

$ nvcc -o t1007 t1007.cu
$ ./t1007
Method 1 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
Method 2 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
$

如果可以接受在输出数据中使用标记值(比如 -1)来通知 remove_if 操作,则可以省去单独的标记数组。这是以这种方式工作的方法 2 的修改版本:

$ cat t1007.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/merge.h>
#include <thrust/remove.h>

#define MARK_VAL -1

struct mark_mpy_func
{
  template <typename T1, typename T2>
  __host__ __device__
  int operator()(T1 &z1, T2 &z2){
    int res = MARK_VAL;
    if (thrust::get<0>(z1) == thrust::get<0>(z2)){
      res = thrust::get<1>(z1) * thrust::get<1>(z2);}
    return res;
    }
};

struct mtest_func
{
  template <typename T>
  __host__ __device__
  bool operator()(T t){
    if (thrust::get<1>(t) == MARK_VAL) return true;
    return false;
    }
};

int main(){

  int Lkeys[] =   { 1, 2, 4, 5, 6 };
  int Lvals[] =   { 3, 4, 1, 2, 1 };
  int Rkeys[] =   { 1, 3, 4, 5, 6, 7 };
  int Rvals[] =   { 2, 1, 1, 4, 1, 2 };

  size_t Lsize = sizeof(Lkeys)/sizeof(int);
  size_t Rsize = sizeof(Rkeys)/sizeof(int);

  thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
  thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
  thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
  thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);

  // method 2

  thrust::device_vector<int> Mkeysv(Lsize + Rsize);
  thrust::device_vector<int> Mvalsv(Lsize + Rsize);

  thrust::merge_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Rvalsv.begin(), Mkeysv.begin(), Mvalsv.begin());
  thrust::device_vector<int> res2(Lsize + Rsize - 1);
  thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), Mvalsv.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, Mvalsv.end()-1)), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin()+1, Mvalsv.begin()+1)), res2.begin(), mark_mpy_func());
  size_t rsize2 = thrust::remove_if(thrust::make_zip_iterator(thrust::make_tuple( Mkeysv.begin(), res2.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, res2.end())), mtest_func()) - thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), res2.begin()));
  std::cout << "Method 2 result:" << std::endl << "keys: ";
  thrust::copy_n(Mkeysv.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl << "vals: ";
  thrust::copy_n(res2.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;


  return 0;
}

$ nvcc -o t1007 t1007.cu
$ ./t1007
Method 2 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
$

实际上,您可以使用 one thrust::set_intersection_by_key 调用来完成所有您想做的事情。 但是,需要满足一些先决条件:

首先,简单的:

您需要 zip LvalsvRvalsv 到一个 thrust::zip_iterator 并将其作为值传递给 thrust::set_intersection_by_key .

你已经可以运行这个:

std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_values_left(min_size);
thrust::device_vector<int> result_values_right(min_size);

auto zipped_input_values = thrust::make_zip_iterator(thrust::make_tuple(Lvalsv.begin(), Rvalsv.begin()));
auto zipped_output_values = thrust::make_zip_iterator(thrust::make_tuple(result_values_left.begin(), result_values_right.begin()));

auto result_pair = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), zipped_input_values, result_keys.begin(), zipped_output_values);

这将产生两个结果向量,您需要逐元素相乘以获得最终结果。

但是等等,如果您可以避免必须将这两个向量存储为结果,然后再次读取每个元素以将它们相乘,然后将最终结果存储在第三个向量中,那不是很好吗?

让我们开始吧。我改编的概念来自heretransform_output_iterator 是一个迭代器,它是另一个 OutputIterator 的包装器。写入 transform_output_iterator 时,将 UnaryFunction 应用于要写入的值,然后将结果写入包装的 OutputIterator.

这允许我们通过 Multiplier 函子传递 thrust::set_intersection_by_key 的结果,然后将其存储在单个 result_values 向量中的结果中。

下面的代码实现了这个想法:

#include <thrust/iterator/iterator_traits.h>
#include <thrust/iterator/iterator_facade.h>
#include <thrust/iterator/iterator_adaptor.h>

#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
#include <thrust/set_operations.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>  
#include <iostream>
#include <cstdint>

#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
    std::cout << name << ":\t";
    thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
    std::cout << std::endl;
}


template <typename OutputIterator, typename UnaryFunction>
class Proxy
{
    UnaryFunction& fun;
    OutputIterator& out;

public:
    __host__ __device__
    Proxy(UnaryFunction& fun, OutputIterator& out) : fun(fun), out(out) {}

    template <typename T>
    __host__ __device__
    Proxy operator=(const T& x) const
    {
        *out = fun(x);
        return *this;
    }
};


// This iterator is a wrapper around another OutputIterator which
// applies a UnaryFunction before writing to the OutputIterator.
template <typename OutputIterator, typename UnaryFunction>
class transform_output_iterator : public thrust::iterator_adaptor<
                                    transform_output_iterator<OutputIterator, UnaryFunction>
                                                                      , OutputIterator
                                      , thrust::use_default
                                      , thrust::use_default
                                      , thrust::use_default
                                                                      , Proxy<const OutputIterator, const UnaryFunction> >
{
    UnaryFunction fun;

public:

    friend class thrust::iterator_core_access;

    // shorthand for the name of the iterator_adaptor we're deriving from
    typedef thrust::iterator_adaptor<
      transform_output_iterator<OutputIterator, UnaryFunction>,
      OutputIterator, thrust::use_default, thrust::use_default, thrust::use_default, Proxy<const OutputIterator, const UnaryFunction>
    > super_t;

    __host__ __device__
    transform_output_iterator(OutputIterator out, UnaryFunction fun) : super_t(out), fun(fun)
    {
    }


private:
    __host__ __device__
    typename super_t::reference dereference() const
    {
        return Proxy<const OutputIterator, const UnaryFunction>(fun, this->base_reference());
    }
};


struct Multiplier
{
    template<typename Tuple>
    __host__ __device__
    auto operator()(Tuple t) const -> decltype(thrust::get<0>(t) * thrust::get<1>(t))
    {
        return thrust::get<0>(t) * thrust::get<1>(t);
    }
};


template <typename OutputIterator, typename UnaryFunction>
transform_output_iterator<OutputIterator, UnaryFunction>
__host__ __device__
make_transform_output_iterator(OutputIterator out, UnaryFunction fun)
{
    return transform_output_iterator<OutputIterator, UnaryFunction>(out, fun);
}

int main()
{
  int Lkeys[] =   { 1, 2, 4, 5, 6 };
  int Lvals[] =   { 3, 4, 1, 2, 1 };
  int Rkeys[] =   { 1, 3, 4, 5, 6, 7 };
  int Rvals[] =   { 2, 1, 1, 4, 1, 2 };

  size_t Lsize = sizeof(Lkeys)/sizeof(int);
  size_t Rsize = sizeof(Rkeys)/sizeof(int);

  thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
  thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
  thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
  thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);

  std::size_t min_size = std::min(Lsize, Rsize);

  thrust::device_vector<int> result_keys(min_size);
  thrust::device_vector<int> result_values(min_size);

  auto zipped_values = thrust::make_zip_iterator(thrust::make_tuple(Lvalsv.begin(), Rvalsv.begin()));

  auto output_it = make_transform_output_iterator(result_values.begin(), Multiplier());

  auto result_pair = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), zipped_values, result_keys.begin(), output_it);

  std::size_t new_size = result_pair.first - result_keys.begin();

  result_keys.resize(new_size);
  result_values.resize(new_size);
  PRINTER(result_keys);
  PRINTER(result_values);
}

输出

$ nvcc -std=c++11 main.cu && ./a.out

result_keys:    1   4   5   6   
result_values:  6   1   8   1   

我认为需要两个交集,如第一个答案中所建议的。其他解决方案将不起作用,它们产生正确结果只是输入数据的巧合。例如,如果从左侧集合中删除第二个(键,值)对,则计算结果将有所不同,但它不应该这是代码:

$ cat inner_join.cu
#include <thrust/set_operations.h>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <iostream>

int main()
{
  int _Lkeys[] = {1, 4, 5, 6};
  int _Lvals[] = {3, 1, 2, 1};
  int _Rkeys[] = {1, 3, 4, 5, 6, 7};
  int _Rvals[] = {2, 1, 1, 4, 1, 2};

  size_t Lsize = sizeof(_Lkeys) / sizeof(int);
  size_t Rsize = sizeof(_Rkeys) / sizeof(int);

  thrust::device_vector<int> Lkeys(_Lkeys, _Lkeys + Lsize);
  thrust::device_vector<int> Lvals(_Lvals, _Lvals + Lsize);
  thrust::device_vector<int> Rkeys(_Rkeys, _Rkeys + Rsize);
  thrust::device_vector<int> Rvals(_Rvals, _Rvals + Rsize);

  std::size_t min_size = std::min(Lsize, Rsize);

  thrust::device_vector<int> result_keys(min_size);
  thrust::device_vector<int> result_Rvals(min_size);
  thrust::device_vector<int> result_Lvals(min_size);

  // set intersection keys, and  left set values
  size_t intersection_size =
      thrust::set_intersection_by_key(Lkeys.begin(), Lkeys.end(), Rkeys.begin(),
                                      Rkeys.end(), Lvals.begin(),
                                      result_keys.begin(), result_Lvals.begin())
          .first -
      result_keys.begin();

  // set intersection keys, and  right set values
  thrust::set_intersection_by_key(Rkeys.begin(), Rkeys.end(), Lkeys.begin(),
                                  Lkeys.end(), Rvals.begin(),
                                  result_keys.begin(), result_Rvals.begin());

  result_Lvals.resize(intersection_size);
  result_keys.resize(intersection_size);

  thrust::device_vector<int> result_values(intersection_size);

  // join left and right intersection values
  thrust::transform(result_Lvals.begin(), result_Lvals.end(),
                    result_Rvals.begin(), result_values.begin(),
                    thrust::multiplies<int>());

  std::cout << "keys: ";
  thrust::copy_n(result_keys.begin(), intersection_size,
                 std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl << "vals: ";
  thrust::copy_n(result_values.begin(), intersection_size,
                 std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
}

输出

$ nvcc inner_join.cu -run
keys: 1,4,5,6,
vals: 6,1,8,1,