使用 Thrust 按键组合两个列表
Combining two lists by key using Thrust
给定两个键值列表,我试图通过匹配键并在键匹配时将函数应用于两个值来组合两侧。在我的例子中,我想乘以这些值。一个小例子让它更清楚:
Left keys: { 1, 2, 4, 5, 6 }
Left values: { 3, 4, 1, 2, 1 }
Right keys: { 1, 3, 4, 5, 6, 7 };
Right values: { 2, 1, 1, 4, 1, 2 };
Expected output keys: { 1, 4, 5, 6 }
Expected output values: { 6, 1, 8, 1 }
我已经能够使用 C++ 在 CPU 上使用下一个代码实现此功能:
int main() {
int leftKeys[5] = { 1, 2, 4, 5, 6 };
int leftValues[5] = { 3, 4, 1, 2, 1 };
int rightKeys[6] = { 1, 3, 4, 5, 6, 7 };
int rightValues[6] = { 2, 1, 1, 4, 1, 2 };
int leftIndex = 0, rightIndex = 0;
std::vector<std::tuple<int, int>> result;
while (leftIndex < 5 && rightIndex < 6) {
if (leftKeys[leftIndex] < rightKeys[rightIndex]) {
leftIndex++;
}
if (leftKeys[leftIndex] > rightKeys[rightIndex]) {
rightIndex++;
}
result.push_back(std::make_tuple(leftKeys[leftIndex], leftValues[leftIndex] * rightValues[rightIndex]));
leftIndex++;
rightIndex++;
}
// Print results
for (int i = 0; i < result.size(); i++) {
std::cout << "Key: " << std::get<0>(result[i]) << "; Value: " << std::get<1>(result[i]) << "\n";
}
}
但是,我在 Thrust 的 device_vector
中有输入键和值,我也需要 GPU 上的结果。因此,如果我不需要将所有输入复制到主机并将所有输出复制回设备,效率会更高。
问题是我找不到可用于使用一组键组合两个列表(并将函数应用于两个值)的 Thrust 函数。是否存在这样的功能,或者是否有一种简单的方法可以自己实现,我应该只在主机上执行吗?
更新:
可以对输入做出以下假设:
- 键总是排序的。
- 单个列表中不存在重复键(列表之间当然存在重复键,否则结果为空)。
更新二:
在@Robert 的回答中实施第二种方法时,我陷入了转型。到目前为止我的代码如下:
struct multiply_transformation : public thrust::binary_function<std::tuple<int, int>, std::tuple<int, int>, std::tuple<int, int>>
{
__host__ __device__
thrust::tuple<int, int> operator()(thrust::tuple<int, int> d_left, thrust::tuple<int, int> d_right)
{
if (thrust::get<0>(d_left) == thrust::get<0>(d_right)) {
return thrust::make_tuple(thrust::get<0>(d_left), thrust::get<1>(d_left) * thrust::get<1>(d_right));
}
return thrust::make_tuple(-1, -1);
}
};
thrust::device_vector<int> d_mergedKeys(h_leftCount + h_rightCount);
thrust::device_vector<int> d_mergedValues(h_leftCount + h_rightCount);
thrust::merge_by_key(d_leftCountKeys.begin(), d_leftCountKeys.begin() + h_leftCount,
d_rightCountKeys.begin(), d_rightCountKeys.begin() + h_rightCount,
d_leftCounts.begin(), d_rightCounts.begin(), d_mergedKeys.begin(), d_mergedValues.begin());
typedef thrust::tuple<int, int> IntTuple;
thrust::zip_iterator<IntTuple> d_zippedCounts(thrust::make_tuple(d_mergedKeys.begin(), d_mergedValues.begin()));
thrust::zip_iterator<IntTuple> d_zippedCountsOffset(d_zippedCounts + 1);
multiply_transformation transformOperator;
thrust::device_vector<IntTuple> d_transformedResult(h_leftCount + h_rightCount);
thrust::transform(d_zippedCounts, d_zippedCounts + h_leftCount + h_rightCount - 1, d_zippedCountsOffset, d_transformedResult.begin(), transformOperator);
但是,我收到错误提示,没有重载函数 thrust::transform
与参数列表匹配。上面代码中h_leftCount
和h_rightCount
是左右输入的大小。 d_leftCountKeys
、d_rightCountKeys
、d_leftCounts
和 d_rightCounts
是 thrust::device_vector<int>
。
好吧,我不确定这是最好的方法(@m.s。通常会提出比我更好的方法),但一种可能的方法是(方法 1):
- set_intersection_by_key(左,右)
- set_intersection_by_key(右,左)
- 获取步骤 1 和步骤 2 的输出值,并对它们执行 transform 以将值结果相乘(或者您希望对相应值结果执行的任何其他数学运算步骤 1 和步骤 2)。
我不知道您的 thrust 技能水平如何,但如果需要,我可以提供上述内容的简单示例。
另一种可能的方法(方法 2):
- merge_by_key两个列表一起
- 使用步骤 1 的结果列表的两个版本执行转换:第一个由 [first, last-1) 组成,第二个由 [first+1, last) 组成。这将需要一个特殊的仿函数,它采用键和值的压缩版本,并比较所显示的两个键。如果匹配,则对两个呈现值输出所需的数学运算;如果不匹配,则输出某种标记或已知的非法值。 (如果无法构造这样的非法值,我们可以根据需要压缩第三个标记数组。)
- 对步骤 2 的输出执行 remove_if,将结果压缩为所需结果,删除所有非法值条目,或者删除标记数组指示的所有值条目.
我的感觉是第二种方法可能会更快,但我没有仔细考虑过。无论如何,最好对测试用例进行基准测试而不是(我的)直觉。
根据下面的评论,这里描述了从方法 2 的第二步开始发生的情况,使用您的示例数据集:
步骤 1(merge_by_key 操作)的输出如下所示:
keys: { 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values: { 3, 2, 4, 1, 1, 1, 2, 4, 1, 1, 2 };
让我们构造两个版本,第一个是"the item",第二个是"the next item to the right":
keys1: { 1, 1, 2, 3, 4, 4, 5, 5, 6, 6 };
values1: { 3, 2, 4, 1, 1, 1, 2, 4, 1, 1 };
keys2: { 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values2: { 2, 4, 1, 1, 1, 2, 4, 1, 1, 2 };
实际的 "construction" 是微不足道的。 keys1 就是 [keys.begin(), keys.end()-1), key2 就是 [keys.begin()+1, keys.end())。同样对于 values1 和 values2.
我们将 keys1 和 values1 压缩在一起,我们将 keys2 和 values2 压缩在一起。然后我们将这两个压缩实体传递给一个转换操作,该操作有一个特殊的函子,它将执行以下操作:
如果 keys1 == keys2,对 values1 和 values2 数量进行所需的数学运算,并在标记数组中放置一个 1。如果不是,则在标记数组中放置一个 0。此操作的输出将是:
keys: { 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values: { 6, 4, 1, 1, 1, 8, 4, 1, 1, 2 };
marker: { 1, 0, 0, 1, 0, 1, 0, 1, 0, 0 };
现在将上面的 3 个向量压缩在一起,并将其传递给 remove_if。 remove_if 仿函数将指示删除任何标记 == 0 的项目,留下:
keys: { 1, 4, 5, 6 };
values: { 6, 1, 8, 1 };
marker: { 1, 1, 1, 1 };
这是一个完整的示例,演示了这两种方法:
$ cat t1007.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/set_operations.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/merge.h>
#include <thrust/remove.h>
#include <assert.h>
struct mark_mpy_func
{
template <typename T1, typename T2>
__host__ __device__
int operator()(T1 &z1, T2 &z2){
int res = 0;
if (thrust::get<0>(z1) == thrust::get<0>(z2)){
res = thrust::get<1>(z1) * thrust::get<1>(z2);
thrust::get<2>(z1) = 1;}
return res;
}
};
struct mtest_func
{
__host__ __device__
bool operator()(int t){
if (t == 0) return true;
return false;
}
};
int main(){
int Lkeys[] = { 1, 2, 4, 5, 6 };
int Lvals[] = { 3, 4, 1, 2, 1 };
int Rkeys[] = { 1, 3, 4, 5, 6, 7 };
int Rvals[] = { 2, 1, 1, 4, 1, 2 };
size_t Lsize = sizeof(Lkeys)/sizeof(int);
size_t Rsize = sizeof(Rkeys)/sizeof(int);
thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);
// method 1
thrust::device_vector<int> Lkeysvo(Lsize);
thrust::device_vector<int> Lvalsvo(Lsize);
thrust::device_vector<int> Rkeysvo(Rsize);
thrust::device_vector<int> Rvalsvo(Rsize);
size_t Lsizeo = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Lkeysvo.begin(), Lvalsvo.begin()).first - Lkeysvo.begin();
size_t Rsizeo = thrust::set_intersection_by_key(Rkeysv.begin(), Rkeysv.end(), Lkeysv.begin(), Lkeysv.end(), Rvalsv.begin(), Rkeysvo.begin(), Rvalsvo.begin()).first - Rkeysvo.begin();
assert(Lsizeo == Rsizeo);
thrust::device_vector<int> res1(Lsizeo);
thrust::transform(Lvalsvo.begin(), Lvalsvo.begin()+Lsizeo, Rvalsvo.begin(), res1.begin(), thrust::multiplies<int>());
std::cout << "Method 1 result:" << std::endl << "keys: ";
thrust::copy_n(Lkeysvo.begin(), Lsizeo, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(res1.begin(), Lsizeo, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
// method 2
thrust::device_vector<int> Mkeysv(Lsize + Rsize);
thrust::device_vector<int> Mvalsv(Lsize + Rsize);
thrust::merge_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Rvalsv.begin(), Mkeysv.begin(), Mvalsv.begin());
thrust::device_vector<int> marker(Lsize + Rsize - 1);
thrust::device_vector<int> res2(Lsize + Rsize - 1);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), Mvalsv.begin(), marker.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, Mvalsv.end()-1, marker.end())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin()+1, Mvalsv.begin()+1)), res2.begin(), mark_mpy_func());
size_t rsize2 = thrust::remove_if(thrust::make_zip_iterator(thrust::make_tuple( Mkeysv.begin(), res2.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, res2.end())), marker.begin(), mtest_func()) - thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), res2.begin()));
std::cout << "Method 2 result:" << std::endl << "keys: ";
thrust::copy_n(Mkeysv.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(res2.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1007 t1007.cu
$ ./t1007
Method 1 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
Method 2 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
$
如果可以接受在输出数据中使用标记值(比如 -1)来通知 remove_if 操作,则可以省去单独的标记数组。这是以这种方式工作的方法 2 的修改版本:
$ cat t1007.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/merge.h>
#include <thrust/remove.h>
#define MARK_VAL -1
struct mark_mpy_func
{
template <typename T1, typename T2>
__host__ __device__
int operator()(T1 &z1, T2 &z2){
int res = MARK_VAL;
if (thrust::get<0>(z1) == thrust::get<0>(z2)){
res = thrust::get<1>(z1) * thrust::get<1>(z2);}
return res;
}
};
struct mtest_func
{
template <typename T>
__host__ __device__
bool operator()(T t){
if (thrust::get<1>(t) == MARK_VAL) return true;
return false;
}
};
int main(){
int Lkeys[] = { 1, 2, 4, 5, 6 };
int Lvals[] = { 3, 4, 1, 2, 1 };
int Rkeys[] = { 1, 3, 4, 5, 6, 7 };
int Rvals[] = { 2, 1, 1, 4, 1, 2 };
size_t Lsize = sizeof(Lkeys)/sizeof(int);
size_t Rsize = sizeof(Rkeys)/sizeof(int);
thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);
// method 2
thrust::device_vector<int> Mkeysv(Lsize + Rsize);
thrust::device_vector<int> Mvalsv(Lsize + Rsize);
thrust::merge_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Rvalsv.begin(), Mkeysv.begin(), Mvalsv.begin());
thrust::device_vector<int> res2(Lsize + Rsize - 1);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), Mvalsv.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, Mvalsv.end()-1)), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin()+1, Mvalsv.begin()+1)), res2.begin(), mark_mpy_func());
size_t rsize2 = thrust::remove_if(thrust::make_zip_iterator(thrust::make_tuple( Mkeysv.begin(), res2.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, res2.end())), mtest_func()) - thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), res2.begin()));
std::cout << "Method 2 result:" << std::endl << "keys: ";
thrust::copy_n(Mkeysv.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(res2.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1007 t1007.cu
$ ./t1007
Method 2 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
$
实际上,您可以使用 one thrust::set_intersection_by_key
调用来完成所有您想做的事情。
但是,需要满足一些先决条件:
首先,简单的:
您需要 zip Lvalsv
和 Rvalsv
到一个 thrust::zip_iterator
并将其作为值传递给 thrust::set_intersection_by_key
.
你已经可以运行这个:
std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_values_left(min_size);
thrust::device_vector<int> result_values_right(min_size);
auto zipped_input_values = thrust::make_zip_iterator(thrust::make_tuple(Lvalsv.begin(), Rvalsv.begin()));
auto zipped_output_values = thrust::make_zip_iterator(thrust::make_tuple(result_values_left.begin(), result_values_right.begin()));
auto result_pair = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), zipped_input_values, result_keys.begin(), zipped_output_values);
这将产生两个结果向量,您需要逐元素相乘以获得最终结果。
但是等等,如果您可以避免必须将这两个向量存储为结果,然后再次读取每个元素以将它们相乘,然后将最终结果存储在第三个向量中,那不是很好吗?
让我们开始吧。我改编的概念来自here。
transform_output_iterator
是一个迭代器,它是另一个 OutputIterator
的包装器。写入 transform_output_iterator
时,将 UnaryFunction
应用于要写入的值,然后将结果写入包装的 OutputIterator
.
这允许我们通过 Multiplier
函子传递 thrust::set_intersection_by_key
的结果,然后将其存储在单个 result_values
向量中的结果中。
下面的代码实现了这个想法:
#include <thrust/iterator/iterator_traits.h>
#include <thrust/iterator/iterator_facade.h>
#include <thrust/iterator/iterator_adaptor.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
#include <thrust/set_operations.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <iostream>
#include <cstdint>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl;
}
template <typename OutputIterator, typename UnaryFunction>
class Proxy
{
UnaryFunction& fun;
OutputIterator& out;
public:
__host__ __device__
Proxy(UnaryFunction& fun, OutputIterator& out) : fun(fun), out(out) {}
template <typename T>
__host__ __device__
Proxy operator=(const T& x) const
{
*out = fun(x);
return *this;
}
};
// This iterator is a wrapper around another OutputIterator which
// applies a UnaryFunction before writing to the OutputIterator.
template <typename OutputIterator, typename UnaryFunction>
class transform_output_iterator : public thrust::iterator_adaptor<
transform_output_iterator<OutputIterator, UnaryFunction>
, OutputIterator
, thrust::use_default
, thrust::use_default
, thrust::use_default
, Proxy<const OutputIterator, const UnaryFunction> >
{
UnaryFunction fun;
public:
friend class thrust::iterator_core_access;
// shorthand for the name of the iterator_adaptor we're deriving from
typedef thrust::iterator_adaptor<
transform_output_iterator<OutputIterator, UnaryFunction>,
OutputIterator, thrust::use_default, thrust::use_default, thrust::use_default, Proxy<const OutputIterator, const UnaryFunction>
> super_t;
__host__ __device__
transform_output_iterator(OutputIterator out, UnaryFunction fun) : super_t(out), fun(fun)
{
}
private:
__host__ __device__
typename super_t::reference dereference() const
{
return Proxy<const OutputIterator, const UnaryFunction>(fun, this->base_reference());
}
};
struct Multiplier
{
template<typename Tuple>
__host__ __device__
auto operator()(Tuple t) const -> decltype(thrust::get<0>(t) * thrust::get<1>(t))
{
return thrust::get<0>(t) * thrust::get<1>(t);
}
};
template <typename OutputIterator, typename UnaryFunction>
transform_output_iterator<OutputIterator, UnaryFunction>
__host__ __device__
make_transform_output_iterator(OutputIterator out, UnaryFunction fun)
{
return transform_output_iterator<OutputIterator, UnaryFunction>(out, fun);
}
int main()
{
int Lkeys[] = { 1, 2, 4, 5, 6 };
int Lvals[] = { 3, 4, 1, 2, 1 };
int Rkeys[] = { 1, 3, 4, 5, 6, 7 };
int Rvals[] = { 2, 1, 1, 4, 1, 2 };
size_t Lsize = sizeof(Lkeys)/sizeof(int);
size_t Rsize = sizeof(Rkeys)/sizeof(int);
thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);
std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_values(min_size);
auto zipped_values = thrust::make_zip_iterator(thrust::make_tuple(Lvalsv.begin(), Rvalsv.begin()));
auto output_it = make_transform_output_iterator(result_values.begin(), Multiplier());
auto result_pair = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), zipped_values, result_keys.begin(), output_it);
std::size_t new_size = result_pair.first - result_keys.begin();
result_keys.resize(new_size);
result_values.resize(new_size);
PRINTER(result_keys);
PRINTER(result_values);
}
输出
$ nvcc -std=c++11 main.cu && ./a.out
result_keys: 1 4 5 6
result_values: 6 1 8 1
我认为需要两个交集,如第一个答案中所建议的。其他解决方案将不起作用,它们产生正确结果只是输入数据的巧合。例如,如果从左侧集合中删除第二个(键,值)对,则计算结果将有所不同,但它不应该这是代码:
$ cat inner_join.cu
#include <thrust/set_operations.h>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <iostream>
int main()
{
int _Lkeys[] = {1, 4, 5, 6};
int _Lvals[] = {3, 1, 2, 1};
int _Rkeys[] = {1, 3, 4, 5, 6, 7};
int _Rvals[] = {2, 1, 1, 4, 1, 2};
size_t Lsize = sizeof(_Lkeys) / sizeof(int);
size_t Rsize = sizeof(_Rkeys) / sizeof(int);
thrust::device_vector<int> Lkeys(_Lkeys, _Lkeys + Lsize);
thrust::device_vector<int> Lvals(_Lvals, _Lvals + Lsize);
thrust::device_vector<int> Rkeys(_Rkeys, _Rkeys + Rsize);
thrust::device_vector<int> Rvals(_Rvals, _Rvals + Rsize);
std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_Rvals(min_size);
thrust::device_vector<int> result_Lvals(min_size);
// set intersection keys, and left set values
size_t intersection_size =
thrust::set_intersection_by_key(Lkeys.begin(), Lkeys.end(), Rkeys.begin(),
Rkeys.end(), Lvals.begin(),
result_keys.begin(), result_Lvals.begin())
.first -
result_keys.begin();
// set intersection keys, and right set values
thrust::set_intersection_by_key(Rkeys.begin(), Rkeys.end(), Lkeys.begin(),
Lkeys.end(), Rvals.begin(),
result_keys.begin(), result_Rvals.begin());
result_Lvals.resize(intersection_size);
result_keys.resize(intersection_size);
thrust::device_vector<int> result_values(intersection_size);
// join left and right intersection values
thrust::transform(result_Lvals.begin(), result_Lvals.end(),
result_Rvals.begin(), result_values.begin(),
thrust::multiplies<int>());
std::cout << "keys: ";
thrust::copy_n(result_keys.begin(), intersection_size,
std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(result_values.begin(), intersection_size,
std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
输出
$ nvcc inner_join.cu -run
keys: 1,4,5,6,
vals: 6,1,8,1,
给定两个键值列表,我试图通过匹配键并在键匹配时将函数应用于两个值来组合两侧。在我的例子中,我想乘以这些值。一个小例子让它更清楚:
Left keys: { 1, 2, 4, 5, 6 }
Left values: { 3, 4, 1, 2, 1 }
Right keys: { 1, 3, 4, 5, 6, 7 };
Right values: { 2, 1, 1, 4, 1, 2 };
Expected output keys: { 1, 4, 5, 6 }
Expected output values: { 6, 1, 8, 1 }
我已经能够使用 C++ 在 CPU 上使用下一个代码实现此功能:
int main() {
int leftKeys[5] = { 1, 2, 4, 5, 6 };
int leftValues[5] = { 3, 4, 1, 2, 1 };
int rightKeys[6] = { 1, 3, 4, 5, 6, 7 };
int rightValues[6] = { 2, 1, 1, 4, 1, 2 };
int leftIndex = 0, rightIndex = 0;
std::vector<std::tuple<int, int>> result;
while (leftIndex < 5 && rightIndex < 6) {
if (leftKeys[leftIndex] < rightKeys[rightIndex]) {
leftIndex++;
}
if (leftKeys[leftIndex] > rightKeys[rightIndex]) {
rightIndex++;
}
result.push_back(std::make_tuple(leftKeys[leftIndex], leftValues[leftIndex] * rightValues[rightIndex]));
leftIndex++;
rightIndex++;
}
// Print results
for (int i = 0; i < result.size(); i++) {
std::cout << "Key: " << std::get<0>(result[i]) << "; Value: " << std::get<1>(result[i]) << "\n";
}
}
但是,我在 Thrust 的 device_vector
中有输入键和值,我也需要 GPU 上的结果。因此,如果我不需要将所有输入复制到主机并将所有输出复制回设备,效率会更高。
问题是我找不到可用于使用一组键组合两个列表(并将函数应用于两个值)的 Thrust 函数。是否存在这样的功能,或者是否有一种简单的方法可以自己实现,我应该只在主机上执行吗?
更新:
可以对输入做出以下假设:
- 键总是排序的。
- 单个列表中不存在重复键(列表之间当然存在重复键,否则结果为空)。
更新二:
在@Robert 的回答中实施第二种方法时,我陷入了转型。到目前为止我的代码如下:
struct multiply_transformation : public thrust::binary_function<std::tuple<int, int>, std::tuple<int, int>, std::tuple<int, int>>
{
__host__ __device__
thrust::tuple<int, int> operator()(thrust::tuple<int, int> d_left, thrust::tuple<int, int> d_right)
{
if (thrust::get<0>(d_left) == thrust::get<0>(d_right)) {
return thrust::make_tuple(thrust::get<0>(d_left), thrust::get<1>(d_left) * thrust::get<1>(d_right));
}
return thrust::make_tuple(-1, -1);
}
};
thrust::device_vector<int> d_mergedKeys(h_leftCount + h_rightCount);
thrust::device_vector<int> d_mergedValues(h_leftCount + h_rightCount);
thrust::merge_by_key(d_leftCountKeys.begin(), d_leftCountKeys.begin() + h_leftCount,
d_rightCountKeys.begin(), d_rightCountKeys.begin() + h_rightCount,
d_leftCounts.begin(), d_rightCounts.begin(), d_mergedKeys.begin(), d_mergedValues.begin());
typedef thrust::tuple<int, int> IntTuple;
thrust::zip_iterator<IntTuple> d_zippedCounts(thrust::make_tuple(d_mergedKeys.begin(), d_mergedValues.begin()));
thrust::zip_iterator<IntTuple> d_zippedCountsOffset(d_zippedCounts + 1);
multiply_transformation transformOperator;
thrust::device_vector<IntTuple> d_transformedResult(h_leftCount + h_rightCount);
thrust::transform(d_zippedCounts, d_zippedCounts + h_leftCount + h_rightCount - 1, d_zippedCountsOffset, d_transformedResult.begin(), transformOperator);
但是,我收到错误提示,没有重载函数 thrust::transform
与参数列表匹配。上面代码中h_leftCount
和h_rightCount
是左右输入的大小。 d_leftCountKeys
、d_rightCountKeys
、d_leftCounts
和 d_rightCounts
是 thrust::device_vector<int>
。
好吧,我不确定这是最好的方法(@m.s。通常会提出比我更好的方法),但一种可能的方法是(方法 1):
- set_intersection_by_key(左,右)
- set_intersection_by_key(右,左)
- 获取步骤 1 和步骤 2 的输出值,并对它们执行 transform 以将值结果相乘(或者您希望对相应值结果执行的任何其他数学运算步骤 1 和步骤 2)。
我不知道您的 thrust 技能水平如何,但如果需要,我可以提供上述内容的简单示例。
另一种可能的方法(方法 2):
- merge_by_key两个列表一起
- 使用步骤 1 的结果列表的两个版本执行转换:第一个由 [first, last-1) 组成,第二个由 [first+1, last) 组成。这将需要一个特殊的仿函数,它采用键和值的压缩版本,并比较所显示的两个键。如果匹配,则对两个呈现值输出所需的数学运算;如果不匹配,则输出某种标记或已知的非法值。 (如果无法构造这样的非法值,我们可以根据需要压缩第三个标记数组。)
- 对步骤 2 的输出执行 remove_if,将结果压缩为所需结果,删除所有非法值条目,或者删除标记数组指示的所有值条目.
我的感觉是第二种方法可能会更快,但我没有仔细考虑过。无论如何,最好对测试用例进行基准测试而不是(我的)直觉。
根据下面的评论,这里描述了从方法 2 的第二步开始发生的情况,使用您的示例数据集:
步骤 1(merge_by_key 操作)的输出如下所示:
keys: { 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values: { 3, 2, 4, 1, 1, 1, 2, 4, 1, 1, 2 };
让我们构造两个版本,第一个是"the item",第二个是"the next item to the right":
keys1: { 1, 1, 2, 3, 4, 4, 5, 5, 6, 6 };
values1: { 3, 2, 4, 1, 1, 1, 2, 4, 1, 1 };
keys2: { 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values2: { 2, 4, 1, 1, 1, 2, 4, 1, 1, 2 };
实际的 "construction" 是微不足道的。 keys1 就是 [keys.begin(), keys.end()-1), key2 就是 [keys.begin()+1, keys.end())。同样对于 values1 和 values2.
我们将 keys1 和 values1 压缩在一起,我们将 keys2 和 values2 压缩在一起。然后我们将这两个压缩实体传递给一个转换操作,该操作有一个特殊的函子,它将执行以下操作:
如果 keys1 == keys2,对 values1 和 values2 数量进行所需的数学运算,并在标记数组中放置一个 1。如果不是,则在标记数组中放置一个 0。此操作的输出将是:
keys: { 1, 2, 3, 4, 4, 5, 5, 6, 6, 7 };
values: { 6, 4, 1, 1, 1, 8, 4, 1, 1, 2 };
marker: { 1, 0, 0, 1, 0, 1, 0, 1, 0, 0 };
现在将上面的 3 个向量压缩在一起,并将其传递给 remove_if。 remove_if 仿函数将指示删除任何标记 == 0 的项目,留下:
keys: { 1, 4, 5, 6 };
values: { 6, 1, 8, 1 };
marker: { 1, 1, 1, 1 };
这是一个完整的示例,演示了这两种方法:
$ cat t1007.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/set_operations.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/merge.h>
#include <thrust/remove.h>
#include <assert.h>
struct mark_mpy_func
{
template <typename T1, typename T2>
__host__ __device__
int operator()(T1 &z1, T2 &z2){
int res = 0;
if (thrust::get<0>(z1) == thrust::get<0>(z2)){
res = thrust::get<1>(z1) * thrust::get<1>(z2);
thrust::get<2>(z1) = 1;}
return res;
}
};
struct mtest_func
{
__host__ __device__
bool operator()(int t){
if (t == 0) return true;
return false;
}
};
int main(){
int Lkeys[] = { 1, 2, 4, 5, 6 };
int Lvals[] = { 3, 4, 1, 2, 1 };
int Rkeys[] = { 1, 3, 4, 5, 6, 7 };
int Rvals[] = { 2, 1, 1, 4, 1, 2 };
size_t Lsize = sizeof(Lkeys)/sizeof(int);
size_t Rsize = sizeof(Rkeys)/sizeof(int);
thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);
// method 1
thrust::device_vector<int> Lkeysvo(Lsize);
thrust::device_vector<int> Lvalsvo(Lsize);
thrust::device_vector<int> Rkeysvo(Rsize);
thrust::device_vector<int> Rvalsvo(Rsize);
size_t Lsizeo = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Lkeysvo.begin(), Lvalsvo.begin()).first - Lkeysvo.begin();
size_t Rsizeo = thrust::set_intersection_by_key(Rkeysv.begin(), Rkeysv.end(), Lkeysv.begin(), Lkeysv.end(), Rvalsv.begin(), Rkeysvo.begin(), Rvalsvo.begin()).first - Rkeysvo.begin();
assert(Lsizeo == Rsizeo);
thrust::device_vector<int> res1(Lsizeo);
thrust::transform(Lvalsvo.begin(), Lvalsvo.begin()+Lsizeo, Rvalsvo.begin(), res1.begin(), thrust::multiplies<int>());
std::cout << "Method 1 result:" << std::endl << "keys: ";
thrust::copy_n(Lkeysvo.begin(), Lsizeo, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(res1.begin(), Lsizeo, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
// method 2
thrust::device_vector<int> Mkeysv(Lsize + Rsize);
thrust::device_vector<int> Mvalsv(Lsize + Rsize);
thrust::merge_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Rvalsv.begin(), Mkeysv.begin(), Mvalsv.begin());
thrust::device_vector<int> marker(Lsize + Rsize - 1);
thrust::device_vector<int> res2(Lsize + Rsize - 1);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), Mvalsv.begin(), marker.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, Mvalsv.end()-1, marker.end())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin()+1, Mvalsv.begin()+1)), res2.begin(), mark_mpy_func());
size_t rsize2 = thrust::remove_if(thrust::make_zip_iterator(thrust::make_tuple( Mkeysv.begin(), res2.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, res2.end())), marker.begin(), mtest_func()) - thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), res2.begin()));
std::cout << "Method 2 result:" << std::endl << "keys: ";
thrust::copy_n(Mkeysv.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(res2.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1007 t1007.cu
$ ./t1007
Method 1 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
Method 2 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
$
如果可以接受在输出数据中使用标记值(比如 -1)来通知 remove_if 操作,则可以省去单独的标记数组。这是以这种方式工作的方法 2 的修改版本:
$ cat t1007.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/merge.h>
#include <thrust/remove.h>
#define MARK_VAL -1
struct mark_mpy_func
{
template <typename T1, typename T2>
__host__ __device__
int operator()(T1 &z1, T2 &z2){
int res = MARK_VAL;
if (thrust::get<0>(z1) == thrust::get<0>(z2)){
res = thrust::get<1>(z1) * thrust::get<1>(z2);}
return res;
}
};
struct mtest_func
{
template <typename T>
__host__ __device__
bool operator()(T t){
if (thrust::get<1>(t) == MARK_VAL) return true;
return false;
}
};
int main(){
int Lkeys[] = { 1, 2, 4, 5, 6 };
int Lvals[] = { 3, 4, 1, 2, 1 };
int Rkeys[] = { 1, 3, 4, 5, 6, 7 };
int Rvals[] = { 2, 1, 1, 4, 1, 2 };
size_t Lsize = sizeof(Lkeys)/sizeof(int);
size_t Rsize = sizeof(Rkeys)/sizeof(int);
thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);
// method 2
thrust::device_vector<int> Mkeysv(Lsize + Rsize);
thrust::device_vector<int> Mvalsv(Lsize + Rsize);
thrust::merge_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), Lvalsv.begin(), Rvalsv.begin(), Mkeysv.begin(), Mvalsv.begin());
thrust::device_vector<int> res2(Lsize + Rsize - 1);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), Mvalsv.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, Mvalsv.end()-1)), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin()+1, Mvalsv.begin()+1)), res2.begin(), mark_mpy_func());
size_t rsize2 = thrust::remove_if(thrust::make_zip_iterator(thrust::make_tuple( Mkeysv.begin(), res2.begin())), thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.end()-1, res2.end())), mtest_func()) - thrust::make_zip_iterator(thrust::make_tuple(Mkeysv.begin(), res2.begin()));
std::cout << "Method 2 result:" << std::endl << "keys: ";
thrust::copy_n(Mkeysv.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(res2.begin(), rsize2, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1007 t1007.cu
$ ./t1007
Method 2 result:
keys: 1,4,5,6,
vals: 6,1,8,1,
$
实际上,您可以使用 one thrust::set_intersection_by_key
调用来完成所有您想做的事情。
但是,需要满足一些先决条件:
首先,简单的:
您需要 zip Lvalsv
和 Rvalsv
到一个 thrust::zip_iterator
并将其作为值传递给 thrust::set_intersection_by_key
.
你已经可以运行这个:
std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_values_left(min_size);
thrust::device_vector<int> result_values_right(min_size);
auto zipped_input_values = thrust::make_zip_iterator(thrust::make_tuple(Lvalsv.begin(), Rvalsv.begin()));
auto zipped_output_values = thrust::make_zip_iterator(thrust::make_tuple(result_values_left.begin(), result_values_right.begin()));
auto result_pair = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), zipped_input_values, result_keys.begin(), zipped_output_values);
这将产生两个结果向量,您需要逐元素相乘以获得最终结果。
但是等等,如果您可以避免必须将这两个向量存储为结果,然后再次读取每个元素以将它们相乘,然后将最终结果存储在第三个向量中,那不是很好吗?
让我们开始吧。我改编的概念来自here。
transform_output_iterator
是一个迭代器,它是另一个 OutputIterator
的包装器。写入 transform_output_iterator
时,将 UnaryFunction
应用于要写入的值,然后将结果写入包装的 OutputIterator
.
这允许我们通过 Multiplier
函子传递 thrust::set_intersection_by_key
的结果,然后将其存储在单个 result_values
向量中的结果中。
下面的代码实现了这个想法:
#include <thrust/iterator/iterator_traits.h>
#include <thrust/iterator/iterator_facade.h>
#include <thrust/iterator/iterator_adaptor.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
#include <thrust/set_operations.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <iostream>
#include <cstdint>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl;
}
template <typename OutputIterator, typename UnaryFunction>
class Proxy
{
UnaryFunction& fun;
OutputIterator& out;
public:
__host__ __device__
Proxy(UnaryFunction& fun, OutputIterator& out) : fun(fun), out(out) {}
template <typename T>
__host__ __device__
Proxy operator=(const T& x) const
{
*out = fun(x);
return *this;
}
};
// This iterator is a wrapper around another OutputIterator which
// applies a UnaryFunction before writing to the OutputIterator.
template <typename OutputIterator, typename UnaryFunction>
class transform_output_iterator : public thrust::iterator_adaptor<
transform_output_iterator<OutputIterator, UnaryFunction>
, OutputIterator
, thrust::use_default
, thrust::use_default
, thrust::use_default
, Proxy<const OutputIterator, const UnaryFunction> >
{
UnaryFunction fun;
public:
friend class thrust::iterator_core_access;
// shorthand for the name of the iterator_adaptor we're deriving from
typedef thrust::iterator_adaptor<
transform_output_iterator<OutputIterator, UnaryFunction>,
OutputIterator, thrust::use_default, thrust::use_default, thrust::use_default, Proxy<const OutputIterator, const UnaryFunction>
> super_t;
__host__ __device__
transform_output_iterator(OutputIterator out, UnaryFunction fun) : super_t(out), fun(fun)
{
}
private:
__host__ __device__
typename super_t::reference dereference() const
{
return Proxy<const OutputIterator, const UnaryFunction>(fun, this->base_reference());
}
};
struct Multiplier
{
template<typename Tuple>
__host__ __device__
auto operator()(Tuple t) const -> decltype(thrust::get<0>(t) * thrust::get<1>(t))
{
return thrust::get<0>(t) * thrust::get<1>(t);
}
};
template <typename OutputIterator, typename UnaryFunction>
transform_output_iterator<OutputIterator, UnaryFunction>
__host__ __device__
make_transform_output_iterator(OutputIterator out, UnaryFunction fun)
{
return transform_output_iterator<OutputIterator, UnaryFunction>(out, fun);
}
int main()
{
int Lkeys[] = { 1, 2, 4, 5, 6 };
int Lvals[] = { 3, 4, 1, 2, 1 };
int Rkeys[] = { 1, 3, 4, 5, 6, 7 };
int Rvals[] = { 2, 1, 1, 4, 1, 2 };
size_t Lsize = sizeof(Lkeys)/sizeof(int);
size_t Rsize = sizeof(Rkeys)/sizeof(int);
thrust::device_vector<int> Lkeysv(Lkeys, Lkeys+Lsize);
thrust::device_vector<int> Lvalsv(Lvals, Lvals+Lsize);
thrust::device_vector<int> Rkeysv(Rkeys, Rkeys+Rsize);
thrust::device_vector<int> Rvalsv(Rvals, Rvals+Rsize);
std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_values(min_size);
auto zipped_values = thrust::make_zip_iterator(thrust::make_tuple(Lvalsv.begin(), Rvalsv.begin()));
auto output_it = make_transform_output_iterator(result_values.begin(), Multiplier());
auto result_pair = thrust::set_intersection_by_key(Lkeysv.begin(), Lkeysv.end(), Rkeysv.begin(), Rkeysv.end(), zipped_values, result_keys.begin(), output_it);
std::size_t new_size = result_pair.first - result_keys.begin();
result_keys.resize(new_size);
result_values.resize(new_size);
PRINTER(result_keys);
PRINTER(result_values);
}
输出
$ nvcc -std=c++11 main.cu && ./a.out
result_keys: 1 4 5 6
result_values: 6 1 8 1
我认为需要两个交集,如第一个答案中所建议的。其他解决方案将不起作用,它们产生正确结果只是输入数据的巧合。例如,如果从左侧集合中删除第二个(键,值)对,则计算结果将有所不同,但它不应该这是代码:
$ cat inner_join.cu
#include <thrust/set_operations.h>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <iostream>
int main()
{
int _Lkeys[] = {1, 4, 5, 6};
int _Lvals[] = {3, 1, 2, 1};
int _Rkeys[] = {1, 3, 4, 5, 6, 7};
int _Rvals[] = {2, 1, 1, 4, 1, 2};
size_t Lsize = sizeof(_Lkeys) / sizeof(int);
size_t Rsize = sizeof(_Rkeys) / sizeof(int);
thrust::device_vector<int> Lkeys(_Lkeys, _Lkeys + Lsize);
thrust::device_vector<int> Lvals(_Lvals, _Lvals + Lsize);
thrust::device_vector<int> Rkeys(_Rkeys, _Rkeys + Rsize);
thrust::device_vector<int> Rvals(_Rvals, _Rvals + Rsize);
std::size_t min_size = std::min(Lsize, Rsize);
thrust::device_vector<int> result_keys(min_size);
thrust::device_vector<int> result_Rvals(min_size);
thrust::device_vector<int> result_Lvals(min_size);
// set intersection keys, and left set values
size_t intersection_size =
thrust::set_intersection_by_key(Lkeys.begin(), Lkeys.end(), Rkeys.begin(),
Rkeys.end(), Lvals.begin(),
result_keys.begin(), result_Lvals.begin())
.first -
result_keys.begin();
// set intersection keys, and right set values
thrust::set_intersection_by_key(Rkeys.begin(), Rkeys.end(), Lkeys.begin(),
Lkeys.end(), Rvals.begin(),
result_keys.begin(), result_Rvals.begin());
result_Lvals.resize(intersection_size);
result_keys.resize(intersection_size);
thrust::device_vector<int> result_values(intersection_size);
// join left and right intersection values
thrust::transform(result_Lvals.begin(), result_Lvals.end(),
result_Rvals.begin(), result_values.begin(),
thrust::multiplies<int>());
std::cout << "keys: ";
thrust::copy_n(result_keys.begin(), intersection_size,
std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "vals: ";
thrust::copy_n(result_values.begin(), intersection_size,
std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
输出
$ nvcc inner_join.cu -run
keys: 1,4,5,6,
vals: 6,1,8,1,