CUDA：使用推力根据另一个数组定义的顺序对数组进行排序

Question

我有 10 个数组。我想对它们进行排序。但是由于它们的元素具有相同的行为，我想节省计算并只对一个进行排序，其他的将根据已排序的数组进行排序。我用的是推力。有一个最优的为什么要这样做？提前谢谢你。

Answer 1

执行此操作的几种方法（不考虑推力）：

1. 初始化索引数组indices到0, 1, 2, 3...等
2. 排序indices，比较函数访问数组之一（比较成本最低的数组）中的元素，并比较它们。调用结果数组
3. 对于 10 个数组中的每一个，arr 使用已排序的 indices 和 arr 作为要收集的数据应用 Gather 操作。即 sorted_arr[i] = arr[indices[i]] 所有 i.
1. 调整其中一种排序实现也执行 "index log-keeping"，即每当您在 "real" 数组中交换或定位数据时，也在索引数组中设置索引。
2. 运行对 10 个数组之一（排序成本最低的数组）进行索引排序。
3. 将 1.3 的 Gather 应用到其他 9 个数组
让 cheap 成为排序（或比较元素）成本最低的数组
1. 创建适当类型的数组对 pairs[i] = { i, cheap[i] }。
2. 让这些对的比较只使用对的第二个元素。
3. 排序pairs
4. 将 pairs 投影到它的第一个元素上：indices[i] = pairs[i].first
5. 将 pairs 投影到它的第二个元素上：sorted_cheap[i] = pairs[i].second
6. 将 1.3 的 Gather 应用到其他九个数组

第二个选项应该是最快的，但需要更多的努力；和推力，这可能是相当困难的。第一个或第三个应该是最简单的；推力接受自定义比较器，对吗？如果没有，您可能必须使用适当的比较器定义包装器 class。

Answer 2

根据评论，我的建议是：

在第一个数据集（数组）上使用thrust::sort_by_key，将第一个数据集作为键传递，将索引序列（0、1、2、...）作为值传递。然后在推力聚集或分散操作中使用重新排列的索引序列来重新排列剩余的阵列。

根据要求，这是一个有效的例子：

$ cat t282.cu
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/sequence.h>
#include <iostream>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>

const size_t ds = 5;
typedef float ft;

int main(){
  ft a1[ds] = {0.0f, -3.0f, 4.0f, 2.0f, 1.0f};
// data setup
  thrust::device_vector<ft> d_a1(a1, a1+ds);
  thrust::device_vector<ft> d_a2(ds);
  thrust::device_vector<ft> d_a3(ds);
  thrust::device_vector<ft> d_a2r(ds);
  thrust::device_vector<ft> d_a3r(ds);
  thrust::device_vector<size_t> d_i(ds);
  thrust::sequence(d_i.begin(), d_i.end());
  thrust::sequence(d_a2.begin(), d_a2.end());
  thrust::sequence(d_a3.begin(), d_a3.end());
// sort
  thrust::sort_by_key(d_a1.begin(), d_a1.end(), d_i.begin());
// copy, using sorted indices
  thrust::copy_n(thrust::make_permutation_iterator(thrust::make_zip_iterator(thrust::make_tuple(d_a2.begin(), d_a3.begin())), d_i.begin()), ds, thrust::make_zip_iterator(thrust::make_tuple(d_a2r.begin(), d_a3r.begin())));
// output results
  thrust::host_vector<ft> h_a1 = d_a1;
  thrust::host_vector<ft> h_a2 = d_a2r;
  thrust::host_vector<ft> h_a3 = d_a3r;
  std::cout << "a1: " ;
  thrust::copy_n(h_a1.begin(), ds, std::ostream_iterator<ft>(std::cout, ","));
  std::cout << std::endl << "a2: " ;
  thrust::copy_n(h_a2.begin(), ds, std::ostream_iterator<ft>(std::cout, ","));
  std::cout << std::endl << "a3: " ;
  thrust::copy_n(h_a3.begin(), ds, std::ostream_iterator<ft>(std::cout, ","));
  std::cout << std::endl;
}
$ nvcc -o t282 t282.cu
$ cuda-memcheck ./t282
========= CUDA-MEMCHECK
a1: -3,0,1,2,4,
a2: 1,0,4,3,2,
a3: 1,0,4,3,2,
========= ERROR SUMMARY: 0 errors
$

在这里，我只是简单地用 thrust::copy_n 和 thrust::permutation_iterator 来代替 thrust::gather 或 thrust::scatter 操作，以实现重新排序。我使用 thrust::zip_iterator 合并要重新排序的剩余数组，但这不是唯一的方法。

请注意，我不是针对 10 个数组而是针对 3 个数组执行此操作，但这应该可以说明该方法。扩展到 10 个阵列应该只是机械的。但是请注意，对于超过 10-11 个数组，该方法必须进行一些修改，因为 thrust::tuple 被限制为 10 个项目。作为修改，您可以简单地在循环中调用 thrust::copy_n，对每个要重新排序的数组调用一次，而不是使用 zip_iterator。我认为这不会对效率产生很大影响。

CUDA：使用推力根据另一个数组定义的顺序对数组进行排序

CUDA: Sort an array according to the order defined by another array using thrust

parallel-processing

cuda

gpu

thrust