使用 CUDA/Thrust 置换矩阵（和相应向量）的行

Question

我想置换存储为交错数组（即由行优先 C 样式格式的向量支持）的矩阵的行，并将相同的置换应用于相应向量的元素。

假设矩阵维度为RxC，对应的向量有R个元素。

我目前的想法是生成 R 索引的排列，然后使用 thrust::stable_sort_by_key 排列向量，如图所示 here。

然后我可以创建另一个置换向量，它重复我之前创建的向量的每个元素 C 次。

因此，如果 R = 4，C = 3 并且原始置换索引向量为 [4, 2, 3, 1]，则矩阵的置换向量将为 [4, 4, 4, 2, 2, 2 , 3, 3, 3, 1, 1, 1]。通过使用稳定排序，矩阵的一行中的元素不应该被置换。

我的问题是，是否有 better/more 有效的方法，使用 Thrust 或普通 CUDA。

示例：

原始矩阵：

[ 1 1 1 1 ]
[ 2 2 2 2 ]
[ 3 3 3 3 ]
[ 4 4 4 4 ]
[ 5 5 5 5 ]

原始矢量：

[1 2 3 4 5]

排列顺序：

[5 3 1 2 4]

置换矩阵：

[ 5 5 5 5 ]
[ 3 3 3 3 ]
[ 1 1 1 1 ]
[ 2 2 2 2 ]
[ 4 4 4 4 ]

置换向量：

[5 3 1 2 4]

我的用例是每个示例都有一个特征矩阵和一个对应标签的向量。我想置换矩阵并对向量应用相同的置换，作为 SGD 迭代之前的洗牌步骤。我想要连续的行并遍历它们的原因是我计划使用 cuBLAS gemv 来执行矩阵向量操作，它假设矩阵在内存中以类似的方式布局（尽管以列为主格式意味着我需要像 this)

这样称呼它

Answer 1

My question is then if there is a better/more efficient way to do this, using Thrust

我相信有。置换向量为您提供了直接将输入矩阵的内容复制到置换矩阵所需的所有信息，无需排序。

一个有用的 thrust 功能是 permutation_iterator。排列迭代器允许我们即时重新排序我们的 selection 输入元素以用于任何操作。如果我们提供适当的索引计算函子，我们可以将线性索引（通过 counting_iterator）传递给索引函子，以创建（通过 transform_iterator）适当的排列输入索引中的任何元素复制操作。

这是一个有效的例子：

$ cat t1061.cu
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <iostream>
#include <assert.h>

typedef int mytype;

struct copy_idx_func : public thrust::unary_function<unsigned, unsigned>
{
  size_t c;
  unsigned *p;
  copy_idx_func(const size_t _c, unsigned *_p) : c(_c),p(_p) {};
  __host__ __device__
  unsigned operator()(unsigned idx){
    unsigned myrow = idx/c;
    unsigned newrow = p[myrow]-1;
    unsigned mycol = idx%c;
    return newrow*c+mycol;
  }
};


int main(){

  const mytype mat[]   = {1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5};
  const mytype vec[]   = {1,2,3,4,5};
  const unsigned per[] = {5,3,1,2,4};

  const size_t msize = sizeof(mat)/sizeof(mytype);
  const size_t vsize = sizeof(vec)/sizeof(mytype);
  const size_t psize = sizeof(per)/sizeof(unsigned);
  const size_t cols  = msize/vsize;
  // const size_t rows  = vsize;
  assert(msize%vsize == 0);
  assert(vsize == psize);

  thrust::device_vector<mytype>   d_m(mat, mat+msize);
  thrust::device_vector<mytype>   d_v(vec, vec+vsize);
  thrust::device_vector<unsigned> d_p(per, per+psize);
  thrust::device_vector<mytype>   d_rm(msize);
  thrust::device_vector<mytype>   d_rv(vsize);
  std::cout << "Initial Matrix:" << std::endl;
  thrust::copy_n(d_m.begin(), msize, std::ostream_iterator<mytype>(std::cout, ","));

  // permute the matrix
  thrust::copy_n(thrust::make_permutation_iterator(d_m.begin(), thrust::make_transform_iterator(thrust::counting_iterator<unsigned>(0), copy_idx_func(cols,thrust::raw_pointer_cast(d_p.data())))), msize, d_rm.begin());

  std::cout << std::endl << "Permuted Matrix:" << std::endl;
  thrust::copy_n(d_rm.begin(), msize, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl << "Initial Vector:" << std::endl;
  thrust::copy_n(d_v.begin(), vsize, std::ostream_iterator<mytype>(std::cout, ","));

  // permute the vector
  thrust::copy_n(thrust::make_permutation_iterator(d_v.begin(), thrust::make_transform_iterator(thrust::counting_iterator<unsigned>(0),  copy_idx_func(1,thrust::raw_pointer_cast(d_p.data())))), vsize, d_rv.begin());

  std::cout << std::endl << "Permuted Vector:" << std::endl;
  thrust::copy_n(d_rv.begin(), vsize, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;
}

$ nvcc -o t1061 t1061.cu
$ ./t1061
Initial Matrix:
1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,
Permuted Matrix:
5,5,5,5,5,3,3,3,3,3,1,1,1,1,1,2,2,2,2,2,4,4,4,4,4,
Initial Vector:
1,2,3,4,5,
Permuted Vector:
5,3,1,2,4,
$

备注：

在操作上置换向量与置换矩阵相同。我们简单地将向量视为一列矩阵。
正如评论中所讨论的那样，如果用例完全在推力之内，则可能根本不需要复制元素。 permutation_iterator 允许我们以任何排列顺序从原始矩阵中获取 select 个元素，我们可以简单地将此构造传递给任何需要原始矩阵排列顺序的推力操作。

使用 CUDA/Thrust 置换矩阵（和相应向量）的行

Permuting the rows of a matrix (and corresponding vector) using CUDA/Thrust

cuda

permutation

matrix

thrust