如何在推力中减少二维数据的一维

How to do a reduction over one dimension of 2D data in Thrust

我是 CUDA 和推力库的新手。我正在学习并尝试实现一个函数,该函数将具有执行推力函数的 for 循环。有没有办法将这个循环转换成另一个推力函数?还是应该使用 CUDA 内核来实现?

我想出了这样的代码

// thrust functor
struct GreaterthanX
{
    const float _x;
    GreaterthanX(float x) : _x(x) {}

    __host__ __device__ bool operator()(const float &a) const
    {
        return a > _x;
    }
};

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int row = 3;
    int col = 4;
    thrust::device_vector<int> vec(row * col);
    thrust::device_vector<int> count(row);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec
    for (int i = 0; i < row; i++)
    {
        count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
    }

    thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
    int max_val = *result;
    unsigned int position = result - count.begin();

    printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

我的目标是找到具有最多大于 2 的元素的行。我正在努力研究如何在没有循环的情况下执行此操作。任何建议将不胜感激。谢谢

使用推力的解决方案

这是一个使用 thrust::reduce_by_key 结合多个智能迭代器的实现。

为了优雅和可读性,我还自由地加入了一些 constauto 和 lambda。由于 lambda,您需要为 nvcc.

使用 -extended-lambda 标志

thrust::distance 是减去 Thrust 迭代器的规范方法。

#include <cassert>
#include <cstdio>

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int const row = 3;
    int const col = 4;
    thrust::device_vector<int> vec(row * col);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;
    thrust::device_vector<int> count(row);

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec

    // counting iterator avoids read from global memory, gives index into vec
    auto keys_in_begin = thrust::make_counting_iterator(0);
    auto keys_in_end = thrust::make_counting_iterator(row * col);
    
    // transform vec on the fly
    auto vals_in_begin = thrust::make_transform_iterator(
        vec.cbegin(), 
        [] __device__ (int val) { return val > 2 ? 1 : 0; });
    
    // discard to avoid write to global memory
    auto keys_out_begin = thrust::make_discard_iterator();
    
    auto vals_out_begin = count.begin();
    
    // transform keys (indices) into row indices and then compare
    // the divisions are one reason one might rather
    // use MatX for higher dimensional data
    auto binary_predicate = [col] __device__ (int i, int j){
        return i / col == j / col;
    };
    
    // this function returns a new end for count 
    // b/c the final number of elements is often not known beforehand
    auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
                                         vals_in_begin,
                                         keys_out_begin,
                                         vals_out_begin,
                                         binary_predicate);
    // make sure that we didn't provide too small of an output vector
    assert(thrust::get<1>(new_ends) == count.end());

    auto const result = thrust::max_element(count.begin(), count.end());
    int const max_val = *result;
    auto const position = thrust::distance(count.begin(), result);

    std::printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

使用 MatX 的奖励解决方案

如评论中所述,NVIDIA 发布了一个名为 MatX which targets problems involving (dense) multi-dimensional data (i.e. tensors). The library tries to unify multiple low-level libraries like CUFFT, CUSOLVER and CUTLASS in one python-/matlab-like interface. At the point of this writing (v0.2.2) the library is still in initial development and therefore probably doesn't guarantee a stable API. Due to this, the performance not being as optimized as with the more mature Thrust library and the documentation/samples not being quite exhaustive, MatX should not be used in production code yet. While constructing this solution I actually stumbled upon a bug which was instantly fixed. So this code will only work on the main branch and not with the current release v0.2.2 and some used features might not appear in the documentation 的新 high-level、C++17 库。

使用 MatX 的解决方案如下所示:

#include <iostream>
#include <matx.h>

int main(void)
{
    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    // tensor.Print(0,0); // print full tensor

    auto count = matx::make_tensor<int, 1>({row});
    // count.Print(0); // print full count

    // Goal: For each row, count the number of elements greater than 2.
    // And then find the row with the max count

    // the kind of reduction is determined through the shapes of tensor and count
    matx::sum(count, matx::as_int(tensor > 2));

    // A single value (scalar) is a tensor of rank 0: 
    auto result_idx = matx::make_tensor<matx::index_t>();
    auto result = matx::make_tensor<int>();
    matx::argmax(result, result_idx, count);

    cudaDeviceSynchronize();
    std::cout << "result = " << result() 
              << " at position " << result_idx() << "\r\n";
    // result = 4 at position 2

    return 0;
}

由于 MatX 使用延迟执行运算符,matx::as_int(tensor > 2) 被有效地融合到内核中,实现与在 Thrust 中使用 thrust::transform_iterator 相同的效果。

由于 MatX 知道问题的规律性而 Thrust 不知道,因此 MatX 解决方案可能比 Thrust 解决方案性能更高。它当然更优雅。也可以在已经分配的内存中构造张量,因此可以混合库,例如我通过将 thrust::raw_pointer_cast(vec.data()) 传递给张量的构造函数,在名为 vecthrust::vector 的内存中构造一个张量。