在 Cuda 8+ 中使用默认推力自展开 CUDA n 暗淡的相同类型元组创建？

Question

我正在尝试展开函数的实现以便在 cuda 中执行优化。基本上我有一块共享内存，它最初会减慢我的代码，并且通过 "unrolling" 我的实现（减少总线程数，并且每个线程做两倍的工作）我能够获得显着的性能提升。我想看看我是否可以通过更多的展开来管理更多的性能提升，但是我广泛使用了元组来实现这一点。我发现这个过程中有很多代码重复，我想减少重复。

这是我的代码中经常发生的事情的示例：

__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg...){
    //all do the same thing, with very similar args as well.
    T value1 = someoperation(thrust::get<0>(choice_arg),...);
    T value2 = someoperation(thrust::get<1>(choice_arg),...);
    T value3 = someoperation(thrust::get<2>(choice_arg),...);
    ...
    return thrust::make_tuple(value1, value2, value3,...);
}

我不想自己在这里写所有的样板，我想要一个像这样的函数：

__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg, ...){
    return someoperation<CHOICE_ARG_LENGTH>(choice_arg,...);
}

我已经看到之类的东西如何提供帮助，但是如果我需要 return 和 thrust::tuple，普通的模板循环将无法工作。如果 thrust 有 thrust::tuple_cat，该解决方案将起作用，但是他们还没有合并可变参数模板元组，尽管 2014 年已经完成了工作，而且我什至找不到任何引用合并 cat 实现的讨论！那么是否可以在 GPU 上不使用 thrust::tuple_cat 实现来实现我正在寻找的行为？

请注意，我不能为此使用数组，在最初使用数组后，我发现我免费获得了 %15 的速度提升，这在可视化分析器和我所拥有的算法的实际应用中都可以看到。该代码对性能非常关键。

Answer 1

如果您可以使用 CUDA 9 和 c++14，您可以执行以下操作，有关详细信息，请参见例如std::integer_sequence.

#include <iostream>
#include <utility>
#include <thrust/tuple.h>

template <typename T>
__device__ T some_operation(T a) {
  return a + 1;  // do something smart
}

template <typename T, std::size_t... I>
__device__ auto foo_impl(const T& t, std::index_sequence<I...>) {
  return thrust::make_tuple(some_operation(thrust::get<I>(t))...);
}

template <typename Tuple>
__device__ auto foo(const Tuple& t) {
  return foo_impl(t,
                  std::make_index_sequence<thrust::tuple_size<Tuple>::value>());
}

__global__ void test_kernel() {
  auto result = foo(thrust::make_tuple(3., 2, 7));
  printf("%f, %d, %d\n", thrust::get<0>(result), thrust::get<1>(result),
         thrust::get<2>(result));
}

int main() {
  test_kernel<<<1, 1>>>();
  cudaDeviceSynchronize();
}

用nvcc -std=c++14 ...

编译

对于 c++11

你需要

提供您自己的 index_sequence
使用尾随 return 类型。

这是一个工作版本。免责声明：我写下了我想到的 index_sequence。也许您想从 std 库中获得一个实现。

您可能会在网上找到很多关于 index_sequence/integer_sequence 的教程，例如在 cppreference.com 上。 index_sequence 的基本思想是它允许枚举元组（或数组）元素。在 foo 中，创建了一个 index_sequence，其模板参数为 0, ..., thrust::tuple_size<Tuple>::value。在 foo_impl 中，您在可变参数包中捕获这些索引并将其扩展为为每个元组元素调用 some_operation。

#include <iostream>
#include <thrust/tuple.h>

namespace compat {
template <size_t... Indices>
struct index_sequence {};

namespace detail {
template <size_t N, typename Seq = index_sequence<>>
struct make_index_sequence_impl;

template <size_t N, size_t... Indices>
struct make_index_sequence_impl<N, index_sequence<Indices...>> {
  using type = typename make_index_sequence_impl<
      N - 1, index_sequence<N - 1, Indices...>>::type;
};

template <size_t... Indices>
struct make_index_sequence_impl<1, index_sequence<Indices...>> {
  using type = index_sequence<0, Indices...>;
};
}

template <size_t N>
using make_index_sequence = typename detail::make_index_sequence_impl<N>::type;
}

template <typename T>
__device__ T some_operation(T a) {
  return a + 1;  // do something smart
}

template <typename T, std::size_t... I>
__device__ auto foo_impl(const T& t, compat::index_sequence<I...>)
    -> decltype(thrust::make_tuple(some_operation(thrust::get<I>(t))...)) {
  return thrust::make_tuple(some_operation(thrust::get<I>(t))...);
}

template <typename Tuple>
__device__ auto foo(const Tuple& t) -> decltype(foo_impl(
    t, compat::make_index_sequence<thrust::tuple_size<Tuple>::value>())) {
  return foo_impl(
      t, compat::make_index_sequence<thrust::tuple_size<Tuple>::value>());
}

__global__ void test_kernel() {
  auto result = foo(thrust::make_tuple(3., 2, 7));
  printf("%f, %d, %d\n", thrust::get<0>(result), thrust::get<1>(result),
         thrust::get<2>(result));
}

int main() {
  test_kernel<<<1, 1>>>();
  cudaDeviceSynchronize();
}

在 Cuda 8+ 中使用默认推力自展开 CUDA n 暗淡的相同类型元组创建？

Self unrolling CUDA n dim same type tuple creation using default thrust in Cuda 8+?

c++

templates

cuda

metaprogramming

thrust

对于 c++11