GPU实现中的顺序操作

Question

我必须在 GPU 中实现以下算法

for(int I = 0; I < 1000; I++){
    VAR1[I+1] = VAR1[I] + VAR2[2*K+(I-1)];//K is a constant
}

每次迭代都依赖于之前的迭代，因此并行化很困难。我不确定原子操作在这里是否有效。我能做什么？

编辑：

VAR1和VAR2都是一维数组。

VAR1[0] = 1

Answer 1

这是一类称为 recurrence relations. Depending on the structure of the recurrence relation, there may exist closed form solutions that describe how to compute each element individually (i.e. in parallel, without recursion). One of the early seminal papers (on parallel computation) was Kogge and Stone 的问题，并且存在并行化特定形式的方法和策略。

有时候递归关系很简单，我们只要稍加“检验”就可以识别出一个closed-form公式或算法。 short tutorial 对这个想法做了更多的处理。

在你的情况下，让我们看看我们是否可以通过映射 VAR1 的前几个术语应该是什么样子，将以前的术语替换为新的术语来发现任何东西：

i      VAR1[i]
___________________
0        1
1        1 + VAR2[2K-1]
2        1 + VAR2[2K-1] + VAR2[2K]
3        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]
4        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]
...

希望您突然想到上面的 VAR2[] 项遵循 prefix sum.

的模式

这意味着可以通过以下方式给出一种可能的解决方法：

VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)])   (for i > 0) notes:(1) (2)
VAR1[i] = 1                                (for i = 0)

现在，前缀和可以并行完成（这不是真正完全独立的操作，但可以并行化。我不想在这里争论太多术语或纯度。我提供一种可能的并行化方法 用于解决您提出的问题，而不是唯一的方法。）要在 GPU 上并行执行前缀和，我会使用像 [=32= 这样的库] 虽然我不推荐它。

备注：

使用 -1 或 -2 作为前缀和 i 的偏移量可能由您使用 inclusive 或 exclusive 决定扫描或前缀求和运算。
VAR2 必须在适当的域上定义，以使其变得合理。但是，该要求隐含在您的问题陈述中。

这是一个简单的工作示例。在这种情况下，由于 VAR2 索引项 2K+(I-1) 仅代表 I (2K-1) 的固定偏移量，我们只是为了演示目的而使用偏移量 0，所以VAR2 只是一个与 VAR1 位于同一域的简单数组。出于演示目的，我将 VAR2 定义为所有 1 的数组。 gpu 并行计算发生在 VAR1 向量中，CPU 等效计算只是在 cpu 变量中计算 on-the-fly 用于验证目的：

$ cat t1056.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>

const int dsize = 1000;
using namespace thrust::placeholders;
int main(){

  thrust::device_vector<int> VAR2(dsize, 1);  // initialize VAR2 array to all 1's
  thrust::device_vector<int> VAR1(dsize);
  thrust::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0); // put prefix sum of VAR2 into VAR1
  thrust::transform(VAR1.begin(), VAR1.end(), VAR1.begin(),  _1 += 1);   // add 1 to every term
  int cpu = 1;
  for (int i = 1; i < dsize; i++){
    int gpu = VAR1[i];
    cpu += VAR2[i];
    if (cpu != gpu) {std::cout << "mismatch at: " << i << " was: " << gpu << " should be: " << cpu << std::endl; return 1;}
    }
  std::cout << "Success!" << std::endl;
  return 0;
}

$ nvcc -o t1056 t1056.cu
$ ./t1056
Success!
$

有关使用扫描操作解决线性递归问题的其他参考，请参阅 Blelloch 的论文 here section 1.4. gives an example of how to implement the equation 1.5 in that paper for a more general first-order recurrence case. 考虑了 second-order 递归情况。

GPU实现中的顺序操作

Sequential operation in GPU implementation

parallel-processing

cuda

gpu