在 cuda 中使用 stride 进行多块缩减的正确方法是什么？

Question

大家好我正在尝试使用网格步幅方法和原子函数来进行多块缩减。
我知道执行此操作的通常方法是启动两个内核或使用 this note.(or this 教程中指示的 lastblock 方法）

不过，我认为这也可以通过使用带有原子代码的网格步幅来完成。
正如我测试的那样，它运行良好..
直到 some 号码，它给出了错误的答案。（这很奇怪）

我已经测试了一些 "n"s，发现我对 n = 1234565、1234566、1234567 的答案是错误的。
这是我做 n 和 1 的整个代码。所以答案应该是 n。
感谢任何帮助或评论。

#include<iostream>

__global__ void stride_sum(const double* input,
                           const int size,
                           double* sumOut){
    extern __shared__ double sm[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockDim.x * blockIdx.x + tid;

    //doing grid loop using stride method.
    for(unsigned int s=i;
            s<size;
            s+=blockDim.x*gridDim.x){
        sm[tid] = input[i];
        __syncthreads();

        //doing parallel reduction.
        for(unsigned int ss = blockDim.x/2;ss>0;ss>>=1){
            if(tid<ss && tid+ss<size) sm[tid] += sm[tid+ss];
            __syncthreads();
        }

        //atomically add results to sumOut.
        if(tid==0) atomicAdd(sumOut, sm[0]);
    }
}

int main(){

    unsigned int n = 1234567;
    int blockSize = 4;
    int nBlocks = (n + blockSize - 1) / blockSize;
    int sharedMemory = sizeof(double)*blockSize;

    double *data, *sum;

    cudaMallocManaged(&data, sizeof(double)*n);
    cudaMallocManaged(&sum, sizeof(double));

    std::fill_n(data,n,1.);
    std::fill_n(sum,1,0.);

    stride_sum<<<nBlocks, blockSize, sharedMemory>>>(data,n,sum);

    cudaDeviceSynchronize();

    printf("res: 10.f \n",sum[0]);

    cudaFree(data);
    cudaFree(sum);

    return 0;
}

Answer 1

您在实施过程中犯了很多错误。这将起作用：

__global__ void stride_sum(const double* input,
                           const int size,
                           double* sumOut)
{
    extern __shared__ volatile double sm[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockDim.x * blockIdx.x + tid;

    //doing grid loop using stride method.
    double val = 0.;
    for(unsigned int s=i; s<size; s+=blockDim.x*gridDim.x){
        val += input[i]; 
    }

    // Load partial sum to memory
    sm[tid] = val; 
    __syncthreads();

    //doing parallel reduction.
    for(unsigned int ss = blockDim.x/2;ss>0;ss>>=1){
        if(tid<ss && tid+ss<size) sm[tid] += sm[tid+ss];
        __syncthreads();
    }

   //atomically add results to sumOut.
   if(tid==0) atomicAdd(sumOut, sm[0]);
}

[从未编译过运行，使用风险自负]

简而言之——进行网格跨步求和，然后单个共享内存减少，然后单个原子更新。您的实现在一些地方有未定义的行为，尤其是有条件执行的 __syncthreads 调用和当某些线程脱离求和循环时使用未初始化的共享内存。

在 cuda 中使用 stride 进行多块缩减的正确方法是什么？

What is the proper way to use stride in cuda to do multiblock reduction?

cuda

reduction

stride