提高 CUDA 中 Compact/Scatter 的效率

Question

总结：

关于如何进一步改进 CUDA 中的基本分散操作有什么想法吗？特别是如果知道它只会用于将较大的阵列压缩成较小的阵列？或者为什么以下向量化内存操作和共享内存的方法不起作用？我觉得我可能缺少一些基本的东西，我们将不胜感激。

编辑 03/09/15：所以我发现了这个 Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics"。为此，我曾假设原子本质上会更慢，但我错了——尤其是因为我认为我不关心在模拟期间维护数组中的元素顺序。我将不得不再考虑一下，然后实施它，看看会发生什么！

编辑 01/04/16：我意识到我从未写过我的结果。不幸的是，在 Parallel for All 博客 post 中，他们将全局原子紧凑方法与 Thrust 前缀和紧凑方法进行了比较，后者实际上非常慢。 CUB 的 Device::IF 比 Thrust 快得多——我使用 CUB 的 Device::Scan + 自定义代码编写的前缀和版本也是如此。 warp-aggregrate 全局原子方法仍然快了大约 5-10%，但远不及我根据博客中的结果所希望的快 3-4 倍。我仍在使用前缀和方法，因为虽然不需要维护元素顺序，但我更喜欢前缀和结果的一致性，并且原子学的优势不是很大。我仍然尝试各种方法来改进 compact，但到目前为止，对于显着增加的代码复杂性，充其量只是边际改进 (2%)。

详情：

我正在 CUDA 中编写一个模拟，其中我压缩了我不再对每 40-60 个时间步模拟一次感兴趣的元素。从分析来看，分散运算似乎在压缩时占用了最多的时间——比过滤器内核或前缀和要多。现在我使用一个非常基本的散点函数：

    __global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
            int myID =  blockIdx.x*blockDim.x + threadIdx.x;
            for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
                 if(flag[id]){
                    new_freq[scan_Index[id]] = freq[id];
                 }
             } 
    }

freq_Index是旧数组的元素个数。标志数组是过滤器的结果。 Scan_ID 是标志数组上的前缀和的结果。

我为改进它所做的尝试是首先将标记的频率读入共享内存，然后从共享内存写入全局内存——这个想法是对全局内存的写入将在扭曲之间更加合并（例如，不是线程 0 写入位置 0，线程 128 写入位置 1，而是线程 0 写入 0，线程 1 写入 1）。我还尝试对读取和写入进行矢量化 - 而不是在可能的情况下从全局数组读取和写入 floats/ints I read/wrote float4/int4，所以一次四个数字。我认为这可能会通过减少传输大量内存的内存操作来加速分散。 "kitchen sink" 向量化内存 loads/stores 和共享内存的代码如下：

    const int compact_threads = 256;
    __global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
        int gID =  blockIdx.x*blockDim.x + threadIdx.x; //global ID
        int tID = threadIdx.x; //thread ID within block
        __shared__ float row[4*compact_threads];
        __shared__ int start_index[1];
        __shared__ int end_index[1];
        float4 myResult;
        int st_index;
        int4 myFlag;
        int4 index;
        for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
            if(tID == 0){
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                start_index[0] = index.x;
                st_index = index.x;
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[0] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
            }
            __syncthreads();
            if(tID > 0){
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                st_index = start_index[0];
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[index.x-st_index] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
                if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
            }
            __syncthreads();
            int count = end_index[0] - st_index;

            int rem = st_index & 0x3; //equivalent to modulo 4
            int offset = 0;
            if(rem){ offset = 4 - rem; }

            if(tID < offset && tID < count){
                new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
            }

            int tempID = 4*tID+offset;
            if((tempID+3) < count){
                reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
            }

            tempID = tID + offset + (count-offset)/4*4;
            if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
        }
        int id = gID + freq_Index/4 * 4; 
        if(id < freq_Index){
            if(flag[id]){
                new_freq[scan_Index[id]] = freq[id];
            }
        }
    }

显然它变得有点复杂。 :) 虽然上面的内核在数组中有数十万个元素时看起来很稳定，但我注意到当数组中有数千万个元素时出现了竞争条件。我仍在努力追踪错误。

但无论如何，两种方法（共享内存或矢量化）结合使用或单独使用都没有提高性能。令我特别惊讶的是，向量化内存操作并没有带来好处。它对我编写的其他函数有帮助，但现在我想知道它是否有帮助，因为它在那些其他函数的计算步骤中增加了指令级并行性，而不是减少了内存操作。

Answer 1

我发现这个 poster (similar algorithm also discussed in this paper) 中提到的算法工作得很好，特别是对于压缩大型数组。它使用更少的内存来完成它，并且比我以前的方法 (5-10%) 稍微快一点。我对发帖者的算法进行了一些调整：1) 消除了阶段 1 中最终的 warp shuffle 减少，可以简单地对计算出的元素求和，2) 使函数能够处理不仅仅是大小为 a 的数组1024 的倍数 + 添加 grid-strided 循环，以及 3) 允许每个线程在第 3 阶段同时加载它们的寄存器，而不是一次加载一个。我还使用 CUB 而不是 Thrust 来计算包含总和以加快扫描速度。我可以做更多的调整，但现在这很好。

//kernel phase 1
int myID =  blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int mask;
    unsigned int cnt=0;//;//

    for(int j = 0; j < 32; j++){
        int index = (warpID<<10)+(j<<5)+lnID;
        
        bool pred;
        if(index > true_length) pred = false;
        else pred = predicate(input[index]);
        mask = __ballot(pred); 

        if(lnID == 0) {
            flag[(warpID<<5)+j] = mask;
            cnt += __popc(mask);
        }
    }

    if(lnID == 0) counter[warpID] = cnt; //store sum
}

//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array

//kernel phase 3
int myID =  blockIdx.x*blockDim.x + threadIdx.x;

for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int predmask;
    unsigned int cnt;

    predmask = flag[(warpID<<5)+lnID];
    cnt = __popc(predmask);

    //parallel prefix sum
#pragma unroll
    for(int offset = 1; offset < 32; offset<<=1){
        unsigned int n = __shfl_up(cnt, offset);
        if(lnID >= offset) cnt += n;
    }

    unsigned int global_index = 0;
    if(warpID > 0) global_index = scan_Index[warpID - 1];

    for(int i = 0; i < 32; i++){
        unsigned int mask = __shfl(predmask, i); //broadcast from thread i
        unsigned int sub_group_index = 0;
        if(i > 0) sub_group_index = __shfl(cnt, i-1);
        if(mask & (1 << lnID)){
            compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID]; 
        }
    }
}

}

编辑：海报作者的一个子集有一个更新的 article，他们在其中检查了比上面写的更快的 compact 变体。但是，他们的新版本不保留顺序，因此对我自己没有用，我还没有实施它来测试它。也就是说，如果您的项目不依赖于对象顺序，他们较新的紧凑版本可能会加快您的算法。

提高 CUDA 中 Compact/Scatter 的效率

Improving the Efficiency of Compact/Scatter in CUDA

performance

cuda

scatter

stream-compaction