CUDA - 优化表面检测内核
CUDA - optimize surface detection kernel
我正在尝试优化我的表面检测内核;给定一个输入二进制 512w x 1024h
图像,我想找到图像中的第一个亮面。我编写的代码声明了 512 个线程,并在 3x3
邻域中搜索第一个亮像素。代码工作正常,但在 ~9.46 ms
时有点慢,我想让它 运行 更快。
编辑 1: 性能改进不到我的原始内核到 运行 所用时间的一半。我的 Quadro K6000 上 4.032 ms
中的 Robert 内核 运行s。
编辑 2:设法通过将线程数减半来进一步提高性能。现在,我的(罗伯特修改过的)内核 运行s 在我的 Quadro K6000 上 2.125 ms
。
内核调用使用:
firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
我想使用共享内存来改进内存获取;关于如何优化这段代码的任何想法?
__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int rows2skip = 10;
float thresh = 1.0f;
//thread Index: (0 -> 511)
if (col < width) {
if( col == 0 ) { // first col - 0
for (int row = 0 + rows2skip; row < height - 2; row++) { // skip first 30 rows
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the right
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col+1)]; if(neibs[1] == thresh) { cnt++; } // right
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col+1)]; if(neibs[3] == thresh) { cnt++; } // bottom right
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom right
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
else if ( col == (width-1) ) { // last col
for (int row = 0 + rows2skip; row < height -2; row++) {
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the left
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col-1)]; if(neibs[1] == thresh) { cnt++; } // left
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; } // bottom left
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col-1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom left
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
// remaining threads are: (1 -> 510)
else { // any col other than first or last column
for (int row = 0 + rows2skip; row < height - 2; row++) {
int cnt = 0;
float neibs[9]; // not shared mem as it reduces speed
// for threads < width/4, get the neighbors
// get nine neighbours - three in curr col, three each to left and right
neibs[0] = threshImg[((row)*width) +(col-1)]; if(neibs[0] == thresh) { cnt++; }
neibs[1] = threshImg[((row)*width) +(col)]; if(neibs[1] == thresh) { cnt++; }
neibs[2] = threshImg[((row)*width) +(col+1)]; if(neibs[2] == thresh) { cnt++; }
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; }
neibs[4] = threshImg[((row+1)*width) +(col)]; if(neibs[4] == thresh) { cnt++; }
neibs[5] = threshImg[((row+1)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; }
neibs[6] = threshImg[((row+2)*width) +(col-1)]; if(neibs[6] == thresh) { cnt++; }
neibs[7] = threshImg[((row+2)*width) +(col)]; if(neibs[7] == thresh) { cnt++; }
neibs[8] = threshImg[((row+2)*width) +(col+1)]; if(neibs[8] == thresh) { cnt++; }
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
}
__syncthreads();
}
这是一个实例,演示了评论中讨论的 3 个概念中的 2 个:
首先要考虑的优化是 512 个线程不足以让任何 GPU 保持忙碌。我们希望以 10000 个线程或更多为目标。 GPU 是延迟隐藏机器,当线程太少无法帮助 GPU 隐藏延迟时,您的内核就会受到延迟限制,这是一种内存限制问题。实现这一点最直接的方法是让每个线程处理图像中的一个像素(总共允许 512*1024 个线程),而不是一列(总共只允许 512 个线程)。但是,由于这似乎"break"我们的"first-surface detection"算法,我们必须再做一次修改(2)。
一旦我们并行处理了所有像素,那么对上述第 1 项的直接改编意味着我们不再知道哪个表面是 "first",即哪个 "bright" 表面(每列)最接近第 0 行。该算法的这一特征将问题从简单的 转换 更改为 减少 (每列减少一次实际上是图像的。)我们将通过为每个像素分配 1 个线程来允许并行处理每一列,但我们将选择满足亮度测试的结果像素,即 closest 到第零行。一种相对简单的方法是在发现适当明亮像素邻域的最小行(每列)的每列阵列上使用 atomicMin
。
以下代码演示了上述 2 个更改(并且不包括任何共享内存的使用)并演示了(对我而言,在 Tesla K40 上)与 OP 的原始内核相比,加速范围为 1 到 20 倍。加速范围是由于算法的工作因图像结构而异。两种算法都有提前退出策略。由于 for 循环上的早期退出结构,原始内核可以做或多或少的工作,具体取决于在每列中发现 "bright" 像素邻域的位置(如果有)。因此,如果所有列在第 0 行附近都有明亮的邻域,我会看到大约 1x 的 "improvement"(即我的内核 运行s 的速度与原始速度大致相同)。如果所有列在图像的另一个 "end" 附近都有(仅)明亮的邻域,我会看到大约 20 倍的改进。这可能因 GPU 而异,因为开普勒 GPU 提高了我正在使用的全局原子吞吐量。 编辑: 由于这种可变工作,我添加了一个粗略的 "early-exit" 策略作为对我的代码的微不足道的修改。这带来了最短的执行时间来近似两个内核之间的奇偶校验(即大约 1x)。
剩余的优化可能包括:
共享内存的使用。这应该是对用于例如矩阵乘法的相同基于图块的共享内存方法的微不足道的改编。如果您使用方形瓷砖,那么您将需要调整内核 block/grid 尺寸以使这些 "square-ish".
改进了减少技术。对于某些图像结构,原子方法可能会有些慢。这可以通过切换到每列适当的并行减少来改善。您可以通过将测试图像设置为所有位置的阈值来对我的内核进行 "worst-case" 测试。这应该会导致原始内核 运行 最快,而我的内核 运行 最慢,但在这种情况下我没有观察到我的内核有任何明显的减速。我的内核的执行时间非常稳定。同样,这可能取决于 GPU。
示例:
#include <stdlib.h>
#include <stdio.h>
#define SKIP_ROWS 10
#define THRESH 1.0f
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int rows2skip = SKIP_ROWS;
float thresh = THRESH;
//thread Index: (0 -> 511)
if (col < width) {
if( col == 0 ) { // first col - 0
for (int row = 0 + rows2skip; row < height; row++) { // skip first 30 rows
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the right
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col+1)]; if(neibs[1] == thresh) { cnt++; } // right
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col+1)]; if(neibs[3] == thresh) { cnt++; } // bottom right
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom right
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
else if ( col == (width-1) ) { // last col
for (int row = 0 + rows2skip; row < height; row++) {
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the left
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col-1)]; if(neibs[1] == thresh) { cnt++; } // left
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; } // bottom left
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col-1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom left
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
// remaining threads are: (1 -> 510)
else { // any col other than first or last column
for (int row = 0 + rows2skip; row < height; row++) {
int cnt = 0;
float neibs[9]; // not shared mem as it reduces speed
// for threads < width/4, get the neighbors
// get nine neighbours - three in curr col, three each to left and right
neibs[0] = threshImg[((row)*width) +(col-1)]; if(neibs[0] == thresh) { cnt++; }
neibs[1] = threshImg[((row)*width) +(col)]; if(neibs[1] == thresh) { cnt++; }
neibs[2] = threshImg[((row)*width) +(col+1)]; if(neibs[2] == thresh) { cnt++; }
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; }
neibs[4] = threshImg[((row+1)*width) +(col)]; if(neibs[4] == thresh) { cnt++; }
neibs[5] = threshImg[((row+1)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; }
neibs[6] = threshImg[((row+2)*width) +(col-1)]; if(neibs[6] == thresh) { cnt++; }
neibs[7] = threshImg[((row+2)*width) +(col)]; if(neibs[7] == thresh) { cnt++; }
neibs[8] = threshImg[((row+2)*width) +(col+1)]; if(neibs[8] == thresh) { cnt++; }
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
}
__syncthreads();
}
__global__ void firstSurfaceDetection_opt (const float * __restrict__ threshImg, int *firstSurfaceImgRow, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int row = threadIdx.y + (blockDim.y*blockIdx.y);
int rows2skip = SKIP_ROWS;
float thresh = THRESH;
if ((row >= rows2skip) && (row < height-2) && (col < width) && (row < firstSurfaceImgRow[col])) {
int cnt = 0;
int inc = 0;
if (col == 0) inc = +1;
if (col == (width-1)) inc = -1;
if (inc){
cnt = 3;
if (threshImg[((row)*width) +(col)] == thresh) cnt++;
if (threshImg[((row)*width) +(col+inc)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col+inc)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col+inc)] == thresh) cnt++;
}
else {
// get nine neighbours - three in curr col, three each to left and right
if (threshImg[((row)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row)*width) +(col)] == thresh) cnt++;
if (threshImg[((row)*width) +(col+1)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col+1)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col+1)] == thresh) cnt++;
}
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
atomicMin(firstSurfaceImgRow + col, row);
}
}
}
int main(int argc, char *argv[]){
float *threshImg, *h_threshImg, *firstSurfaceImg, *h_firstSurfaceImg;
int *firstSurfaceImgRow, *h_firstSurfaceImgRow;
int actualImHeight = 1024;
int actualImWidth = 512;
int row_set = 512;
if (argc > 1){
int my_val = atoi(argv[1]);
if ((my_val > SKIP_ROWS) && (my_val < actualImHeight - 3)) row_set = my_val;
}
h_firstSurfaceImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
h_threshImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
h_firstSurfaceImgRow = (int *)malloc(actualImWidth*sizeof(int));
cudaMalloc(&threshImg, actualImHeight*actualImWidth*sizeof(float));
cudaMalloc(&firstSurfaceImg, actualImHeight*actualImWidth*sizeof(float));
cudaMalloc(&firstSurfaceImgRow, actualImWidth*sizeof(int));
cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
cudaMemset(firstSurfaceImg, 0, actualImHeight*actualImWidth*sizeof(float));
for (int i = 0; i < actualImHeight*actualImWidth; i++) h_threshImg[i] = 0.0f;
// insert "bright row" here
for (int i = (row_set*actualImWidth); i < ((row_set+3)*actualImWidth); i++) h_threshImg[i] = THRESH;
cudaMemcpy(threshImg, h_threshImg, actualImHeight*actualImWidth*sizeof(float), cudaMemcpyHostToDevice);
dim3 grid(1,1024);
//warm-up run
firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
cudaDeviceSynchronize();
unsigned long long t2 = dtime_usec(0);
firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
t2 = dtime_usec(t2);
cudaMemcpy(h_firstSurfaceImgRow, firstSurfaceImgRow, actualImWidth*sizeof(float), cudaMemcpyDeviceToHost);
unsigned long long t1 = dtime_usec(0);
firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
t1 = dtime_usec(t1);
cudaMemcpy(h_firstSurfaceImg, firstSurfaceImg, actualImWidth*actualImHeight*sizeof(float), cudaMemcpyDeviceToHost);
printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
// validate results
for (int i = 0; i < actualImWidth; i++)
if (h_firstSurfaceImgRow[i] < actualImHeight)
if (h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i] != THRESH)
{printf("mismatch at %d, was %f, should be %d\n", i, h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i], THRESH); return 1;}
return 0;
}
$ nvcc -arch=sm_35 -o t667 t667.cu
$ ./t667
t1 = 0.000978s, t2 = 0.000050s
$
备注:
上面的例子在图像的第 512 行插入了一个 "bright neighborhood",因此在我的例子中给出了近 20 倍的中间加速因子( K40c).
为了简洁起见,我省略了 proper cuda error checking。不过我推荐它。
任一内核的执行性能在很大程度上取决于它是否是第一个运行。这可能与缓存和一般预热效果有关。因此,为了给出合理的结果,我首先 运行 我的内核作为额外的不定时预热 运行.
我没有追求共享内存优化的原因之一是这个问题已经很小了,至少对于像 K40 这样的大型开普勒 GPU,它几乎完全适合L2 缓存(尤其是我的内核,因为它使用较小的输出数据结构。)鉴于此,共享内存可能不会带来太大的性能提升。
编辑: 我已经(再次)修改了代码,以便可以将测试图像中插入亮边界的线(行)作为命令传递 -行参数,我已经在 3 种不同的设备上测试了代码,对亮行使用了 3 种不同的设置:
execution time on: K40 C2075 Quadro NVS 310
bright row = 15: 31/33 23/45 29/314
bright row = 512: 978/50 929/112 1232/805
bright row = 1000: 2031/71 2033/149 2418/1285
all times are microseconds (original kernel/optimized kernel)
CUDA 6.5, CentOS 6.2
我正在尝试优化我的表面检测内核;给定一个输入二进制 512w x 1024h
图像,我想找到图像中的第一个亮面。我编写的代码声明了 512 个线程,并在 3x3
邻域中搜索第一个亮像素。代码工作正常,但在 ~9.46 ms
时有点慢,我想让它 运行 更快。
编辑 1: 性能改进不到我的原始内核到 运行 所用时间的一半。我的 Quadro K6000 上 4.032 ms
中的 Robert 内核 运行s。
编辑 2:设法通过将线程数减半来进一步提高性能。现在,我的(罗伯特修改过的)内核 运行s 在我的 Quadro K6000 上 2.125 ms
。
内核调用使用:
firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
我想使用共享内存来改进内存获取;关于如何优化这段代码的任何想法?
__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int rows2skip = 10;
float thresh = 1.0f;
//thread Index: (0 -> 511)
if (col < width) {
if( col == 0 ) { // first col - 0
for (int row = 0 + rows2skip; row < height - 2; row++) { // skip first 30 rows
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the right
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col+1)]; if(neibs[1] == thresh) { cnt++; } // right
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col+1)]; if(neibs[3] == thresh) { cnt++; } // bottom right
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom right
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
else if ( col == (width-1) ) { // last col
for (int row = 0 + rows2skip; row < height -2; row++) {
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the left
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col-1)]; if(neibs[1] == thresh) { cnt++; } // left
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; } // bottom left
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col-1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom left
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
// remaining threads are: (1 -> 510)
else { // any col other than first or last column
for (int row = 0 + rows2skip; row < height - 2; row++) {
int cnt = 0;
float neibs[9]; // not shared mem as it reduces speed
// for threads < width/4, get the neighbors
// get nine neighbours - three in curr col, three each to left and right
neibs[0] = threshImg[((row)*width) +(col-1)]; if(neibs[0] == thresh) { cnt++; }
neibs[1] = threshImg[((row)*width) +(col)]; if(neibs[1] == thresh) { cnt++; }
neibs[2] = threshImg[((row)*width) +(col+1)]; if(neibs[2] == thresh) { cnt++; }
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; }
neibs[4] = threshImg[((row+1)*width) +(col)]; if(neibs[4] == thresh) { cnt++; }
neibs[5] = threshImg[((row+1)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; }
neibs[6] = threshImg[((row+2)*width) +(col-1)]; if(neibs[6] == thresh) { cnt++; }
neibs[7] = threshImg[((row+2)*width) +(col)]; if(neibs[7] == thresh) { cnt++; }
neibs[8] = threshImg[((row+2)*width) +(col+1)]; if(neibs[8] == thresh) { cnt++; }
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
}
__syncthreads();
}
这是一个实例,演示了评论中讨论的 3 个概念中的 2 个:
首先要考虑的优化是 512 个线程不足以让任何 GPU 保持忙碌。我们希望以 10000 个线程或更多为目标。 GPU 是延迟隐藏机器,当线程太少无法帮助 GPU 隐藏延迟时,您的内核就会受到延迟限制,这是一种内存限制问题。实现这一点最直接的方法是让每个线程处理图像中的一个像素(总共允许 512*1024 个线程),而不是一列(总共只允许 512 个线程)。但是,由于这似乎"break"我们的"first-surface detection"算法,我们必须再做一次修改(2)。
一旦我们并行处理了所有像素,那么对上述第 1 项的直接改编意味着我们不再知道哪个表面是 "first",即哪个 "bright" 表面(每列)最接近第 0 行。该算法的这一特征将问题从简单的 转换 更改为 减少 (每列减少一次实际上是图像的。)我们将通过为每个像素分配 1 个线程来允许并行处理每一列,但我们将选择满足亮度测试的结果像素,即 closest 到第零行。一种相对简单的方法是在发现适当明亮像素邻域的最小行(每列)的每列阵列上使用
atomicMin
。
以下代码演示了上述 2 个更改(并且不包括任何共享内存的使用)并演示了(对我而言,在 Tesla K40 上)与 OP 的原始内核相比,加速范围为 1 到 20 倍。加速范围是由于算法的工作因图像结构而异。两种算法都有提前退出策略。由于 for 循环上的早期退出结构,原始内核可以做或多或少的工作,具体取决于在每列中发现 "bright" 像素邻域的位置(如果有)。因此,如果所有列在第 0 行附近都有明亮的邻域,我会看到大约 1x 的 "improvement"(即我的内核 运行s 的速度与原始速度大致相同)。如果所有列在图像的另一个 "end" 附近都有(仅)明亮的邻域,我会看到大约 20 倍的改进。这可能因 GPU 而异,因为开普勒 GPU 提高了我正在使用的全局原子吞吐量。 编辑: 由于这种可变工作,我添加了一个粗略的 "early-exit" 策略作为对我的代码的微不足道的修改。这带来了最短的执行时间来近似两个内核之间的奇偶校验(即大约 1x)。
剩余的优化可能包括:
共享内存的使用。这应该是对用于例如矩阵乘法的相同基于图块的共享内存方法的微不足道的改编。如果您使用方形瓷砖,那么您将需要调整内核 block/grid 尺寸以使这些 "square-ish".
改进了减少技术。对于某些图像结构,原子方法可能会有些慢。这可以通过切换到每列适当的并行减少来改善。您可以通过将测试图像设置为所有位置的阈值来对我的内核进行 "worst-case" 测试。这应该会导致原始内核 运行 最快,而我的内核 运行 最慢,但在这种情况下我没有观察到我的内核有任何明显的减速。我的内核的执行时间非常稳定。同样,这可能取决于 GPU。
示例:
#include <stdlib.h>
#include <stdio.h>
#define SKIP_ROWS 10
#define THRESH 1.0f
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int rows2skip = SKIP_ROWS;
float thresh = THRESH;
//thread Index: (0 -> 511)
if (col < width) {
if( col == 0 ) { // first col - 0
for (int row = 0 + rows2skip; row < height; row++) { // skip first 30 rows
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the right
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col+1)]; if(neibs[1] == thresh) { cnt++; } // right
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col+1)]; if(neibs[3] == thresh) { cnt++; } // bottom right
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom right
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
else if ( col == (width-1) ) { // last col
for (int row = 0 + rows2skip; row < height; row++) {
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the left
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col-1)]; if(neibs[1] == thresh) { cnt++; } // left
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; } // bottom left
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col-1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom left
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
// remaining threads are: (1 -> 510)
else { // any col other than first or last column
for (int row = 0 + rows2skip; row < height; row++) {
int cnt = 0;
float neibs[9]; // not shared mem as it reduces speed
// for threads < width/4, get the neighbors
// get nine neighbours - three in curr col, three each to left and right
neibs[0] = threshImg[((row)*width) +(col-1)]; if(neibs[0] == thresh) { cnt++; }
neibs[1] = threshImg[((row)*width) +(col)]; if(neibs[1] == thresh) { cnt++; }
neibs[2] = threshImg[((row)*width) +(col+1)]; if(neibs[2] == thresh) { cnt++; }
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; }
neibs[4] = threshImg[((row+1)*width) +(col)]; if(neibs[4] == thresh) { cnt++; }
neibs[5] = threshImg[((row+1)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; }
neibs[6] = threshImg[((row+2)*width) +(col-1)]; if(neibs[6] == thresh) { cnt++; }
neibs[7] = threshImg[((row+2)*width) +(col)]; if(neibs[7] == thresh) { cnt++; }
neibs[8] = threshImg[((row+2)*width) +(col+1)]; if(neibs[8] == thresh) { cnt++; }
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
}
__syncthreads();
}
__global__ void firstSurfaceDetection_opt (const float * __restrict__ threshImg, int *firstSurfaceImgRow, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int row = threadIdx.y + (blockDim.y*blockIdx.y);
int rows2skip = SKIP_ROWS;
float thresh = THRESH;
if ((row >= rows2skip) && (row < height-2) && (col < width) && (row < firstSurfaceImgRow[col])) {
int cnt = 0;
int inc = 0;
if (col == 0) inc = +1;
if (col == (width-1)) inc = -1;
if (inc){
cnt = 3;
if (threshImg[((row)*width) +(col)] == thresh) cnt++;
if (threshImg[((row)*width) +(col+inc)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col+inc)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col+inc)] == thresh) cnt++;
}
else {
// get nine neighbours - three in curr col, three each to left and right
if (threshImg[((row)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row)*width) +(col)] == thresh) cnt++;
if (threshImg[((row)*width) +(col+1)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col+1)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col+1)] == thresh) cnt++;
}
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
atomicMin(firstSurfaceImgRow + col, row);
}
}
}
int main(int argc, char *argv[]){
float *threshImg, *h_threshImg, *firstSurfaceImg, *h_firstSurfaceImg;
int *firstSurfaceImgRow, *h_firstSurfaceImgRow;
int actualImHeight = 1024;
int actualImWidth = 512;
int row_set = 512;
if (argc > 1){
int my_val = atoi(argv[1]);
if ((my_val > SKIP_ROWS) && (my_val < actualImHeight - 3)) row_set = my_val;
}
h_firstSurfaceImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
h_threshImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
h_firstSurfaceImgRow = (int *)malloc(actualImWidth*sizeof(int));
cudaMalloc(&threshImg, actualImHeight*actualImWidth*sizeof(float));
cudaMalloc(&firstSurfaceImg, actualImHeight*actualImWidth*sizeof(float));
cudaMalloc(&firstSurfaceImgRow, actualImWidth*sizeof(int));
cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
cudaMemset(firstSurfaceImg, 0, actualImHeight*actualImWidth*sizeof(float));
for (int i = 0; i < actualImHeight*actualImWidth; i++) h_threshImg[i] = 0.0f;
// insert "bright row" here
for (int i = (row_set*actualImWidth); i < ((row_set+3)*actualImWidth); i++) h_threshImg[i] = THRESH;
cudaMemcpy(threshImg, h_threshImg, actualImHeight*actualImWidth*sizeof(float), cudaMemcpyHostToDevice);
dim3 grid(1,1024);
//warm-up run
firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
cudaDeviceSynchronize();
unsigned long long t2 = dtime_usec(0);
firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
t2 = dtime_usec(t2);
cudaMemcpy(h_firstSurfaceImgRow, firstSurfaceImgRow, actualImWidth*sizeof(float), cudaMemcpyDeviceToHost);
unsigned long long t1 = dtime_usec(0);
firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
t1 = dtime_usec(t1);
cudaMemcpy(h_firstSurfaceImg, firstSurfaceImg, actualImWidth*actualImHeight*sizeof(float), cudaMemcpyDeviceToHost);
printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
// validate results
for (int i = 0; i < actualImWidth; i++)
if (h_firstSurfaceImgRow[i] < actualImHeight)
if (h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i] != THRESH)
{printf("mismatch at %d, was %f, should be %d\n", i, h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i], THRESH); return 1;}
return 0;
}
$ nvcc -arch=sm_35 -o t667 t667.cu
$ ./t667
t1 = 0.000978s, t2 = 0.000050s
$
备注:
上面的例子在图像的第 512 行插入了一个 "bright neighborhood",因此在我的例子中给出了近 20 倍的中间加速因子( K40c).
为了简洁起见,我省略了 proper cuda error checking。不过我推荐它。
任一内核的执行性能在很大程度上取决于它是否是第一个运行。这可能与缓存和一般预热效果有关。因此,为了给出合理的结果,我首先 运行 我的内核作为额外的不定时预热 运行.
我没有追求共享内存优化的原因之一是这个问题已经很小了,至少对于像 K40 这样的大型开普勒 GPU,它几乎完全适合L2 缓存(尤其是我的内核,因为它使用较小的输出数据结构。)鉴于此,共享内存可能不会带来太大的性能提升。
编辑: 我已经(再次)修改了代码,以便可以将测试图像中插入亮边界的线(行)作为命令传递 -行参数,我已经在 3 种不同的设备上测试了代码,对亮行使用了 3 种不同的设置:
execution time on: K40 C2075 Quadro NVS 310
bright row = 15: 31/33 23/45 29/314
bright row = 512: 978/50 929/112 1232/805
bright row = 1000: 2031/71 2033/149 2418/1285
all times are microseconds (original kernel/optimized kernel)
CUDA 6.5, CentOS 6.2