OpenACC 数据移动
OpenACC data movement
我是 OpenACC 的新手,我不太了解数据移动和“#pragma acc data”子句。
我有一个用 C 编写的程序。代码摘录如下:
#pragma acc data create(intersectionSet[0:intersectionsCount][0:4]) // line 122
#pragma acc kernels // line 123
for (int i = 0; i<intersectionsCount; i++){ // line 124
intersectionSet[i][0] = 9; // line 125
}
intersectionsCount 的值为 210395。通过以下方式编译和 运行 上述代码后:
pgcc -o rect_openacc -fast -Minfo -acc -ta=nvidia,time rect.c
我有这个输出:
time(us): 1,475,607
122: data region reached 1 time
31: kernel launched 210395 times
grid: [1] block: [128]
device time(us): total=1,475,315 max=15 min=7 avg=7
elapsed time(us): total=5,451,647 max=24,028 min=24 avg=25
123: compute region reached 1 time
124: kernel launched 1 time
grid: [1644] block: [128]
device time(us): total=292 max=292 min=292 avg=292
elapsed time(us): total=312 max=312 min=312 avg=312
156: data region reached 1 time
阅读输出后我有一些问题:
- 不知道为什么说第31行,因为第31行没有acc pragma。这是否意味着我无法追踪的东西?
- 在“31: kernel launched 210395 times”这一行中,表示它启动了210395次内核。不知道kernel需要launch这么多次是不是正常,因为这部分用了5,451,647(us),我觉得有点长。我认为 for 循环很简单,不应该花那么多时间。我是否以错误的方式使用了 pragma?
更新
我确实有几个程序的头文件。但是那些文件没有 "acc data" 或 "acc kernels" pragma.
使用"-Minfo=all"编译代码后,结果如下:
breakStringToCharArray:
11, include "stringHelper.h"
50, Loop not vectorized/parallelized: contains call
countChar:
11, include "stringHelper.h"
74, Loop not vectorized/parallelized: not countable
extractCharToIntRequiredInt:
11, include "stringHelper.h"
93, Loop not vectorized/parallelized: contains call
extractArray:
12, include "fileHelper.h"
49, Loop not vectorized/parallelized: contains call
isRectOverlap:
13, include "shapeHelper.h"
23, Generating acc routine vector
Generating Tesla code
getRectIntersection:
13, include "shapeHelper.h"
45, Generating acc routine vector
Generating Tesla code
getRectIntersectionInGPU:
13, include "shapeHelper.h"
69, Generating acc routine vector
Generating Tesla code
max:
13, include "shapeHelper.h"
98, Generating acc routine vector
Generating Tesla code
min:
13, include "shapeHelper.h"
118, Generating acc routine vector
Generating Tesla code
main:
64, Loop not vectorized/parallelized: contains call
108, Loop not vectorized/parallelized: contains call
122, Generating create(intersectionSet[:intersectionsCount][:4])
124, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
124, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
我是这样创建交集的:
intersectionSet = (int **)malloc(sizeof(int **) * intersectionsCount);
for (i = 0; i<intersectionsCount; i++){
intersectionSet[i] = (int *)malloc(sizeof(int *) * 4);
}
发生的事情是,因为你有指向指针数组的指针,“**”,(至少我猜这就是 intersectionSet 是什么)编译器必须首先将指针分配给设备上的指针,然后循环每个元素分配单独的设备数组。最后,它需要启动内核来设置设备上的指针值。这里有一些伪代码来帮助说明。
devPtrPtr = deviceMalloc(numElements*pointer size);
for (i=0; i < numElements; ++i) {
devPtr = deviceMalloc(elementSize * dataTypeSize);
call deviceKernelToSetPointer<<<1,128>>(devPtrPtr[i],devPtr);
}
为了帮助您编写代码,我将切换尺寸,使列长度为 4,行长度为 "intersectionsCount"。这也将有助于设备上的数据访问,因为 "vector" 循环应对应于 stride-1(连续)维度以避免内存分歧。
希望这对您有所帮助,
垫子
我是 OpenACC 的新手,我不太了解数据移动和“#pragma acc data”子句。
我有一个用 C 编写的程序。代码摘录如下:
#pragma acc data create(intersectionSet[0:intersectionsCount][0:4]) // line 122
#pragma acc kernels // line 123
for (int i = 0; i<intersectionsCount; i++){ // line 124
intersectionSet[i][0] = 9; // line 125
}
intersectionsCount 的值为 210395。通过以下方式编译和 运行 上述代码后:
pgcc -o rect_openacc -fast -Minfo -acc -ta=nvidia,time rect.c
我有这个输出:
time(us): 1,475,607
122: data region reached 1 time
31: kernel launched 210395 times
grid: [1] block: [128]
device time(us): total=1,475,315 max=15 min=7 avg=7
elapsed time(us): total=5,451,647 max=24,028 min=24 avg=25
123: compute region reached 1 time
124: kernel launched 1 time
grid: [1644] block: [128]
device time(us): total=292 max=292 min=292 avg=292
elapsed time(us): total=312 max=312 min=312 avg=312
156: data region reached 1 time
阅读输出后我有一些问题:
- 不知道为什么说第31行,因为第31行没有acc pragma。这是否意味着我无法追踪的东西?
- 在“31: kernel launched 210395 times”这一行中,表示它启动了210395次内核。不知道kernel需要launch这么多次是不是正常,因为这部分用了5,451,647(us),我觉得有点长。我认为 for 循环很简单,不应该花那么多时间。我是否以错误的方式使用了 pragma?
更新
我确实有几个程序的头文件。但是那些文件没有 "acc data" 或 "acc kernels" pragma.
使用"-Minfo=all"编译代码后,结果如下:
breakStringToCharArray:
11, include "stringHelper.h"
50, Loop not vectorized/parallelized: contains call
countChar:
11, include "stringHelper.h"
74, Loop not vectorized/parallelized: not countable
extractCharToIntRequiredInt:
11, include "stringHelper.h"
93, Loop not vectorized/parallelized: contains call
extractArray:
12, include "fileHelper.h"
49, Loop not vectorized/parallelized: contains call
isRectOverlap:
13, include "shapeHelper.h"
23, Generating acc routine vector
Generating Tesla code
getRectIntersection:
13, include "shapeHelper.h"
45, Generating acc routine vector
Generating Tesla code
getRectIntersectionInGPU:
13, include "shapeHelper.h"
69, Generating acc routine vector
Generating Tesla code
max:
13, include "shapeHelper.h"
98, Generating acc routine vector
Generating Tesla code
min:
13, include "shapeHelper.h"
118, Generating acc routine vector
Generating Tesla code
main:
64, Loop not vectorized/parallelized: contains call
108, Loop not vectorized/parallelized: contains call
122, Generating create(intersectionSet[:intersectionsCount][:4])
124, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
124, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
我是这样创建交集的:
intersectionSet = (int **)malloc(sizeof(int **) * intersectionsCount);
for (i = 0; i<intersectionsCount; i++){
intersectionSet[i] = (int *)malloc(sizeof(int *) * 4);
}
发生的事情是,因为你有指向指针数组的指针,“**”,(至少我猜这就是 intersectionSet 是什么)编译器必须首先将指针分配给设备上的指针,然后循环每个元素分配单独的设备数组。最后,它需要启动内核来设置设备上的指针值。这里有一些伪代码来帮助说明。
devPtrPtr = deviceMalloc(numElements*pointer size);
for (i=0; i < numElements; ++i) {
devPtr = deviceMalloc(elementSize * dataTypeSize);
call deviceKernelToSetPointer<<<1,128>>(devPtrPtr[i],devPtr);
}
为了帮助您编写代码,我将切换尺寸,使列长度为 4,行长度为 "intersectionsCount"。这也将有助于设备上的数据访问,因为 "vector" 循环应对应于 stride-1(连续)维度以避免内存分歧。
希望这对您有所帮助,
垫子