为内核设置参数非常慢(OpenCL)
Setting argument for kernel extremely slow (OpenCL)
在我的 OpenCL Dijkstra 算法实现中,迄今为止最慢的部分是将一维简化图矩阵写入内核参数,即全局内存。
我的图表是一个二维数组;对于 OpenCL,它被缩减为一维数组,如下所示:
for (int q = 0; q < numberOfVertices; q++)
{
for (int t = 0; t < numberOfVertices; t++)
{
reducedGraph[q * numberOfVertices + t] = graph[q][t];
}
}
放入缓冲区:
cl::Buffer graphBuffer = cl::Buffer(context, CL_MEM_READ_WRITE, numberOfVertices * numberOfVertices * sizeof(int));
然后设置参数需要非常长的时间。对于我的 5,760,000 个顶点的测试,将数据写入参数需要 3 秒以上,而算法本身需要不到 3 秒一毫秒:
kernel_dijkstra.setArg(5, graphBuffer);
内核使用图形作为全局参数:
void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices)
有什么办法可以加快速度吗?谢谢!
编辑:我的内核代码:
// Kernel source, calculates minimum distance in segment and relaxes graph.
std::string kernel_code =
void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices) {
for (int b = 0; b < numberOfVertices; b++) {
int gid = get_global_id(0);
int min = huge_int, min_index = -1;
for (int v = gid * sizeOfChunks; v < sizeOfChunks * gid + sizeOfChunks; v++) {
if (verticesSet[v] == false && dist[v] < min && dist[v] != 0) {
min = dist[v];
min_index = v;
}
}
result[gid] = min_index;
if (gid != 0) continue;
min = huge_int;
min_index = -1;
int current_min;
for (int a = 0; a < numberOfVertices; a++) {
current_min = dist[result[a]];
if (current_min < min && current_min != -1 && current_min != 0) { min = current_min; min_index = result[a]; }
}
verticesSet[min_index] = true;
// relax graph with found global min.
int a = 0;
int min_dist = dist[min_index];
int current_dist;
int compare_dist;
for (int i = min_index * numberOfVertices; i < min_index * numberOfVertices + numberOfVertices; i++) {
current_dist = dist[a];
compare_dist = graph[min_index * numberOfVertices + a];
if (current_dist > min_dist + compare_dist && !verticesSet[a] && compare_dist != 0) {
dist[a] = min_dist + compare_dist;
}
a++;
}
}
};
我如何排队:
numberOfComputeUnits = default_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
queue.enqueueNDRangeKernel(kernel_dijkstra, 0, cl::NDRange(numberOfVertices), numberOfComputeUnits);
此处的错误是您的内存分配太大:5.76M 顶点需要 133TB 缓冲区,因为缓冲区大小是顶点数的二次方。 C++ 编译器和 OpenCL 都不会将此报告为错误,甚至您的内核也会出现 运行 就好了,但实际上它不会计算任何东西,因为内存不足,您将得到随机和未定义的结果.
通常 .setArg(...)
不应超过几毫秒。同样有益的是初始化部分(包含缓冲区分配,.setArg(...)
等)在开始时只进行一次,然后重复 运行 内核或在缓冲区中交换数据而不重新分配。
在我的 OpenCL Dijkstra 算法实现中,迄今为止最慢的部分是将一维简化图矩阵写入内核参数,即全局内存。
我的图表是一个二维数组;对于 OpenCL,它被缩减为一维数组,如下所示:
for (int q = 0; q < numberOfVertices; q++)
{
for (int t = 0; t < numberOfVertices; t++)
{
reducedGraph[q * numberOfVertices + t] = graph[q][t];
}
}
放入缓冲区:
cl::Buffer graphBuffer = cl::Buffer(context, CL_MEM_READ_WRITE, numberOfVertices * numberOfVertices * sizeof(int));
然后设置参数需要非常长的时间。对于我的 5,760,000 个顶点的测试,将数据写入参数需要 3 秒以上,而算法本身需要不到 3 秒一毫秒:
kernel_dijkstra.setArg(5, graphBuffer);
内核使用图形作为全局参数:
void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices)
有什么办法可以加快速度吗?谢谢!
编辑:我的内核代码:
// Kernel source, calculates minimum distance in segment and relaxes graph.
std::string kernel_code =
void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices) {
for (int b = 0; b < numberOfVertices; b++) {
int gid = get_global_id(0);
int min = huge_int, min_index = -1;
for (int v = gid * sizeOfChunks; v < sizeOfChunks * gid + sizeOfChunks; v++) {
if (verticesSet[v] == false && dist[v] < min && dist[v] != 0) {
min = dist[v];
min_index = v;
}
}
result[gid] = min_index;
if (gid != 0) continue;
min = huge_int;
min_index = -1;
int current_min;
for (int a = 0; a < numberOfVertices; a++) {
current_min = dist[result[a]];
if (current_min < min && current_min != -1 && current_min != 0) { min = current_min; min_index = result[a]; }
}
verticesSet[min_index] = true;
// relax graph with found global min.
int a = 0;
int min_dist = dist[min_index];
int current_dist;
int compare_dist;
for (int i = min_index * numberOfVertices; i < min_index * numberOfVertices + numberOfVertices; i++) {
current_dist = dist[a];
compare_dist = graph[min_index * numberOfVertices + a];
if (current_dist > min_dist + compare_dist && !verticesSet[a] && compare_dist != 0) {
dist[a] = min_dist + compare_dist;
}
a++;
}
}
};
我如何排队:
numberOfComputeUnits = default_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
queue.enqueueNDRangeKernel(kernel_dijkstra, 0, cl::NDRange(numberOfVertices), numberOfComputeUnits);
此处的错误是您的内存分配太大:5.76M 顶点需要 133TB 缓冲区,因为缓冲区大小是顶点数的二次方。 C++ 编译器和 OpenCL 都不会将此报告为错误,甚至您的内核也会出现 运行 就好了,但实际上它不会计算任何东西,因为内存不足,您将得到随机和未定义的结果.
通常 .setArg(...)
不应超过几毫秒。同样有益的是初始化部分(包含缓冲区分配,.setArg(...)
等)在开始时只进行一次,然后重复 运行 内核或在缓冲区中交换数据而不重新分配。