为什么在每个 clEnqueue 函数中都会调用 clGetPlatformInfo？

Question

我们正在主机和设备上的 NVidia GPU 上分析 OpenCL 应用程序运行。我们惊讶地发现（基于 gperftools）主机在 clGetPlatformInfo 中花费了 44% 的时间，这种方法在我们自己的代码中只被调用一次。它由 clEnqueueCopyBuffer_hid、clEnqueueWriteBuffer_hid 和 clEnqueueNDRangeKernel_hid 调用（可能还有所有其他 clEnqueue 方法，但它们在我们的代码中不太常调用）。由于这占用了我们太多的主机时间，而且我们现在似乎受到主机速度的限制，我需要知道是否有办法消除这些额外的调用。

为什么每次 OpenCL 调用都会调用它？（大概它是可以存储在上下文中的静态信息？）我们是否可能错误地初始化了我们的上下文？

编辑：有人要求我提供 MWE：

#include <CL/opencl.h>

#include <vector>
using namespace std;


int main ()
{
    cl_uint numPlatforms;
    clGetPlatformIDs (0, nullptr, &numPlatforms);

    vector<cl_platform_id> platformIdArray (numPlatforms);
    clGetPlatformIDs (numPlatforms, platformIdArray.data (), nullptr);

    // Assume the NVidia GPU is the first platform
    cl_platform_id platformId = platformIdArray[0];

    cl_uint numDevices;
    clGetDeviceIDs (platformId, CL_DEVICE_TYPE_GPU, 0, nullptr, &numDevices);

    vector<cl_device_id> deviceArray (numDevices);
    clGetDeviceIDs (platformId, CL_DEVICE_TYPE_GPU, numDevices, deviceArray.data (), nullptr);

    // Assume the NVidia GPU is the first device
    cl_device_id deviceId = deviceArray[0];

    cl_context context = clCreateContext (
        nullptr,
        1,
        &deviceId,
        nullptr,
        nullptr,
        nullptr);

    cl_command_queue commandQueue = clCreateCommandQueue (context, deviceId, {}, nullptr);

    cl_mem mem = clCreateBuffer (context, CL_MEM_READ_WRITE, sizeof(cl_int),
                                 nullptr, nullptr);

    cl_int i = 0;

    while (true)
    {
        clEnqueueWriteBuffer (
            commandQueue,
            mem,
            CL_TRUE,
            0,
            sizeof (i),
            &i,
            0,
            nullptr,
            nullptr);

        ++i;
    }
}

此 MWE 在几秒钟内生成以下配置文件。请注意，99% 的时间花在了 clGetPlatformInfo 上。

Answer 1

尝试将 NULL 作为第一个参数传递给 clCreateContext。设备 ID 已被传递，因此可能不需要第一个参数，并且可能导致对 clGetPlatformInfo.

的这些额外调用

要尝试的另一件事是 link 使用非 Nvidia OpenCL 库。不需要使用 GPU 供应商 OpenCL 库，只要您正在使用的功能在此其他 OpenCL 库中实现，任何库都应该可以工作。使用 Nvidia 没有风险，因为就目前而言，最新支持的版本是 OpenCL 1.2，大多数（如果不是所有）供应商已经支持。因此，您可以尝试来自其他供应商 SDK（如 Intel 或 AMD）的 OpenCL 库。如果您使用 Ubuntu，则可以使用 ocl-icd-opencl-dev。

======更新=========

尝试在创建上下文时指定平台：

const cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties) platformId, 0}; 

cl_context context = clCreateContext (
    properties, // <-- here
    1,
    &deviceId,
    nullptr,
    nullptr,
    nullptr);

可能是因为在创建上下文时没有指定平台，所以每次需要时都会查询它，但是当像上面那样指定时就不会了。

Answer 2

我们解决了问题：gproftools 很难给出正确的回溯。代码实际上并没有像 gperftools 所说的那样调用 clGetPlatformInfo 数千次。根据 Khronos 论坛上与 bashbaug 的对话：

When I run a test with gperftools using our GPU driver I see most of the time attributed to GTPin_Init, as you mentioned. I think this is because an OpenCL ICD has to export very few symbols, since calls into most OpenCL APIs occur through the ICD dispatch table.

我们使用了他建议的分析工具（OpenCL 拦截层，在 https://github.com/intel/opencl-intercept-layer 找到）让我们更好地了解内核的运行时特性，并帮助我们找到一些内存泄漏。实际上是内存泄漏导致了速度下降——如果将高引用计数的内存作为参数传递给内核，内核似乎需要很长时间才能启动。

您可以在 Khronos 论坛上找到完整的对话：https://community.khronos.org/t/why-does-clgetplatforminfo-get-called-in-every-clenqueue-function/105756/5

为什么在每个 clEnqueue 函数中都会调用 clGetPlatformInfo？

Why does clGetPlatformInfo get called in every clEnqueue function?

performance

profiling

opencl