删除 OpenCL 内核所需的时间

Time recquired for OpenCL kernel deletion

我的 OpenCL 代码出现意外性能(更准确地说,我使用 boost::compute 1.67.0)。现在,我只想添加 2 个缓冲区 c[i] = a[i] + b[i] 的每个元素。 我注意到与现有 SIMD 实现相比速度有所降低,因此我将每个步骤分开以突出显示哪个步骤耗时。这是我的代码示例:

    Chrono chrono2;
    chrono2.start();
    Chrono chrono;
    ipReal64 elapsed;

    // creating the OpenCL context and other stuff
    // ...
    
    std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
        __kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
    {
        size_t idx = get_global_id(0);
        out[idx] = in1[idx] + in2[idx];
    }
    );

    boost::compute::program* program = new boost::compute::program;
    try {
        chrono.start();
        *program = boost::compute::program::create_with_source(kernel_src, context);
        elapsed = chrono.elapsed();
        std::cout << "Create program : " << elapsed << "s" << std::endl;
        chrono.start();
        program->build();
        elapsed = chrono.elapsed();
        std::cout << "Build program : " << elapsed << "s" << std::endl;
    }
    catch (boost::compute::opencl_error& e) {
        std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
        return;
    }

    boost::compute::kernel* kernel = new boost::compute::kernel;
    try {
        chrono.start();
        *kernel = program->create_kernel("add_knl");
        elapsed = chrono.elapsed();
        std::cout << "Create kernel : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
        return;
    }

    try {
        chrono.start();
        // Pass the argument to the kernel
        kernel->set_arg(0, bufIn1);
        kernel->set_arg(1, bufIn2);
        kernel->set_arg(2, bufOut);
        elapsed = chrono.elapsed();
        std::cout << "Set args : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
        return;
    }

    try {

        chrono.start();
        queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
        elapsed = chrono.elapsed();
        std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
        return;
    }
    
    std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;

    chrono.start();
    delete program;
    elapsed = chrono.elapsed();
    std::cout << "Delete program : " << elapsed << "s" << std::endl;

    delete kernel;
    elapsed = chrono.elapsed();
    std::cout << "Delete kernel  : " << elapsed << "s" << std::endl;

这里是结果示例(我 运行 我的程序在 NVidia GeForce GT 630 上运行,带有 NVidia SDK TookKit):

Create program           : 0.0013123s
Build program            : 0.0015421s
Create kernel            : 6.6e-06s
Set args                 : 1.7e-06s
Kernel calculation       : 0.0001639s
[Function] Full duration : 0.0077794
Delete program           : 4.1e-06s
Delete kernel            : 0.0879901s

我知道我的程序很简单,我不希望内核执行是最耗时的步骤。但是,我认为内核删除只需要几毫秒,例如创建或构建程序。

这是正常行为吗?

谢谢

我会指出我从未使用过 boost::compute,但它看起来像是 OpenCL 的一个相当薄的包装器,所以下面的 应该 是正确的:

入队 内核不会等待它完成。入队函数 returns 一个事件,您可以等待该事件,也可以等待所有入队任务完成。你没有为这些事情计时。可能发生的情况是,当您销毁内核时,它会等待所有仍在等待完成排队的实例,然后再从析构函数返回。