删除 OpenCL 内核所需的时间
Time recquired for OpenCL kernel deletion
我的 OpenCL 代码出现意外性能(更准确地说,我使用 boost::compute 1.67.0)。现在,我只想添加 2 个缓冲区 c[i] = a[i] + b[i] 的每个元素。
我注意到与现有 SIMD 实现相比速度有所降低,因此我将每个步骤分开以突出显示哪个步骤耗时。这是我的代码示例:
Chrono chrono2;
chrono2.start();
Chrono chrono;
ipReal64 elapsed;
// creating the OpenCL context and other stuff
// ...
std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
__kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
{
size_t idx = get_global_id(0);
out[idx] = in1[idx] + in2[idx];
}
);
boost::compute::program* program = new boost::compute::program;
try {
chrono.start();
*program = boost::compute::program::create_with_source(kernel_src, context);
elapsed = chrono.elapsed();
std::cout << "Create program : " << elapsed << "s" << std::endl;
chrono.start();
program->build();
elapsed = chrono.elapsed();
std::cout << "Build program : " << elapsed << "s" << std::endl;
}
catch (boost::compute::opencl_error& e) {
std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
return;
}
boost::compute::kernel* kernel = new boost::compute::kernel;
try {
chrono.start();
*kernel = program->create_kernel("add_knl");
elapsed = chrono.elapsed();
std::cout << "Create kernel : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
// Pass the argument to the kernel
kernel->set_arg(0, bufIn1);
kernel->set_arg(1, bufIn2);
kernel->set_arg(2, bufOut);
elapsed = chrono.elapsed();
std::cout << "Set args : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
elapsed = chrono.elapsed();
std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
return;
}
std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;
chrono.start();
delete program;
elapsed = chrono.elapsed();
std::cout << "Delete program : " << elapsed << "s" << std::endl;
delete kernel;
elapsed = chrono.elapsed();
std::cout << "Delete kernel : " << elapsed << "s" << std::endl;
这里是结果示例(我 运行 我的程序在 NVidia GeForce GT 630 上运行,带有 NVidia SDK TookKit):
Create program : 0.0013123s
Build program : 0.0015421s
Create kernel : 6.6e-06s
Set args : 1.7e-06s
Kernel calculation : 0.0001639s
[Function] Full duration : 0.0077794
Delete program : 4.1e-06s
Delete kernel : 0.0879901s
我知道我的程序很简单,我不希望内核执行是最耗时的步骤。但是,我认为内核删除只需要几毫秒,例如创建或构建程序。
这是正常行为吗?
谢谢
我会指出我从未使用过 boost::compute,但它看起来像是 OpenCL 的一个相当薄的包装器,所以下面的 应该 是正确的:
入队 内核不会等待它完成。入队函数 returns 一个事件,您可以等待该事件,也可以等待所有入队任务完成。你没有为这些事情计时。可能发生的情况是,当您销毁内核时,它会等待所有仍在等待完成排队的实例,然后再从析构函数返回。
我的 OpenCL 代码出现意外性能(更准确地说,我使用 boost::compute 1.67.0)。现在,我只想添加 2 个缓冲区 c[i] = a[i] + b[i] 的每个元素。 我注意到与现有 SIMD 实现相比速度有所降低,因此我将每个步骤分开以突出显示哪个步骤耗时。这是我的代码示例:
Chrono chrono2;
chrono2.start();
Chrono chrono;
ipReal64 elapsed;
// creating the OpenCL context and other stuff
// ...
std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
__kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
{
size_t idx = get_global_id(0);
out[idx] = in1[idx] + in2[idx];
}
);
boost::compute::program* program = new boost::compute::program;
try {
chrono.start();
*program = boost::compute::program::create_with_source(kernel_src, context);
elapsed = chrono.elapsed();
std::cout << "Create program : " << elapsed << "s" << std::endl;
chrono.start();
program->build();
elapsed = chrono.elapsed();
std::cout << "Build program : " << elapsed << "s" << std::endl;
}
catch (boost::compute::opencl_error& e) {
std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
return;
}
boost::compute::kernel* kernel = new boost::compute::kernel;
try {
chrono.start();
*kernel = program->create_kernel("add_knl");
elapsed = chrono.elapsed();
std::cout << "Create kernel : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
// Pass the argument to the kernel
kernel->set_arg(0, bufIn1);
kernel->set_arg(1, bufIn2);
kernel->set_arg(2, bufOut);
elapsed = chrono.elapsed();
std::cout << "Set args : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
elapsed = chrono.elapsed();
std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
return;
}
std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;
chrono.start();
delete program;
elapsed = chrono.elapsed();
std::cout << "Delete program : " << elapsed << "s" << std::endl;
delete kernel;
elapsed = chrono.elapsed();
std::cout << "Delete kernel : " << elapsed << "s" << std::endl;
这里是结果示例(我 运行 我的程序在 NVidia GeForce GT 630 上运行,带有 NVidia SDK TookKit):
Create program : 0.0013123s
Build program : 0.0015421s
Create kernel : 6.6e-06s
Set args : 1.7e-06s
Kernel calculation : 0.0001639s
[Function] Full duration : 0.0077794
Delete program : 4.1e-06s
Delete kernel : 0.0879901s
我知道我的程序很简单,我不希望内核执行是最耗时的步骤。但是,我认为内核删除只需要几毫秒,例如创建或构建程序。
这是正常行为吗?
谢谢
我会指出我从未使用过 boost::compute,但它看起来像是 OpenCL 的一个相当薄的包装器,所以下面的 应该 是正确的:
入队 内核不会等待它完成。入队函数 returns 一个事件,您可以等待该事件,也可以等待所有入队任务完成。你没有为这些事情计时。可能发生的情况是,当您销毁内核时,它会等待所有仍在等待完成排队的实例,然后再从析构函数返回。