为什么 CPU 和 GPU 内存之间的数据交换这么慢?

why it's so slow in data exchanging between CPU and GPU memory?

第一次在ARM上使用openCL(CPU:Qualcomm Snapdragon MSM8930, GPU:Adreno(TM)305)。

我发现用openCL确实很有效,但是CPU和GPU之间的数据交换太费时间了,我无法成像。

这是一个例子:

cv::Mat mat(640,480,CV_8UC3,cv::Scalar(0,0,0));
cv::ocl::oclMat mat_ocl;

//cpu->gpu
mat_ocl.upload(mat);
//gpu->cpu
mat = (cv::Mat)mat_ocl;

就这么一张小图,上传选项需要10ms,下载选项需要20ms!时间太长了。

谁能告诉我这种情况正常吗?或者这里出了什么问题?

提前致谢!

已添加:

我的破解方法是

clock_t start,end;
start=clock();
mat_ocl.upload(mat);
end = clock();
__android_log_print(ANDROID_LOG_INFO,"tag","upload time = %f s",(double)(end-start)/CLOCKS_PER_SEC);

实际上,我并没有完全使用openCL,而是openCV中的ocl模块(尽管它说它们是相等的)。看openCV的文档,发现只是告诉我们将cv::Mat转换成cv::ocl::oclMat(也就是数据从CPU上传到GPU)来做GPU计算,但是没找到内存ocl 模块文档中的映射方法。

提供准确的测量方法和结果。

根据在ARM 平台(不是Qcom)下开发OpenCL 的经验,我可以说您不应该期望太多的读写操作。内存总线通常是 64 位,加上 DDR3 没那么快。

使用共享内存获得优势 - 选择 mapping/unmapping 而不是 read/write。

P. S.实际运行时间是实测的,使用cl_event profiling:

cl_ulong getTimeNanoSeconds(cl_event event)
{
    cl_ulong start = 0, end = 0;

    cl_int ret = clWaitForEvents(1, &event);
    if (ret != CL_SUCCESS)
        throw(ret);

    ret = clGetEventProfilingInfo(
              event,
              CL_PROFILING_COMMAND_START,
              sizeof(cl_ulong),
              &start,
              NULL);
    if (ret != CL_SUCCESS)
        throw(ret);

    ret = clGetEventProfilingInfo(
              event,
              CL_PROFILING_COMMAND_END,
              sizeof(cl_ulong),
              &end,
              NULL);
    if (ret != CL_SUCCESS)
        throw(ret);

    return (end - start);
}

好吧,我在 openCV 文档中找到了一些有用的介绍:

In a heterogeneous device environment, there may be cost associated with data transfer. This would be the case, for example, when data needs to be moved from host memory (accessible to the CPU), to device memory (accessible to a discrete GPU). in the case of integrated graphics chips, there may be performance issues, relating to memory coherency between access from the GPU “part” of the integrated device, or the CPU “part.” For best performance, in either case, it is recommended that you do not introduce data transfers between CPU and the discrete GPU, except in the beginning and the end of the algorithmic pipeline.

所以,这似乎解释了 CPU 和 GPU 之间的数据传输速度如此缓慢的原因。但是我还是不知道怎么解决这个问题。