PyOpenCL Error when accessing result of kernel operation (RuntimeError: Unable to compute length of object)

Question

我是 OpenCL 实现的新手，正在尝试制作一个相对简单的数据处理脚本。该脚本的工作原理如下：

在脚本的 OpenCL 端之前

加载数据
准备一些数组用于处理

这都是在评论之前完成的：## Step #1. Obtain an OpenCL platform.

OpenCL 端：

获取平台和设备，并创建上下文
创建内核
创建队列
将要处理的数组放入PyOpenCL中缓冲区
运行内核
获取内核输出

代码：


    # -*- coding: utf-8 -*-
    """
    Created on Sun Nov 17 16:14:50 2019

    @author: Mike
    based on example provided here:
        https://www.drdobbs.com/open-source/easy-opencl-with-python/240162614?pgno=2

    """

    import pyopencl as cl
    from pyopencl import cltypes
    from pyopencl import array
    from pyopencl.elementwise import ElementwiseKernel
    import numpy as np
    import time
    import matplotlib.pyplot as plt
    from reikna import fft

    if __name__ == "__main__":

        # Load some test data!
        rawdata = np.load('fig8_13_raw_5000nm_2p0.npy')
        posdata = np.zeros((2048,40))
        flat_rawdata = rawdata.flatten()
        #t0 = flat_rawdata[0:2048*40].astype(np.uint16)  # The first B-scan for use with the demo\
        t0=flat_rawdata.astype(np.uint16)
        n = 40  # Total A-scans in input (not all are included with B)
        x = 40  # A-scans in output

        B1 = posdata[2,:].astype(np.bool)
        B2 = posdata[3,:].astype(np.bool)
        b = np.logical_or(B1,B2)

        b = np.zeros(40).astype(bool)
        b[0:40] = 1

        # Load chirp matrix, containing wavelengths corresponding to spectrum bins for lambda->k interpolation
        lam = np.load('lam.npy')
        # Define apodization window
        window = np.hanning(2048).astype(np.float16)

        lam_min = np.amin(lam)
        lam_max = np.amax(lam)
        d_lam = lam_max-lam_min
        d_k = (1/lam_min - 1/lam_max)/2048
        k = np.array([1/((1/lam_max)+d_k*i) for i in range(2048)])

        nn0 = np.zeros(2048,dtype=np.uint16)
        nn1 = np.zeros(2048,dtype=np.uint16)

        for i in range(2048):
            res = np.abs(lam-k[i])
            minind = np.argmin(res)
            if res[minind]>=0:
                nn0[i]=minind-1
                nn1[i]=minind
            else:
                nn0[i]=minind
                nn1[i]=minind+1

        ## Step #1. Obtain an OpenCL platform.
        platform = cl.get_platforms()
        platform = platform[0]

        print('\nAcquired platform: \n\t'+platform.name)

        ## Step #2. Obtain a device id for at least one device (accelerator).
        device = platform.get_devices()
        device = device[0]

        print('Acquired device: \n\t'+device.name)
        extensions = ['\t'+x for x in device.extensions.split()]
        print('Device extensions: ')
        for i in range(len(extensions)):
            print(extensions[i])

        ## Step #3. Create a context for the selected device.
        context = cl.Context([device])
        print('Created context.')     

        ## Step #6. Create one or more kernels from the program functions.
        program = cl.Program(context, """
            __kernel void interp_hann(__global const double *lambda_spec,
            __global const double *win, __global const double *k,__global int *nn0,
            __global int *nn1,__global const double *lam, __global double *result, double d_lam)
            {
                int gid = get_global_id(0);
                int gid1 = gid % 2048;
                double y1 = lambda_spec[nn0[gid1]];  // y-values from neighbors in spectrum
                double y2 = lambda_spec[nn1[gid1]];
                double x1 = lam[nn0[gid1]];  // corresponding initial wavelength
                double x = k[gid1];  // linear-in-wavenumber interpolation point

                if (y1 == y2)
                {
                    result[gid] = y1*win[gid1];
                }
                else
                {
                    result[gid] = (y1 + (x - x1) / (y2 - y1) * d_lam) * win[gid1];
                }

            }
            """).build()

        ## Step #7. Create a command queue for the target device.
        queue = cl.CommandQueue(context)

        ## Step #8. Allocate device memory and move input data from the host to the device memory.   
        result = np.zeros(t0.shape,dtype=cl.cltypes.float)
        mem_flags = cl.mem_flags
        n0_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=nn0)
        n1_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=nn1)
        win_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=window)
        k_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=k)
        raw_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=t0)
        lam_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=lam)
        dest_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, result.nbytes)
        ## Step #9. Associate the arguments to the kernel with kernel object.

        ## Step #10. Deploy the kernel for device execution.

        # check that array sizes are all correct
        print(nn0.shape,nn1.shape,window.shape,k.shape,t0.shape,lam.shape,result.shape)

        # run kernel and get event to wait for
        evt=program.interp_hann(queue, (len(result),), None, raw_buf, win_buf, k_buf,
                            n0_buf, n1_buf, lam_buf, dest_buf, d_lam)

        ## Step #11. Move the kernel’s output data to host memory.

        cl.enqueue_copy(queue,result,dest_buf,wait_for=evt)

        print(result)
        ## Step #12. Release context, program, kernels and memory.
        ## PyOpenCL performs this step for you, and therefore,
        ## you don't need to worry about cleanup code

问题： 运行我得到以下脚本output/error：

Acquired platform: 
        NVIDIA CUDA
Acquired device: 
        GeForce RTX 2080
Device extensions: 
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_khr_fp64
        cl_khr_byte_addressable_store
        cl_khr_icd
        cl_khr_gl_sharing
        cl_nv_compiler_options
        cl_nv_device_attribute_query
        cl_nv_pragma_unroll
        cl_nv_d3d10_sharing
        cl_khr_d3d10_sharing
        cl_nv_d3d11_sharing
        cl_nv_copy_opts
        cl_nv_create_buffer
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
Created context.
(2048,) (2048,) (2048,) (2048,) (40960000,) (2048,) (40960000,)
Traceback (most recent call last):

  File "C:\Users\Mike\Desktop\Drive\PC Workspace\Senior Design\opencl\test.py", line 134, in <module>
    cl.enqueue_copy(queue,result1,dest_buf,wait_for=evt)

  File "C:\Users\Mike\Anaconda3\lib\site-packages\pyopencl\__init__.py", line 1719, in enqueue_copy
    return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)

RuntimeError: Unable to compute length of object

我真的搞不懂这个错误是什么意思，也不确定如何进行调试。

回溯前的数字是内核中使用的数组的长度。我想做的是遍历 t0/raw_buf 的所有 40960000 个元素，将 gid 作为迭代器，gid1 将保持 0-2047 索引，gid % 2048 用于长度为 2048 的对象。我认为出了点问题在内核的这方面，因为我真的不清楚 get_global_id() 函数是如何工作的。

我也不清楚如何确定内核的第二个和第三个参数（评论下：# run kernel and get event to wait for）。我知道它们对应于 global_size 和 local_size，但不确定我应该如何为我的应用程序确定这些参数。任何建议（或推荐阅读）将不胜感激。

Answer 1

我无法运行你的示例，因为我没有 fig8_13_raw_5000nm_2p0.npy，但是当我尝试让副本等待事件时，我遇到了类似的错误通过调用内核返回。

试试这个：

    program.interp_hann(queue, (len(result),), None, raw_buf, win_buf, k_buf,
                        n0_buf, n1_buf, lam_buf, dest_buf, d_lam)
    cl.enqueue_copy(queue,result,dest_buf)