opencl 速度和 OUT_OF_RESOURCES

Question

我是 opencl 的新手，正在尝试我的第一个程序。我实现了一个简单的波形 sinc 滤波。代码有效，但我有两个问题：

一旦我增加输入矩阵的大小（numrows 需要增加到 100 000）我得到（clEnqueueReadBuffer 失败：OUT_OF_RESOURCES），即使矩阵相对较小（几 mb）。这在某种程度上与我认为的工作组规模有关，但有人可以详细说明我如何解决这个问题吗？会不会是驱动问题？

更新：

离开组大小None 崩溃
调整 GPU (1,600) 和 IntelHD (1,50) 的组大小让我可以达到大约 6400 行。然而，对于更大的尺寸，它会在 GPU 上崩溃，而 IntelHD 只会冻结并且什么都不做（资源监视器上为 0%）

2.I 有可用的 Intel HD4600 和 Nvidia K1100M GPU，但 Intel 的速度大约快 2 倍。我部分理解这是因为我不需要将阵列复制到与外部 GPU 不同的内部 Intel 内存。但是我预计会出现边际差异。这是正常现象还是我的代码应该更好地优化以在 GPU 上使用？ （已解决）

感谢您的帮助！！

    from __future__ import absolute_import, print_function
    import numpy as np
    import pyopencl as cl
    import os
    os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'       
    import matplotlib.pyplot as plt

    def resample_opencl(y,key='GPU'):
            #
            # selecting to run on GPU or CPU
            #
            newlen = 1200  
            my_platform = cl.get_platforms()[0]
            device =my_platform.get_devices()[0] 

            for found_platform in cl.get_platforms():    
                if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):         
                    my_platform = found_platform
                    device =my_platform.get_devices()[0]
                    print("using GPU")


            #
            #Create context for GPU/CPU
            #
            ctx = cl.Context([device])

            #
            # Create queue for each kernel execution
            #
            queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
    #        queue = cl.CommandQueue(ctx)

            prg = cl.Program(ctx, """
            __kernel void resample(      
                int M,
                __global const float *y_g,
                __global float *res_g)

            {
                int row = get_global_id(0);
                int col = get_global_id(1);
                int gs = get_global_size(1);
                __private float tmp,tmp2,x;
                __private float t;
                t = (float)(col)/2+1;


                tmp=0;
                tmp2=0;


                for (int i=0; i<M ; i++)
                    {
                    x = (float)(i+1);
                    tmp2 = (t- x)*3.14159;
                    if (t == x) {
                        tmp += y_g[row*M + i]  ;
                                    }
                    else 
                        tmp += y_g[row*M +i]  * sin(tmp2)/tmp2;
                     }

                res_g[row*gs +  col] = tmp;


            }
            """).build()

            mf = cl.mem_flags

            y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
            res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
            res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

            M = np.array(600).astype(np.int32)
            prg.resample(queue, res.shape, (1,200),M, y_g, res_g)


            event = cl.enqueue_copy(queue, res, res_g)
            print("success")
            event.wait()
            return res,event





    if __name__ == "__main__":
        #
        # this is the number i need to increase ( up to some 100 000)
        numrows = 2000  
        Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))


        x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
        t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
        y= np.zeros(( numrows,np.size(x)))
        y[:] = Gaussian(x).astype(np.float32)
        y = y.astype(np.float32)

        res,event =  resample_opencl(y,'GPU')
        print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)

        #
        # test plot if it worked
        #
        plt.figure()
        plt.plot(x,y[1,:],'+')
        plt.plot(t,res[1,:])

Answer 1

回复 1.

您的 newlen 必须能被 200 整除，因为这是您设置的局部维度 (1,200)。我将其增加到 9600，但仍然工作正常。

更新

更新后，我建议不要指定局部尺寸，而让实施来决定：

prg.resample(queue, res.shape, None,M, y_g, res_g)

如果newlen 和 numrows 是 16 的乘积，也可能会提高性能。

Nvidia GPU 的性能必须优于 Intel GPU，这并不是一条规则，特别是根据维基百科，它们之间的 GFLOPS 没有太大差异 (549.89 vs 288–432)。应该对这种 GFLOPS 比较持保留态度，因为一种算法可能比另一种算法更适合一种 GPU。换句话说，从这个数字来看，您可能期望一个 GPU 通常比另一个更快，但这可能因算法而异。

100000 行的内核要求：

y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB

Quadro K1100M 具有 2GB 全局内存，应该足以处理 100000 行。我发现 Intel HD 4600 受到系统内存的限制，所以我怀疑这也不是问题。

回复 2.

时间测量不正确。不是测量内核执行时间，而是测量将数据复制回主机的时间。因此，CPU 的这个数字较低也就不足为奇了。要测量内核执行时间，请执行以下操作：

event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)

我不知道如何衡量整个事情，包括使用 pyopencl 中的 OpenCL 分析事件将数据复制回主机，但仅使用 python 会得到类似的结果：

start = time.time()
... #code to be measured
end = time.time()
print(end - start)

Answer 2

我想我明白了问题所在：

IntelHd：关闭分析可以解决所有问题。可以运行代码没有任何问题。
K1100M GPU 仍然崩溃，但我怀疑这可能是超时问题，因为我在我的显示器上使用相同的视频卡。

opencl 速度和 OUT_OF_RESOURCES

opencl speed and OUT_OF_RESOURCES

opencl

python-3.x

pyopencl