Numba.vectorize for CUDA：return 数组的正确签名是什么？

Question

我有如下结构的函数，

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((len(X),N))
    out[:,0]  = X 
    for i in range(1,N):
        out[:,i] = X**i+out[:,i-1] 
    return out

我现在正尝试在我的 GPU 上运行。到目前为止我尝试的是以非向量化形式编写函数（即分别处理 X 的每个条目），并将 return 数组作为输入传递：

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

但是，我不知道该函数使用什么装饰器。如果我使用

@numba.vectorize([(float64,int64,float64[:])],target = 'cuda') 我得到 TypeError: Buffer dtype cannot be buffer
@numba.guvectorize([(float64,int64,float64[:])],'(),()->(n)',target = 'cuda') 我得到 NameError: undefined output symbols: n

适合我的目的的正确装饰器是什么？

我希望能够在最后以与 foo 大致相同的方式调用 foo_cuda，即传递一维数组 X、一个整数 N 和一个用结果填充的二维数组 out。

更新

我函数的 numpy.vectorize 版本是

def foo_np(x,N):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out = np.zeros(N)
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]
    return out
foo_ve = np.vectorize(foo_np,signature='(),()->(n)')

但是，我无法在 numba 中创建输出数组 (out = np.zeros(N))（cuda.local.array(N,dtype=float64) 也失败了），这使我无法使用 @numba.vectorize('void(float64,int64)',target='cuda')。我试图通过将输出数组传递给函数并将其添加到签名中来解决此问题（请参阅上面的尝试 1），但出现错误。

更新 2

实际作用如下：

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2] 
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

Answer 1

我对此进行了深入研究，所以我将分享我所拥有的。我不确定它是否是一个完整的答案，但它可能会解决你的一些问题。

最好的 numba-based 方法可能是使用 numba CUDA (jit) 编写您自己的 "custom" CUDA 内核。一个例子是 for reduction or here 用于矩阵乘法。要正确地做到这一点，需要学习一些有关 CUDA 编程的知识。然而，这似乎不是您想要进入的方向。

作为替代方案，您的问题侧重于使用 numba 矢量化生成 GPU 代码。 numba vectorize 装饰器用于对标量输入和输出进行操作的函数，向量化将它们应用于矩阵 input/matrix 输出。

对于不适合这个的功能，例如那些对向量或标量进行运算但产生向量的，或者那些对一个或两个向量进行运算并产生向量或标量输出的，numba 提供广义 guvectorize。

从你想要做的最简单的例子开始，我们可以用 guvectorize:

来实现

What I tried so far is to write the function in a non-vectorized form (i.e. treat each entry of X separately), and pass the return array as an input:

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

这个意图，取一个标量（每个函数调用）和return一个向量（每个函数调用）属于guvectorize的能力范围内（有一些limitations/caveats - 见注释在底部）。

这是一个工作示例，源自示例代码 here:

# cat t2.py
from __future__ import print_function

import sys

import numpy as np

from numba import guvectorize, cuda

if sys.version_info[0] == 2:
    range = xrange


#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        tmp1 = 0.
        tmp = inp[0]
        for i in range(out.shape[0]):
            tmp1 += tmp
            out[i] = tmp1
            tmp *= inp[0]

# set up input data
rows = 1280000 # shape[0]
cols = 4   # shape[1]
inp = np.zeros(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,0] = (i%4)+1
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('result'.center(80, '-'))
print(inp)

# nvprof --print-gpu-trace python t2.py
==4773== NVPROF is profiling process 4773, command: python t2.py
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4773== Profiling application: python t2.py
==4773== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
994.08ms  5.5731ms                    -               -         -         -         -  19.531MB  3.4224GB/s    Pageable      Device  Tesla P100-PCIE         1         7  [CUDA memcpy HtoD]
1.00083s  159.20us          (20000 1 1)        (64 1 1)        22        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  cudapy::__main__::__gufunc_my_func2(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>) [48]
1.00100s  4.8017ms                    -               -         -         -         -  19.531MB  3.9722GB/s      Device    Pageable  Tesla P100-PCIE         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

# nvprof --metrics gst_efficiency python t2.py
==4787== NVPROF is profiling process 4787, command: python t2.py
==4787== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "cudapy::__main__::__gufunc_my_func2(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)" (done)
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4787== Profiling application: python t2.py
==4787== Profiling result:
==4787== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: cudapy::__main__::__gufunc_my_func2(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
#

在 guvectorize 的上下文中回答您的具体问题：

What is the correct decorator to use for my purpose?

函数类型规范如下所示：

['<return-type>(<parameter 0 type>, <parameter 1 type>, ...)']

对于 guvectorize，return 类型应始终为 void。参数的类型应该与您打算使用函数的类型相匹配。对于 guvectorize，我们将类型视为我们将传递的实际输入和输出数据类型的 "slice"，"slice" 是单个函数调用将对其进行操作的对象。然后，矢量化对 input/output 数据的每个 "slice" 应用单独的函数调用，以覆盖 input/output 数据集的整个大小。那么，对于我的示例，我建议传递一个输入向量 (float32[:]) 和一个输出向量 (float32[:])。

函数签名显示输入的维度，然后是输出的维度：

(x)...->(x)

每个都可以是多维的（尽管对于矢量化来说应该仍然只表示 input/output 的 "slice"），标量可以用 () 表示。这里出现了一个问题，因为我们希望输出 "slice" 是一个结果向量，具有一定的长度，比如 n。 numba guvectorize 似乎不允许指定不属于已指定输入维度的输出维度 (n)。因此，虽然这个函数实际上只需要一个标量输入，但我选择通过为输入和输出传递一个向量 "slice" 来解决这个问题。事实上，这个函数，按照我编写的方式，可以使用相同的数据进行输入和输出，因此对于这种特殊情况，这个变通办法并没有真正 "overhead"。

关于implementation/performance的一些注意事项：

这 "parallelizes" 跨越第一个数组 (shape[0]) 维度。 所有计算都在 GPU 上执行。执行输出的每个向量切片的计算由 GPU 上的单个线程执行。但对于大型数据集（第一维），这将为 GPU 暴露大量并行工作（线程）。第二个维度的工作，即循环，是在每个线程的上下文中运行的。使用 vectorize 或 guvectorize 几乎肯定不可能在第二个维度中尝试并行化（创建例如前缀和），但使用 numba cuda (jit) 应该是可能的。我们可以通过研究上面工作示例中的（第一个）nvprof --print-gpu-trace 输出来确认并行化维度，并注意到它构成了 20000 个块，每个块有 64 个线程，总共有 1280000 个线程，与我们的第一个数组维度相匹配。
如前所述，此实现有些老套，因为我正在传递矢量输入，即使我只需要并使用标量。这是为了解决 numba 中的一个限制，据我所知，您不能指定像 ()->(n) 这样的维度签名。（注意：经过进一步研究，我认为这里正确的做法是定义一个 input-only ufunc，将输入 vector/matrix 作为单个参数传递的两倍，并且仅使用 (n) 作为维度签名，而不是 (n)->(n)。参见 here）。
从内存访问模式的角度来看，这个实现并不是最优的。从（第二个）nvprof --metrics gst_efficiency 输出中可以明显看出这一点。我认为这样做的原因也可能是 numba 设计（当前）的限制。当 numba 在此示例中通过 guvectorize 为 distribution/parallelization 呈现一个数组时，它将数组切成行，并将每一行传递给函数调用以进行处理。对于这个特定的例子，一个更有效的实现是转置我们的数组存储系统，并将数组切片为列，并让每个函数调用在列上工作。其原因与 GPU 行为和设计有关，我不会在这里回顾。然而，指标显示我们使用这种 row-per-thread 访问模式仅实现了 25% 的效率。使用转置数据和 column-per-thread 方法很容易实现 100% 的效率，但我不知道如何使用 numba guvectorize 实现这一点。我知道解决低效访问的唯一选择是恢复编写 numba CUDA 内核（即 CUDA jit）。（我尝试研究的另一种选择是看看我们是否可以指示 ufunc 在其签名中采用 column-vector，例如 void(float32[[:]],float32[[:]])，希望 numba 将按列切片，但没有运气有了那个。）

使用前面的示例作为模型，我们可以创建类似的东西来解决更新 2 中的 "actual function"，使用 guvectorize。我已经将这里的数组大小减少到 5 (= len(X)) x 4 (= N):

# cat t3.py
from __future__ import print_function

import sys
import numpy as np
import numba

from numba import guvectorize, cuda
import math

if sys.version_info[0] == 2:
    range = xrange

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2]
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        for i in range(2,out.shape[0]):
            out[i] = out[1]*out[i-1] - (i-1)*out[i-2]
        c = 1.
        for i in range(2,out.shape[0]):
            c *= i
            out[i] /= math.sqrt(c)

# set up input data
rows = 5   # shape[0]
cols = 4   # shape[1]
inp = np.ones(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,1] = i
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('gpu result'.center(80, '-'))
print(inp)

rrows = rows
rcols = cols
rin = np.zeros(rrows, dtype=np.float32)
for i in range(rin.shape[0]):
    rin[i] = i
rout = foo(rin, rcols)
print('cpu result'.center(80, '-'))
print(rout)


# python t3.py
-----------------------------------gpu result-----------------------------------
[[ 1.          0.         -0.70710677 -0.        ]
 [ 1.          1.          0.         -0.8164966 ]
 [ 1.          2.          2.1213202   0.8164966 ]
 [ 1.          3.          5.656854    7.3484693 ]
 [ 1.          4.         10.606602   21.22891   ]]
-----------------------------------cpu result-----------------------------------
[[ 1.          0.         -0.70710678 -0.        ]
 [ 1.          1.          0.         -0.81649658]
 [ 1.          2.          2.12132034  0.81649658]
 [ 1.          3.          5.65685425  7.34846923]
 [ 1.          4.         10.60660172 21.2289111 ]]
#

对 "slices" 的概念进行了额外的讨论，因为它们与 vectorize 和 guvectorize 相关，并介绍了一些示例。

Numba.vectorize for CUDA：return 数组的正确签名是什么？

Numba.vectorize for CUDA: What is the correct signature to return arrays?

python

cuda

numba