np.tile 的 Numba 兼容实现？

Question

我正在编写一些图像去雾代码，based on this paper, and I started with an abandoned Py2.7 implementation。从那时起，特别是使用 Numba，我做了一些真正的性能改进（这很重要，因为我必须运行在 8K 图像上这样做）。

我非常确信我最后一个重要的性能瓶颈是执行 box filter step（我已经为每张图片节省了将近一分钟的时间，但最后一个缓慢的步骤是大约 30 秒/张图片），并且在 Numba 中，我即将达到运行，因为 nopython：

@njit # Row dependencies means can't be parallel
def yCumSum(a):
    """
    Numba based computation of y-direction
    cumulative sum. Can't be parallel!
    """
    out = np.empty_like(a)
    out[0, :] = a[0, :]
    for i in prange(1, a.shape[0]):
        out[i, :] = a[i, :] + out[i - 1, :]
    return out

@njit(parallel= True)
def xCumSum(a):
    """
    Numba-based parallel computation
    of X-direction cumulative sum
    """
    out = np.empty_like(a)
    for i in prange(a.shape[0]):
        out[i, :] = np.cumsum(a[i, :])
    return out

@jit
def _boxFilter(m, r, gpu= hasGPU):
    if gpu:
        m = cp.asnumpy(m)
    out = __boxfilter__(m, r)
    if gpu:
        return cp.asarray(out)
    return out

@jit(fastmath= True)
def __boxfilter__(m, r):
    """
    Fast box filtering implementation, O(1) time.
    Parameters
    ----------
    m:  a 2-D matrix data normalized to [0.0, 1.0]
    r:  radius of the window considered
    Return
    -----------
    The filtered matrix m'.
    """
    #H: height, W: width
    H, W = m.shape
    #the output matrix m'
    mp = np.empty(m.shape)

    #cumulative sum over y axis
    ySum = yCumSum(m) #np.cumsum(m, axis=0)
    #copy the accumulated values of the windows in y
    mp[0:r+1,: ] = ySum[r:(2*r)+1,: ]
    #differences in y axis
    mp[r+1:H-r,: ] = ySum[(2*r)+1:,: ] - ySum[ :H-(2*r)-1,: ]
    mp[(-r):,: ] = np.tile(ySum[-1,: ], (r, 1)) - ySum[H-(2*r)-1:H-r-1,: ]

    #cumulative sum over x axis
    xSum = xCumSum(mp) #np.cumsum(mp, axis=1)
    #copy the accumulated values of the windows in x
    mp[:, 0:r+1] = xSum[:, r:(2*r)+1]
    #difference over x axis
    mp[:, r+1:W-r] = xSum[:, (2*r)+1: ] - xSum[:, :W-(2*r)-1]
    mp[:, -r: ] = np.tile(xSum[:, -1][:, None], (1, r)) - xSum[:, W-(2*r)-1:W-r-1]
    return mp

边缘有很多事情要做，但如果我可以将平铺操作作为 nopython 调用，我就可以 nopython 整个 boxfilter 步骤并获得很大的性能提升。我不太愿意做一些非常具体的事情，因为我喜欢在别处重用这段代码，但我不会特别反对将它限制在 2D 范围内。不管出于什么原因，我只是盯着这个看，不确定从哪里开始。

Answer 1

np.tile 是一个 bit too complicated 完全重新实现，但除非我误读它看起来你只需要获取一个向量然后沿不同的轴重复它 r 次。

与 Numba 兼容的方法是编写

y = x.repeat(r).reshape((-1, r))

然后x会沿着第二个维度重复r次，这样y[i, j] == x[i].

示例：

In [2]: x = np.arange(5)                                                                                                

In [3]: x.repeat(3).reshape((-1, 3))                                                                                                                                  
Out[3]: 
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])

如果您希望 x 沿第一个维度重复，只需进行转置 y.T。

np.tile 的 Numba 兼容实现？

Numba-compatible implementation of np.tile?

python

numpy

numba