Python - 使用 Numpy 快速放大阵列,不允许图像库

Python - Quick Upscaling of Array with Numpy, No Image Libary Allowed

关于重复消息的注意事项:

相似的主题,不完全相同。特别是因为循环仍然是最快的方法。谢谢。

目标:

将数组从 [small,small] 快速放大到 [big,big],不要使用图像库。很简单的缩放,一个小的值会变成几个大的值,归一化成几个大的值后就变成了。换句话说,这是天文术语中的 "flux conserving" - 小数组的 16 值扩展到大数组的 4 个值(2 的因数)将是 4 个 4,因此值的数量已被保留。

问题:

我有一些工作代码来进行放大,但与缩小相比它们的工作速度不是很快。放大实际上比缩小更容易(在这种基本情况下需要很多总和) - 放大只需要将已知数据放入预分配数组的大块中。

对于一个工作示例,[16,24;8,16] 的 [2,2] 数组:

16 , 24

8 , 16

[4,4] 数组乘以 2 后的值为:

4 , 4 , 6 , 6

4 , 4 , 6 , 6

2 , 2 , 4 , 4

2 , 2 , 4 , 4

最快的实现是由 numba 的 jit & prange 加速的 for 循环。我想更好地利用 Numpy 的预编译函数来完成这项工作。我也会娱乐 Scipy 东西 - 但不是它的调整大小功能。

对于强大的矩阵操作函数来说,这似乎是一个完美的问题,但我还没有设法让它快速实现。

此外,单行 numpy 调用 way 很时髦,所以不要感到惊讶。但这就是让它正确对齐所需要的。

代码示例:

检查下面更优化的调用请注意,我这里的案例制作了一个 20480x20480 float64 数组,它可以占用相当多的内存 - 但如果一个方法可以炫耀内存太密集(因为矩阵可以)。

环境:Python3,Windows,i5-4960K @ 4.5 GHz。在显示的示例中,运行 循环代码的时间约为 18.9 秒,运行 numpy 代码的时间约为 52.5 秒。

% 主要:运行 这些

import timeit

timeitSetup = ''' 
from Regridder1 import Regridder1
import numpy as np

factor = 10;

inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';

print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));

timeitSetup = ''' 
from Regridder2 import Regridder2
import numpy as np

factor = 10;

inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';

print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));

% 有趣:Regridder 1 - for 循环

import numpy as np
from numba import prange, jit

@jit(nogil=True)
def Regridder1(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
    outArray = np.zeros(outSize); #preallcoate
    outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
    for i in prange(0,inSize[0]):
        for j in prange(0,inSize[1]):
            outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places

    return outArray;

% 乐趣:Regridder 2 - numpy

import numpy as np

def Regridder2(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels

    outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;

    return outArray;

非常感谢对加速这一过程的洞察力。希望代码不错,在文本框中制定。

当前最佳方案:

在我的 comp 上,numba 的 jit for 循环实现 (Regridder1) 仅将 jit 应用于需要的地方 运行 timeit 测试时间为 18.0 秒,而 numpy 仅实现 (Regridder2) 运行s timeit 测试在 18.5 秒。好处是在第一次调用时,numpy 的唯一实现不需要等待 jit 编译代码。 Jit 的 cache=True 让它不会在后续的 运行 上编译。其他电话(nogil、nopython、prange)似乎没有帮助,但似乎也没有伤害。也许在未来的 numba 更新中他们会做得更好或者什么的。

为了简单性和便携性,Regridder2 是最佳选择。它几乎一样快,而且不需要安装 numba(我的 Anaconda 安装需要我去安装它)——所以它有助于便携性。

% 有趣:Regridder 1 - for 循环

import numpy as np

def Regridder1(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
    outArray = np.empty(outSize) #preallcoate
    outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
    factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)

    outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop

    return outArray;
#END def Regridder1

from numba import jit, prange
@jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
    for i in prange(0,inSize[0]):
        for j in prange(0,inSize[1]):
            outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
        #END for j
    #END for i
    #scales the original data up, note for other languages you need i*factor+factor-1 because slicing
    return outArray; #return success
#END def RegridderUpscale

% 有趣:Regridder 2 - numpy 基于@ZisIsNotZis 的回答

import numpy as np

def Regridder2(inArray,factor):
    inSize = np.shape(inArray);
    #outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels

    outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done

    return outArray;
#END def Regridder2

我使用 512x512 字节图像(10 倍放大)对此做了一些基准测试:

a = np.empty((512, 512), 'B')

重复两次

>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

重复一次 + 重塑

>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

上面两种方法都是复制两次,下面两种方法都是复制一次。

花式索引

由于t可以重复使用(并预先计算),所以不计时

>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

查看 + 整形

>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

似乎 查看 + 重塑获胜(至少在我的机器上)。 2048x2048 字节图像的测试结果如下,其中 view + reshape 仍然获胜

2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2048x2048 float64 图片的结果是

3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

虽然项目大小是原来的 8 倍,但并没有花费更多时间

一些表明操作顺序很重要的新函数:

import numpy as np
from numba import jit

A=np.random.rand(2048,2048)

@jit
def reg1(A,factor):
    factor2=factor**2
    a,b = [factor*s for s in A.shape]
    B=np.empty((a,b),A.dtype)
    Bf=B.ravel()
    k=0
    for i in range(A.shape[0]):
        Ai=A[i]
        for _ in range(factor):
            for j in range(A.shape[1]):
                x=Ai[j]/factor2
                for _ in range(factor):
                    Bf[k]=x
                    k += 1
    return B   

def reg2(A,factor):
    return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)

def reg3(A,factor):
    return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)

def reg4(A,factor):
    shx,shy=A.shape
    stx,sty=A.strides
    B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
    shape=(shx,factor,shy,factor))
    return B.reshape(shx*factor,shy*factor) 

并运行:

In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""