Python - 使用 Numpy 快速放大阵列,不允许图像库
Python - Quick Upscaling of Array with Numpy, No Image Libary Allowed
关于重复消息的注意事项:
相似的主题,不完全相同。特别是因为循环仍然是最快的方法。谢谢。
目标:
将数组从 [small,small] 快速放大到 [big,big],不要使用图像库。很简单的缩放,一个小的值会变成几个大的值,归一化成几个大的值后就变成了。换句话说,这是天文术语中的 "flux conserving" - 小数组的 16 值扩展到大数组的 4 个值(2 的因数)将是 4 个 4,因此值的数量已被保留。
问题:
我有一些工作代码来进行放大,但与缩小相比它们的工作速度不是很快。放大实际上比缩小更容易(在这种基本情况下需要很多总和) - 放大只需要将已知数据放入预分配数组的大块中。
对于一个工作示例,[16,24;8,16] 的 [2,2] 数组:
16 , 24
8 , 16
[4,4] 数组乘以 2 后的值为:
4 , 4 , 6 , 6
4 , 4 , 6 , 6
2 , 2 , 4 , 4
2 , 2 , 4 , 4
最快的实现是由 numba 的 jit & prange 加速的 for 循环。我想更好地利用 Numpy 的预编译函数来完成这项工作。我也会娱乐 Scipy 东西 - 但不是它的调整大小功能。
对于强大的矩阵操作函数来说,这似乎是一个完美的问题,但我还没有设法让它快速实现。
此外,单行 numpy 调用 way 很时髦,所以不要感到惊讶。但这就是让它正确对齐所需要的。
代码示例:
检查下面更优化的调用请注意,我这里的案例制作了一个 20480x20480 float64 数组,它可以占用相当多的内存 - 但如果一个方法可以炫耀内存太密集(因为矩阵可以)。
环境:Python3,Windows,i5-4960K @ 4.5 GHz。在显示的示例中,运行 循环代码的时间约为 18.9 秒,运行 numpy 代码的时间约为 52.5 秒。
% 主要:运行 这些
import timeit
timeitSetup = '''
from Regridder1 import Regridder1
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));
timeitSetup = '''
from Regridder2 import Regridder2
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));
% 有趣:Regridder 1 - for 循环
import numpy as np
from numba import prange, jit
@jit(nogil=True)
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.zeros(outSize); #preallcoate
outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places
return outArray;
% 乐趣:Regridder 2 - numpy
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;
return outArray;
非常感谢对加速这一过程的洞察力。希望代码不错,在文本框中制定。
当前最佳方案:
在我的 comp 上,numba 的 jit for 循环实现 (Regridder1) 仅将 jit 应用于需要的地方 运行 timeit 测试时间为 18.0 秒,而 numpy 仅实现 (Regridder2) 运行s timeit 测试在 18.5 秒。好处是在第一次调用时,numpy 的唯一实现不需要等待 jit 编译代码。 Jit 的 cache=True 让它不会在后续的 运行 上编译。其他电话(nogil、nopython、prange)似乎没有帮助,但似乎也没有伤害。也许在未来的 numba 更新中他们会做得更好或者什么的。
为了简单性和便携性,Regridder2 是最佳选择。它几乎一样快,而且不需要安装 numba(我的 Anaconda 安装需要我去安装它)——所以它有助于便携性。
% 有趣:Regridder 1 - for 循环
import numpy as np
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.empty(outSize) #preallcoate
outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)
outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop
return outArray;
#END def Regridder1
from numba import jit, prange
@jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
#END for j
#END for i
#scales the original data up, note for other languages you need i*factor+factor-1 because slicing
return outArray; #return success
#END def RegridderUpscale
% 有趣:Regridder 2 - numpy 基于@ZisIsNotZis 的回答
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
#outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done
return outArray;
#END def Regridder2
我使用 512x512
字节图像(10 倍放大)对此做了一些基准测试:
a = np.empty((512, 512), 'B')
重复两次
>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
重复一次 + 重塑
>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
上面两种方法都是复制两次,下面两种方法都是复制一次。
花式索引
由于t
可以重复使用(并预先计算),所以不计时
>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
查看 + 整形
>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
似乎 查看 + 重塑获胜(至少在我的机器上)。 2048x2048
字节图像的测试结果如下,其中 view + reshape 仍然获胜
2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
而 2048x2048
float64
图片的结果是
3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
虽然项目大小是原来的 8 倍,但并没有花费更多时间
一些表明操作顺序很重要的新函数:
import numpy as np
from numba import jit
A=np.random.rand(2048,2048)
@jit
def reg1(A,factor):
factor2=factor**2
a,b = [factor*s for s in A.shape]
B=np.empty((a,b),A.dtype)
Bf=B.ravel()
k=0
for i in range(A.shape[0]):
Ai=A[i]
for _ in range(factor):
for j in range(A.shape[1]):
x=Ai[j]/factor2
for _ in range(factor):
Bf[k]=x
k += 1
return B
def reg2(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)
def reg3(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)
def reg4(A,factor):
shx,shy=A.shape
stx,sty=A.strides
B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
shape=(shx,factor,shy,factor))
return B.reshape(shx*factor,shy*factor)
并运行:
In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
关于重复消息的注意事项:
相似的主题,不完全相同。特别是因为循环仍然是最快的方法。谢谢。
目标:
将数组从 [small,small] 快速放大到 [big,big],不要使用图像库。很简单的缩放,一个小的值会变成几个大的值,归一化成几个大的值后就变成了。换句话说,这是天文术语中的 "flux conserving" - 小数组的 16 值扩展到大数组的 4 个值(2 的因数)将是 4 个 4,因此值的数量已被保留。
问题:
我有一些工作代码来进行放大,但与缩小相比它们的工作速度不是很快。放大实际上比缩小更容易(在这种基本情况下需要很多总和) - 放大只需要将已知数据放入预分配数组的大块中。
对于一个工作示例,[16,24;8,16] 的 [2,2] 数组:
16 , 24
8 , 16
[4,4] 数组乘以 2 后的值为:
4 , 4 , 6 , 6
4 , 4 , 6 , 6
2 , 2 , 4 , 4
2 , 2 , 4 , 4
最快的实现是由 numba 的 jit & prange 加速的 for 循环。我想更好地利用 Numpy 的预编译函数来完成这项工作。我也会娱乐 Scipy 东西 - 但不是它的调整大小功能。
对于强大的矩阵操作函数来说,这似乎是一个完美的问题,但我还没有设法让它快速实现。
此外,单行 numpy 调用 way 很时髦,所以不要感到惊讶。但这就是让它正确对齐所需要的。
代码示例:
检查下面更优化的调用请注意,我这里的案例制作了一个 20480x20480 float64 数组,它可以占用相当多的内存 - 但如果一个方法可以炫耀内存太密集(因为矩阵可以)。
环境:Python3,Windows,i5-4960K @ 4.5 GHz。在显示的示例中,运行 循环代码的时间约为 18.9 秒,运行 numpy 代码的时间约为 52.5 秒。
% 主要:运行 这些
import timeit
timeitSetup = '''
from Regridder1 import Regridder1
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));
timeitSetup = '''
from Regridder2 import Regridder2
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));
% 有趣:Regridder 1 - for 循环
import numpy as np
from numba import prange, jit
@jit(nogil=True)
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.zeros(outSize); #preallcoate
outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places
return outArray;
% 乐趣:Regridder 2 - numpy
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;
return outArray;
非常感谢对加速这一过程的洞察力。希望代码不错,在文本框中制定。
当前最佳方案:
在我的 comp 上,numba 的 jit for 循环实现 (Regridder1) 仅将 jit 应用于需要的地方 运行 timeit 测试时间为 18.0 秒,而 numpy 仅实现 (Regridder2) 运行s timeit 测试在 18.5 秒。好处是在第一次调用时,numpy 的唯一实现不需要等待 jit 编译代码。 Jit 的 cache=True 让它不会在后续的 运行 上编译。其他电话(nogil、nopython、prange)似乎没有帮助,但似乎也没有伤害。也许在未来的 numba 更新中他们会做得更好或者什么的。
为了简单性和便携性,Regridder2 是最佳选择。它几乎一样快,而且不需要安装 numba(我的 Anaconda 安装需要我去安装它)——所以它有助于便携性。
% 有趣:Regridder 1 - for 循环
import numpy as np
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.empty(outSize) #preallcoate
outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)
outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop
return outArray;
#END def Regridder1
from numba import jit, prange
@jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
#END for j
#END for i
#scales the original data up, note for other languages you need i*factor+factor-1 because slicing
return outArray; #return success
#END def RegridderUpscale
% 有趣:Regridder 2 - numpy 基于@ZisIsNotZis 的回答
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
#outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done
return outArray;
#END def Regridder2
我使用 512x512
字节图像(10 倍放大)对此做了一些基准测试:
a = np.empty((512, 512), 'B')
重复两次
>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
重复一次 + 重塑
>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
上面两种方法都是复制两次,下面两种方法都是复制一次。
花式索引
由于t
可以重复使用(并预先计算),所以不计时
>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
查看 + 整形
>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
似乎 查看 + 重塑获胜(至少在我的机器上)。 2048x2048
字节图像的测试结果如下,其中 view + reshape 仍然获胜
2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
而 2048x2048
float64
图片的结果是
3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
虽然项目大小是原来的 8 倍,但并没有花费更多时间
一些表明操作顺序很重要的新函数:
import numpy as np
from numba import jit
A=np.random.rand(2048,2048)
@jit
def reg1(A,factor):
factor2=factor**2
a,b = [factor*s for s in A.shape]
B=np.empty((a,b),A.dtype)
Bf=B.ravel()
k=0
for i in range(A.shape[0]):
Ai=A[i]
for _ in range(factor):
for j in range(A.shape[1]):
x=Ai[j]/factor2
for _ in range(factor):
Bf[k]=x
k += 1
return B
def reg2(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)
def reg3(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)
def reg4(A,factor):
shx,shy=A.shape
stx,sty=A.strides
B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
shape=(shx,factor,shy,factor))
return B.reshape(shx*factor,shy*factor)
并运行:
In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""