如何向量化循环遍历 3D 点数组的 python 函数?
How to vectorize a python function that loops over an array of 3D points?
代码如下:
import numpy as np
from numpy.random import random
@profile
def point_func(point, points, funct):
return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
@profile
def point_afunc(ipoints, epoints, funct):
res = np.zeros(len(ipoints))
for idx, point in enumerate(ipoints):
res[idx] = point_func(point, epoints, funct)
return res
@profile
def main():
points = random((5000,3))
rpoint = random((1,3))
pres = point_func(rpoint, points, lambda r : r**3)
ares = point_afunc(points, points, lambda r : r**3)
if __name__=="__main__":
main()
我用 kernprof
分析了它并得到了这个:
Timer unit: 1e-06 s
Total time: 2.25667 s File: point-array-vectorization.py Function: point_func at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def point_func(point, points, funct):
6 5001 2256667.0 451.2 100.0 return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
Total time: 2.27844 s File: point-array-vectorization.py Function: point_afunc at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def point_afunc(ipoints, epoints, funct):
10 1 5.0 5.0 0.0 res = np.zeros(len(ipoints))
11 5001 4650.0 0.9 0.2 for idx, point in enumerate(ipoints):
12 5000 2273789.0 454.8 99.8 res[idx] = point_func(point, epoints, funct)
13 1 0.0 0.0 0.0 return res
Total time: 2.28239 s File: point-array-vectorization.py Function: main at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 @profile
16 def main():
17 1 145.0 145.0 0.0 points = random((5000,3))
18 1 2.0 2.0 0.0 rpoint = random((1,3))
19
20 1 507.0 507.0 0.0 pres = point_func(rpoint, points, lambda r : r**3)
21
22 1 2281731.0 2281731.0 100.0 ares = point_afunc(points, points, lambda r : r**3)
所以这部分花费了大部分时间:
11 5001 4650.0 0.9 0.2 for idx, point in enumerate(ipoints):
12 5000 2273789.0 454.8 99.8 res[idx] = point_func(point, epoints, funct)
我想看看时间损失是不是因为在for
循环中调用funct
造成的。为此,我想使用 numpy.vectorize
向量化 point_afunc
。我试过了,但它似乎把点矢量化了:循环最终循环遍历各个点组件。
@profile
def point_afunc(ipoints, epoints, funct):
res = np.zeros(len(ipoints))
for idx, point in enumerate(ipoints):
res[idx] = point_func(point, epoints, funct)
return res
point_afunc = np.vectorize(point_afunc)
导致错误:
File "point-array-vectorization.py", line 24, in main
ares = point_afunc(points, points, lambda r : r**3)
File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2755, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2825, in _vectorize_call
ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2785, in _get_ufunc_and_otypes
outputs = func(*inputs)
File "/usr/lib/python3.6/site-packages/line_profiler.py", line 115, in wrapper
result = func(*args, **kwds)
File "point-array-vectorization.py", line 10, in point_afunc
res = np.zeros(len(ipoints))
TypeError: object of type 'numpy.float64' has no len()
不知何故,不是对 ipoints
中的每个 点 进行矢量化,而是对点的分量进行矢量化?
编辑:尝试了下面@John Zwinck 的建议并使用了 numba。我使用 @jit
的执行时间比没有它时更长。如果我从所有函数中删除 @profile
装饰器,并用 @jit
替换 point_func
和 point_afunc
,这些是执行时间:
time ./point_array_vectorization.py
real 0m3.686s
user 0m3.584s
sys 0m0.077s
point-array-vectorization> time ./point_array_vectorization.py
real 0m3.683s
user 0m3.596s
sys 0m0.063s
point-array-vectorization> time ./point_array_vectorization.py
real 0m3.751s
user 0m3.658s
sys 0m0.070s
并移除所有 @jit
装饰器:
point-array-vectorization> time ./point_array_vectorization.py
real 0m2.925s
user 0m2.874s
sys 0m0.030s
point-array-vectorization> time ./point_array_vectorization.py
real 0m2.950s
user 0m2.902s
sys 0m0.029s
point-array-vectorization> time ./point_array_vectorization.py
real 0m2.951s
user 0m2.886s
sys 0m0.042s
我需要更多地帮助 numba
编译器吗?
编辑: point_afunc
是否可以使用 numpy
以某种方式在没有 for 循环的情况下编写?
编辑:循环版与Peter的numpy
广播版比较,循环版更快:
Timer unit: 1e-06 s
Total time: 2.13361 s
File: point_array_vectorization.py
Function: point_func at line 7
Line # Hits Time Per Hit % Time Line Contents
==============================================================
7 @profile
8 def point_func(point, points, funct):
9 5001 2133615.0 426.6 100.0 return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
Total time: 2.1528 s
File: point_array_vectorization.py
Function: point_afunc at line 11
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def point_afunc(ipoints, epoints, funct):
13 1 5.0 5.0 0.0 res = np.zeros(len(ipoints))
14 5001 4176.0 0.8 0.2 for idx, point in enumerate(ipoints):
15 5000 2148617.0 429.7 99.8 res[idx] = point_func(point, epoints, funct)
16 1 0.0 0.0 0.0 return res
Total time: 2.75093 s
File: point_array_vectorization.py
Function: new_point_afunc at line 18
Line # Hits Time Per Hit % Time Line Contents
==============================================================
18 @profile
19 def new_point_afunc(ipoints, epoints, funct):
20 1 2750926.0 2750926.0 100.0 return np.sum(funct(np.sqrt((ipoints[:, None, :] - epoints[None, :, :])**2).sum(axis=-1)), axis=1)
Total time: 4.90756 s
File: point_array_vectorization.py
Function: main at line 22
Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def main():
24 1 170.0 170.0 0.0 points = random((5000,3))
25 1 4.0 4.0 0.0 rpoint = random((1,3))
26 1 546.0 546.0 0.0 pres = point_func(rpoint, points, lambda r : r**3)
27 1 2155829.0 2155829.0 43.9 ares = point_afunc(points, points, lambda r : r**3)
28 1 2750945.0 2750945.0 56.1 vares = new_point_afunc(points, points, lambda r : r**3)
29 1 71.0 71.0 0.0 assert(np.max(np.abs(ares-vares)) < 1e-15)
numpy.vectorize()
在性能方面没有任何用处:它只是构建隐藏的 Python for
循环的语法糖(或者更确切地说,语法氰化物)。帮不了你。
可能对您有很大帮助的一件事是 Numba。它可以即时编译您的原始代码,并且可能会大大加快速度。只需将 @profile
装饰器替换为 @numba.jit
.
您可以为此使用广播。广播是点矩阵的重塑,使维度 "broadcast" 相互对抗。例如ipoints[:, None, :] - epoints[None, :, :]
说:
- 将
ipoints
从 MxD 重塑为 Mx1xD
- 将
epoints
从 NxD 重塑为 1xNxD
- 减去每对点得到一个 MxNxD 数组
完整代码变为:
def new_point_afunc(ipoints, epoints, funct):
return np.sum(funct(np.sqrt((ipoints[:, None, :] - epoints[None, :, :])**2).sum(axis=-1)), axis=1)
警告 - 在您的示例中,点的维数仅为 3,但对于更高的维数,这在内存方面可能不实用,因为这种 ipoints[:, None, :] - epoints[None, :, :]
方法创建了一个形状为 [=15= 的中间矩阵].
使用 Numba 的示例
此方法的性能取决于某人想要调用创建的函数的频率以及输入数据的大小。编译开销约为 1.67 秒,对于相对较小的数据或仅调用一次函数来说,这种方法并不适合。
我也使用了你的代码,稍作修改。使用 Numba 编写普通循环而不是像 np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
这样的多个矢量化命令会更快,无论是在运行时还是编译时间。此外,这个问题可以很容易地并行化,但这会额外增加编译量。
例子
import numpy as np
from numpy.random import random
import numba as nb
import time
def make_point_func(funct):
@nb.njit(fastmath=True)
def point_func(point, points):
return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
return point_func
def make_point_afunc(funct,point_func):
@nb.njit(fastmath=True)
def point_afunc(ipoints, epoints):
res = np.zeros(len(ipoints))
for idx in range(len(ipoints)):
res[idx] = point_func(ipoints[idx], epoints)
return res
return point_afunc
def main():
points = random((5000,3))
rpoint = random((1,3))
#Make functions
point_func=make_point_func(nb.njit(lambda r : r**3))
point_afunc=make_point_afunc(nb.njit(lambda r : r**3),point_func)
#first call
print("First call with compilation overhead")
t1=time.time()
pres = point_func(rpoint, points)
ares = point_afunc(points, points)
print(time.time()-t1)
print("Second call without compilation overhead")
t1=time.time()
#second call
pres = point_func(rpoint, points)
ares = point_afunc(points, points)
print(time.time()-t1)
if __name__=="__main__":
main()
性能
original: 1.7s
Numba first call: 1.87s
Numba further calls: 0.2s
代码如下:
import numpy as np
from numpy.random import random
@profile
def point_func(point, points, funct):
return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
@profile
def point_afunc(ipoints, epoints, funct):
res = np.zeros(len(ipoints))
for idx, point in enumerate(ipoints):
res[idx] = point_func(point, epoints, funct)
return res
@profile
def main():
points = random((5000,3))
rpoint = random((1,3))
pres = point_func(rpoint, points, lambda r : r**3)
ares = point_afunc(points, points, lambda r : r**3)
if __name__=="__main__":
main()
我用 kernprof
分析了它并得到了这个:
Timer unit: 1e-06 s
Total time: 2.25667 s File: point-array-vectorization.py Function: point_func at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def point_func(point, points, funct):
6 5001 2256667.0 451.2 100.0 return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
Total time: 2.27844 s File: point-array-vectorization.py Function: point_afunc at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def point_afunc(ipoints, epoints, funct):
10 1 5.0 5.0 0.0 res = np.zeros(len(ipoints))
11 5001 4650.0 0.9 0.2 for idx, point in enumerate(ipoints):
12 5000 2273789.0 454.8 99.8 res[idx] = point_func(point, epoints, funct)
13 1 0.0 0.0 0.0 return res
Total time: 2.28239 s File: point-array-vectorization.py Function: main at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 @profile
16 def main():
17 1 145.0 145.0 0.0 points = random((5000,3))
18 1 2.0 2.0 0.0 rpoint = random((1,3))
19
20 1 507.0 507.0 0.0 pres = point_func(rpoint, points, lambda r : r**3)
21
22 1 2281731.0 2281731.0 100.0 ares = point_afunc(points, points, lambda r : r**3)
所以这部分花费了大部分时间:
11 5001 4650.0 0.9 0.2 for idx, point in enumerate(ipoints):
12 5000 2273789.0 454.8 99.8 res[idx] = point_func(point, epoints, funct)
我想看看时间损失是不是因为在for
循环中调用funct
造成的。为此,我想使用 numpy.vectorize
向量化 point_afunc
。我试过了,但它似乎把点矢量化了:循环最终循环遍历各个点组件。
@profile
def point_afunc(ipoints, epoints, funct):
res = np.zeros(len(ipoints))
for idx, point in enumerate(ipoints):
res[idx] = point_func(point, epoints, funct)
return res
point_afunc = np.vectorize(point_afunc)
导致错误:
File "point-array-vectorization.py", line 24, in main
ares = point_afunc(points, points, lambda r : r**3)
File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2755, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2825, in _vectorize_call
ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2785, in _get_ufunc_and_otypes
outputs = func(*inputs)
File "/usr/lib/python3.6/site-packages/line_profiler.py", line 115, in wrapper
result = func(*args, **kwds)
File "point-array-vectorization.py", line 10, in point_afunc
res = np.zeros(len(ipoints))
TypeError: object of type 'numpy.float64' has no len()
不知何故,不是对 ipoints
中的每个 点 进行矢量化,而是对点的分量进行矢量化?
编辑:尝试了下面@John Zwinck 的建议并使用了 numba。我使用 @jit
的执行时间比没有它时更长。如果我从所有函数中删除 @profile
装饰器,并用 @jit
替换 point_func
和 point_afunc
,这些是执行时间:
time ./point_array_vectorization.py
real 0m3.686s
user 0m3.584s
sys 0m0.077s
point-array-vectorization> time ./point_array_vectorization.py
real 0m3.683s
user 0m3.596s
sys 0m0.063s
point-array-vectorization> time ./point_array_vectorization.py
real 0m3.751s
user 0m3.658s
sys 0m0.070s
并移除所有 @jit
装饰器:
point-array-vectorization> time ./point_array_vectorization.py
real 0m2.925s
user 0m2.874s
sys 0m0.030s
point-array-vectorization> time ./point_array_vectorization.py
real 0m2.950s
user 0m2.902s
sys 0m0.029s
point-array-vectorization> time ./point_array_vectorization.py
real 0m2.951s
user 0m2.886s
sys 0m0.042s
我需要更多地帮助 numba
编译器吗?
编辑: point_afunc
是否可以使用 numpy
以某种方式在没有 for 循环的情况下编写?
编辑:循环版与Peter的numpy
广播版比较,循环版更快:
Timer unit: 1e-06 s
Total time: 2.13361 s
File: point_array_vectorization.py
Function: point_func at line 7
Line # Hits Time Per Hit % Time Line Contents
==============================================================
7 @profile
8 def point_func(point, points, funct):
9 5001 2133615.0 426.6 100.0 return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
Total time: 2.1528 s
File: point_array_vectorization.py
Function: point_afunc at line 11
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def point_afunc(ipoints, epoints, funct):
13 1 5.0 5.0 0.0 res = np.zeros(len(ipoints))
14 5001 4176.0 0.8 0.2 for idx, point in enumerate(ipoints):
15 5000 2148617.0 429.7 99.8 res[idx] = point_func(point, epoints, funct)
16 1 0.0 0.0 0.0 return res
Total time: 2.75093 s
File: point_array_vectorization.py
Function: new_point_afunc at line 18
Line # Hits Time Per Hit % Time Line Contents
==============================================================
18 @profile
19 def new_point_afunc(ipoints, epoints, funct):
20 1 2750926.0 2750926.0 100.0 return np.sum(funct(np.sqrt((ipoints[:, None, :] - epoints[None, :, :])**2).sum(axis=-1)), axis=1)
Total time: 4.90756 s
File: point_array_vectorization.py
Function: main at line 22
Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def main():
24 1 170.0 170.0 0.0 points = random((5000,3))
25 1 4.0 4.0 0.0 rpoint = random((1,3))
26 1 546.0 546.0 0.0 pres = point_func(rpoint, points, lambda r : r**3)
27 1 2155829.0 2155829.0 43.9 ares = point_afunc(points, points, lambda r : r**3)
28 1 2750945.0 2750945.0 56.1 vares = new_point_afunc(points, points, lambda r : r**3)
29 1 71.0 71.0 0.0 assert(np.max(np.abs(ares-vares)) < 1e-15)
numpy.vectorize()
在性能方面没有任何用处:它只是构建隐藏的 Python for
循环的语法糖(或者更确切地说,语法氰化物)。帮不了你。
可能对您有很大帮助的一件事是 Numba。它可以即时编译您的原始代码,并且可能会大大加快速度。只需将 @profile
装饰器替换为 @numba.jit
.
您可以为此使用广播。广播是点矩阵的重塑,使维度 "broadcast" 相互对抗。例如ipoints[:, None, :] - epoints[None, :, :]
说:
- 将
ipoints
从 MxD 重塑为 Mx1xD - 将
epoints
从 NxD 重塑为 1xNxD - 减去每对点得到一个 MxNxD 数组
完整代码变为:
def new_point_afunc(ipoints, epoints, funct):
return np.sum(funct(np.sqrt((ipoints[:, None, :] - epoints[None, :, :])**2).sum(axis=-1)), axis=1)
警告 - 在您的示例中,点的维数仅为 3,但对于更高的维数,这在内存方面可能不实用,因为这种 ipoints[:, None, :] - epoints[None, :, :]
方法创建了一个形状为 [=15= 的中间矩阵].
使用 Numba 的示例
此方法的性能取决于某人想要调用创建的函数的频率以及输入数据的大小。编译开销约为 1.67 秒,对于相对较小的数据或仅调用一次函数来说,这种方法并不适合。
我也使用了你的代码,稍作修改。使用 Numba 编写普通循环而不是像 np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
这样的多个矢量化命令会更快,无论是在运行时还是编译时间。此外,这个问题可以很容易地并行化,但这会额外增加编译量。
例子
import numpy as np
from numpy.random import random
import numba as nb
import time
def make_point_func(funct):
@nb.njit(fastmath=True)
def point_func(point, points):
return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
return point_func
def make_point_afunc(funct,point_func):
@nb.njit(fastmath=True)
def point_afunc(ipoints, epoints):
res = np.zeros(len(ipoints))
for idx in range(len(ipoints)):
res[idx] = point_func(ipoints[idx], epoints)
return res
return point_afunc
def main():
points = random((5000,3))
rpoint = random((1,3))
#Make functions
point_func=make_point_func(nb.njit(lambda r : r**3))
point_afunc=make_point_afunc(nb.njit(lambda r : r**3),point_func)
#first call
print("First call with compilation overhead")
t1=time.time()
pres = point_func(rpoint, points)
ares = point_afunc(points, points)
print(time.time()-t1)
print("Second call without compilation overhead")
t1=time.time()
#second call
pres = point_func(rpoint, points)
ares = point_afunc(points, points)
print(time.time()-t1)
if __name__=="__main__":
main()
性能
original: 1.7s
Numba first call: 1.87s
Numba further calls: 0.2s