python 中的 Numba jit 警告解释
Numba jit warnings interpretation in python
我已经定义了以下递归数组生成器,并且正在使用 Numba jit 来尝试加速处理(基于 this SO answer)
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
res = np.empty(n, dtype="float32")
res[0] = 0
for i in range(1, n):
res[i] = a * res[i - 1] + (1 - a) * (b ** (i - 1))
return res
a = calc_func(0.988, 0.9988, 5000)
我得到了一堆 warnings/errors 我不太明白。希望能帮助解释它们并使它们消失,以便(我假设)进一步加快计算速度。
他们在下面:
NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "calc_func" failed type inference due to: Invalid use of Function() with argument(s) of type(s): (int64, dtype=Literalstr)
* parameterized
In definition 0:
All templates rejected with literals.
In definition 1:
All templates rejected without literals.
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function()
[2] During: typing of call at res = np.empty(n, dtype="float32")
File "thenameofmyscript.py", line 71:
def calc_func(a, b, n):
res = np.empty(n, dtype="float32")
^
@jit("float32:", nopython=False, nogil=True)
thenameofmyscript.py:69: NumbaWarning:
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "calc_func" failed type inference due to: cannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>
File "thenameofmyscript.py", line 73:
def calc_func(a, b, n):
<source elided>
res[0] = 0
for i in range(1, n):
^
@jit("float32:", nopython=False, nogil=True)
H:\projects\decay-optimizer\venv\lib\site-packages\numba\compiler.py:742: NumbaWarning: Function "calc_func" was compiled in object mode without forceobj=True, but has lifted loops.
File "thenameofmyscript.py", line 70:
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
^
self.func_ir.loc))
H:\projects\decay-optimizer\venv\lib\site-packages\numba\compiler.py:751: NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
File "thenameofmyscript.py", line 70:
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
^
warnings.warn(errors.NumbaDeprecationWarning(msg, self.func_ir.loc))
thenameofmyscript.py:69: NumbaWarning: Code running in object mode won't allow parallel execution despite nogil=True.
@jit("float32:", nopython=False, nogil=True)
1。优化函数(代数化简)
现代 CPU 的加法、减法和乘法运算速度非常快。应尽可能避免诸如求幂之类的操作。
例子
在这个例子中,我用一个简单的乘法代替了代价高昂的幂运算。像这样的简化可以带来相当高的加速,但也可能会改变结果。
起初你的实现 (float64) 没有任何签名,稍后我将在另一个简单的例子中处理这个问题。
#nb.jit(nopython=True) is a shortcut for @nb.njit()
@nb.njit()
def calc_func_opt_1(a, b, n):
res = np.empty(n, dtype=np.float64)
fact=b
res[0] = 0.
res[1] = a * res[0] + (1. - a) *1.
res[2] = a * res[1] + (1. - a) * fact
for i in range(3, n):
fact*=b
res[i] = a * res[i - 1] + (1. - a) * fact
return res
另外一个好主意是尽可能使用标量。
@nb.njit()
def calc_func_opt_2(a, b, n):
res = np.empty(n, dtype=np.float64)
fact_1=b
fact_2=0.
res[0] = fact_2
fact_2=a * fact_2 + (1. - a) *1.
res[1] = fact_2
fact_2 = a * fact_2 + (1. - a) * fact_1
res[2]=fact_2
for i in range(3, n):
fact_1*=b
fact_2= a * fact_2 + (1. - a) * fact_1
res[i] = fact_2
return res
计时
%timeit a = calc_func(0.988, 0.9988, 5000)
222 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a = calc_func_opt_1(0.988, 0.9988, 5000)
22.7 µs ± 45.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit a = calc_func_opt_2(0.988, 0.9988, 5000)
15.3 µs ± 35.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2。签名值得推荐吗?
在提前模式 (AOT) 中,签名是必需的,但在通常的 JIT 模式中则不需要。上面的示例不是 SIMD 可向量化的。因此,您不会看到可能不是最佳的输入和输出声明的正面或负面影响。
再看一个例子。
#Numba is able to SIMD-vectorize this loop if
#a,b,res are contigous arrays
@nb.njit(fastmath=True)
def some_function_1(a,b):
res=np.empty_like(a)
for i in range(a.shape[0]):
res[i]=a[i]**2+b[i]**2
return res
@nb.njit("float64[:](float64[:],float64[:])",fastmath=True)
def some_function_2(a,b):
res=np.empty_like(a)
for i in range(a.shape[0]):
res[i]=a[i]**2+b[i]**2
return res
a=np.random.rand(10_000)
b=np.random.rand(10_000)
#Example for non contiguous input
#a=np.random.rand(10_000)[0::2]
#b=np.random.rand(10_000)[0::2]
%timeit res=some_function_1(a,b)
5.59 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit res=some_function_2(a,b)
9.36 µs ± 47.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
为什么有签名的版本比较慢?
让我们仔细看看签名。
some_function_1.nopython_signatures
#[(array(float64, 1d, C), array(float64, 1d, C)) -> array(float64, 1d, C)]
some_function_2.nopython_signatures
#[(array(float64, 1d, A), array(float64, 1d, A)) -> array(float64, 1d, A)]
#this is equivivalent to
#"float64[::1](float64[::1],float64[::1])"
如果编译时内存布局未知,通常不可能对算法进行 SIMD 向量化。当然,您可以显式声明 C 连续数组,但该函数不再适用于非连续输入,这通常不是预期的。
我已经定义了以下递归数组生成器,并且正在使用 Numba jit 来尝试加速处理(基于 this SO answer)
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
res = np.empty(n, dtype="float32")
res[0] = 0
for i in range(1, n):
res[i] = a * res[i - 1] + (1 - a) * (b ** (i - 1))
return res
a = calc_func(0.988, 0.9988, 5000)
我得到了一堆 warnings/errors 我不太明白。希望能帮助解释它们并使它们消失,以便(我假设)进一步加快计算速度。
他们在下面:
NumbaWarning: Compilation is falling back to object mode WITH looplifting enabled because Function "calc_func" failed type inference due to: Invalid use of Function() with argument(s) of type(s): (int64, dtype=Literalstr) * parameterized
In definition 0: All templates rejected with literals.
In definition 1: All templates rejected without literals. This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function()
[2] During: typing of call at
res = np.empty(n, dtype="float32")
File "thenameofmyscript.py", line 71:
def calc_func(a, b, n):
res = np.empty(n, dtype="float32")
^
@jit("float32:", nopython=False, nogil=True)
thenameofmyscript.py:69: NumbaWarning: Compilation is falling back to object mode WITHOUT looplifting enabled because Function "calc_func" failed type inference due to: cannot determine Numba type of
<class 'numba.dispatcher.LiftedLoop'>
File "thenameofmyscript.py", line 73:
def calc_func(a, b, n):
<source elided>
res[0] = 0
for i in range(1, n):
^
@jit("float32:", nopython=False, nogil=True)
H:\projects\decay-optimizer\venv\lib\site-packages\numba\compiler.py:742: NumbaWarning: Function "calc_func" was compiled in object mode without forceobj=True, but has lifted loops.
File "thenameofmyscript.py", line 70:
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
^
self.func_ir.loc))
H:\projects\decay-optimizer\venv\lib\site-packages\numba\compiler.py:751: NumbaDeprecationWarning: Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
File "thenameofmyscript.py", line 70:
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
^
warnings.warn(errors.NumbaDeprecationWarning(msg, self.func_ir.loc))
thenameofmyscript.py:69: NumbaWarning: Code running in object mode won't allow parallel execution despite nogil=True. @jit("float32:", nopython=False, nogil=True)
1。优化函数(代数化简)
现代 CPU 的加法、减法和乘法运算速度非常快。应尽可能避免诸如求幂之类的操作。
例子
在这个例子中,我用一个简单的乘法代替了代价高昂的幂运算。像这样的简化可以带来相当高的加速,但也可能会改变结果。
起初你的实现 (float64) 没有任何签名,稍后我将在另一个简单的例子中处理这个问题。
#nb.jit(nopython=True) is a shortcut for @nb.njit()
@nb.njit()
def calc_func_opt_1(a, b, n):
res = np.empty(n, dtype=np.float64)
fact=b
res[0] = 0.
res[1] = a * res[0] + (1. - a) *1.
res[2] = a * res[1] + (1. - a) * fact
for i in range(3, n):
fact*=b
res[i] = a * res[i - 1] + (1. - a) * fact
return res
另外一个好主意是尽可能使用标量。
@nb.njit()
def calc_func_opt_2(a, b, n):
res = np.empty(n, dtype=np.float64)
fact_1=b
fact_2=0.
res[0] = fact_2
fact_2=a * fact_2 + (1. - a) *1.
res[1] = fact_2
fact_2 = a * fact_2 + (1. - a) * fact_1
res[2]=fact_2
for i in range(3, n):
fact_1*=b
fact_2= a * fact_2 + (1. - a) * fact_1
res[i] = fact_2
return res
计时
%timeit a = calc_func(0.988, 0.9988, 5000)
222 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a = calc_func_opt_1(0.988, 0.9988, 5000)
22.7 µs ± 45.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit a = calc_func_opt_2(0.988, 0.9988, 5000)
15.3 µs ± 35.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2。签名值得推荐吗?
在提前模式 (AOT) 中,签名是必需的,但在通常的 JIT 模式中则不需要。上面的示例不是 SIMD 可向量化的。因此,您不会看到可能不是最佳的输入和输出声明的正面或负面影响。 再看一个例子。
#Numba is able to SIMD-vectorize this loop if
#a,b,res are contigous arrays
@nb.njit(fastmath=True)
def some_function_1(a,b):
res=np.empty_like(a)
for i in range(a.shape[0]):
res[i]=a[i]**2+b[i]**2
return res
@nb.njit("float64[:](float64[:],float64[:])",fastmath=True)
def some_function_2(a,b):
res=np.empty_like(a)
for i in range(a.shape[0]):
res[i]=a[i]**2+b[i]**2
return res
a=np.random.rand(10_000)
b=np.random.rand(10_000)
#Example for non contiguous input
#a=np.random.rand(10_000)[0::2]
#b=np.random.rand(10_000)[0::2]
%timeit res=some_function_1(a,b)
5.59 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit res=some_function_2(a,b)
9.36 µs ± 47.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
为什么有签名的版本比较慢?
让我们仔细看看签名。
some_function_1.nopython_signatures
#[(array(float64, 1d, C), array(float64, 1d, C)) -> array(float64, 1d, C)]
some_function_2.nopython_signatures
#[(array(float64, 1d, A), array(float64, 1d, A)) -> array(float64, 1d, A)]
#this is equivivalent to
#"float64[::1](float64[::1],float64[::1])"
如果编译时内存布局未知,通常不可能对算法进行 SIMD 向量化。当然,您可以显式声明 C 连续数组,但该函数不再适用于非连续输入,这通常不是预期的。