性能:Matlab 与 Python
Performance: Matlab vs Python
我最近从 Matlab
切换到 Python
。在转换我的一个冗长代码时,我惊讶地发现 Python
非常慢。我分析并追踪了一个占用时间的函数的问题。从我的代码中的不同位置调用此函数(作为递归调用的其他函数的一部分)。 Profiler 建议在 Matlab
和 Python
.
中对该函数进行 300 次调用
简而言之,以下代码总结了手头的问题:
MATLAB
class包含函数:
classdef ExampleKernel1 < handle
methods (Static)
function [kernel] = kernel_2D(M,x,N,y)
kernel = zeros(M,N);
for i= 1 : M
for j= 1 : N
% Define the custom kernel function here
kernel(i , j) = sqrt((x(i , 1) - y(j , 1)) .^ 2 + ...
(x(i , 2) - y(j , 2)) .^2 );
end
end
end
end
end
和调用脚本 test.m:
xVec=[
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
];
N=size(xVec,1);
kex1=ExampleKernel1;
tic
for i=1:300
K=kex1.kernel_2D(N,xVec,N,xVec);
end
toc
给出输出
clear all
>> test
Elapsed time is 0.022426 seconds.
>> test
Elapsed time is 0.009852 seconds.
PYTHON 3.4
Class 包含函数 CustomKernels.py:
from numpy import zeros
from math import sqrt
class CustomKernels:
"""Class for defining the custom kernel functions"""
@staticmethod
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
kernel = zeros([M, N])
for i in range(0, M):
for j in range(0, N):
# Define the custom kernel function here
kernel[i, j] = sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
和调用脚本 test.py:
import numpy as np
from CustomKernels import CustomKernels
from time import perf_counter
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
N = xVec.shape[0]
kex1 = CustomKernels.exampleKernelA
start=perf_counter()
for i in range(0,300):
K = kex1(N, xVec, N, xVec)
print(' %f secs' %(perf_counter()-start))
给出输出
%run test.py
0.940515 secs
%run test.py
0.884418 secs
%run test.py
0.940239 secs
结果
比较结果似乎 Matlab
在调用“clear all
”后快了大约 42 倍,如果脚本多次 运行 而没有调用“clear all
”。这至少是一个数量级,如果不是快两个数量级的话。这对我来说是一个非常令人惊讶的结果。我期待结果是相反的。
有人可以解释一下吗?
有人可以建议一种更快的方法来执行此操作吗?
旁注
我也尝试过使用 numpy.sqrt
这会使性能更差,因此我在 Python
.
中使用 math.sqrt
编辑
用于调用函数的 for
循环纯属虚构。它们只是为了“simulate” 300 调用该函数。如前所述,内核函数(Matlab
中的 kernel_2D
和 Python
中的 kex1
)从程序中的不同位置调用。为了使问题更短,我使用 for
循环“模拟”300 调用。由于内核矩阵的结构,内核函数中的 for
循环是必不可少且不可避免的。
编辑 2
您想摆脱那些 for
循环。试试这个:
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
i, j = np.indices((N, M))
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
您也可以使用广播来完成,这可能会更快,但来自 MATLAB
的直观性稍差。
经过进一步调查,我发现按照答案中的指示使用 indices
仍然较慢。
解决方法:使用meshgrid
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = meshgrid(y[:, 0], x[:, 0])
x1, y1 = meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
结果:非常非常快,比indices
方法快10倍。我得到的时间更接近 C.
但是: 使用 meshgrid
和 Matlab
比 C
和 Numpy
快 10 倍。
还在想为什么!
Matlab 使用商业 MKL 库。如果您使用免费 python 发行版,请检查您是否在 python 中使用了 MKL 或其他高性能 blas 库,或者它是默认的,这可能会慢得多。
比较 Jit 编译器
已经提到 Matlab 使用内部 Jit 编译器在此类任务上获得良好的性能。让我们将 Matlabs jit-compiler 与 Python jit-compiler (Numba) 进行比较。
代码
import numba as nb
import numpy as np
import math
import time
#If the arrays are somewhat larger it makes also sense to parallelize this problem
#cache ==True may also make sense
@nb.njit(fastmath=True)
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
def exampleKernelB(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = np.meshgrid(y[:, 0], x[:, 0])
x1, y1 = np.meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = np.sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
@nb.njit()
def exampleKernelC(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
#Your test data
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
#compilation on first callable
#can be avoided with cache=True
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
t1=time.time()
for i in range(10_000):
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelB(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
性能
exampleKernelA: 0.03s
exampleKernelC: 0.03s
exampleKernelB: 1.02s
Matlab_2016b (your code, but 10000 rep., after few runs): 0.165s
与仅使用广播的 meshgrid 解决方案相比,我的速度提高了约 5 倍:
def exampleKernelD(M, x, N, y):
return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)
我最近从 Matlab
切换到 Python
。在转换我的一个冗长代码时,我惊讶地发现 Python
非常慢。我分析并追踪了一个占用时间的函数的问题。从我的代码中的不同位置调用此函数(作为递归调用的其他函数的一部分)。 Profiler 建议在 Matlab
和 Python
.
简而言之,以下代码总结了手头的问题:
MATLAB
class包含函数:
classdef ExampleKernel1 < handle
methods (Static)
function [kernel] = kernel_2D(M,x,N,y)
kernel = zeros(M,N);
for i= 1 : M
for j= 1 : N
% Define the custom kernel function here
kernel(i , j) = sqrt((x(i , 1) - y(j , 1)) .^ 2 + ...
(x(i , 2) - y(j , 2)) .^2 );
end
end
end
end
end
和调用脚本 test.m:
xVec=[
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
];
N=size(xVec,1);
kex1=ExampleKernel1;
tic
for i=1:300
K=kex1.kernel_2D(N,xVec,N,xVec);
end
toc
给出输出
clear all
>> test
Elapsed time is 0.022426 seconds.
>> test
Elapsed time is 0.009852 seconds.
PYTHON 3.4
Class 包含函数 CustomKernels.py:
from numpy import zeros
from math import sqrt
class CustomKernels:
"""Class for defining the custom kernel functions"""
@staticmethod
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
kernel = zeros([M, N])
for i in range(0, M):
for j in range(0, N):
# Define the custom kernel function here
kernel[i, j] = sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
和调用脚本 test.py:
import numpy as np
from CustomKernels import CustomKernels
from time import perf_counter
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
N = xVec.shape[0]
kex1 = CustomKernels.exampleKernelA
start=perf_counter()
for i in range(0,300):
K = kex1(N, xVec, N, xVec)
print(' %f secs' %(perf_counter()-start))
给出输出
%run test.py
0.940515 secs
%run test.py
0.884418 secs
%run test.py
0.940239 secs
结果
比较结果似乎 Matlab
在调用“clear all
”后快了大约 42 倍,如果脚本多次 运行 而没有调用“clear all
”。这至少是一个数量级,如果不是快两个数量级的话。这对我来说是一个非常令人惊讶的结果。我期待结果是相反的。
有人可以解释一下吗?
有人可以建议一种更快的方法来执行此操作吗?
旁注
我也尝试过使用 numpy.sqrt
这会使性能更差,因此我在 Python
.
math.sqrt
编辑
用于调用函数的 for
循环纯属虚构。它们只是为了“simulate” 300 调用该函数。如前所述,内核函数(Matlab
中的 kernel_2D
和 Python
中的 kex1
)从程序中的不同位置调用。为了使问题更短,我使用 for
循环“模拟”300 调用。由于内核矩阵的结构,内核函数中的 for
循环是必不可少且不可避免的。
编辑 2
您想摆脱那些 for
循环。试试这个:
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
i, j = np.indices((N, M))
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
您也可以使用广播来完成,这可能会更快,但来自 MATLAB
的直观性稍差。
经过进一步调查,我发现按照答案中的指示使用 indices
仍然较慢。
解决方法:使用meshgrid
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = meshgrid(y[:, 0], x[:, 0])
x1, y1 = meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
结果:非常非常快,比indices
方法快10倍。我得到的时间更接近 C.
但是: 使用 meshgrid
和 Matlab
比 C
和 Numpy
快 10 倍。
还在想为什么!
Matlab 使用商业 MKL 库。如果您使用免费 python 发行版,请检查您是否在 python 中使用了 MKL 或其他高性能 blas 库,或者它是默认的,这可能会慢得多。
比较 Jit 编译器
已经提到 Matlab 使用内部 Jit 编译器在此类任务上获得良好的性能。让我们将 Matlabs jit-compiler 与 Python jit-compiler (Numba) 进行比较。
代码
import numba as nb
import numpy as np
import math
import time
#If the arrays are somewhat larger it makes also sense to parallelize this problem
#cache ==True may also make sense
@nb.njit(fastmath=True)
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
def exampleKernelB(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = np.meshgrid(y[:, 0], x[:, 0])
x1, y1 = np.meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = np.sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
@nb.njit()
def exampleKernelC(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
#Your test data
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
#compilation on first callable
#can be avoided with cache=True
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
t1=time.time()
for i in range(10_000):
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelB(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
性能
exampleKernelA: 0.03s
exampleKernelC: 0.03s
exampleKernelB: 1.02s
Matlab_2016b (your code, but 10000 rep., after few runs): 0.165s
与仅使用广播的 meshgrid 解决方案相比,我的速度提高了约 5 倍:
def exampleKernelD(M, x, N, y):
return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)