Julia vs MATLAB:为什么我的 Julia 代码这么慢?
Julia vs MATLAB: Why is my Julia code so slow?
我刚开始使用 Julia,然后运行将我的 MATLAB 代码转换为 Julia(基本上是逐行)。我注意到 Julia 代码要慢得多(比如 50 倍)。原始问题是一个动态规划问题,我在其中插值函数——插值是代码花费大部分时间的地方。所以我尝试制作一个最小的示例代码来显示性能差异。需要注意的重要事项是它是插值的样条近似,并且网格最好是不规则的,即不等距。 MATLAB代码:
tic
spacing=1.5;
Nxx = 300;
Naa = 350;
Nalal = 200;
sigma = 10;
NoIter = 500;
xx=NaN(Nxx,1);
xmin = 0.01;
xmax = 400;
xx(1) = xmin;
for i=2:Nxx
xx(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end
f_U = @(c) c.^(1-sigma)/(1-sigma);
W=NaN(Nxx,1);
W(:,1) = f_U(xx);
xprime = ones(Nalal,Naa);
for i=1:NoIter
W_temp = interp1(xx,W(:,1),xprime,'spline');
end
toc
Elapsed time is 0.242288 seconds.
Julia 代码:
using Dierckx
function performance()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
for k=1:Nalal
W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
end
end
end
end
@time(performance())
4.200732 seconds (70.02 M allocations: 5.217 GB, 4.35% gc time)
我将 W 设为二维数组,因为在原题中它是一个矩阵。我对不同的插值包进行了一些研究,但对于不规则网格和样条曲线没有太多选择。 MATLAB 的 interp1 显然不可用。
我的问题是,我在想如果我编写 Julia 代码并且它给出与 MATLAB 相同的结果,那么 Julia 通常应该更快。但显然情况并非如此,因此您需要注意编码。我不是程序员,当然我会尽力而为,但我想知道我是否在这里犯了一些很容易修复的明显错误,或者它是否会(太)经常发生,我必须注意我的 Julia 编码——因为那样的话我可能不值得学习它。同样,如果我可以让 Julia 在这里更快(我很确定我可以,例如,分配看起来有点大),我也可以让 MATLAB 更快。我对 Julia 的希望是——对于类似的代码——它会比 MATLAB 快 运行。
在对时间进行一些评论之后,我也运行这个代码:
using Dierckx
tic()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
for k=1:Nalal
W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
end
end
end
toc()
elapsed time:
7.336371495 seconds
甚至更慢,嗯...
另一项编辑:在这种情况下,消除一个循环实际上使它更快,但仍然无法与 MATLAB 相提并论。代码:
function performance2()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
W_temp[:,j] = evaluate(interp_spline, xprime[:,j])
end
end
end
@time(performance2())
1.573347 seconds (700.04 k allocations: 620.643 MB, 1.08% gc time)
另一个编辑,现在通过循环迭代相同的次数:
function performance3()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
W_tempr = Array(Float64, Nalal*Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
xprimer = reshape(xprime, Nalal*Naa)
for i=1:NoIter
W_tempr = evaluate(interp_spline, xprimer)
end
W_temp = reshape(W_tempr, Nalal, Naa)
end
tic()
performance3()
toc()
elapsed time:
1.480349334 seconds
仍然,与 MATLAB 完全不能相提并论。顺便说一句,在我的实际问题中,我很容易 运行 循环了 50k 次,并且我正在访问更大的 xprime 矩阵,尽管不确定这部分是否有所作为。
因为我也在学习 Julia,所以我尝试了 OP 代码的可能加速(为了我的练习!)。而且瓶颈似乎本质上是底层的 Fortran 代码。为了验证这一点,我首先将 OP 的代码重写为最小形式,如下所示:
using Dierckx
function perf()
Nx = 300
xinp = Float64[ 2pi * i / Nx for i = 1:Nx ]
yinp = sin( xinp )
interp = Spline1D( xinp, yinp )
Nsample = 200 * 350
x = ones( Nsample ) * pi / 6
y = zeros( Nsample )
for i = 1:500
y[:] = evaluate( interp, x )
end
@show y[1:3] # The result should be 0.5 (= sin(pi/6))
end
@time perf()
@time perf()
@time perf()
其中问题的大小保持不变,同时更改了输入的 x 和 y 坐标,以便结果众所周知 (0.5)。在我的机器上,结果是
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.886956 seconds (174.20 k allocations: 277.414 MB, 3.55% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.659290 seconds (1.56 k allocations: 269.295 MB, 0.39% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.648145 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)
从现在开始,为了简洁起见,我将省略y[1:3](我已经确认在所有情况下获得的y[1:3]都是正确的)。如果我们将 evaluate()
替换为 copy!(y,x)
,结果将变为
0.206723 seconds (168.26 k allocations: 10.137 MB, 10.27% gc time)
0.023068 seconds (60 allocations: 2.198 MB)
0.023013 seconds (60 allocations: 2.198 MB)
所以基本上所有时间都花在 evaluate()
上。现在看 original code of this routine, we see that it calls splev() in Fortran, which in turn calls fpbspl() (both originally from Netlib)。这些例程相当古老(写于 ~1990 年)并且似乎没有针对当前的计算机进行很好的优化(例如,有很多 IF 分支并且向量化可能很困难......)。因为如何 "vectorize" 代码并不简单,所以我尝试使用 OpenMP 进行并行化。修改后的splev()
是这样的,这里简单的把输入点分成线程:
subroutine splev(t,n,c,k,x,y,m,e,ier)
c subroutine splev evaluates in a number of points x(i),i=1,2,...,m
c a spline s(x) of degree k, given in its b-spline representation.
(... same here ...)
c main loop for the different points.
c$omp parallel do default(shared)
c$omp.firstprivate(arg,ier,l1,l,ll,sp,h) private(i,j)
do i = 1, m
c fetch a new x-value arg.
arg = x(i)
c check if arg is in the support
if (arg .lt. tb .or. arg .gt. te) then
if (e .eq. 0) then
goto 35
else if (e .eq. 1) then
y(i) = 0
goto 80
else if (e .eq. 2) then
ier = 1
! goto 100 !! I skipped this error handling for simplicity.
else if (e .eq. 3) then
if (arg .lt. tb) then
arg = tb
else
arg = te
endif
endif
endif
c search for knot interval t(l) <= arg < t(l+1)
35 if ( t(l) <= arg .or. l1 == k2 ) go to 40
l1 = l
l = l - 1
go to 35
40 if ( arg < t(l1) .or. l == nk1 ) go to 50
l = l1
l1 = l + 1
go to 40
c evaluate the non-zero b-splines at arg.
50 call fpbspl(t, n, k, arg, l, h)
c find the value of s(x) at x=arg.
sp = 0.0d0
ll = l - k1
do 60 j = 1, k1
ll = ll + 1
sp = sp + c(ll)*h(j)
60 continue
y(i) = sp
80 continue
enddo
c$omp end parallel do
100 return
end
现在用 gfortran -fopenmp
重建包并调用上面的 perf()
得到
$ OMP_NUM_THREADS=1 julia interp.jl
1.911112 seconds (174.20 k allocations: 277.414 MB, 3.49% gc time)
1.599154 seconds (1.56 k allocations: 269.295 MB, 0.41% gc time)
1.671308 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)
$ OMP_NUM_THREADS=2 julia interp.jl
1.308713 seconds (174.20 k allocations: 277.414 MB, 5.14% gc time)
0.874616 seconds (1.56 k allocations: 269.295 MB, 0.46% gc time)
0.897505 seconds (1.56 k allocations: 269.295 MB, 0.21% gc time)
$ OMP_NUM_THREADS=4 julia interp.jl
0.749203 seconds (174.20 k allocations: 277.414 MB, 9.31% gc time)
0.446702 seconds (1.56 k allocations: 269.295 MB, 0.93% gc time)
0.439522 seconds (1.56 k allocations: 269.295 MB, 0.43% gc time)
$ OMP_NUM_THREADS=8 julia interp.jl
0.478504 seconds (174.20 k allocations: 277.414 MB, 14.66% gc time)
0.243258 seconds (1.56 k allocations: 269.295 MB, 1.81% gc time)
0.233157 seconds (1.56 k allocations: 269.295 MB, 0.89% gc time)
$ OMP_NUM_THREADS=16 julia interp.jl
0.379243 seconds (174.20 k allocations: 277.414 MB, 19.02% gc time)
0.129145 seconds (1.56 k allocations: 269.295 MB, 3.49% gc time)
0.124497 seconds (1.56 k allocations: 269.295 MB, 1.80% gc time)
# Julia: v0.4.0, Machine: Linux x86_64 (2.6GHz, Xeon2.60GHz, 16 cores)
所以缩放看起来非常好(但如果我以这种方式使用 OpenMP 犯了一个大错误,请告诉我......)如果上面的结果是正确的,这意味着需要 8 个线程Fortran 代码以匹配 OP 机器上 interp1()
的速度。但好消息是 Fortran 代码可能还有改进的空间(即使是串行 运行)。无论如何,OP 的程序(如最终形式)似乎正在比较底层插值例程的性能,即 Matlab 中的 interp1()
与 Fortran 中的 splev()
。
我刚开始使用 Julia,然后运行将我的 MATLAB 代码转换为 Julia(基本上是逐行)。我注意到 Julia 代码要慢得多(比如 50 倍)。原始问题是一个动态规划问题,我在其中插值函数——插值是代码花费大部分时间的地方。所以我尝试制作一个最小的示例代码来显示性能差异。需要注意的重要事项是它是插值的样条近似,并且网格最好是不规则的,即不等距。 MATLAB代码:
tic
spacing=1.5;
Nxx = 300;
Naa = 350;
Nalal = 200;
sigma = 10;
NoIter = 500;
xx=NaN(Nxx,1);
xmin = 0.01;
xmax = 400;
xx(1) = xmin;
for i=2:Nxx
xx(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end
f_U = @(c) c.^(1-sigma)/(1-sigma);
W=NaN(Nxx,1);
W(:,1) = f_U(xx);
xprime = ones(Nalal,Naa);
for i=1:NoIter
W_temp = interp1(xx,W(:,1),xprime,'spline');
end
toc
Elapsed time is 0.242288 seconds.
Julia 代码:
using Dierckx
function performance()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
for k=1:Nalal
W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
end
end
end
end
@time(performance())
4.200732 seconds (70.02 M allocations: 5.217 GB, 4.35% gc time)
我将 W 设为二维数组,因为在原题中它是一个矩阵。我对不同的插值包进行了一些研究,但对于不规则网格和样条曲线没有太多选择。 MATLAB 的 interp1 显然不可用。
我的问题是,我在想如果我编写 Julia 代码并且它给出与 MATLAB 相同的结果,那么 Julia 通常应该更快。但显然情况并非如此,因此您需要注意编码。我不是程序员,当然我会尽力而为,但我想知道我是否在这里犯了一些很容易修复的明显错误,或者它是否会(太)经常发生,我必须注意我的 Julia 编码——因为那样的话我可能不值得学习它。同样,如果我可以让 Julia 在这里更快(我很确定我可以,例如,分配看起来有点大),我也可以让 MATLAB 更快。我对 Julia 的希望是——对于类似的代码——它会比 MATLAB 快 运行。
在对时间进行一些评论之后,我也运行这个代码:
using Dierckx
tic()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
for k=1:Nalal
W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
end
end
end
toc()
elapsed time:
7.336371495 seconds
甚至更慢,嗯...
另一项编辑:在这种情况下,消除一个循环实际上使它更快,但仍然无法与 MATLAB 相提并论。代码:
function performance2()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
W_temp[:,j] = evaluate(interp_spline, xprime[:,j])
end
end
end
@time(performance2())
1.573347 seconds (700.04 k allocations: 620.643 MB, 1.08% gc time)
另一个编辑,现在通过循环迭代相同的次数:
function performance3()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
W_tempr = Array(Float64, Nalal*Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
xprimer = reshape(xprime, Nalal*Naa)
for i=1:NoIter
W_tempr = evaluate(interp_spline, xprimer)
end
W_temp = reshape(W_tempr, Nalal, Naa)
end
tic()
performance3()
toc()
elapsed time:
1.480349334 seconds
仍然,与 MATLAB 完全不能相提并论。顺便说一句,在我的实际问题中,我很容易 运行 循环了 50k 次,并且我正在访问更大的 xprime 矩阵,尽管不确定这部分是否有所作为。
因为我也在学习 Julia,所以我尝试了 OP 代码的可能加速(为了我的练习!)。而且瓶颈似乎本质上是底层的 Fortran 代码。为了验证这一点,我首先将 OP 的代码重写为最小形式,如下所示:
using Dierckx
function perf()
Nx = 300
xinp = Float64[ 2pi * i / Nx for i = 1:Nx ]
yinp = sin( xinp )
interp = Spline1D( xinp, yinp )
Nsample = 200 * 350
x = ones( Nsample ) * pi / 6
y = zeros( Nsample )
for i = 1:500
y[:] = evaluate( interp, x )
end
@show y[1:3] # The result should be 0.5 (= sin(pi/6))
end
@time perf()
@time perf()
@time perf()
其中问题的大小保持不变,同时更改了输入的 x 和 y 坐标,以便结果众所周知 (0.5)。在我的机器上,结果是
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.886956 seconds (174.20 k allocations: 277.414 MB, 3.55% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.659290 seconds (1.56 k allocations: 269.295 MB, 0.39% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.648145 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)
从现在开始,为了简洁起见,我将省略y[1:3](我已经确认在所有情况下获得的y[1:3]都是正确的)。如果我们将 evaluate()
替换为 copy!(y,x)
,结果将变为
0.206723 seconds (168.26 k allocations: 10.137 MB, 10.27% gc time)
0.023068 seconds (60 allocations: 2.198 MB)
0.023013 seconds (60 allocations: 2.198 MB)
所以基本上所有时间都花在 evaluate()
上。现在看 original code of this routine, we see that it calls splev() in Fortran, which in turn calls fpbspl() (both originally from Netlib)。这些例程相当古老(写于 ~1990 年)并且似乎没有针对当前的计算机进行很好的优化(例如,有很多 IF 分支并且向量化可能很困难......)。因为如何 "vectorize" 代码并不简单,所以我尝试使用 OpenMP 进行并行化。修改后的splev()
是这样的,这里简单的把输入点分成线程:
subroutine splev(t,n,c,k,x,y,m,e,ier)
c subroutine splev evaluates in a number of points x(i),i=1,2,...,m
c a spline s(x) of degree k, given in its b-spline representation.
(... same here ...)
c main loop for the different points.
c$omp parallel do default(shared)
c$omp.firstprivate(arg,ier,l1,l,ll,sp,h) private(i,j)
do i = 1, m
c fetch a new x-value arg.
arg = x(i)
c check if arg is in the support
if (arg .lt. tb .or. arg .gt. te) then
if (e .eq. 0) then
goto 35
else if (e .eq. 1) then
y(i) = 0
goto 80
else if (e .eq. 2) then
ier = 1
! goto 100 !! I skipped this error handling for simplicity.
else if (e .eq. 3) then
if (arg .lt. tb) then
arg = tb
else
arg = te
endif
endif
endif
c search for knot interval t(l) <= arg < t(l+1)
35 if ( t(l) <= arg .or. l1 == k2 ) go to 40
l1 = l
l = l - 1
go to 35
40 if ( arg < t(l1) .or. l == nk1 ) go to 50
l = l1
l1 = l + 1
go to 40
c evaluate the non-zero b-splines at arg.
50 call fpbspl(t, n, k, arg, l, h)
c find the value of s(x) at x=arg.
sp = 0.0d0
ll = l - k1
do 60 j = 1, k1
ll = ll + 1
sp = sp + c(ll)*h(j)
60 continue
y(i) = sp
80 continue
enddo
c$omp end parallel do
100 return
end
现在用 gfortran -fopenmp
重建包并调用上面的 perf()
得到
$ OMP_NUM_THREADS=1 julia interp.jl
1.911112 seconds (174.20 k allocations: 277.414 MB, 3.49% gc time)
1.599154 seconds (1.56 k allocations: 269.295 MB, 0.41% gc time)
1.671308 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)
$ OMP_NUM_THREADS=2 julia interp.jl
1.308713 seconds (174.20 k allocations: 277.414 MB, 5.14% gc time)
0.874616 seconds (1.56 k allocations: 269.295 MB, 0.46% gc time)
0.897505 seconds (1.56 k allocations: 269.295 MB, 0.21% gc time)
$ OMP_NUM_THREADS=4 julia interp.jl
0.749203 seconds (174.20 k allocations: 277.414 MB, 9.31% gc time)
0.446702 seconds (1.56 k allocations: 269.295 MB, 0.93% gc time)
0.439522 seconds (1.56 k allocations: 269.295 MB, 0.43% gc time)
$ OMP_NUM_THREADS=8 julia interp.jl
0.478504 seconds (174.20 k allocations: 277.414 MB, 14.66% gc time)
0.243258 seconds (1.56 k allocations: 269.295 MB, 1.81% gc time)
0.233157 seconds (1.56 k allocations: 269.295 MB, 0.89% gc time)
$ OMP_NUM_THREADS=16 julia interp.jl
0.379243 seconds (174.20 k allocations: 277.414 MB, 19.02% gc time)
0.129145 seconds (1.56 k allocations: 269.295 MB, 3.49% gc time)
0.124497 seconds (1.56 k allocations: 269.295 MB, 1.80% gc time)
# Julia: v0.4.0, Machine: Linux x86_64 (2.6GHz, Xeon2.60GHz, 16 cores)
所以缩放看起来非常好(但如果我以这种方式使用 OpenMP 犯了一个大错误,请告诉我......)如果上面的结果是正确的,这意味着需要 8 个线程Fortran 代码以匹配 OP 机器上 interp1()
的速度。但好消息是 Fortran 代码可能还有改进的空间(即使是串行 运行)。无论如何,OP 的程序(如最终形式)似乎正在比较底层插值例程的性能,即 Matlab 中的 interp1()
与 Fortran 中的 splev()
。