为什么简化的数学方程 运行 (稍微)比它们在 Julia-Lang 中具有更多操作的等价方程慢?
Why do simplified maths equations run (slightly) slower than their equivalences with far more operations in Julia-Lang?
在 C++ 课程中,我被教导避免重复计算、使用更多的加法而不是更多的乘法、避免幂等技巧来提高性能。然而,当我尝试他们用 Julia-Lang 优化代码时,我对相反的结果感到惊讶。
例如,这里有几个没有数学优化的方程式(所有代码都是用 Julia 1.1 编写的,而不是 JuliaPro):
function OriginalFunction( a,b,c,d,E )
# Oprations' count:
# sqrt: 4
# ^: 14
# * : 14
# / : 10
# +: 20
# -: 6
# = : 0+4
x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
return [ [x1;y1] [x2;y2] ]
end
我用一些技巧优化了它们,包括:
(a*b + a*c) -> a*(b+c)
因为加法比乘法快。
a^2 -> a*a
避免电源操作。
- 如果有一个长操作至少使用了两次,请将其分配给一个变量,以避免重复计算。例如:
x = a * (1+c^2); y = b * (1+c^2)
->
temp = 1+c^2
x = a * temp; y = b * temp
- 将 Int 转换为 Float64,这样计算机就不必执行此操作(在 运行 时间或编译时)。例如:
1/x -> 1.0/x
结果给出了运算次数少得多的等价方程:
function SimplifiedFunction( a,b,c,d,E )
# Oprations' count:
# sqrt: 1
# ^: 0
# *: 9
# /: 1
# +: 4
# -: 6
# = : 5+4
temp1 = sqrt(E)
temp2 = c*(b - d) + a
temp3 = 1.0/(1.0+c*c)
temp4 = d - (c*(c*(d - b) - a))*temp3
temp5 = (c*temp1)*temp3
x1 = temp3*(temp2-temp1)
y1 = temp4-temp5
x2 = temp3*(temp2+temp1)
y2 = temp4+temp5
return [ [x1;y1] [x2;y2] ]
end
然后我用以下功能测试了它们,希望操作少得多的版本玩得更快或相同:
function Test2Functions( NumberOfTests::Real )
local num = Int(NumberOfTests)
# -- Generate random numbers
local rands = Array{Float64,2}(undef, 5,num)
for i in 1:num
rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
end
local res1 = Array{Array{Float64,2}}(undef, num)
local res2 = Array{Array{Float64,2}}(undef, num)
# - Test OriginalFunction
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res1[i] = OriginalFunction( a,b,c,d,E )
end
# - Test SimplifiedFunction
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res2[i] = SimplifiedFunction( a,b,c,d,E )
end
return res1, res2
end
Test2Functions( 1e6 )
然而,事实证明这 2 个函数使用相同数量的内存分配,但简化的函数有更多的垃圾收集时间,并且 运行s 慢了大约 5%:
julia> Test2Functions( 1e6 )
1.778731 seconds (7.00 M allocations: 503.540 MiB, 47.35% gc time)
1.787668 seconds (7.00 M allocations: 503.540 MiB, 50.92% gc time)
julia> Test2Functions( 1e6 )
1.969535 seconds (7.00 M allocations: 503.540 MiB, 52.05% gc time)
2.221151 seconds (7.00 M allocations: 503.540 MiB, 56.68% gc time)
julia> Test2Functions( 1e6 )
1.946441 seconds (7.00 M allocations: 503.540 MiB, 55.23% gc time)
2.099875 seconds (7.00 M allocations: 503.540 MiB, 59.33% gc time)
julia> Test2Functions( 1e6 )
1.836350 seconds (7.00 M allocations: 503.540 MiB, 53.37% gc time)
2.011242 seconds (7.00 M allocations: 503.540 MiB, 58.43% gc time)
julia> Test2Functions( 1e6 )
1.856081 seconds (7.00 M allocations: 503.540 MiB, 53.44% gc time)
2.002087 seconds (7.00 M allocations: 503.540 MiB, 58.21% gc time)
julia> Test2Functions( 1e6 )
1.833049 seconds (7.00 M allocations: 503.540 MiB, 53.55% gc time)
1.996548 seconds (7.00 M allocations: 503.540 MiB, 58.41% gc time)
julia> Test2Functions( 1e6 )
1.846894 seconds (7.00 M allocations: 503.540 MiB, 53.53% gc time)
2.053529 seconds (7.00 M allocations: 503.540 MiB, 58.30% gc time)
julia> Test2Functions( 1e6 )
1.896265 seconds (7.00 M allocations: 503.540 MiB, 54.11% gc time)
2.083253 seconds (7.00 M allocations: 503.540 MiB, 58.10% gc time)
julia> Test2Functions( 1e6 )
1.910244 seconds (7.00 M allocations: 503.540 MiB, 53.79% gc time)
2.085719 seconds (7.00 M allocations: 503.540 MiB, 58.36% gc time)
谁能告诉我为什么?即使在某些性能关键代码中,5% 的速度可能也不值得为之奋斗,但我仍然很好奇:我如何帮助 Julia 编译器生成更快的代码?
原因是您 运行 在第二个循环中(而不是在第一个循环中)进入垃圾收集。如果你在循环之前做 GC.gc()
你会得到更多可比较的结果:
function Test2Functions( NumberOfTests::Real )
local num = Int(NumberOfTests)
# -- Generate random numbers
local rands = Array{Float64,2}(undef, 5,num)
for i in 1:num
rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
end
local res1 = Array{Array{Float64,2}}(undef, num)
local res2 = Array{Array{Float64,2}}(undef, num)
# - Test OriginalFunction
GC.gc()
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res1[i] = OriginalFunction( a,b,c,d,E )
end
# - Test SimplifiedFunction
GC.gc()
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res2[i] = SimplifiedFunction( a,b,c,d,E )
end
return res1, res2
end
# call this twice as the first time you may have precompilation issues
Test2Functions( 1e6 )
Test2Functions( 1e6 )
但是,一般来说做基准测试最好使用 BenchmarkTools.jl 包。
julia> function OriginalFunction()
a,b,c,d,E = rand(5)
x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
return [ [x1;y1] [x2;y2] ]
end
OriginalFunction (generic function with 2 methods)
julia>
julia> function SimplifiedFunction()
a,b,c,d,E = rand(5)
temp1 = sqrt(E)
temp2 = c*(b - d) + a
temp3 = 1.0/(1.0+c*c)
temp4 = d - (c*(c*(d - b) - a))*temp3
temp5 = (c*temp1)*temp3
x1 = temp3*(temp2-temp1)
y1 = temp4-temp5
x2 = temp3*(temp2+temp1)
y2 = temp4+temp5
return [ [x1;y1] [x2;y2] ]
end
SimplifiedFunction (generic function with 2 methods)
julia>
julia> using BenchmarkTools
julia> @btime OriginalFunction()
136.211 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
-0.609035 0.954271
0.724708 0.926523
julia> @btime SimplifiedFunction()
137.201 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
0.284514 1.58639
0.922347 0.979835
julia> @btime OriginalFunction()
137.301 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
-0.109814 0.895533
0.365399 1.08743
julia> @btime SimplifiedFunction()
136.429 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
0.516157 1.07871
0.219441 0.361133
而且我们看到它们具有可比的性能。通常,您可以期望 Julia 和 LLVM 编译器会为您完成大部分此类优化(当然不能保证总是如此,但在这种情况下似乎会发生)。
编辑
我把功能简化如下:
function OriginalFunction( a,b,c,d,E )
x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
x1, y1, x2, y2
end
function SimplifiedFunction( a,b,c,d,E )
temp1 = sqrt(E)
temp2 = c*(b - d) + a
temp3 = 1.0/(1.0+c*c)
temp4 = d - (c*(c*(d - b) - a))*temp3
temp5 = (c*temp1)*temp3
x1 = temp3*(temp2-temp1)
y1 = temp4-temp5
x2 = temp3*(temp2+temp1)
y2 = temp4+temp5
x1, y1, x2, y2
end
只专注于计算的核心,运行 @code_native
。这是它们(为了缩短它们而删除了注释)。
.text
pushq %rbp
movq %rsp, %rbp
subq 2, %rsp
vmovaps %xmm10, -16(%rbp)
vmovaps %xmm9, -32(%rbp)
vmovaps %xmm8, -48(%rbp)
vmovaps %xmm7, -64(%rbp)
vmovaps %xmm6, -80(%rbp)
vmovsd 56(%rbp), %xmm8 # xmm8 = mem[0],zero
vxorps %xmm4, %xmm4, %xmm4
vucomisd %xmm8, %xmm4
ja L229
vmovsd 48(%rbp), %xmm9 # xmm9 = mem[0],zero
vmulsd %xmm9, %xmm3, %xmm5
vsubsd %xmm5, %xmm1, %xmm5
vmulsd %xmm3, %xmm2, %xmm6
vaddsd %xmm5, %xmm6, %xmm10
vmulsd %xmm3, %xmm3, %xmm6
movabsq 6594656, %rax # imm = 0x1F633260
vmovsd (%rax), %xmm7 # xmm7 = mem[0],zero
vaddsd %xmm7, %xmm6, %xmm0
vdivsd %xmm0, %xmm7, %xmm7
vsqrtsd %xmm8, %xmm8, %xmm4
vsubsd %xmm4, %xmm10, %xmm5
vmulsd %xmm5, %xmm7, %xmm8
vmulsd %xmm9, %xmm6, %xmm5
vdivsd %xmm0, %xmm5, %xmm5
vsubsd %xmm5, %xmm9, %xmm5
vmulsd %xmm3, %xmm1, %xmm1
vdivsd %xmm0, %xmm1, %xmm1
vaddsd %xmm5, %xmm1, %xmm1
vmulsd %xmm2, %xmm6, %xmm2
vdivsd %xmm0, %xmm2, %xmm2
vaddsd %xmm1, %xmm2, %xmm1
vmulsd %xmm3, %xmm4, %xmm2
vdivsd %xmm0, %xmm2, %xmm0
vsubsd %xmm0, %xmm1, %xmm2
vaddsd %xmm10, %xmm4, %xmm3
vmulsd %xmm3, %xmm7, %xmm3
vaddsd %xmm1, %xmm0, %xmm0
vmovsd %xmm8, (%rcx)
vmovsd %xmm2, 8(%rcx)
vmovsd %xmm3, 16(%rcx)
vmovsd %xmm0, 24(%rcx)
movq %rcx, %rax
vmovaps -80(%rbp), %xmm6
vmovaps -64(%rbp), %xmm7
vmovaps -48(%rbp), %xmm8
vmovaps -32(%rbp), %xmm9
vmovaps -16(%rbp), %xmm10
addq 2, %rsp
popq %rbp
retq
L229:
movabsq $throw_complex_domainerror, %rax
movl 381680, %ecx # imm = 0x45074F0
vmovapd %xmm8, %xmm1
callq *%rax
ud2
ud2
nop
和
.text
pushq %rbp
movq %rsp, %rbp
subq , %rsp
vmovaps %xmm7, -16(%rbp)
vmovaps %xmm6, -32(%rbp)
vmovsd 56(%rbp), %xmm0 # xmm0 = mem[0],zero
vxorps %xmm4, %xmm4, %xmm4
vucomisd %xmm0, %xmm4
ja L178
vmovsd 48(%rbp), %xmm4 # xmm4 = mem[0],zero
vsqrtsd %xmm0, %xmm0, %xmm0
vsubsd %xmm4, %xmm2, %xmm5
vmulsd %xmm3, %xmm5, %xmm5
vaddsd %xmm1, %xmm5, %xmm5
vmulsd %xmm3, %xmm3, %xmm6
movabsq 6593928, %rax # imm = 0x1F632F88
vmovsd (%rax), %xmm7 # xmm7 = mem[0],zero
vaddsd %xmm7, %xmm6, %xmm6
vdivsd %xmm6, %xmm7, %xmm6
vsubsd %xmm2, %xmm4, %xmm2
vmulsd %xmm3, %xmm2, %xmm2
vsubsd %xmm1, %xmm2, %xmm1
vmulsd %xmm3, %xmm1, %xmm1
vmulsd %xmm1, %xmm6, %xmm1
vsubsd %xmm1, %xmm4, %xmm1
vmulsd %xmm3, %xmm0, %xmm2
vmulsd %xmm2, %xmm6, %xmm2
vsubsd %xmm0, %xmm5, %xmm3
vmulsd %xmm3, %xmm6, %xmm3
vsubsd %xmm2, %xmm1, %xmm4
vaddsd %xmm5, %xmm0, %xmm0
vmulsd %xmm0, %xmm6, %xmm0
vaddsd %xmm1, %xmm2, %xmm1
vmovsd %xmm3, (%rcx)
vmovsd %xmm4, 8(%rcx)
vmovsd %xmm0, 16(%rcx)
vmovsd %xmm1, 24(%rcx)
movq %rcx, %rax
vmovaps -32(%rbp), %xmm6
vmovaps -16(%rbp), %xmm7
addq , %rsp
popq %rbp
retq
L178:
movabsq $throw_complex_domainerror, %rax
movl 381680, %ecx # imm = 0x45074F0
vmovapd %xmm0, %xmm1
callq *%rax
ud2
ud2
nopl (%rax,%rax)
可能你不想详细消化它,但你可以看到简化后的函数使用的指令少了一些,但只有几条,如果你比较原始代码,这可能会令人惊讶。例如,两个代码都只调用 sqrt
一次(因此优化了第一个函数中对 sqrt
的多次调用)。
很多原因是 Julia 会自动执行您的一些优化(特别是我知道固定整数幂被编译为有效的乘法序列)。常量传播可能还可以让编译器将 1 变成 1.0。一般来说,只要可以进行类型推断,Julia 的编译器就会非常积极地提高代码速度。
在 C++ 课程中,我被教导避免重复计算、使用更多的加法而不是更多的乘法、避免幂等技巧来提高性能。然而,当我尝试他们用 Julia-Lang 优化代码时,我对相反的结果感到惊讶。
例如,这里有几个没有数学优化的方程式(所有代码都是用 Julia 1.1 编写的,而不是 JuliaPro):
function OriginalFunction( a,b,c,d,E )
# Oprations' count:
# sqrt: 4
# ^: 14
# * : 14
# / : 10
# +: 20
# -: 6
# = : 0+4
x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
return [ [x1;y1] [x2;y2] ]
end
我用一些技巧优化了它们,包括:
(a*b + a*c) -> a*(b+c)
因为加法比乘法快。a^2 -> a*a
避免电源操作。- 如果有一个长操作至少使用了两次,请将其分配给一个变量,以避免重复计算。例如:
x = a * (1+c^2); y = b * (1+c^2)
->
temp = 1+c^2
x = a * temp; y = b * temp
- 将 Int 转换为 Float64,这样计算机就不必执行此操作(在 运行 时间或编译时)。例如:
1/x -> 1.0/x
结果给出了运算次数少得多的等价方程:
function SimplifiedFunction( a,b,c,d,E )
# Oprations' count:
# sqrt: 1
# ^: 0
# *: 9
# /: 1
# +: 4
# -: 6
# = : 5+4
temp1 = sqrt(E)
temp2 = c*(b - d) + a
temp3 = 1.0/(1.0+c*c)
temp4 = d - (c*(c*(d - b) - a))*temp3
temp5 = (c*temp1)*temp3
x1 = temp3*(temp2-temp1)
y1 = temp4-temp5
x2 = temp3*(temp2+temp1)
y2 = temp4+temp5
return [ [x1;y1] [x2;y2] ]
end
然后我用以下功能测试了它们,希望操作少得多的版本玩得更快或相同:
function Test2Functions( NumberOfTests::Real )
local num = Int(NumberOfTests)
# -- Generate random numbers
local rands = Array{Float64,2}(undef, 5,num)
for i in 1:num
rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
end
local res1 = Array{Array{Float64,2}}(undef, num)
local res2 = Array{Array{Float64,2}}(undef, num)
# - Test OriginalFunction
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res1[i] = OriginalFunction( a,b,c,d,E )
end
# - Test SimplifiedFunction
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res2[i] = SimplifiedFunction( a,b,c,d,E )
end
return res1, res2
end
Test2Functions( 1e6 )
然而,事实证明这 2 个函数使用相同数量的内存分配,但简化的函数有更多的垃圾收集时间,并且 运行s 慢了大约 5%:
julia> Test2Functions( 1e6 )
1.778731 seconds (7.00 M allocations: 503.540 MiB, 47.35% gc time)
1.787668 seconds (7.00 M allocations: 503.540 MiB, 50.92% gc time)
julia> Test2Functions( 1e6 )
1.969535 seconds (7.00 M allocations: 503.540 MiB, 52.05% gc time)
2.221151 seconds (7.00 M allocations: 503.540 MiB, 56.68% gc time)
julia> Test2Functions( 1e6 )
1.946441 seconds (7.00 M allocations: 503.540 MiB, 55.23% gc time)
2.099875 seconds (7.00 M allocations: 503.540 MiB, 59.33% gc time)
julia> Test2Functions( 1e6 )
1.836350 seconds (7.00 M allocations: 503.540 MiB, 53.37% gc time)
2.011242 seconds (7.00 M allocations: 503.540 MiB, 58.43% gc time)
julia> Test2Functions( 1e6 )
1.856081 seconds (7.00 M allocations: 503.540 MiB, 53.44% gc time)
2.002087 seconds (7.00 M allocations: 503.540 MiB, 58.21% gc time)
julia> Test2Functions( 1e6 )
1.833049 seconds (7.00 M allocations: 503.540 MiB, 53.55% gc time)
1.996548 seconds (7.00 M allocations: 503.540 MiB, 58.41% gc time)
julia> Test2Functions( 1e6 )
1.846894 seconds (7.00 M allocations: 503.540 MiB, 53.53% gc time)
2.053529 seconds (7.00 M allocations: 503.540 MiB, 58.30% gc time)
julia> Test2Functions( 1e6 )
1.896265 seconds (7.00 M allocations: 503.540 MiB, 54.11% gc time)
2.083253 seconds (7.00 M allocations: 503.540 MiB, 58.10% gc time)
julia> Test2Functions( 1e6 )
1.910244 seconds (7.00 M allocations: 503.540 MiB, 53.79% gc time)
2.085719 seconds (7.00 M allocations: 503.540 MiB, 58.36% gc time)
谁能告诉我为什么?即使在某些性能关键代码中,5% 的速度可能也不值得为之奋斗,但我仍然很好奇:我如何帮助 Julia 编译器生成更快的代码?
原因是您 运行 在第二个循环中(而不是在第一个循环中)进入垃圾收集。如果你在循环之前做 GC.gc()
你会得到更多可比较的结果:
function Test2Functions( NumberOfTests::Real )
local num = Int(NumberOfTests)
# -- Generate random numbers
local rands = Array{Float64,2}(undef, 5,num)
for i in 1:num
rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
end
local res1 = Array{Array{Float64,2}}(undef, num)
local res2 = Array{Array{Float64,2}}(undef, num)
# - Test OriginalFunction
GC.gc()
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res1[i] = OriginalFunction( a,b,c,d,E )
end
# - Test SimplifiedFunction
GC.gc()
@time for i in 1:num
a,b,c,d,E = rands[:,i]
res2[i] = SimplifiedFunction( a,b,c,d,E )
end
return res1, res2
end
# call this twice as the first time you may have precompilation issues
Test2Functions( 1e6 )
Test2Functions( 1e6 )
但是,一般来说做基准测试最好使用 BenchmarkTools.jl 包。
julia> function OriginalFunction()
a,b,c,d,E = rand(5)
x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
return [ [x1;y1] [x2;y2] ]
end
OriginalFunction (generic function with 2 methods)
julia>
julia> function SimplifiedFunction()
a,b,c,d,E = rand(5)
temp1 = sqrt(E)
temp2 = c*(b - d) + a
temp3 = 1.0/(1.0+c*c)
temp4 = d - (c*(c*(d - b) - a))*temp3
temp5 = (c*temp1)*temp3
x1 = temp3*(temp2-temp1)
y1 = temp4-temp5
x2 = temp3*(temp2+temp1)
y2 = temp4+temp5
return [ [x1;y1] [x2;y2] ]
end
SimplifiedFunction (generic function with 2 methods)
julia>
julia> using BenchmarkTools
julia> @btime OriginalFunction()
136.211 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
-0.609035 0.954271
0.724708 0.926523
julia> @btime SimplifiedFunction()
137.201 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
0.284514 1.58639
0.922347 0.979835
julia> @btime OriginalFunction()
137.301 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
-0.109814 0.895533
0.365399 1.08743
julia> @btime SimplifiedFunction()
136.429 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
0.516157 1.07871
0.219441 0.361133
而且我们看到它们具有可比的性能。通常,您可以期望 Julia 和 LLVM 编译器会为您完成大部分此类优化(当然不能保证总是如此,但在这种情况下似乎会发生)。
编辑
我把功能简化如下:
function OriginalFunction( a,b,c,d,E )
x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
x1, y1, x2, y2
end
function SimplifiedFunction( a,b,c,d,E )
temp1 = sqrt(E)
temp2 = c*(b - d) + a
temp3 = 1.0/(1.0+c*c)
temp4 = d - (c*(c*(d - b) - a))*temp3
temp5 = (c*temp1)*temp3
x1 = temp3*(temp2-temp1)
y1 = temp4-temp5
x2 = temp3*(temp2+temp1)
y2 = temp4+temp5
x1, y1, x2, y2
end
只专注于计算的核心,运行 @code_native
。这是它们(为了缩短它们而删除了注释)。
.text
pushq %rbp
movq %rsp, %rbp
subq 2, %rsp
vmovaps %xmm10, -16(%rbp)
vmovaps %xmm9, -32(%rbp)
vmovaps %xmm8, -48(%rbp)
vmovaps %xmm7, -64(%rbp)
vmovaps %xmm6, -80(%rbp)
vmovsd 56(%rbp), %xmm8 # xmm8 = mem[0],zero
vxorps %xmm4, %xmm4, %xmm4
vucomisd %xmm8, %xmm4
ja L229
vmovsd 48(%rbp), %xmm9 # xmm9 = mem[0],zero
vmulsd %xmm9, %xmm3, %xmm5
vsubsd %xmm5, %xmm1, %xmm5
vmulsd %xmm3, %xmm2, %xmm6
vaddsd %xmm5, %xmm6, %xmm10
vmulsd %xmm3, %xmm3, %xmm6
movabsq 6594656, %rax # imm = 0x1F633260
vmovsd (%rax), %xmm7 # xmm7 = mem[0],zero
vaddsd %xmm7, %xmm6, %xmm0
vdivsd %xmm0, %xmm7, %xmm7
vsqrtsd %xmm8, %xmm8, %xmm4
vsubsd %xmm4, %xmm10, %xmm5
vmulsd %xmm5, %xmm7, %xmm8
vmulsd %xmm9, %xmm6, %xmm5
vdivsd %xmm0, %xmm5, %xmm5
vsubsd %xmm5, %xmm9, %xmm5
vmulsd %xmm3, %xmm1, %xmm1
vdivsd %xmm0, %xmm1, %xmm1
vaddsd %xmm5, %xmm1, %xmm1
vmulsd %xmm2, %xmm6, %xmm2
vdivsd %xmm0, %xmm2, %xmm2
vaddsd %xmm1, %xmm2, %xmm1
vmulsd %xmm3, %xmm4, %xmm2
vdivsd %xmm0, %xmm2, %xmm0
vsubsd %xmm0, %xmm1, %xmm2
vaddsd %xmm10, %xmm4, %xmm3
vmulsd %xmm3, %xmm7, %xmm3
vaddsd %xmm1, %xmm0, %xmm0
vmovsd %xmm8, (%rcx)
vmovsd %xmm2, 8(%rcx)
vmovsd %xmm3, 16(%rcx)
vmovsd %xmm0, 24(%rcx)
movq %rcx, %rax
vmovaps -80(%rbp), %xmm6
vmovaps -64(%rbp), %xmm7
vmovaps -48(%rbp), %xmm8
vmovaps -32(%rbp), %xmm9
vmovaps -16(%rbp), %xmm10
addq 2, %rsp
popq %rbp
retq
L229:
movabsq $throw_complex_domainerror, %rax
movl 381680, %ecx # imm = 0x45074F0
vmovapd %xmm8, %xmm1
callq *%rax
ud2
ud2
nop
和
.text
pushq %rbp
movq %rsp, %rbp
subq , %rsp
vmovaps %xmm7, -16(%rbp)
vmovaps %xmm6, -32(%rbp)
vmovsd 56(%rbp), %xmm0 # xmm0 = mem[0],zero
vxorps %xmm4, %xmm4, %xmm4
vucomisd %xmm0, %xmm4
ja L178
vmovsd 48(%rbp), %xmm4 # xmm4 = mem[0],zero
vsqrtsd %xmm0, %xmm0, %xmm0
vsubsd %xmm4, %xmm2, %xmm5
vmulsd %xmm3, %xmm5, %xmm5
vaddsd %xmm1, %xmm5, %xmm5
vmulsd %xmm3, %xmm3, %xmm6
movabsq 6593928, %rax # imm = 0x1F632F88
vmovsd (%rax), %xmm7 # xmm7 = mem[0],zero
vaddsd %xmm7, %xmm6, %xmm6
vdivsd %xmm6, %xmm7, %xmm6
vsubsd %xmm2, %xmm4, %xmm2
vmulsd %xmm3, %xmm2, %xmm2
vsubsd %xmm1, %xmm2, %xmm1
vmulsd %xmm3, %xmm1, %xmm1
vmulsd %xmm1, %xmm6, %xmm1
vsubsd %xmm1, %xmm4, %xmm1
vmulsd %xmm3, %xmm0, %xmm2
vmulsd %xmm2, %xmm6, %xmm2
vsubsd %xmm0, %xmm5, %xmm3
vmulsd %xmm3, %xmm6, %xmm3
vsubsd %xmm2, %xmm1, %xmm4
vaddsd %xmm5, %xmm0, %xmm0
vmulsd %xmm0, %xmm6, %xmm0
vaddsd %xmm1, %xmm2, %xmm1
vmovsd %xmm3, (%rcx)
vmovsd %xmm4, 8(%rcx)
vmovsd %xmm0, 16(%rcx)
vmovsd %xmm1, 24(%rcx)
movq %rcx, %rax
vmovaps -32(%rbp), %xmm6
vmovaps -16(%rbp), %xmm7
addq , %rsp
popq %rbp
retq
L178:
movabsq $throw_complex_domainerror, %rax
movl 381680, %ecx # imm = 0x45074F0
vmovapd %xmm0, %xmm1
callq *%rax
ud2
ud2
nopl (%rax,%rax)
可能你不想详细消化它,但你可以看到简化后的函数使用的指令少了一些,但只有几条,如果你比较原始代码,这可能会令人惊讶。例如,两个代码都只调用 sqrt
一次(因此优化了第一个函数中对 sqrt
的多次调用)。
很多原因是 Julia 会自动执行您的一些优化(特别是我知道固定整数幂被编译为有效的乘法序列)。常量传播可能还可以让编译器将 1 变成 1.0。一般来说,只要可以进行类型推断,Julia 的编译器就会非常积极地提高代码速度。