CPU 四倍精度与双精度的时间
CPU time in quadruple vs. double precision
我正在进行一些长期模拟,在这些模拟中,我试图在 ODE 系统的求解中达到尽可能高的精度。我试图找出四倍(128 位)精度计算与双倍(64 位)精度相比需要多少时间。我在谷歌上搜索了一下,看到了一些关于它的意见:有人说它需要 4 倍的时间,其他人说 60-70 倍......所以我决定自己动手,我写了一个简单的 Fortran 基准程序:
program QUAD_TEST
implicit none
integer,parameter :: dp = selected_int_kind(15)
integer,parameter :: qp = selected_int_kind(33)
integer :: cstart_dp,cend_dp,cstart_qp,cend_qp,crate
real :: time_dp,time_qp
real(dp) :: sum_dp,sqrt_dp,pi_dp,mone_dp,zero_dp
real(qp) :: sum_qp,sqrt_qp,pi_qp,mone_qp,zero_qp
integer :: i
! ==============================================================================
! == TEST 1. ELEMENTARY OPERATIONS ==
sum_dp = 1._dp
sum_qp = 1._qp
call SYSTEM_CLOCK(count_rate=crate)
write(*,*) 'Testing elementary operations...'
call SYSTEM_CLOCK(count=cstart_dp)
do i=1,50000000
sum_dp = sum_dp - 1._dp
sum_dp = sum_dp + 1._dp
sum_dp = sum_dp*2._dp
sum_dp = sum_dp/2._dp
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP sum: ',sum_dp
write(*,*) 'DP time: ',time_dp,' seconds'
call SYSTEM_CLOCK(count=cstart_qp)
do i=1,50000000
sum_qp = sum_qp - 1._qp
sum_qp = sum_qp + 1._qp
sum_qp = sum_qp*2._qp
sum_qp = sum_qp/2._qp
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP sum: ',sum_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)
! == TEST 2. SQUARE ROOT ==
sqrt_dp = 2._dp
sqrt_qp = 2._qp
write(*,*) 'Testing square root ...'
call SYSTEM_CLOCK(count=cstart_dp)
do i = 1,10000000
sqrt_dp = sqrt(sqrt_dp)
sqrt_dp = 2._dp
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP sqrt: ',sqrt_dp
write(*,*) 'DP time: ',time_dp,' seconds'
call SYSTEM_CLOCK(count=cstart_qp)
do i = 1,10000000
sqrt_qp = sqrt(sqrt_qp)
sqrt_qp = 2._qp
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP sqrt: ',sqrt_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)
! == TEST 3. TRIGONOMETRIC FUNCTIONS ==
pi_dp = acos(-1._dp); mone_dp = 1._dp; zero_dp = 0._dp
pi_qp = acos(-1._qp); mone_qp = 1._qp; zero_qp = 0._qp
write(*,*) 'Testing trigonometric functions ...'
call SYSTEM_CLOCK(count=cstart_dp)
do i = 1,10000000
mone_dp = cos(pi_dp)
zero_dp = sin(pi_dp)
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP cos: ',mone_dp
write(*,*) 'DP sin: ',zero_dp
write(*,*) 'DP time: ',time_dp,' seconds'
call SYSTEM_CLOCK(count=cstart_qp)
do i = 1,10000000
mone_qp = cos(pi_qp)
zero_qp = sin(pi_qp)
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP cos: ',mone_qp
write(*,*) 'QP sin: ',zero_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)
end program QUAD_TEST
典型 运行 的结果,在使用 gfortran 4.8.4
编译后,没有任何优化标志:
Testing elementary operations...
DP sum: 1.0000000000000000
DP time: 0.572000027 seconds
QP sum: 1.00000000000000000000000000000000000
QP time: 4.32299995 seconds
DP is 7.55769205 times faster.
Testing square root ...
DP sqrt: 2.0000000000000000
DP time: 5.20000011E-02 seconds
QP sqrt: 2.00000000000000000000000000000000000
QP time: 2.60700011 seconds
DP is 50.1346169 times faster.
Testing trigonometric functions ...
DP cos: -1.0000000000000000
DP sin: 1.2246467991473532E-016
DP time: 2.79600000 seconds
QP cos: -1.00000000000000000000000000000000000
QP sin: 8.67181013012378102479704402604335225E-0035
QP time: 5.90199995 seconds
DP is 2.11087275 times faster.
这里一定发生了什么事。我的猜测是 sqrt
是通过优化算法用 gfortran
计算的,该算法可能尚未针对四倍精度计算实现。 sin
和 cos
可能不是这种情况,但为什么初等运算在四倍精度上慢 7.6 倍,而三角函数的速度只慢 2 倍?如果用于三角函数的算法对于四精度和双精度相同,我希望它们的 CPU 时间也增加七倍。
与 64 位精度相比,使用 128 位精度时科学计算的平均减速是多少?
我运行在 Intel i7-4771 @ 3.50GHz 上运行这个。
更多的是扩展评论而不是答案,但是...
当前的 CPUs 为双精度浮点运算提供了大量的硬件加速。有些甚至提供扩展精度的工具。
除此之外,您仅限于(如您所注意到的)相当慢的软件实现。
但是,在一般情况下几乎不可能预测这种减速的确切因素。
这取决于您的 CPU(例如,它内置了哪种加速)以及软件堆栈。
对于双精度,您通常使用与四精度不同的数学库,并且这些可能使用不同的算法来进行基本操作。
对于给定硬件上使用相同算法的特定 operation/algorithm,您可能可以得出一个数字,但这肯定不会普遍适用。
有趣的是,如果您更改:
sqrt_qp = sqrt(sqrt_qp)
sqrt_qp = 2._qp
至
sqrt_qp = sqrt(2._qp)
计算速度会更快!
我正在进行一些长期模拟,在这些模拟中,我试图在 ODE 系统的求解中达到尽可能高的精度。我试图找出四倍(128 位)精度计算与双倍(64 位)精度相比需要多少时间。我在谷歌上搜索了一下,看到了一些关于它的意见:有人说它需要 4 倍的时间,其他人说 60-70 倍......所以我决定自己动手,我写了一个简单的 Fortran 基准程序:
program QUAD_TEST
implicit none
integer,parameter :: dp = selected_int_kind(15)
integer,parameter :: qp = selected_int_kind(33)
integer :: cstart_dp,cend_dp,cstart_qp,cend_qp,crate
real :: time_dp,time_qp
real(dp) :: sum_dp,sqrt_dp,pi_dp,mone_dp,zero_dp
real(qp) :: sum_qp,sqrt_qp,pi_qp,mone_qp,zero_qp
integer :: i
! ==============================================================================
! == TEST 1. ELEMENTARY OPERATIONS ==
sum_dp = 1._dp
sum_qp = 1._qp
call SYSTEM_CLOCK(count_rate=crate)
write(*,*) 'Testing elementary operations...'
call SYSTEM_CLOCK(count=cstart_dp)
do i=1,50000000
sum_dp = sum_dp - 1._dp
sum_dp = sum_dp + 1._dp
sum_dp = sum_dp*2._dp
sum_dp = sum_dp/2._dp
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP sum: ',sum_dp
write(*,*) 'DP time: ',time_dp,' seconds'
call SYSTEM_CLOCK(count=cstart_qp)
do i=1,50000000
sum_qp = sum_qp - 1._qp
sum_qp = sum_qp + 1._qp
sum_qp = sum_qp*2._qp
sum_qp = sum_qp/2._qp
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP sum: ',sum_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)
! == TEST 2. SQUARE ROOT ==
sqrt_dp = 2._dp
sqrt_qp = 2._qp
write(*,*) 'Testing square root ...'
call SYSTEM_CLOCK(count=cstart_dp)
do i = 1,10000000
sqrt_dp = sqrt(sqrt_dp)
sqrt_dp = 2._dp
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP sqrt: ',sqrt_dp
write(*,*) 'DP time: ',time_dp,' seconds'
call SYSTEM_CLOCK(count=cstart_qp)
do i = 1,10000000
sqrt_qp = sqrt(sqrt_qp)
sqrt_qp = 2._qp
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP sqrt: ',sqrt_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)
! == TEST 3. TRIGONOMETRIC FUNCTIONS ==
pi_dp = acos(-1._dp); mone_dp = 1._dp; zero_dp = 0._dp
pi_qp = acos(-1._qp); mone_qp = 1._qp; zero_qp = 0._qp
write(*,*) 'Testing trigonometric functions ...'
call SYSTEM_CLOCK(count=cstart_dp)
do i = 1,10000000
mone_dp = cos(pi_dp)
zero_dp = sin(pi_dp)
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP cos: ',mone_dp
write(*,*) 'DP sin: ',zero_dp
write(*,*) 'DP time: ',time_dp,' seconds'
call SYSTEM_CLOCK(count=cstart_qp)
do i = 1,10000000
mone_qp = cos(pi_qp)
zero_qp = sin(pi_qp)
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP cos: ',mone_qp
write(*,*) 'QP sin: ',zero_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)
end program QUAD_TEST
典型 运行 的结果,在使用 gfortran 4.8.4
编译后,没有任何优化标志:
Testing elementary operations...
DP sum: 1.0000000000000000
DP time: 0.572000027 seconds
QP sum: 1.00000000000000000000000000000000000
QP time: 4.32299995 seconds
DP is 7.55769205 times faster.
Testing square root ...
DP sqrt: 2.0000000000000000
DP time: 5.20000011E-02 seconds
QP sqrt: 2.00000000000000000000000000000000000
QP time: 2.60700011 seconds
DP is 50.1346169 times faster.
Testing trigonometric functions ...
DP cos: -1.0000000000000000
DP sin: 1.2246467991473532E-016
DP time: 2.79600000 seconds
QP cos: -1.00000000000000000000000000000000000
QP sin: 8.67181013012378102479704402604335225E-0035
QP time: 5.90199995 seconds
DP is 2.11087275 times faster.
这里一定发生了什么事。我的猜测是 sqrt
是通过优化算法用 gfortran
计算的,该算法可能尚未针对四倍精度计算实现。 sin
和 cos
可能不是这种情况,但为什么初等运算在四倍精度上慢 7.6 倍,而三角函数的速度只慢 2 倍?如果用于三角函数的算法对于四精度和双精度相同,我希望它们的 CPU 时间也增加七倍。
与 64 位精度相比,使用 128 位精度时科学计算的平均减速是多少?
我运行在 Intel i7-4771 @ 3.50GHz 上运行这个。
更多的是扩展评论而不是答案,但是...
当前的 CPUs 为双精度浮点运算提供了大量的硬件加速。有些甚至提供扩展精度的工具。 除此之外,您仅限于(如您所注意到的)相当慢的软件实现。
但是,在一般情况下几乎不可能预测这种减速的确切因素。 这取决于您的 CPU(例如,它内置了哪种加速)以及软件堆栈。 对于双精度,您通常使用与四精度不同的数学库,并且这些可能使用不同的算法来进行基本操作。
对于给定硬件上使用相同算法的特定 operation/algorithm,您可能可以得出一个数字,但这肯定不会普遍适用。
有趣的是,如果您更改:
sqrt_qp = sqrt(sqrt_qp)
sqrt_qp = 2._qp
至
sqrt_qp = sqrt(2._qp)
计算速度会更快!