在 Fortran 中给出数组的初始值和矢量化
Give initial value of array and vectorization in Fortran
我的问题是关于在 Fortran 90 或更高版本中为串行和 OpenMP 的数组提供初始值的最快方法是什么。我可以试试
(a) A = 0.0
;或者
(b) 为 A(i, j...) = 0.0
做嵌套循环并调整循环的顺序以适应向量化(第一个参数的最内层)
我不知怎么记得,但在谷歌搜索了几次后找不到参考,编译器将尝试对 (a) 进行矢量化。下面是串行级别的测试(抱歉代码比较乱,不是面向过程的,一些变量名等沿用了之前的回复)
Program vectorization_test
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
real :: A(20,20,20,20), sum_time
integer :: i,j,k,l,n,m, m_iter
Integer( li ) :: start, finish, rate
m_iter = 10
n = 20
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
A= 0.0
Call System_clock( finish, rate )
write(*,*) 'time 1', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time', sum_time / m_iter
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
do l = 1, n
do k = 1, n
do j = 1, n
do i = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
write(*,*) 'time 2', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time 2', sum_time / m_iter
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
do l = 1, n
do j = 1, n
do k = 1, n
do i = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
write(*,*) 'time 3', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time 3', sum_time / m_iter
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
do i = 1, n
do j = 1, n
do k = 1, n
do l = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
write(*,*) 'time 4', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time 4', sum_time / m_iter
end program vectorization_test
我在 16 GB 内存的笔记本电脑上从 gfortran-11 -o3
获得了 average time 3.76699973E-05, average time 2 5.98790008E-04, average time 3 6.55650045E-04, average time 4 3.10386019E-03
。在 384 GB 内存的计算中心上,我得到了 average time 4.75034976E-05, average time 2 , 4.47604398E-04, average time 3 4.70327737E-04, average time 4 4.14085982E-04
。大尺寸类似趋势。
不确定这是否适用于其他编译器。似乎最内层的循环对于矢量化最为关键。
所以我的问题是
(1) 这个问题有没有关于数组的向量化和初始化的参考资料;
(2) 如果我使用 OpenMP,我是否应该对一个变量使用一个循环,A(i,:,:,:) = 0.0
之类的?
P.S。数组的初始化应该不是瓶颈,所以这个问题比较好奇
尝试改变第一个索引最快
Call System_clock( start, rate )
do l = 1, n
do k = 1, n
do j = 1, n
do i = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
由于 Fortran 是列优先的,这意味着第一个索引将值放在尽可能接近的位置,从而利用 CPU 缓存来避免过多的内存访问,这比缓存访问慢 100 倍。
最后我认为这不会有太大的不同,因为编译器非常擅长优化代码。
在我使用 ifort
进行的测试中,在并行发布版本中,我得到了两组基于浮点设置的结果:
我测量了每秒的初始化:
Method
/fp:fast
/fp:precise
Description
LOOP
440.9171
403.2258
Four loops
ATOM
443.4590
432.5259
a=x
SPAN
443.8526
457.8755
a(:,:,:,:)=x
PARA
445.0378
438.4042
$omp parallel
代码清单:
program Console1
implicit none
! Variables
integer, parameter :: n = 60, repeat=1000
integer :: iter
real :: x, a(n,n,n,n)
integer(8) :: tic, toc, rate
! Body of Console1
x = 4*atan(1.0)
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_loop(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "LOOP", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_atom(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "ATOM", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_span(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "SPAN", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_parallel(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "PARA", (rate*repeat)/real(toc-tic), "ips"
contains
pure subroutine r_fill_loop(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
integer :: n, m, g, h
integer :: i,j,k,l
n = size(a,1)
m = size(a,2)
g = size(a,3)
h = size(a,4)
do l=1, h
do k=1, g
do j=1, m
do i=1,n
a(i,j,k,l) = x
end do
end do
end do
end do
end subroutine
pure subroutine r_fill_atom(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
a = x
end subroutine
pure subroutine r_fill_parallel(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
integer :: n, m, g, h
integer :: i,j,k,l
n = size(a,1)
m = size(a,2)
g = size(a,3)
h = size(a,4)
!$OMP PARALLEL
!$OMP DO
do l=1, h
do k=1, g
do j=1, m
do i=1,n
a(i,j,k,l) = x
end do
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
end subroutine
pure subroutine r_fill_span(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
a(:,:,:,:) = x
end subroutine
end program Console1
关于精度和舍入误差的附注。我最后做了一个 sum(a)
并将其与预期值 n*n*n*n*x = 40715040.79
进行了比较。
使用 /fp:fast=2
我得到 sum(a) = 40738716.0
使用 /fp:precise
我得到 sum(a) = 46579532.0
上面的内容非常令人惊讶,与快速模型相比,精确浮点模型的精度要差得多。
以下是我使用的编译器选项:
[IFORT]
/nologo /O3 /Qparallel /heap-arrays200 /fp:fast=2 /module:x64\Release\ /object:
x64\Release\ /Fdx64\Release\vc150.pdb /libs:dll /threads /c /Qlocation,link,C:\
Program Files (x86)\Microsoft Visual Studio17\Community\VC\Tools\MSVC.16.
27023\bin\HostX64\x64 /Qm64
我的问题是关于在 Fortran 90 或更高版本中为串行和 OpenMP 的数组提供初始值的最快方法是什么。我可以试试
(a) A = 0.0
;或者
(b) 为 A(i, j...) = 0.0
做嵌套循环并调整循环的顺序以适应向量化(第一个参数的最内层)
我不知怎么记得,但在谷歌搜索了几次后找不到参考,编译器将尝试对 (a) 进行矢量化。下面是串行级别的测试(抱歉代码比较乱,不是面向过程的,一些变量名等沿用了之前的回复)
Program vectorization_test
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
real :: A(20,20,20,20), sum_time
integer :: i,j,k,l,n,m, m_iter
Integer( li ) :: start, finish, rate
m_iter = 10
n = 20
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
A= 0.0
Call System_clock( finish, rate )
write(*,*) 'time 1', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time', sum_time / m_iter
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
do l = 1, n
do k = 1, n
do j = 1, n
do i = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
write(*,*) 'time 2', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time 2', sum_time / m_iter
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
do l = 1, n
do j = 1, n
do k = 1, n
do i = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
write(*,*) 'time 3', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time 3', sum_time / m_iter
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
do i = 1, n
do j = 1, n
do k = 1, n
do l = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
write(*,*) 'time 4', Real( finish - start, wp ) / rate
sum_time = sum_time + Real( finish - start, wp ) / rate
end do
write(*,*) 'average time 4', sum_time / m_iter
end program vectorization_test
我在 16 GB 内存的笔记本电脑上从 gfortran-11 -o3
获得了 average time 3.76699973E-05, average time 2 5.98790008E-04, average time 3 6.55650045E-04, average time 4 3.10386019E-03
。在 384 GB 内存的计算中心上,我得到了 average time 4.75034976E-05, average time 2 , 4.47604398E-04, average time 3 4.70327737E-04, average time 4 4.14085982E-04
。大尺寸类似趋势。
不确定这是否适用于其他编译器。似乎最内层的循环对于矢量化最为关键。
所以我的问题是
(1) 这个问题有没有关于数组的向量化和初始化的参考资料;
(2) 如果我使用 OpenMP,我是否应该对一个变量使用一个循环,A(i,:,:,:) = 0.0
之类的?
P.S。数组的初始化应该不是瓶颈,所以这个问题比较好奇
尝试改变第一个索引最快
Call System_clock( start, rate )
do l = 1, n
do k = 1, n
do j = 1, n
do i = 1, n
A(i,j,k,l) = 0.0
end do
end do
end do
end do
Call System_clock( finish, rate )
由于 Fortran 是列优先的,这意味着第一个索引将值放在尽可能接近的位置,从而利用 CPU 缓存来避免过多的内存访问,这比缓存访问慢 100 倍。
最后我认为这不会有太大的不同,因为编译器非常擅长优化代码。
在我使用 ifort
进行的测试中,在并行发布版本中,我得到了两组基于浮点设置的结果:
我测量了每秒的初始化:
Method | /fp:fast |
/fp:precise |
Description |
---|---|---|---|
LOOP | 440.9171 | 403.2258 | Four loops |
ATOM | 443.4590 | 432.5259 | a=x |
SPAN | 443.8526 | 457.8755 | a(:,:,:,:)=x |
PARA | 445.0378 | 438.4042 | $omp parallel |
代码清单:
program Console1
implicit none
! Variables
integer, parameter :: n = 60, repeat=1000
integer :: iter
real :: x, a(n,n,n,n)
integer(8) :: tic, toc, rate
! Body of Console1
x = 4*atan(1.0)
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_loop(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "LOOP", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_atom(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "ATOM", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_span(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "SPAN", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_parallel(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "PARA", (rate*repeat)/real(toc-tic), "ips"
contains
pure subroutine r_fill_loop(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
integer :: n, m, g, h
integer :: i,j,k,l
n = size(a,1)
m = size(a,2)
g = size(a,3)
h = size(a,4)
do l=1, h
do k=1, g
do j=1, m
do i=1,n
a(i,j,k,l) = x
end do
end do
end do
end do
end subroutine
pure subroutine r_fill_atom(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
a = x
end subroutine
pure subroutine r_fill_parallel(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
integer :: n, m, g, h
integer :: i,j,k,l
n = size(a,1)
m = size(a,2)
g = size(a,3)
h = size(a,4)
!$OMP PARALLEL
!$OMP DO
do l=1, h
do k=1, g
do j=1, m
do i=1,n
a(i,j,k,l) = x
end do
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
end subroutine
pure subroutine r_fill_span(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
a(:,:,:,:) = x
end subroutine
end program Console1
关于精度和舍入误差的附注。我最后做了一个 sum(a)
并将其与预期值 n*n*n*n*x = 40715040.79
进行了比较。
使用 /fp:fast=2
我得到 sum(a) = 40738716.0
使用 /fp:precise
我得到 sum(a) = 46579532.0
上面的内容非常令人惊讶,与快速模型相比,精确浮点模型的精度要差得多。
以下是我使用的编译器选项:
[IFORT]
/nologo /O3 /Qparallel /heap-arrays200 /fp:fast=2 /module:x64\Release\ /object:
x64\Release\ /Fdx64\Release\vc150.pdb /libs:dll /threads /c /Qlocation,link,C:\
Program Files (x86)\Microsoft Visual Studio17\Community\VC\Tools\MSVC.16.
27023\bin\HostX64\x64 /Qm64