在 Fortran 中给出数组的初始值和矢量化

Give initial value of array and vectorization in Fortran

我的问题是关于在 Fortran 90 或更高版本中为串行和 OpenMP 的数组提供初始值的最快方法是什么。我可以试试

(a) A = 0.0;或者

(b) 为 A(i, j...) = 0.0 做嵌套循环并调整循环的顺序以适应向量化(第一个参数的最内层)

我不知怎么记得,但在谷歌搜索了几次后找不到参考,编译器将尝试对 (a) 进行矢量化。下面是串行级别的测试(抱歉代码比较乱,不是面向过程的,一些变量名等沿用了之前的回复)

Program vectorization_test

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64

  real :: A(20,20,20,20), sum_time
  integer :: i,j,k,l,n,m, m_iter
  Integer( li ) :: start, finish, rate
  

  m_iter = 10
  n = 20
  sum_time = 0.0
  do m = 1, m_iter

    Call System_clock( start, rate )
    A= 0.0
    Call System_clock( finish, rate )  
  
    write(*,*) 'time 1', Real( finish - start, wp ) / rate   
    sum_time = sum_time +  Real( finish - start, wp ) / rate   
  end do 

  write(*,*) 'average time', sum_time / m_iter



  sum_time = 0.0  
  do m = 1, m_iter
    Call System_clock( start, rate )
    do l = 1, n
      do k = 1, n
         do j = 1, n
           do i = 1, n
             A(i,j,k,l) = 0.0
           end do 
         end do   
      end do      
    end do        
 
    Call System_clock( finish, rate )  
  
    write(*,*) 'time 2', Real( finish - start, wp ) / rate  
    sum_time = sum_time +  Real( finish - start, wp ) / rate 
  end do   

  write(*,*) 'average time 2', sum_time / m_iter
  

  sum_time = 0.0  
  do m = 1, m_iter
    Call System_clock( start, rate )
    do l = 1, n
      do j = 1, n      
        do k = 1, n
           do i = 1, n
             A(i,j,k,l) = 0.0
           end do 
         end do   
      end do      
    end do        
 
    Call System_clock( finish, rate )  
  
    write(*,*) 'time 3', Real( finish - start, wp ) / rate  
    sum_time = sum_time +  Real( finish - start, wp ) / rate 
  end do   

  write(*,*) 'average time 3', sum_time / m_iter

  

  sum_time = 0.0  
  do m = 1, m_iter
    Call System_clock( start, rate )
    do i = 1, n
      do j = 1, n      
        do k = 1, n
           do l = 1, n
             A(i,j,k,l) = 0.0
           end do 
         end do   
      end do      
    end do        
 
    Call System_clock( finish, rate )  
  
    write(*,*) 'time 4', Real( finish - start, wp ) / rate  
    sum_time = sum_time +  Real( finish - start, wp ) / rate 
  end do   
  write(*,*) 'average time 4', sum_time / m_iter
    
end program vectorization_test

我在 16 GB 内存的笔记本电脑上从 gfortran-11 -o3 获得了 average time 3.76699973E-05, average time 2 5.98790008E-04, average time 3 6.55650045E-04, average time 4 3.10386019E-03。在 384 GB 内存的计算中心上,我得到了 average time 4.75034976E-05, average time 2 , 4.47604398E-04, average time 3 4.70327737E-04, average time 4 4.14085982E-04。大尺寸类似趋势。

不确定这是否适用于其他编译器。似乎最内层的循环对于矢量化最为关键。

所以我的问题是 (1) 这个问题有没有关于数组的向量化和初始化的参考资料; (2) 如果我使用 OpenMP,我是否应该对一个变量使用一个循环,A(i,:,:,:) = 0.0 之类的?

P.S。数组的初始化应该不是瓶颈,所以这个问题比较好奇

尝试改变第一个索引最快

Call System_clock( start, rate )
do l = 1, n
  do k = 1, n      
    do j = 1, n
       do i = 1, n
         A(i,j,k,l) = 0.0
       end do 
     end do   
  end do      
end do        
Call System_clock( finish, rate ) 

由于 Fortran 是列优先的,这意味着第一个索引将值放在尽可能接近的位置,从而利用 CPU 缓存来避免过多的内存访问,这比缓存访问慢 100 倍。

最后我认为这不会有太大的不同,因为编译器非常擅长优化代码。

在我使用 ifort 进行的测试中,在并行发布版本中,我得到了两组基于浮点设置的结果:

我测量了每秒的初始化:

Method /fp:fast /fp:precise Description
LOOP 440.9171 403.2258 Four loops
ATOM 443.4590 432.5259 a=x
SPAN 443.8526 457.8755 a(:,:,:,:)=x
PARA 445.0378 438.4042 $omp parallel

代码清单:

program Console1

implicit none

! Variables
integer, parameter :: n = 60, repeat=1000
integer :: iter
real :: x, a(n,n,n,n)
integer(8) :: tic, toc, rate

! Body of Console1
x = 4*atan(1.0)
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_loop(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "LOOP", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_atom(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "ATOM", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_span(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "SPAN", (rate*repeat)/real(toc-tic), "ips"
call SYSTEM_CLOCK(tic,rate)
do iter=1, repeat
call r_fill_parallel(a,x)
end do
call SYSTEM_CLOCK(toc,rate)
print *, "PARA", (rate*repeat)/real(toc-tic), "ips"

contains

pure subroutine r_fill_loop(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
integer :: n, m, g, h
integer :: i,j,k,l

    n = size(a,1)
    m = size(a,2)
    g = size(a,3)
    h = size(a,4)
    
    do l=1, h
        do k=1, g
            do j=1, m
                do i=1,n
                    a(i,j,k,l) = x
                end do
            end do
        end do
    end do    

end subroutine

pure subroutine r_fill_atom(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
    a = x
end subroutine

pure subroutine r_fill_parallel(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x
integer :: n, m, g, h
integer :: i,j,k,l

    n = size(a,1)
    m = size(a,2)
    g = size(a,3)
    h = size(a,4)
    
    !$OMP PARALLEL
    !$OMP DO 
    do l=1, h
        do k=1, g
            do j=1, m
                do i=1,n
                    a(i,j,k,l) = x
                end do
            end do
        end do
    end do  
    !$OMP END DO
    !$OMP END PARALLEL
end subroutine

pure subroutine r_fill_span(a,x)
real, intent(out) :: a(:,:,:,:)
real, intent(in) :: x

    a(:,:,:,:) = x

end subroutine


end program Console1

关于精度和舍入误差的附注。我最后做了一个 sum(a) 并将其与预期值 n*n*n*n*x = 40715040.79 进行了比较。

使用 /fp:fast=2 我得到 sum(a) = 40738716.0

使用 /fp:precise 我得到 sum(a) = 46579532.0

上面的内容非常令人惊讶,与快速模型相比,精确浮点模型的精度要差得多。

以下是我使用的编译器选项:

 [IFORT]
 /nologo /O3 /Qparallel /heap-arrays200 /fp:fast=2 /module:x64\Release\ /object:
 x64\Release\ /Fdx64\Release\vc150.pdb /libs:dll /threads /c /Qlocation,link,C:\
 Program Files (x86)\Microsoft Visual Studio17\Community\VC\Tools\MSVC.16.
 27023\bin\HostX64\x64 /Qm64