为什么以数组作为输入的子程序比以自动本地数组为输入的子程序性能更快？

Question

我正在重写一些遗留代码以提高可读性并希望使其更易于维护。

我试图减少子程序的输入参数数量，但我发现改变 subroutine sub(N, ID) --> subroutine sub(N) 性能明显下降。

ID 仅在 sub 中使用，因此我认为将其作为输入没有意义。是否可以在不影响性能的情况下使用 sub(N)？（对于我的使用，N < 10，性能差 5-10 倍。）

性能比较：

sub_1
- N = 4, 0.9 秒
- N = 20, 1.0 秒
- N = 200, 2.1 秒
sub_2
- N = 4, 0.07 秒
- N = 20, 0.18 秒
- N = 200, 1.3 秒

我正在使用 Mac OS 10.14.6 和 gfortran 5.2.0

program test
  integer, parameter  :: N = 1
  real, dimension(N)  :: ID


  call CPU_time(t1)

  do i = 1, 10000000
    CALL sub_1(N)
  end do

  call CPU_time(t2)
  write ( *, * ) 'Elapsed real time =', t2 - t1



  call CPU_time(t1)

  do i = 1, 10000000
    CALL sub_2(N, ID)
  end do

  call CPU_time(t2)
  write ( *, * ) 'Elapsed real time =', t2 - t1

end program test



SUBROUTINE sub_1(N)
  integer,            intent(in)      :: N
  real, dimension(N)                  :: ID

  ID = 0.0

END SUBROUTINE sub_1



SUBROUTINE sub_2(N, ID)
  integer,            intent(in)      :: N
  real, dimension(N), intent(in out)  :: ID

  ID = 0.0

END SUBROUTINE sub_2

Answer 1

我假设它与数组分配有关。

分配内存的过程本身需要时间。当你将数组原封不动地传递给子程序sub_2时，我认为子程序很可能不需要为数组分配内存。这可能假设数组是在堆上创建的，而不是堆栈，但我不是 100% 确定。

另一方面，对于子程序sub_1，每次都需要重新为数组分配space。

不幸的是我不太精通优化，所以我希望其他人会同意我或告诉我我错了;)

Answer 2

这似乎是您使用的旧版 gfortran 的一个“特性”。如果我至少使用 N=10 的更高版本，则时间更具可比性：

ian@eris:~/work/stack$ head s.f90
program test
  integer, parameter  :: N = 10
  real, dimension(N)  :: ID


  call CPU_time(t1)

  do i = 1, 10000000
    CALL sub_1(N)
  end do
ian@eris:~/work/stack$ gfortran-5 --version
GNU Fortran (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010
Copyright (C) 2015 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

ian@eris:~/work/stack$ gfortran-5 -O3 s.f90
ian@eris:~/work/stack$ ./a.out
 Elapsed real time =  0.149489999    
 Elapsed real time =   1.99675560E-06
ian@eris:~/work/stack$ gfortran-6 --version
GNU Fortran (Ubuntu 6.5.0-2ubuntu1~18.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ gfortran-6 -O3 s.f90
ian@eris:~/work/stack$ ./a.out
 Elapsed real time =   7.00005330E-06
 Elapsed real time =   5.00003807E-06
ian@eris:~/work/stack$ gfortran-7 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ gfortran-7 -O3 s.f90
ian@eris:~/work/stack$ ./a.out
 Elapsed real time =   8.00006092E-06
 Elapsed real time =   6.00004569E-06
ian@eris:~/work/stack$ gfortran-8 --version
GNU Fortran (Ubuntu 8.3.0-6ubuntu1~18.04.1) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ gfortran-8 -O3 s.f90
ian@eris:~/work/stack$ ./a.out
 Elapsed real time =   9.00030136E-06
 Elapsed real time =   6.00004569E-06

不过，我会用 bucket-full 的盐来处理以上所有内容。优化器很可能已经解决了在这个简单的情况下它实际上不需要做任何事情，所以只是摆脱了你想要计时的所有操作 - 唯一可以真正告诉你这件事的基准是您想要的代码运行.

Answer 3

sub_1 和 sub_2 没有可比性。在 sub_1 中，您正在分配 ID，初始化所有元素，然后在子例程 returns 时将其丢弃（因为它是子例程的本地元素）。

由于从未使用过 ID 数组，编译器可以优化它的创建和初始化。如果您使用 -O3 进行编译，这就是 gfortran 所做的。 sub_1 的生成代码只执行 return.

在sub_2中它仍然必须将ID的所有元素设置为0.0。

为什么以数组作为输入的子程序比以自动本地数组为输入的子程序性能更快？

Why does a subroutine with an array as an input give faster performance than the same subroutine with an automatic local array?

fortran

gfortran