python 使用 f2py 调用时 fortran 矩阵乘积变慢

Question

我一直在尝试使用 f2py 将用于向量和矩阵乘法的优化 fortran 代码与 python 接口。为了获得对我的目的有用的性能比较，我在一个周期内执行相同的产品 100000 次。使用完整的 Fortran 代码，该产品需要 2.4 秒（ifort），而使用 f2py 则需要大约 11 秒。仅供参考，使用 numpy 大约需要 20 秒。我要求 fortran 和 python 部分写入循环前后的时间差，使用 f2py 它们都写入 11 秒，因此代码不会在传递数组时浪费时间。我试图了解这是否是存储 numpy 数组的方式，但我不明白这个问题。你有什么主意吗？提前致谢

fortran 主程序

program Main
    implicit none
    save

    integer :: seed, i, j, k
    integer, parameter :: states =15
    integer, parameter :: tessere = 400
    real, dimension(tessere,states,states) :: matrix
    real, dimension(states) :: vector
    real :: start, finish
    real  :: prod(tessere)

    do i=1,tessere
       do j=1,states
          do k=1,states
              matrix(i,j,k) = i+j+k
          end do
       enddo
    end do
    do i=1,states
        vector(i) = i
    enddo
    call doubleSum(vector,vector,matrix,states,tessere,prod)

end program

fortran 子例程：

subroutine doubleSum(ket, bra, M , states, tessere,prod)
    integer :: its, j, k,t
    integer :: states
    integer :: tessere
    real, dimension(tessere,states,states) :: M
    real, dimension(states) :: ket
    real, dimension(states) :: bra
    real, dimension(tessere) :: prod
    real,dimension(tessere,states) :: ctmp

    call cpu_time(start)
    do t=1,100000
        ctmp=0.d0
        do k=1,states
             do j=1,states
                do its=1,tessere
                   ctmp(its,k)=ctmp(its,k)+ M(its,k,j)*ket(j)
                enddo
             enddo
        enddo
        do its=1,tessere
            prod(its)=dot_product(bra,ctmp(its,:))
        enddo
    enddo
    call cpu_time(finish)
    print '("Time = ",f6.3," seconds.")',finish-start
end subroutine

python 脚本

import numpy as np
import time
import cicloS


M= np.random.rand(400,15,15)
ket=np.random.rand(15)

#M=np.asfortranarray(M)
#ket=np.asfortranarray(ket)

import time


start=time.time()  
prod=cicloS.doublesum(ket,ket,M)
end=time.time()
print(end-start)

.pyf 文件由 f2py 生成并编辑

!    -*- f90 -*-
! Note: the context of this file is case sensitive.

python module cicloS 
    interface  
        subroutine doublesum(ket,bra,m,states,tessere,prod) 
            real dimension(states),intent(in) :: ket
            real dimension(states),depend(states),intent(in) :: bra
            real dimension(tessere,states,states),depend(states,states),intent(in) :: m
            integer, optional,check(len(ket)>=states),depend(ket) :: states=len(ket)
            integer, optional,check(shape(m,0)==tessere),depend(m) :: tessere=shape(m,0)
            real dimension(tessere),intent(out) :: prod
        end subroutine doublesum
    end interface
end python module cicloS

Answer 1

OP 指出，在代码的独立编译版本和 F2PY 编译版本之间观察到的执行时间差异是由于使用了不同的编译器和编译器标志。

为了获得一致的结果，从而回答问题，有必要确保 F2PY 使用所需的 1) 编译器和 2) 编译器标志。

第 1 部分：指定 F2PY

应使用哪个 Fortran 编译器

可以通过执行例如显示目标系统上 F2PY 可用的 Fortran 编译器列表。 python -m numpy.f2py -c --help-fcompiler。在我的系统上，这会产生 (t运行cated):

Fortran compilers found:
  --fcompiler=gnu95    GNU Fortran 95 compiler (7)
  --fcompiler=intelem  Intel Fortran Compiler for 64-bit apps (19.0.1.144)

您可以通过向编译命令添加适当的 --fcompiler 标志来指示 F2PY 使用哪个可用的 Fortran 编译器。对于使用 ifort 例如（假设您已经创建并编辑了 cicloS.pyf 文件）：

python -m numpy.f2py --fcompiler=intelem -c cicloS.pyf sub.f90

第 2 部分：指定 F2PY

使用的附加 编译器标志

请注意，上一步中 --help-fcompiler 的输出还显示了 F2PY 为每个可用编译器定义的默认编译器标志（参见 compiler_f90）。再次在我的系统上，这是（t运行分类并简化为最相关的标志）：

gnu95: -O3 -funroll-loops
英特尔：-O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512

您可以在编译命令中使用 --opt 标志为 F2PY 指定首选优化标志（另请参阅 documentation 中的 --f90flags），现在变成例如：

python -m numpy.f2py --fcompiler=intelem --opt='-O1' -c cicloS.pyf sub.f90

比较运行独立版本和 F2PY 版本的时间：

使用 ifort -O1 sub.f90 main.f90 -o main 和 第 2 部分 中的 F2PY 编译版本编译独立可执行文件，我得到以下输出：

./main
Time =  5.359 seconds.

python test.py
Time =  5.297 seconds.
5.316878795623779

然后，使用 ifort -O3 sub.f90 main.f90 -o main 和来自 第 1 部分 的（默认）F2PY 编译版本编译一个独立的可执行文件，我得到这些结果：

./main
Time =  1.297 seconds.

python test.py
Time =  1.219 seconds.
1.209657907485962

因此显示了独立版本和 F2PY 版本的相似结果，以及编译器标志的影响。

评论临时数组

虽然不是您观察到的减速的原因，但请注意 F2PY 被迫在您的 Python 示例中制作数组 M（和 ket）的临时副本两个原因：

您传递给 cicloS.doublesum() 的 3D 数组 M 是默认的 NumPy 数组，采用 C 排序（行优先）。由于 Fortran 使用列优先顺序，F2PY 将制作数组副本。注释掉的 np.asfortranarray() 应该可以解决这部分问题。
数组副本（也适用于 ket）的下一个原因是 Python（默认 64 位，双精度浮点数）和 Fortran（real 给出默认精度，可能是 32 位浮点数）示例的边。因此，再次制作副本以说明这一点。

通过向 F2PY 编译命令添加 -DF2PY_REPORT_ON_ARRAY_COPY=1 标志（也在 documentation 中），您可以在创建数组副本时收到通知。在您的情况下，可以通过更改 Python 中 M 和 ket 矩阵的 dtype 来完全避免数组副本（即 M=np.asfortranarray(M, dtype=np.float32)) 和 ket=np.asfortranarray(ket, dtype=np.float32)) ，或者通过使用适当的 kind 在 Fortran 代码中定义 real 变量（例如，将 use, intrinsic :: iso_fortran_env, only : real64 添加到子例程和主程序并使用 real(kind=real64) 定义实数。

python 使用 f2py 调用时 fortran 矩阵乘积变慢

fortran matrix product slows when called with f2py by python

python

f2py

第 1 部分：指定 F2PY

第 2 部分：指定 F2PY

比较 运行 独立版本和 F2PY 版本的时间：

评论临时数组

比较运行独立版本和 F2PY 版本的时间：