单一 vs OpenMP vs MPI - Fortran
Single vs OpenMP vs MPI - Fortran
我是 MPI 编程的新手。我必须测试 3 个代码,例如顺序代码、OpenMP 代码和 MPI 代码。这3个代码(不是真正的代码,只是举例)分别给出如下
顺序码
program no_parallel
implicit none
integer, parameter :: dp = selected_real_kind(15,307)
integer :: i, j
real(kind = dp) :: time1, time2
real(kind = dp), dimension(1000000) :: a
!Initialisation
do i = 1, 1000000
a(i) = sqrt( dble(i) / 3.0d+0 )
end do
call cpu_time( time1 )
do j = 1, 1000
do i = 1, 1000000
a(i) = a(i) + sqrt( dble(i) )
end do
end do
call cpu_time( time2 )
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end program no_parallel
OpenMP代码
program openmp
implicit none
integer, parameter :: dp = selected_real_kind(15,307)
integer :: i, j
real(kind = dp) :: time1, time2, omp_get_wtime
real(kind = dp), dimension(1000000) :: a
!Initialisation
do i = 1, 1000000
a(i) = sqrt( dble(i) / 3.0d+0 )
end do
time1 = omp_get_wtime()
!$omp parallel
do j = 1, 1000
!$omp do schedule( runtime )
do i = 1, 1000000
a(i) = a(i) + sqrt( dble(i) )
end do
!$omp end do
end do
!$omp end parallel
time2 = omp_get_wtime()
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end program openmp
MPI代码
program MPI
implicit none
include "mpif.h"
integer, parameter :: dp = selected_real_kind(15,307)
integer :: ierr, num_procs, my_id, destination, tag, source, stat, i, j
real(kind = dp) :: time1, time2
real(kind = dp), dimension(1000000) :: a
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )
!Initialisation
do i = 1, 1000000
a(i) = sqrt( dble(i) / 3.0d+0 )
end do
destination = 0
tag = 999
source = 3
stat = MPI_STATUS_SIZE
time1 = MPI_Wtime()
do j = 1, 1000
do i = 1 + my_id, 1000000, num_procs
a(i) = a(i) + sqrt( dble(i) )
end do
end do
call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
if( my_id == source ) then
call MPI_SEND ( a(1000000), 1, MPI_DOUBLE_PRECISION, destination, tag, MPI_COMM_WORLD, ierr )
end if
if( my_id == destination ) then
call MPI_RECV ( a(1000000), 1, MPI_DOUBLE_PRECISION, source, tag, MPI_COMM_WORLD, stat, ierr )
end if
time2 = MPI_Wtime()
if( my_id == 0) then
print *, a(1000000) !, 'from ID =', my_id
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end if
stop
call MPI_FINALIZE ( ierr )
end program MPI
我使用 Intel Fortran Compiler 17.0.3
和 -O0
优化标志编译了这些代码。 OpenMP 和 MPI 代码均在 4 核 Haswell Desktop 上执行。我分别获得了顺序、OpenMP 和 MPI 代码 8.08s
、2.1s
和 3.2s
的 CPU 次。实际上,我期望 OpenMP 和 MPI 代码之间的结果几乎相似;然而,事实并非如此。我的问题:
关于 MPI 代码,如果我想打印出 a(1000000)
的结果,是否可以在不这样做的情况下以更智能的方式做到这一点 call MPI_SEND
和 call MPI_RECV
?
你知道 MPI 代码的哪一部分仍然可以优化吗?
关于MPI代码中的source
,是否可以自动定义?在这种情况下,对我来说很容易,因为处理器的数量是4,所以a(1000000)
必须分配给线程3。
提前谢谢你。
我发现通常需要在 SUBROUTINE 或 FUNCTION 中做更多的工作才能使并行性得到回报,因此在此示例中,专注于矢量化是最好的方法。
绰号是,"Vecorize inner - Parallelise outer" (VIPO)
对于第二种情况,我建议如下:
MODULE MyOMP_Funcs
IMPLICIT NONE
PRIVATE
integer, parameter, PUBLIC :: dp = selected_real_kind(15,307)
real(kind = dp), dimension(1000000) :: a
PUBLIC MyOMP_Init, MyOMP_Sum
CONTAINS
!=================================
SUBROUTINE MyOMP_Init(N,A)
IMPLICIT NONE
integer , INTENT(IN ) :: N
real(kind = dp), dimension(n), INTENT(INOUT) :: A
integer :: I
!Initialisation
DO i = 1, n
A(i) = sqrt( dble(i) / 3.0d+0 )
ENDDO
RETURN
END SUBROUTINE MyOMP_Init
!=================================
SUBROUTINE MyOMP_Sum(N,A,SumA)
!$OMP DECLARE SIMD(MyOMP_Sum) UNIFORM(N,SumA) linear(ref(A))
USE OMPLIB
IMPLICIT NONE
integer , INTENT(IN ) :: N
!DIR$ ASSUME_ALIGNED A: 64 :: A
real(kind = dp), dimension(n), INTENT(IN ) :: A
real(kind = dp) , INTENT( OUT) :: SumA
integer :: I
SumA = 0.0
!Maybe also try... !DIR$ VECTOR ALWAYS
!$OMP SIMD REDUCTION(+:SumA)
Sum_Loop: DO i = 1, N
SumA = SumA + A(i) + sqrt( dble(i) )
ENDDO Sum_Loop
!$omp end !<-- You probably do not need these
RETURN
END SUBROUTINE MyOMP_Sum
!=================================
SUBROUTINE My_NOVEC_Sum_Sum(N,A,SumA)
IMPLICIT NONE
integer , INTENT(IN ) :: N
!DIR$ ASSUME_ALIGNED A: 64 :: A
real(kind = dp), dimension(n), INTENT(IN ) :: A
real(kind = dp) , INTENT( OUT) :: SumA
integer :: I
SumA = 0.0
!DIR$ NOVECTOR
Sum_Loop: DO i = 1, N
SumA = SumA + A(i) + sqrt( dble(i) )
ENDDO Sum_Loop
RETURN
END SUBROUTINE My_NOVEC_Sum
!=================================
END MODULE MyOMP_Funcs
!=================================
!=================================
program openmp
!USE OMP_LIB
USE MyOMP_Funcs
implicit none
integer , PARAMETER :: OneM = 1000000
integer , PARAMETER :: OneK = 1000
integer :: i, j
real(kind = dp) :: time1, time2, omp_get_wtime
!DIR$ ATTRIBUTES ALIGNED:64 :: A, SumA
real(kind = dp), dimension(OneM) :: A
real(kind = dp) :: SumA
!Initialisation
CALL MyOMP_Init(N,A)
time1 = omp_get_wtime()
! !$omp parallel
! do j = 1, OneK
CALL MyOMP_Sum(OneM, A, SumA)
! end do
! !$omp end parallel
!!--> Put timing loops here
time2 = omp_get_wtime()
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end program openmp
一旦您有了 SIMD REDUCTION 版本 运行,您就可以尝试在并行性上分层。
如果模块是库的一部分,则编译器设置独立于程序。
其实你的MPI程序对我来说意义不大。为什么所有等级都有相同的完整阵列?为什么要复制完整数组?为什么就在这个特定的源和目标之间?
该程序没有计算任何有用的东西,所以很难说什么是正确的程序(没有正确计算任何有用的东西)。
在许多 MPI 程序中,您永远不会发送和接收整个数组。甚至不是完整的本地数组,只是它们之间的一些边界。
所以我想到了这个。注意 use mpi
并且我从所有地方删除了 magic number 1000000。
我也删除了 stop
。在 end
之前停止只是一个坏习惯,但并无害处。把它放在 MPI_Finalize()
之前 是非常有害的。
而且 最重要的是, 我以不同的方式分配了工作。每个等级都有它自己的数组部分。
program Test_MPI
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(15,307)
integer :: ierr, num_procs, my_id, stat, i, j
real(kind = dp) :: time1, time2
real(kind = dp), dimension(:), allocatable :: a
integer, parameter :: n = 1000000
integer :: my_n, ns
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )
my_n = n / num_procs
ns = my_id * my_n
if (my_id == num_procs-1) my_n = n - ns
allocate(a(my_n))
!Initialisation
do i = 1, my_n
a(i) = sqrt( real(i+ns, dp) / 3.0d+0 )
end do
stat = MPI_STATUS_SIZE
time1 = MPI_Wtime()
do j = 1, 1000
do i = 1 , my_n
a(i+my_id) = a(i) + sqrt( real(i+ns, dp) )
end do
end do
call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
time2 = MPI_Wtime()
if( my_id == 0) then
!!!! why??? print *, a(my_n)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end if
call MPI_FINALIZE ( ierr )
end program Test_MPI
是的,里面没有通讯。我想不出为什么它应该在那里。如果应该,您必须告诉我们原因。它应该几乎完美地缩放。
也许你想在一个等级中收集最终的阵列?许多人这样做,但通常根本不需要这样做。目前尚不清楚为什么在您的情况下需要它。
终于,我得到了我的问题的解决方案。之前没有意识到串行代码中do循环的并行化方式:
do i = 1, 1000000
a(i) = a(i) + sqrt( dble(i) )
end do
在MPI代码中循环分布:
do i = 1 + my_id, 1000000, num_procs
a(i) = a(i) + sqrt( dble(i) )
end do
是问题所在。我假设这是因为更多 缓存未命中 发生。因此,我将 block distribution 应用于 MPI 代码,而不是循环分布,这样效率更高(对于这种情况!!!)。我现在将修改后的 MPI 代码写为:
program Revised_MPI
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(15,307), array_size = 1000000
integer :: ierr, num_procs, my_id, ista, iend, i, j
integer, dimension(:), allocatable :: ista_idx, iend_idx
real(kind = dp) :: time1, time2
real(kind = dp), dimension(:), allocatable :: a
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )
!Distribute loop with block distribution
call para_range ( 1, array_size, num_procs, my_id, ista, iend )
allocate ( a( ista : iend ), ista_idx( num_procs ), iend_idx( num_procs ) )
!Initialisation and saving ista and iend
do i = ista, iend
a(i) = sqrt( dble(i) / 3.0d+0 )
ista_idx( my_id + 1 ) = ista
iend_idx( my_id + 1 ) = iend
end do
time1 = MPI_Wtime()
!Performing main calculation for all processors (including master and slaves)
do j = 1, 1000
do i = ista_idx( my_id + 1 ), iend_idx( my_id + 1 )
a(i) = a(i) + sqrt( dble(i) )
end do
end do
call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
time2 = MPI_Wtime()
if( my_id == num_procs - 1 ) then
print *, a( array_size )
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end if
call MPI_FINALIZE ( ierr )
deallocate ( a )
end program Revised_MPI
!-----------------------------------------------------------------------------------------
subroutine para_range ( n1, n2, num_procs, my_id, ista, iend )
implicit none
integer :: n1, n2, num_procs, my_id, ista, iend, &
iwork1, iwork2
iwork1 = ( n2 - n1 + 1 ) / num_procs
iwork2 = mod( n2 - n1 + 1, num_procs )
ista = my_id * iwork1 + n1 + min( my_id, iwork2 )
iend = ista + iwork1 - 1
if( iwork2 > my_id ) then
iend = iend + 1
end if
end subroutine para_range
!-----------------------------------------------------------------------------------------
现在,MPI 代码可以实现与 OpenMP 的 CPU(几乎)相似的时间。此外,它非常适合优化标志 -O3 和 -fast 的使用。
谢谢大家的帮助。 :)
我是 MPI 编程的新手。我必须测试 3 个代码,例如顺序代码、OpenMP 代码和 MPI 代码。这3个代码(不是真正的代码,只是举例)分别给出如下
顺序码
program no_parallel
implicit none
integer, parameter :: dp = selected_real_kind(15,307)
integer :: i, j
real(kind = dp) :: time1, time2
real(kind = dp), dimension(1000000) :: a
!Initialisation
do i = 1, 1000000
a(i) = sqrt( dble(i) / 3.0d+0 )
end do
call cpu_time( time1 )
do j = 1, 1000
do i = 1, 1000000
a(i) = a(i) + sqrt( dble(i) )
end do
end do
call cpu_time( time2 )
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end program no_parallel
OpenMP代码
program openmp
implicit none
integer, parameter :: dp = selected_real_kind(15,307)
integer :: i, j
real(kind = dp) :: time1, time2, omp_get_wtime
real(kind = dp), dimension(1000000) :: a
!Initialisation
do i = 1, 1000000
a(i) = sqrt( dble(i) / 3.0d+0 )
end do
time1 = omp_get_wtime()
!$omp parallel
do j = 1, 1000
!$omp do schedule( runtime )
do i = 1, 1000000
a(i) = a(i) + sqrt( dble(i) )
end do
!$omp end do
end do
!$omp end parallel
time2 = omp_get_wtime()
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end program openmp
MPI代码
program MPI
implicit none
include "mpif.h"
integer, parameter :: dp = selected_real_kind(15,307)
integer :: ierr, num_procs, my_id, destination, tag, source, stat, i, j
real(kind = dp) :: time1, time2
real(kind = dp), dimension(1000000) :: a
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )
!Initialisation
do i = 1, 1000000
a(i) = sqrt( dble(i) / 3.0d+0 )
end do
destination = 0
tag = 999
source = 3
stat = MPI_STATUS_SIZE
time1 = MPI_Wtime()
do j = 1, 1000
do i = 1 + my_id, 1000000, num_procs
a(i) = a(i) + sqrt( dble(i) )
end do
end do
call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
if( my_id == source ) then
call MPI_SEND ( a(1000000), 1, MPI_DOUBLE_PRECISION, destination, tag, MPI_COMM_WORLD, ierr )
end if
if( my_id == destination ) then
call MPI_RECV ( a(1000000), 1, MPI_DOUBLE_PRECISION, source, tag, MPI_COMM_WORLD, stat, ierr )
end if
time2 = MPI_Wtime()
if( my_id == 0) then
print *, a(1000000) !, 'from ID =', my_id
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end if
stop
call MPI_FINALIZE ( ierr )
end program MPI
我使用 Intel Fortran Compiler 17.0.3
和 -O0
优化标志编译了这些代码。 OpenMP 和 MPI 代码均在 4 核 Haswell Desktop 上执行。我分别获得了顺序、OpenMP 和 MPI 代码 8.08s
、2.1s
和 3.2s
的 CPU 次。实际上,我期望 OpenMP 和 MPI 代码之间的结果几乎相似;然而,事实并非如此。我的问题:
关于 MPI 代码,如果我想打印出
a(1000000)
的结果,是否可以在不这样做的情况下以更智能的方式做到这一点call MPI_SEND
和call MPI_RECV
?你知道 MPI 代码的哪一部分仍然可以优化吗?
关于MPI代码中的
source
,是否可以自动定义?在这种情况下,对我来说很容易,因为处理器的数量是4,所以a(1000000)
必须分配给线程3。
提前谢谢你。
我发现通常需要在 SUBROUTINE 或 FUNCTION 中做更多的工作才能使并行性得到回报,因此在此示例中,专注于矢量化是最好的方法。
绰号是,"Vecorize inner - Parallelise outer" (VIPO)
对于第二种情况,我建议如下:
MODULE MyOMP_Funcs
IMPLICIT NONE
PRIVATE
integer, parameter, PUBLIC :: dp = selected_real_kind(15,307)
real(kind = dp), dimension(1000000) :: a
PUBLIC MyOMP_Init, MyOMP_Sum
CONTAINS
!=================================
SUBROUTINE MyOMP_Init(N,A)
IMPLICIT NONE
integer , INTENT(IN ) :: N
real(kind = dp), dimension(n), INTENT(INOUT) :: A
integer :: I
!Initialisation
DO i = 1, n
A(i) = sqrt( dble(i) / 3.0d+0 )
ENDDO
RETURN
END SUBROUTINE MyOMP_Init
!=================================
SUBROUTINE MyOMP_Sum(N,A,SumA)
!$OMP DECLARE SIMD(MyOMP_Sum) UNIFORM(N,SumA) linear(ref(A))
USE OMPLIB
IMPLICIT NONE
integer , INTENT(IN ) :: N
!DIR$ ASSUME_ALIGNED A: 64 :: A
real(kind = dp), dimension(n), INTENT(IN ) :: A
real(kind = dp) , INTENT( OUT) :: SumA
integer :: I
SumA = 0.0
!Maybe also try... !DIR$ VECTOR ALWAYS
!$OMP SIMD REDUCTION(+:SumA)
Sum_Loop: DO i = 1, N
SumA = SumA + A(i) + sqrt( dble(i) )
ENDDO Sum_Loop
!$omp end !<-- You probably do not need these
RETURN
END SUBROUTINE MyOMP_Sum
!=================================
SUBROUTINE My_NOVEC_Sum_Sum(N,A,SumA)
IMPLICIT NONE
integer , INTENT(IN ) :: N
!DIR$ ASSUME_ALIGNED A: 64 :: A
real(kind = dp), dimension(n), INTENT(IN ) :: A
real(kind = dp) , INTENT( OUT) :: SumA
integer :: I
SumA = 0.0
!DIR$ NOVECTOR
Sum_Loop: DO i = 1, N
SumA = SumA + A(i) + sqrt( dble(i) )
ENDDO Sum_Loop
RETURN
END SUBROUTINE My_NOVEC_Sum
!=================================
END MODULE MyOMP_Funcs
!=================================
!=================================
program openmp
!USE OMP_LIB
USE MyOMP_Funcs
implicit none
integer , PARAMETER :: OneM = 1000000
integer , PARAMETER :: OneK = 1000
integer :: i, j
real(kind = dp) :: time1, time2, omp_get_wtime
!DIR$ ATTRIBUTES ALIGNED:64 :: A, SumA
real(kind = dp), dimension(OneM) :: A
real(kind = dp) :: SumA
!Initialisation
CALL MyOMP_Init(N,A)
time1 = omp_get_wtime()
! !$omp parallel
! do j = 1, OneK
CALL MyOMP_Sum(OneM, A, SumA)
! end do
! !$omp end parallel
!!--> Put timing loops here
time2 = omp_get_wtime()
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end program openmp
一旦您有了 SIMD REDUCTION 版本 运行,您就可以尝试在并行性上分层。
如果模块是库的一部分,则编译器设置独立于程序。
其实你的MPI程序对我来说意义不大。为什么所有等级都有相同的完整阵列?为什么要复制完整数组?为什么就在这个特定的源和目标之间?
该程序没有计算任何有用的东西,所以很难说什么是正确的程序(没有正确计算任何有用的东西)。
在许多 MPI 程序中,您永远不会发送和接收整个数组。甚至不是完整的本地数组,只是它们之间的一些边界。
所以我想到了这个。注意 use mpi
并且我从所有地方删除了 magic number 1000000。
我也删除了 stop
。在 end
之前停止只是一个坏习惯,但并无害处。把它放在 MPI_Finalize()
之前 是非常有害的。
而且 最重要的是, 我以不同的方式分配了工作。每个等级都有它自己的数组部分。
program Test_MPI
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(15,307)
integer :: ierr, num_procs, my_id, stat, i, j
real(kind = dp) :: time1, time2
real(kind = dp), dimension(:), allocatable :: a
integer, parameter :: n = 1000000
integer :: my_n, ns
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )
my_n = n / num_procs
ns = my_id * my_n
if (my_id == num_procs-1) my_n = n - ns
allocate(a(my_n))
!Initialisation
do i = 1, my_n
a(i) = sqrt( real(i+ns, dp) / 3.0d+0 )
end do
stat = MPI_STATUS_SIZE
time1 = MPI_Wtime()
do j = 1, 1000
do i = 1 , my_n
a(i+my_id) = a(i) + sqrt( real(i+ns, dp) )
end do
end do
call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
time2 = MPI_Wtime()
if( my_id == 0) then
!!!! why??? print *, a(my_n)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end if
call MPI_FINALIZE ( ierr )
end program Test_MPI
是的,里面没有通讯。我想不出为什么它应该在那里。如果应该,您必须告诉我们原因。它应该几乎完美地缩放。
也许你想在一个等级中收集最终的阵列?许多人这样做,但通常根本不需要这样做。目前尚不清楚为什么在您的情况下需要它。
终于,我得到了我的问题的解决方案。之前没有意识到串行代码中do循环的并行化方式:
do i = 1, 1000000
a(i) = a(i) + sqrt( dble(i) )
end do
在MPI代码中循环分布:
do i = 1 + my_id, 1000000, num_procs
a(i) = a(i) + sqrt( dble(i) )
end do
是问题所在。我假设这是因为更多 缓存未命中 发生。因此,我将 block distribution 应用于 MPI 代码,而不是循环分布,这样效率更高(对于这种情况!!!)。我现在将修改后的 MPI 代码写为:
program Revised_MPI
use mpi
implicit none
integer, parameter :: dp = selected_real_kind(15,307), array_size = 1000000
integer :: ierr, num_procs, my_id, ista, iend, i, j
integer, dimension(:), allocatable :: ista_idx, iend_idx
real(kind = dp) :: time1, time2
real(kind = dp), dimension(:), allocatable :: a
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )
!Distribute loop with block distribution
call para_range ( 1, array_size, num_procs, my_id, ista, iend )
allocate ( a( ista : iend ), ista_idx( num_procs ), iend_idx( num_procs ) )
!Initialisation and saving ista and iend
do i = ista, iend
a(i) = sqrt( dble(i) / 3.0d+0 )
ista_idx( my_id + 1 ) = ista
iend_idx( my_id + 1 ) = iend
end do
time1 = MPI_Wtime()
!Performing main calculation for all processors (including master and slaves)
do j = 1, 1000
do i = ista_idx( my_id + 1 ), iend_idx( my_id + 1 )
a(i) = a(i) + sqrt( dble(i) )
end do
end do
call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
time2 = MPI_Wtime()
if( my_id == num_procs - 1 ) then
print *, a( array_size )
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
end if
call MPI_FINALIZE ( ierr )
deallocate ( a )
end program Revised_MPI
!-----------------------------------------------------------------------------------------
subroutine para_range ( n1, n2, num_procs, my_id, ista, iend )
implicit none
integer :: n1, n2, num_procs, my_id, ista, iend, &
iwork1, iwork2
iwork1 = ( n2 - n1 + 1 ) / num_procs
iwork2 = mod( n2 - n1 + 1, num_procs )
ista = my_id * iwork1 + n1 + min( my_id, iwork2 )
iend = ista + iwork1 - 1
if( iwork2 > my_id ) then
iend = iend + 1
end if
end subroutine para_range
!-----------------------------------------------------------------------------------------
现在,MPI 代码可以实现与 OpenMP 的 CPU(几乎)相似的时间。此外,它非常适合优化标志 -O3 和 -fast 的使用。
谢谢大家的帮助。 :)