尽管 OpenMP 中的每一步都存在障碍,但可能存在竞争条件
Possible race condition despite barrier at every step in OpenMP
我正在尝试使用 openMP 编写一个堡垒运行 应用程序。我在 MPI 方面拥有丰富的经验,但仍在为 OMP 同步而苦苦挣扎。请考虑以下示例:
module ps_mod
contains
subroutine cg(A,b,N)
use iso_fortran_env
implicit none
real(real64) :: A(:,:), b(:)
integer(int32) :: N
integer :: i
real(real64) :: alpha
real(real64), allocatable :: v(:), p0(:)
allocate(v(N))
allocate(p0(N))
!$omp parallel shared(v,p0,A,b,i)
v(:) = 10000.
alpha = 1.0_real64
p0(:) = b(:)
!$omp barrier
call dotp(p0, v, alpha, N)
!$omp barrier
alpha = 10.0/alpha
!$omp barrier
print *, "alpha", alpha
!$omp barrier
!$omp end parallel
end subroutine cg
subroutine dotp(x,y,res,n)
!res = sum(x*y)
use iso_fortran_env
implicit none
integer(int32) :: n, j
real(real64) :: x(n), y(n), res
res=0.0_real64
!$omp barrier
!$omp do private(j) reduction(+:res)
do j = 1, n
res = res + x(j)*y(j)
enddo
!$omp barrier
end subroutine dotp
end module ps_mod
!------------------------ main function
program main_omp
use iso_fortran_env
use ps_mod
implicit none
real(real64), allocatable :: mat(:,:), y(:)
integer(int32) :: n, i
n = 8000
allocate(mat(n, n))
allocate(y(n))
mat = 0.0_real64
do i = 1,n
mat(i,i) = 1.0_real64
enddo
y = 0.2_real64
call cg(mat, y, n)
end program main_omp
它采用简单的矩阵和向量并对它们执行一些计算,减少一个变量 alpha 的输出。
我使用 gfort运行 7.3.1 gfortran -O3 -fopenmp main_omp.f90
和 运行 编译它,使用 export OMP_NUM_THREADS=5
5 个线程;我得到的输出在运行之间发生变化 for i in $(seq 1 20);do ./a.out ;done
.
...
alpha 6.2500000000000005E-007
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 6.2500000000000005E-007
alpha 6.2500000000000005E-007
...
对于较小的线程也一样,但很少见。但是,对于单线程,我总是得到相同的结果。因此我不认为它有内存分配不当的问题(在我的主程序中,我也用 libasan 测试过,它说没有泄漏)。因此,我相信这是一个开放的 MP 竞争条件,但我无法诊断出在哪里。正如你所看到的,我已经明确地在所有地方设置了障碍。
更新:
在与@Giles 交谈后,我发现如果我使用虚拟变量并显式将 alpha 设为私有,则以下构造确实有效,但我不知道为什么。
real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
v(:) = 10000.
alpha = 1.0_real64
p0(:) = b(:)
!$omp barrier
call dotp(p0, v, s, N) !<----- dummy s
!$omp barrier
alpha = s !<------ assign dummy s value to private alpha
!$omp barrier
alpha = 10.0/alpha
您可能在每一行都有一个同步,但您没有在每个内存访问之间进行同步,而这正是您需要绝对确定的顺序。特别是行
alpha = 1.0 / alpha
有多个线程可能同时读取和写入一个共享变量;内存访问没有顺序,只是因为它们在同一行上。一个简单的例子:
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ cat race.f90
Program in_line_racing
Implicit None
Real :: alpha, s
alpha = 3.0
!$omp parallel default( none ) shared( alpha ) private( s )
alpha = 1.0 / alpha
s = alpha
Write( *, * ) s
!$omp end parallel
End Program in_line_racing
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran -fopenmp race.f90
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ export OMP_NUM_THREADS=4
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
0.333333343
0.333333343
0.333333343
3.00000000
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
3.00000000
0.333333343
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
3.00000000
0.333333343
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
0.333333343
3.00000000
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
0.333333343
0.333333343
0.333333343
你需要减少的结果是一个共享变量,所以你的更新版本可以工作,因为你首先确保减少的变量在所有线程上都是完全最新的,然后对私有变量执行逆操作不能忍受竞争条件。
欢迎使用线程编程。错误很容易!
之前:
!$omp parallel shared(v,p0,A,b,i)
v(:) = 10000. ! race, write to shared v(:)
alpha = 1.0_real64 ! race, write to shared alpha
p0(:) = b(:) ! race, write to shared p0(:)
!$omp barrier
call dotp(p0, v, alpha, N) ! not race, reduction(+:res)
!$omp barrier
alpha = 10.0/alpha ! race, write to shared alpha
!$omp barrier
print *, "alpha", alpha, loc(alpha), loc(v(1)), loc(p0(1))
之后:
real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
v(:) = 10000. ! race, write to shared v(:)
alpha = 1.0_real64 ! not race, write to private alpha
p0(:) = b(:) ! race, write to shared p0(:)
!$omp barrier
call dotp(p0, v, s, N) ! not race, reduction(+:res)
!$omp barrier
alpha = s ! not race, write to private alpha
!$omp barrier ! barrier don't needed
alpha = 10.0/alpha ! not race, write to private alpha
“v(:) =”、“p0(:) =”和“res=0.0_real64”中的竞争 - 重复写入相同的值只会导致性能损失。
我正在尝试使用 openMP 编写一个堡垒运行 应用程序。我在 MPI 方面拥有丰富的经验,但仍在为 OMP 同步而苦苦挣扎。请考虑以下示例:
module ps_mod
contains
subroutine cg(A,b,N)
use iso_fortran_env
implicit none
real(real64) :: A(:,:), b(:)
integer(int32) :: N
integer :: i
real(real64) :: alpha
real(real64), allocatable :: v(:), p0(:)
allocate(v(N))
allocate(p0(N))
!$omp parallel shared(v,p0,A,b,i)
v(:) = 10000.
alpha = 1.0_real64
p0(:) = b(:)
!$omp barrier
call dotp(p0, v, alpha, N)
!$omp barrier
alpha = 10.0/alpha
!$omp barrier
print *, "alpha", alpha
!$omp barrier
!$omp end parallel
end subroutine cg
subroutine dotp(x,y,res,n)
!res = sum(x*y)
use iso_fortran_env
implicit none
integer(int32) :: n, j
real(real64) :: x(n), y(n), res
res=0.0_real64
!$omp barrier
!$omp do private(j) reduction(+:res)
do j = 1, n
res = res + x(j)*y(j)
enddo
!$omp barrier
end subroutine dotp
end module ps_mod
!------------------------ main function
program main_omp
use iso_fortran_env
use ps_mod
implicit none
real(real64), allocatable :: mat(:,:), y(:)
integer(int32) :: n, i
n = 8000
allocate(mat(n, n))
allocate(y(n))
mat = 0.0_real64
do i = 1,n
mat(i,i) = 1.0_real64
enddo
y = 0.2_real64
call cg(mat, y, n)
end program main_omp
它采用简单的矩阵和向量并对它们执行一些计算,减少一个变量 alpha 的输出。
我使用 gfort运行 7.3.1 gfortran -O3 -fopenmp main_omp.f90
和 运行 编译它,使用 export OMP_NUM_THREADS=5
5 个线程;我得到的输出在运行之间发生变化 for i in $(seq 1 20);do ./a.out ;done
.
...
alpha 6.2500000000000005E-007
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 6.2500000000000005E-007
alpha 6.2500000000000005E-007
...
对于较小的线程也一样,但很少见。但是,对于单线程,我总是得到相同的结果。因此我不认为它有内存分配不当的问题(在我的主程序中,我也用 libasan 测试过,它说没有泄漏)。因此,我相信这是一个开放的 MP 竞争条件,但我无法诊断出在哪里。正如你所看到的,我已经明确地在所有地方设置了障碍。
更新:
在与@Giles 交谈后,我发现如果我使用虚拟变量并显式将 alpha 设为私有,则以下构造确实有效,但我不知道为什么。
real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
v(:) = 10000.
alpha = 1.0_real64
p0(:) = b(:)
!$omp barrier
call dotp(p0, v, s, N) !<----- dummy s
!$omp barrier
alpha = s !<------ assign dummy s value to private alpha
!$omp barrier
alpha = 10.0/alpha
您可能在每一行都有一个同步,但您没有在每个内存访问之间进行同步,而这正是您需要绝对确定的顺序。特别是行
alpha = 1.0 / alpha
有多个线程可能同时读取和写入一个共享变量;内存访问没有顺序,只是因为它们在同一行上。一个简单的例子:
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ cat race.f90
Program in_line_racing
Implicit None
Real :: alpha, s
alpha = 3.0
!$omp parallel default( none ) shared( alpha ) private( s )
alpha = 1.0 / alpha
s = alpha
Write( *, * ) s
!$omp end parallel
End Program in_line_racing
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran -fopenmp race.f90
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ export OMP_NUM_THREADS=4
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
0.333333343
0.333333343
0.333333343
3.00000000
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
3.00000000
0.333333343
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
3.00000000
0.333333343
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
0.333333343
3.00000000
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
0.333333343
0.333333343
0.333333343
你需要减少的结果是一个共享变量,所以你的更新版本可以工作,因为你首先确保减少的变量在所有线程上都是完全最新的,然后对私有变量执行逆操作不能忍受竞争条件。
欢迎使用线程编程。错误很容易!
之前:
!$omp parallel shared(v,p0,A,b,i)
v(:) = 10000. ! race, write to shared v(:)
alpha = 1.0_real64 ! race, write to shared alpha
p0(:) = b(:) ! race, write to shared p0(:)
!$omp barrier
call dotp(p0, v, alpha, N) ! not race, reduction(+:res)
!$omp barrier
alpha = 10.0/alpha ! race, write to shared alpha
!$omp barrier
print *, "alpha", alpha, loc(alpha), loc(v(1)), loc(p0(1))
之后:
real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
v(:) = 10000. ! race, write to shared v(:)
alpha = 1.0_real64 ! not race, write to private alpha
p0(:) = b(:) ! race, write to shared p0(:)
!$omp barrier
call dotp(p0, v, s, N) ! not race, reduction(+:res)
!$omp barrier
alpha = s ! not race, write to private alpha
!$omp barrier ! barrier don't needed
alpha = 10.0/alpha ! not race, write to private alpha
“v(:) =”、“p0(:) =”和“res=0.0_real64”中的竞争 - 重复写入相同的值只会导致性能损失。