尽管 OpenMP 中的每一步都存在障碍,但可能存在竞争条件

Possible race condition despite barrier at every step in OpenMP

我正在尝试使用 openMP 编写一个堡垒运行 应用程序。我在 MPI 方面拥有丰富的经验,但仍在为 OMP 同步而苦苦挣扎。请考虑以下示例:

module ps_mod

    contains
    subroutine cg(A,b,N)
        use iso_fortran_env
        implicit none
        real(real64)        :: A(:,:), b(:)
        integer(int32)      :: N

        integer                   :: i
        real(real64)              :: alpha
        real(real64), allocatable :: v(:), p0(:)


        allocate(v(N))
        allocate(p0(N))

        !$omp parallel shared(v,p0,A,b,i)
        v(:)   = 10000.
        alpha = 1.0_real64
        p0(:) = b(:)

        !$omp barrier
        call dotp(p0, v, alpha, N)
        !$omp barrier

        alpha = 10.0/alpha
        !$omp barrier

        print *, "alpha", alpha

        !$omp barrier
        !$omp end parallel

    end subroutine cg

    subroutine dotp(x,y,res,n)
        !res = sum(x*y)
        use iso_fortran_env
        implicit none
        integer(int32)  :: n, j
        real(real64)    :: x(n), y(n), res

        res=0.0_real64
        !$omp barrier
        !$omp do private(j) reduction(+:res)
        do j = 1, n
            res = res + x(j)*y(j)
        enddo
        !$omp barrier
    end subroutine dotp

end module ps_mod

!------------------------ main function
program main_omp
    use iso_fortran_env
    use ps_mod
    implicit none
    real(real64), allocatable           :: mat(:,:), y(:)
    integer(int32) :: n, i

    n = 8000
    allocate(mat(n, n))
    allocate(y(n))

    mat = 0.0_real64
    do i = 1,n
        mat(i,i) = 1.0_real64
    enddo
    y = 0.2_real64

    call cg(mat, y, n)
end program main_omp

它采用简单的矩阵和向量并对它们执行一些计算,减少一个变量 alpha 的输出。

我使用 gfort运行 7.3.1 gfortran -O3 -fopenmp main_omp.f90 和 运行 编译它,使用 export OMP_NUM_THREADS=5 5 个线程;我得到的输出在运行之间发生变化 for i in $(seq 1 20);do ./a.out ;done.

...
 alpha   6.2500000000000005E-007
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   6.2500000000000005E-007
 alpha   6.2500000000000005E-007
...

对于较小的线程也一样,但很少见。但是,对于单线程,我总是得到相同的结果。因此我不认为它有内存分配不当的问题(在我的主程序中,我也用 libasan 测试过,它说没有泄漏)。因此,我相信这是一个开放的 MP 竞争条件,但我无法诊断出在哪里。正如你所看到的,我已经明确地在所有地方设置了障碍。


更新:

在与@Giles 交谈后,我发现如果我使用虚拟变量并显式将 alpha 设为私有,则以下构造确实有效,但我不知道为什么。

real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
        v(:)   = 10000.
        alpha = 1.0_real64
        p0(:) = b(:)

        !$omp barrier
        call dotp(p0, v, s, N) !<----- dummy s
        !$omp barrier
        alpha = s  !<------ assign dummy s value to private alpha
        !$omp barrier
        alpha = 10.0/alpha

您可能在每一行都有一个同步,但您没有在每个内存访问之间进行同步,而这正是您需要绝对确定的顺序。特别是行

alpha = 1.0 / alpha

有多个线程可能同时读取和写入一个共享变量;内存访问没有顺序,只是因为它们在同一行上。一个简单的例子:

ijb@LAPTOP-GUG8KQ9I:~/work/stack$ cat race.f90 
Program in_line_racing

  Implicit None

  Real :: alpha, s

  alpha = 3.0

  !$omp parallel default( none ) shared( alpha ) private( s )
  alpha = 1.0 / alpha
  s = alpha
  Write( *, * ) s
  !$omp end parallel

End Program in_line_racing
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran -fopenmp race.f90 
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ export OMP_NUM_THREADS=4
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
  0.333333343    
  0.333333343    
  0.333333343    
   3.00000000    
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
   3.00000000    
   3.00000000    
  0.333333343    
  0.333333343    
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
   3.00000000    
   3.00000000    
  0.333333343    
  0.333333343    
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
   3.00000000    
  0.333333343    
   3.00000000    
  0.333333343    
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
   3.00000000    
  0.333333343    
  0.333333343    
  0.333333343 

你需要减少的结果是一个共享变量,所以你的更新版本可以工作,因为你首先确保减少的变量在所有线程上都是完全最新的,然后对私有变量执行逆操作不能忍受竞争条件。

欢迎使用线程编程。错误很容易!

之前:

    !$omp parallel shared(v,p0,A,b,i)
    v(:)   = 10000.     ! race, write to shared v(:)
    alpha = 1.0_real64  ! race, write to shared alpha 
    p0(:) = b(:)        ! race, write to shared p0(:)

    !$omp barrier
    call dotp(p0, v, alpha, N) ! not race, reduction(+:res)
    !$omp barrier

    alpha = 10.0/alpha  ! race, write to shared alpha
    !$omp barrier

    print *, "alpha", alpha, loc(alpha), loc(v(1)), loc(p0(1))

之后:

    real(real64)::s
    !$omp parallel shared(v,p0,A,b,i) private(alpha)
    v(:)   = 10000.     ! race, write to shared v(:)
    alpha = 1.0_real64  ! not race, write to private alpha
    p0(:) = b(:)        ! race, write to shared p0(:)

    !$omp barrier
    call dotp(p0, v, s, N) ! not race, reduction(+:res)
    !$omp barrier
    alpha = s           ! not race, write to private alpha
    !$omp barrier       ! barrier don't needed
    alpha = 10.0/alpha  ! not race, write to private alpha

“v(:) =”、“p0(:) =”和“res=0.0_real64”中的竞争 - 重复写入相同的值只会导致性能损失。