在并行计算中,为什么使用所有线程(4)执行时间比只使用一半(2)更长?

In parallel computing, why using all threads (4) execution time is longer than using only a half (2)?

例如,我正在使用此代码(CPU:4 个核心(每个核心线程)):

program main
use omp_lib
implicit none
integer, parameter:: ma=100, n=10000, mb= 100
integer:: istart, iend 
real, dimension (ma,n) :: a 
real, dimension (n,mb) :: b
real, dimension (ma,mb) :: c = 0. 

integer:: i,j,k, threads=2, ppt, thread_num

integer:: toc, tic, rate 
real:: time_parallel, time 

call random_number (a) 
call random_number (b)


!/////////////////////// 1- PARALLEL PRIVATE ///////////////////////
CALL system_clock(count_rate=rate)
call system_clock(tic)

ppt = ma/threads
  !$ call omp_set_num_threads(threads)
  
  !$omp parallel default(shared) private(istart, iend, &
  !$omp thread_num, i)
  
    !$ thread_num = omp_get_thread_num()
    !$ istart = thread_num*ppt +1 
    !$ iend = min(ma, thread_num*ppt + ppt) 

  do i= istart,iend
    do j= 1,mb
      do k= 1,n
        c(i,j) = c(i,j) + a(i,k)*b(k,j)
      end do 
    end do
  end do 
  
!$omp end parallel
print*, 'Result in parallel mode' 
!$ print*, c(85:90,40)  
  
call system_clock(toc)
time_parallel = real(toc-tic)/real(rate)


!/////////////////////// 2-normal execution ///////////////////////
 c = 0
CALL system_clock(count_rate=rate)
call system_clock(tic)

  call system_clock(tic)

  do i= 1,ma
    do j= 1,mb
      do k= 1,n
        c(i,j) = c(i,j) + a(i,k)*b(k,j)
      end do 
    end do
  end do 
  
  
call system_clock(toc)
time =  real(toc-tic)/real(rate)
print*, 'Result in serial mode'
print*, c(85:90,40)  
print*, '------------------------------------------------'
print*, 'Threads: ', threads, '|  Time Parallel Private', time_parallel, 's '
print*, '                         Time Normal  ', time, 's'
!----------------------------------------------------------------


end program main

我得到以下结果:

第一次执行:

 Result in parallel mode

   2477.89478       2528.50391       2511.84204       2528.12061       2500.79517       
2510.69971    

 Result in serial mode

   2477.89478       2528.50391       2511.84204       2528.12061       2500.79517       
2510.69971    


 Threads:            2 |  Time Parallel Private  0.379999995     s 

 Time Normal    0.603999972     s

第二次执行:

 Result in parallel mode

   2492.20679       2496.56152       2500.58203       2516.51685       2516.43604       
2530.71313    

 Result in serial mode

   2492.20679       2496.56152       2500.58203       2516.51685       2516.43604       
2530.71313    

 ------------------------------------------------

 Threads:            4 |  Time Parallel Private   1.11500001     s 

 Time Normal    0.486000001     s

编译使用:

gfortran -Wall -fopenmp -g -O2 -o prog.exe prueba.f90 
./prog.exe

如果您有 N 个内核并使用 N 个线程,那么您的某些线程将被切换为其他一些进程和线程。因此,最好使用比可用内核少的线程数。