OpenACC Fortran 循环中的顺序 dot_product
Sequential dot_product in OpenACC Fortran loop
在 Fortran 程序中,我有一个大循环,其中有几个 dot_product
调用循环内生成的小向量:
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels
!$acc loop independent private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end kernels
!$acc end data
print "(2(g0, x))", res
endprogram
当使用 PGI 编译器编译时,dot_product
的加速实现似乎使用了加速循环,因此阻止了更好地加速主循环(在 gang 和 vector 上):
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
17, !$acc loop vector(32) ! threadidx%x
Generating implicit reduction(+:subarray1$r)
14, CUDA shared memory used for subarray2,subarray1
15, Loop is parallelizable
17, Loop is parallelizable
如日志中所示,它对循环专用向量使用隐式缩减和共享内存。
有没有办法强制 dot_product
按顺序变为 运行?
Is there a way to force dot_product to run sequentially?
只要您不介意数组语法也是 运行 顺序的,只需将“gang vector”添加到循环指令即可。
% cat test.f90
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels loop gang vector private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end data
print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
13, Loop is parallelizable
Generating Tesla code
13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
14, !$acc loop seq
16, !$acc loop seq
13, Local memory used for subarray2,subarray1
14, Loop is parallelizable
16, Loop is parallelizable
在 Fortran 程序中,我有一个大循环,其中有几个 dot_product
调用循环内生成的小向量:
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels
!$acc loop independent private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end kernels
!$acc end data
print "(2(g0, x))", res
endprogram
当使用 PGI 编译器编译时,dot_product
的加速实现似乎使用了加速循环,因此阻止了更好地加速主循环(在 gang 和 vector 上):
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
17, !$acc loop vector(32) ! threadidx%x
Generating implicit reduction(+:subarray1$r)
14, CUDA shared memory used for subarray2,subarray1
15, Loop is parallelizable
17, Loop is parallelizable
如日志中所示,它对循环专用向量使用隐式缩减和共享内存。
有没有办法强制 dot_product
按顺序变为 运行?
Is there a way to force dot_product to run sequentially?
只要您不介意数组语法也是 运行 顺序的,只需将“gang vector”添加到循环指令即可。
% cat test.f90
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels loop gang vector private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end data
print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
13, Loop is parallelizable
Generating Tesla code
13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
14, !$acc loop seq
16, !$acc loop seq
13, Local memory used for subarray2,subarray1
14, Loop is parallelizable
16, Loop is parallelizable