CUDA Fortran 中的最大减少
Max reduce in CUDA Fortran
我正在尝试在 CUDA Fortran 中执行缩减;到目前为止我所做的是类似的事情,分两步执行减少(参见下面的 CUDA 内核)。
在第一个内核中,我正在做一些简单的计算,我为一个线程块声明了一个共享数组来存储 abs(a - anew)
的值;一旦线程同步,我计算这个共享数组的最大值,我存储在维度 gridDim%x * gridDim%y
.
的中间数组中
在第二个内核中,我正在读取这个数组(在单个线程块中)并尝试计算它的最大值。
完整代码如下:
module commons
integer, parameter :: dp=kind(1.d0)
integer, parameter :: nx=1024, ny=1024
integer, parameter :: block_dimx=16, block_dimy=32
end module commons
module kernels
use commons
contains
attributes(global) subroutine kernel_gpu_reduce(a, anew, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: a
real(dp), dimension(nxi,nyi), intent(inout) :: anew
real(dp), dimension(nxi/block_dimx+1,nyi/block_dimy+1), intent(inout) :: error
real(dp), shared, dimension(block_dimx,block_dimy) :: err_sh
integer :: i, j, k, tx, ty
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y
tx = threadIdx%x
ty = threadIdx%y
if (i > 1 .and. i < nxi .and. j > 1 .and. j < nyi) then
anew(i,j) = 0.25d0*(a(i-1,j) + a(i+1,j) &
& + a(i,j-1) + a(i,j+1))
err_sh(tx,ty) = abs(anew(i,j) - a(i,j))
endif
call syncthreads()
error(blockIdx%x,blockIdx%y) = maxval(err_sh)
end subroutine kernel_gpu_reduce
attributes(global) subroutine max_reduce(local_error, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: local_error
real(dp), intent(out) :: error
real(dp), shared, dimension(nxi) :: shared_error
integer :: tx, i
tx = threadIdx%x
shared_error(tx) = 0.d0
if (tx >=1 .and. tx <= nxi) shared_error(tx) = maxval(local_error(tx,:))
call syncthreads()
error = maxval(shared_error)
end subroutine max_reduce
end module kernels
program laplace
use cudafor
use kernels
use commons
implicit none
real(dp), allocatable, dimension(:,:) :: a, anew
real(dp) :: error=1.d0
real(dp), device, allocatable, dimension(:,:) :: adev, adevnew
real(dp), device, allocatable, dimension(:,:) :: edev
real(dp), allocatable, dimension(:,:) :: ehost
real(dp), device :: error_dev
integer :: i
integer :: num_device, h_status, ierrSync, ierrAsync
type(dim3) :: dimGrid, dimBlock
num_device = 0
h_status = cudaSetDevice(num_device)
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1)
dimBlock = dim3(block_dimx, block_dimy, 1)
allocate(a(nx,ny), anew(nx,ny))
allocate(adev(nx,ny), adevnew(nx,ny))
allocate(edev(dimGrid%x,dimGrid%y), ehost(dimGrid%x,dimGrid%y))
do i = 1, nx
a(i,:) = 1.d0
anew(i,:) = 1.d0
enddo
adev = a
adevnew = anew
call kernel_gpu_reduce<<<dimGrid, dimBlock>>>(adev, adevnew, edev, nx, ny)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 1st kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 1st kernel:', cudaGetErrorString(ierrAsync)
call max_reduce<<<1, dimGrid%x>>>(edev, error_dev, dimGrid%x, dimGrid%y)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 2nd kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 2nd kernel:', cudaGetErrorString(ierrAsync)
error = error_dev
print*, 'error from kernel: ', error
ehost = edev
error = maxval(ehost)
print*, 'error from host: ', error
deallocate(a, anew, adev, adevnew, edev, ehost)
end program laplace
我第一次遇到问题是因为第二个内核的内核配置(<<<1, dimGrid>>>
);我根据罗伯特的回答修改了代码。现在我有一个内存访问错误:
Async kernel error - 2nd kernel:
an illegal memory access was encountered
0: copyout Memcpy (host=0x666bf0, dev=0x4203e20000, size=8) FAILED: 77(an illegal memory access was encountered)
而且,如果我 运行 它与 cuda-memcheck
:
========= Invalid __shared__ write of size 8
========= at 0x00000060 in kernels_max_reduce_
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x00000008 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x2c5) [0x14ad95]
每个线程。
该代码是在 Tesla K20 卡(具有 CUDA 功能 3.5)上使用 PGI Fortran 14.9 和 CUDA 6.5 编译的。我编译它:
pgfortran -Mcuda -ta:nvidia,cc35 laplace.f90 -o laplace
你可以做到 proper cuda error checking in CUDA Fortran。你应该在你的代码中这样做。
一个问题是您试图在第二个内核中启动太多线程(每个块):
call max_reduce<<<1, dimGrid>>>(edev, error_dev, dimGrid%x, dimGrid%y)
^^^^^^^
dimGrid
参数先前计算为:
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1);
代入实际值,我们有:
dimGrid = dim3(1024/16 + 1, 1024/32 +1);
即
dimGrid = dim3(65,33);
但是你不允许每块请求65*33 = 2145个线程。最大值是 512 或 1024,具体取决于您要编译的设备架构目标。
由于这个错误,你的第二个内核根本就不是 运行。
我正在尝试在 CUDA Fortran 中执行缩减;到目前为止我所做的是类似的事情,分两步执行减少(参见下面的 CUDA 内核)。
在第一个内核中,我正在做一些简单的计算,我为一个线程块声明了一个共享数组来存储 abs(a - anew)
的值;一旦线程同步,我计算这个共享数组的最大值,我存储在维度 gridDim%x * gridDim%y
.
在第二个内核中,我正在读取这个数组(在单个线程块中)并尝试计算它的最大值。
完整代码如下:
module commons
integer, parameter :: dp=kind(1.d0)
integer, parameter :: nx=1024, ny=1024
integer, parameter :: block_dimx=16, block_dimy=32
end module commons
module kernels
use commons
contains
attributes(global) subroutine kernel_gpu_reduce(a, anew, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: a
real(dp), dimension(nxi,nyi), intent(inout) :: anew
real(dp), dimension(nxi/block_dimx+1,nyi/block_dimy+1), intent(inout) :: error
real(dp), shared, dimension(block_dimx,block_dimy) :: err_sh
integer :: i, j, k, tx, ty
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y
tx = threadIdx%x
ty = threadIdx%y
if (i > 1 .and. i < nxi .and. j > 1 .and. j < nyi) then
anew(i,j) = 0.25d0*(a(i-1,j) + a(i+1,j) &
& + a(i,j-1) + a(i,j+1))
err_sh(tx,ty) = abs(anew(i,j) - a(i,j))
endif
call syncthreads()
error(blockIdx%x,blockIdx%y) = maxval(err_sh)
end subroutine kernel_gpu_reduce
attributes(global) subroutine max_reduce(local_error, error, nxi, nyi)
implicit none
integer, value, intent(in) :: nxi, nyi
real(dp), dimension(nxi,nyi), intent(in) :: local_error
real(dp), intent(out) :: error
real(dp), shared, dimension(nxi) :: shared_error
integer :: tx, i
tx = threadIdx%x
shared_error(tx) = 0.d0
if (tx >=1 .and. tx <= nxi) shared_error(tx) = maxval(local_error(tx,:))
call syncthreads()
error = maxval(shared_error)
end subroutine max_reduce
end module kernels
program laplace
use cudafor
use kernels
use commons
implicit none
real(dp), allocatable, dimension(:,:) :: a, anew
real(dp) :: error=1.d0
real(dp), device, allocatable, dimension(:,:) :: adev, adevnew
real(dp), device, allocatable, dimension(:,:) :: edev
real(dp), allocatable, dimension(:,:) :: ehost
real(dp), device :: error_dev
integer :: i
integer :: num_device, h_status, ierrSync, ierrAsync
type(dim3) :: dimGrid, dimBlock
num_device = 0
h_status = cudaSetDevice(num_device)
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1)
dimBlock = dim3(block_dimx, block_dimy, 1)
allocate(a(nx,ny), anew(nx,ny))
allocate(adev(nx,ny), adevnew(nx,ny))
allocate(edev(dimGrid%x,dimGrid%y), ehost(dimGrid%x,dimGrid%y))
do i = 1, nx
a(i,:) = 1.d0
anew(i,:) = 1.d0
enddo
adev = a
adevnew = anew
call kernel_gpu_reduce<<<dimGrid, dimBlock>>>(adev, adevnew, edev, nx, ny)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 1st kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 1st kernel:', cudaGetErrorString(ierrAsync)
call max_reduce<<<1, dimGrid%x>>>(edev, error_dev, dimGrid%x, dimGrid%y)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
& 'Sync kernel error - 2nd kernel:', cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
& 'Async kernel error - 2nd kernel:', cudaGetErrorString(ierrAsync)
error = error_dev
print*, 'error from kernel: ', error
ehost = edev
error = maxval(ehost)
print*, 'error from host: ', error
deallocate(a, anew, adev, adevnew, edev, ehost)
end program laplace
我第一次遇到问题是因为第二个内核的内核配置(<<<1, dimGrid>>>
);我根据罗伯特的回答修改了代码。现在我有一个内存访问错误:
Async kernel error - 2nd kernel:
an illegal memory access was encountered
0: copyout Memcpy (host=0x666bf0, dev=0x4203e20000, size=8) FAILED: 77(an illegal memory access was encountered)
而且,如果我 运行 它与 cuda-memcheck
:
========= Invalid __shared__ write of size 8
========= at 0x00000060 in kernels_max_reduce_
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x00000008 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x2c5) [0x14ad95]
每个线程。
该代码是在 Tesla K20 卡(具有 CUDA 功能 3.5)上使用 PGI Fortran 14.9 和 CUDA 6.5 编译的。我编译它:
pgfortran -Mcuda -ta:nvidia,cc35 laplace.f90 -o laplace
你可以做到 proper cuda error checking in CUDA Fortran。你应该在你的代码中这样做。
一个问题是您试图在第二个内核中启动太多线程(每个块):
call max_reduce<<<1, dimGrid>>>(edev, error_dev, dimGrid%x, dimGrid%y)
^^^^^^^
dimGrid
参数先前计算为:
dimGrid = dim3(nx/block_dimx+1, ny/block_dimy+1, 1);
代入实际值,我们有:
dimGrid = dim3(1024/16 + 1, 1024/32 +1);
即
dimGrid = dim3(65,33);
但是你不允许每块请求65*33 = 2145个线程。最大值是 512 或 1024,具体取决于您要编译的设备架构目标。
由于这个错误,你的第二个内核根本就不是 运行。