一个 CUDA 错误 当一个大数组被用作输入数据时
a CUDA error When a large array is used as input data
我有一个代码可以通过 python3.5 使用 numba 和 CUDA8.0 在 GPU 中进行一些计算。当输入大小为(50,27)的数组时,它 运行 成功并得到正确的结果。我把输入数据改成size(200,340),报错
我在我的代码中使用了共享内存。是不是共享内存不够?还是grid size和block size不好?我不知道如何识别它并为网格和块选择合适的大小。
我设置了小格子大小和块大小,错误是一样的。
我应该怎么做才能解决这个问题?感谢您的一些建议。
我简化了我的代码,它有同样的错误。这里方便设置输入数据的大小:df = np.random.random_sample((300, 200)) + 10
.
代码:
import os,sys,time,math
import pandas as pd
import numpy as np
from numba import cuda, float32
os.environ['NUMBAPRO_NVVM']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\bin\nvvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\libdevice'
bpg = 8
tpb = (4,32)
tsize = (3,4)
hsize = (1,4)
@cuda.jit
def calcu_T(D, T):
gw = cuda.gridDim.x
bx = cuda.blockIdx.x
tx = cuda.threadIdx.x
bw = cuda.blockDim.x
ty = cuda.threadIdx.y
bh = cuda.blockDim.y
c_num = D.shape[1]
c_index = bx
while c_index<c_num*c_num:
c_x = int(c_index/c_num)
c_y = c_index%c_num
if c_x==c_y:
T[c_x,c_y] = 0.0
else:
X = D[:,c_x]
Y = D[:,c_y]
hbuf = cuda.shared.array(hsize, float32)
h = tx
Xi = X[h:]
Xi1 = X[:-h]
Yih = Y[:-h]
sbuf = cuda.shared.array(tsize, float32)
L = len(Xi)
#mean
if ty==0:
Xi_m = 0.0
Xi1_m = 0.0
Yih_m = 0.0
for i in range(L):
Xi_m += Xi[i]
Xi1_m += Xi1[i]
Yih_m += Yih[i]
Xi_m = Xi_m/L
Xi1_m = Xi1_m/L
Yih_m = Yih_m/L
sbuf[0,tx] = Xi_m
sbuf[1,tx] = Xi1_m
sbuf[2,tx] = Yih_m
cuda.syncthreads()
sl = cuda.shared.array(tpb, float32)
r_index = ty
s_l = 0.0
while r_index<L:
s1 = 0.0
for i in range(L):
s1 += (Xi[r_index]+Xi1[i])/sbuf[0,tx]
s_l += s1
r_index +=bh
sl[tx,ty] = s_l
cuda.syncthreads()
#
if ty==0:
ht = 0.0
for i in range(bh):
ht += sl[tx,i]
hbuf[0,tx] = ht/L
cuda.syncthreads()
#max
if tx==0 and ty==0:
m_t = 0.0
for index,ele in enumerate(hbuf[0]):
if index==0:
m_t = ele
elif ele>m_t:
m_t = ele
T[c_x,c_y] = m_t
c_index +=gw
df = np.random.random_sample((300, 200)) + 10
D = np.array(df, dtype=np.float32)
r,c = D.shape
T = np.empty([c,c])
dD = cuda.to_device(D)
dT = cuda.device_array_like(T)
calcu_T[bpg, tpb](dD,dT)
dT.copy_to_host(T)
错误:
Traceback (most recent call last):
File "G:\myworkspace\python3.5\forte\forte170327\test10fortest8.py", line 118, in <module>
dT.copy_to_host(T)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\devicearray.py", line 198, in copy_to_host
_driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 1481, in device_to_host
fn(host_pointer(dst), device_pointer(src), size, *varargs)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 259, in safe_cuda_api_call
self._check_error(fname, retcode)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 296, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
我的设备信息:
Device 0:
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
您的代码没有任何问题。如果我 运行 你的代码在我的 GTX970 上,我得到这个:
In [11]: main??
Signature: main()
Source:
def main():
df = np.random.random_sample((300, 200)) + 10
D = np.array(df, dtype=np.float32)
r,c = D.shape
T = np.empty([c,c])
dD = cuda.to_device(D)
dT = cuda.device_array_like(T)
calcu_T[bpg, tpb](dD,dT)
dT.copy_to_host(T)
File: ~/SO/crash.py
Type: function
In [12]: %timeit -n 3 -r 3 main()
3 loops, best of 3: 6.61 s per loop
即没有 运行 时间错误,但是包含内核的 python 代码需要 6.6 秒才能 运行。如果我使用 CUDA 分析器分析代码:
$ nvprof python crash.py
==13828== NVPROF is profiling process 13828, command: python crash.py
All finished
==13828== Profiling application: python crash.py
==13828== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 6.59109s 1 6.59109s 6.59109s 6.59109s cudapy::__main__::calcu_T1(Array<float, int=2, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>)
0.00% 26.271us 1 26.271us 26.271us 26.271us [CUDA memcpy DtoH]
0.00% 21.279us 1 21.279us 21.279us 21.279us [CUDA memcpy HtoD]
==13828== API calls:
Time(%) Time Calls Avg Min Max Name
98.51% 6.59118s 1 6.59118s 6.59118s 6.59118s cuMemcpyDtoH
1.42% 94.890ms 1 94.890ms 94.890ms 94.890ms cuDevicePrimaryCtxRetain
0.05% 3.4116ms 1 3.4116ms 3.4116ms 3.4116ms cuModuleLoadDataEx
0.01% 417.96us 1 417.96us 417.96us 417.96us cuLinkCreate
0.00% 227.57us 1 227.57us 227.57us 227.57us cuLinkAddData
0.00% 195.72us 2 97.859us 95.710us 100.01us cuMemAlloc
0.00% 190.10us 1 190.10us 190.10us 190.10us cuLinkComplete
0.00% 139.04us 1 139.04us 139.04us 139.04us cuMemGetInfo
0.00% 53.193us 1 53.193us 53.193us 53.193us cuMemcpyHtoD
0.00% 29.538us 1 29.538us 29.538us 29.538us cuDeviceGetName
0.00% 17.895us 1 17.895us 17.895us 17.895us cuLaunchKernel
0.00% 2.0250us 1 2.0250us 2.0250us 2.0250us cuCtxPushCurrent
0.00% 2.0150us 5 403ns 255ns 752ns cuFuncGetAttribute
0.00% 1.6260us 2 813ns 547ns 1.0790us cuDeviceGetCount
0.00% 1.1430us 1 1.1430us 1.1430us 1.1430us cuModuleGetFunction
0.00% 951ns 2 475ns 372ns 579ns cuDeviceGet
0.00% 796ns 1 796ns 796ns 796ns cuLinkDestroy
0.00% 787ns 1 787ns 787ns 787ns cuDeviceComputeCapability
可以看到你贴出的内核需要6.5秒才能运行。
您没有提供任何详细信息,但我猜您 运行 正在 Windows,您的 GPU 是显示 GPU,您的代码 运行s 足够慢以至于它正在达到 WDDM 显示管理器看门狗超时限制。这是非常有据可查的,之前已经被问过数百次——例如 here.
您选择的搜索引擎和 CUDA Windows 入门指南将为您提供有关从操作系统和硬件角度改善情况的备选方案的信息。然而,最明显的是简单地改进您的代码以使其 运行 更快。
我有一个代码可以通过 python3.5 使用 numba 和 CUDA8.0 在 GPU 中进行一些计算。当输入大小为(50,27)的数组时,它 运行 成功并得到正确的结果。我把输入数据改成size(200,340),报错
我在我的代码中使用了共享内存。是不是共享内存不够?还是grid size和block size不好?我不知道如何识别它并为网格和块选择合适的大小。
我设置了小格子大小和块大小,错误是一样的。
我应该怎么做才能解决这个问题?感谢您的一些建议。
我简化了我的代码,它有同样的错误。这里方便设置输入数据的大小:df = np.random.random_sample((300, 200)) + 10
.
代码:
import os,sys,time,math
import pandas as pd
import numpy as np
from numba import cuda, float32
os.environ['NUMBAPRO_NVVM']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\bin\nvvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\libdevice'
bpg = 8
tpb = (4,32)
tsize = (3,4)
hsize = (1,4)
@cuda.jit
def calcu_T(D, T):
gw = cuda.gridDim.x
bx = cuda.blockIdx.x
tx = cuda.threadIdx.x
bw = cuda.blockDim.x
ty = cuda.threadIdx.y
bh = cuda.blockDim.y
c_num = D.shape[1]
c_index = bx
while c_index<c_num*c_num:
c_x = int(c_index/c_num)
c_y = c_index%c_num
if c_x==c_y:
T[c_x,c_y] = 0.0
else:
X = D[:,c_x]
Y = D[:,c_y]
hbuf = cuda.shared.array(hsize, float32)
h = tx
Xi = X[h:]
Xi1 = X[:-h]
Yih = Y[:-h]
sbuf = cuda.shared.array(tsize, float32)
L = len(Xi)
#mean
if ty==0:
Xi_m = 0.0
Xi1_m = 0.0
Yih_m = 0.0
for i in range(L):
Xi_m += Xi[i]
Xi1_m += Xi1[i]
Yih_m += Yih[i]
Xi_m = Xi_m/L
Xi1_m = Xi1_m/L
Yih_m = Yih_m/L
sbuf[0,tx] = Xi_m
sbuf[1,tx] = Xi1_m
sbuf[2,tx] = Yih_m
cuda.syncthreads()
sl = cuda.shared.array(tpb, float32)
r_index = ty
s_l = 0.0
while r_index<L:
s1 = 0.0
for i in range(L):
s1 += (Xi[r_index]+Xi1[i])/sbuf[0,tx]
s_l += s1
r_index +=bh
sl[tx,ty] = s_l
cuda.syncthreads()
#
if ty==0:
ht = 0.0
for i in range(bh):
ht += sl[tx,i]
hbuf[0,tx] = ht/L
cuda.syncthreads()
#max
if tx==0 and ty==0:
m_t = 0.0
for index,ele in enumerate(hbuf[0]):
if index==0:
m_t = ele
elif ele>m_t:
m_t = ele
T[c_x,c_y] = m_t
c_index +=gw
df = np.random.random_sample((300, 200)) + 10
D = np.array(df, dtype=np.float32)
r,c = D.shape
T = np.empty([c,c])
dD = cuda.to_device(D)
dT = cuda.device_array_like(T)
calcu_T[bpg, tpb](dD,dT)
dT.copy_to_host(T)
错误:
Traceback (most recent call last):
File "G:\myworkspace\python3.5\forte\forte170327\test10fortest8.py", line 118, in <module>
dT.copy_to_host(T)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\devicearray.py", line 198, in copy_to_host
_driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 1481, in device_to_host
fn(host_pointer(dst), device_pointer(src), size, *varargs)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 259, in safe_cuda_api_call
self._check_error(fname, retcode)
File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 296, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
我的设备信息:
Device 0:
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
您的代码没有任何问题。如果我 运行 你的代码在我的 GTX970 上,我得到这个:
In [11]: main??
Signature: main()
Source:
def main():
df = np.random.random_sample((300, 200)) + 10
D = np.array(df, dtype=np.float32)
r,c = D.shape
T = np.empty([c,c])
dD = cuda.to_device(D)
dT = cuda.device_array_like(T)
calcu_T[bpg, tpb](dD,dT)
dT.copy_to_host(T)
File: ~/SO/crash.py
Type: function
In [12]: %timeit -n 3 -r 3 main()
3 loops, best of 3: 6.61 s per loop
即没有 运行 时间错误,但是包含内核的 python 代码需要 6.6 秒才能 运行。如果我使用 CUDA 分析器分析代码:
$ nvprof python crash.py
==13828== NVPROF is profiling process 13828, command: python crash.py
All finished
==13828== Profiling application: python crash.py
==13828== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 6.59109s 1 6.59109s 6.59109s 6.59109s cudapy::__main__::calcu_T1(Array<float, int=2, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>)
0.00% 26.271us 1 26.271us 26.271us 26.271us [CUDA memcpy DtoH]
0.00% 21.279us 1 21.279us 21.279us 21.279us [CUDA memcpy HtoD]
==13828== API calls:
Time(%) Time Calls Avg Min Max Name
98.51% 6.59118s 1 6.59118s 6.59118s 6.59118s cuMemcpyDtoH
1.42% 94.890ms 1 94.890ms 94.890ms 94.890ms cuDevicePrimaryCtxRetain
0.05% 3.4116ms 1 3.4116ms 3.4116ms 3.4116ms cuModuleLoadDataEx
0.01% 417.96us 1 417.96us 417.96us 417.96us cuLinkCreate
0.00% 227.57us 1 227.57us 227.57us 227.57us cuLinkAddData
0.00% 195.72us 2 97.859us 95.710us 100.01us cuMemAlloc
0.00% 190.10us 1 190.10us 190.10us 190.10us cuLinkComplete
0.00% 139.04us 1 139.04us 139.04us 139.04us cuMemGetInfo
0.00% 53.193us 1 53.193us 53.193us 53.193us cuMemcpyHtoD
0.00% 29.538us 1 29.538us 29.538us 29.538us cuDeviceGetName
0.00% 17.895us 1 17.895us 17.895us 17.895us cuLaunchKernel
0.00% 2.0250us 1 2.0250us 2.0250us 2.0250us cuCtxPushCurrent
0.00% 2.0150us 5 403ns 255ns 752ns cuFuncGetAttribute
0.00% 1.6260us 2 813ns 547ns 1.0790us cuDeviceGetCount
0.00% 1.1430us 1 1.1430us 1.1430us 1.1430us cuModuleGetFunction
0.00% 951ns 2 475ns 372ns 579ns cuDeviceGet
0.00% 796ns 1 796ns 796ns 796ns cuLinkDestroy
0.00% 787ns 1 787ns 787ns 787ns cuDeviceComputeCapability
可以看到你贴出的内核需要6.5秒才能运行。
您没有提供任何详细信息,但我猜您 运行 正在 Windows,您的 GPU 是显示 GPU,您的代码 运行s 足够慢以至于它正在达到 WDDM 显示管理器看门狗超时限制。这是非常有据可查的,之前已经被问过数百次——例如 here.
您选择的搜索引擎和 CUDA Windows 入门指南将为您提供有关从操作系统和硬件角度改善情况的备选方案的信息。然而,最明显的是简单地改进您的代码以使其 运行 更快。