在 Python CUDA 内核中使用 cublas GEMM
Using cublas GEMM in a Python CUDA kernel
我有一个简单的矩阵乘法代码如下:
TPB = 32
@cuda.jit('void(double[:, :], double[:,:], double[:, :])', device = True)
def GPU_Mat2(A, B, C):
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
ROW = bx * TPB + tx
COL = by * TPB + ty
res = 0
for k in range(A.shape[1]):
if ROW < A.shape[0] and COL < B.shape[1]:
res += A[ROW, k] * B[k, COL]
cuda.syncthreads()
if ROW < A.shape[0] and COL < B.shape[1]:
C[ROW, COL] = res
cuda.syncthreads()
然后我在另一个内核中调用了这个函数两次。
@cuda.jit('void(double[:, :], double[:,:], double[:, :], double[:, :])')
def call_Mat2(A, B, C, D):
for _ in range(200):
GPU_Mat2(A, B, C)
GPU_Mat2(C, B, D) # Is this correct?
不幸的是,与主机中的相同计算相比,此过程没有给我正确的答案。即使我在每次 GPU_Mat2 调用后使用 cuda.syncthreads() ,答案仍然是错误的。我的问题是 "is it possible to use the output of a kernel call (here C) in another kernel as an input?"
def main():
N = 300
A = np.asfortranarray(np.random.random_sample((N,N)))
B = np.asfortranarray(np.random.random_sample((N,N)))
C_GPU = np.zeros((N,N), dtype = np.double, order = 'F')
D_GPU = np.zeros((N,N), dtype = np.double, order = 'F')
numThreads = [TPB, TPB]
numBlocks =[(A.shape[0]+TPB-1)//TPB, (B.shape[1]+TPB-1)//TPB]
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_C = cuda.to_device(C_GPU)
d_D = cuda.to_device(D_GPU)
call_Mat2[numBlocks, numThreads](d_A, d_B, d_C, d_D)
其次,基于this,可以在内核中调用"blas GEMM",但我在python脚本中找不到类似的例子。 python 是否支持这种类型的呼叫?
感谢您的帮助。
根据 documentation:
Note: newer CUDA devices support device-side kernel launching; this feature is called dynamic parallelism but Numba does not support it currently)
所以不,您目前无法在 numba 编译的 CUDA Python 中调用其他设备库或 @cuda.jit
函数。
我有一个简单的矩阵乘法代码如下:
TPB = 32
@cuda.jit('void(double[:, :], double[:,:], double[:, :])', device = True)
def GPU_Mat2(A, B, C):
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
ROW = bx * TPB + tx
COL = by * TPB + ty
res = 0
for k in range(A.shape[1]):
if ROW < A.shape[0] and COL < B.shape[1]:
res += A[ROW, k] * B[k, COL]
cuda.syncthreads()
if ROW < A.shape[0] and COL < B.shape[1]:
C[ROW, COL] = res
cuda.syncthreads()
然后我在另一个内核中调用了这个函数两次。
@cuda.jit('void(double[:, :], double[:,:], double[:, :], double[:, :])')
def call_Mat2(A, B, C, D):
for _ in range(200):
GPU_Mat2(A, B, C)
GPU_Mat2(C, B, D) # Is this correct?
不幸的是,与主机中的相同计算相比,此过程没有给我正确的答案。即使我在每次 GPU_Mat2 调用后使用 cuda.syncthreads() ,答案仍然是错误的。我的问题是 "is it possible to use the output of a kernel call (here C) in another kernel as an input?"
def main():
N = 300
A = np.asfortranarray(np.random.random_sample((N,N)))
B = np.asfortranarray(np.random.random_sample((N,N)))
C_GPU = np.zeros((N,N), dtype = np.double, order = 'F')
D_GPU = np.zeros((N,N), dtype = np.double, order = 'F')
numThreads = [TPB, TPB]
numBlocks =[(A.shape[0]+TPB-1)//TPB, (B.shape[1]+TPB-1)//TPB]
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_C = cuda.to_device(C_GPU)
d_D = cuda.to_device(D_GPU)
call_Mat2[numBlocks, numThreads](d_A, d_B, d_C, d_D)
其次,基于this,可以在内核中调用"blas GEMM",但我在python脚本中找不到类似的例子。 python 是否支持这种类型的呼叫? 感谢您的帮助。
根据 documentation:
Note: newer CUDA devices support device-side kernel launching; this feature is called dynamic parallelism but Numba does not support it currently)
所以不,您目前无法在 numba 编译的 CUDA Python 中调用其他设备库或 @cuda.jit
函数。