无法使用 numba 获得与 numpy 元素矩阵乘法相同的值
Can't get same values as numpy elementwise matrix multiplication using numba
我一直在研究 numba 并尝试实现一个简单的逐元素矩阵乘法。使用 'vectorize' 时,我得到与 numpy 乘法相同的结果,但是当我使用 'cuda.jit' 时,它们并不相同。其中许多是零。为此,我提供了一个最低限度的工作示例。对问题的任何帮助将不胜感激。我正在使用 numba o.35.0 和 python 2.7
from __future__ import division
from __future__ import print_function
import numpy as np
from numba import vectorize, cuda, jit
M = 80
N = 40
P = 40
# Set the number of threads in a block
threadsperblock = 32
# Calculate the number of thread blocks in the grid
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock
@vectorize(['float32(float32,float32)'], target='cuda')
def VectorMult3d(a, b):
return a*b
@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])')
def mult_gpu_3d(a, b, c):
[x, y, z] = cuda.grid(3)
if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]:
c[x, y, z] = a[x, y, z] * b[x, y, z]
if __name__ == '__main__':
A = np.random.normal(size=(M, N, P)).astype(np.float32)
B = np.random.normal(size=(M, N, P)).astype(np.float32)
numpy_C = A*B
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu)
mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu)
cudajit_C = C_gpu.copy_to_host()
print('------- using cuda.jit -------')
print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C)))
print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P))
print('------- using cuda.jit -------\n')
vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu)
vectorize_C = vectorize_C_gpu.copy_to_host()
print('------- using vectorize -------')
print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C)))
print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P))
print('------- using vectorize -------\n')
import numba; print("numba version: "+numba.__version__)
下面是调试方法。
考虑一个更小的简化示例:
- 减少数组大小,例如(2, 3, 1)(这样你就可以实际打印这些值并能够读取它们)
- 简单且确定的内容,例如"all ones"(比较 运行 秒)
- 用于调试的额外内核参数
from __future__ import (division, print_function)
import numpy as np
from numba import cuda
M = 2
N = 3
P = 1
threadsperblock = 1
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock
@cuda.jit
def mult_gpu_3d(a, b, c, grid_ran, grid_multed):
grid = cuda.grid(3)
x, y, z = grid
grid_ran[x] = 1
if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]):
grid_multed[x] = 1
c[grid] = a[grid] * b[grid]
if __name__ == '__main__':
A = np.ones((M, N, P), np.int32)
B = np.ones((M, N, P), np.int32)
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.to_device(np.zeros_like(A))
# Tells whether thread at index i have ran
grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32))
# Tells whether thread at index i have performed multiplication
grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32))
mult_gpu_3d[blockspergrid, threadsperblock](
A_gpu, B_gpu, C_gpu, grid_ran, grid_multed)
print("grid_ran.shape : ", grid_ran.shape)
print("grid_multed.shape : ", grid_multed.shape)
print("C_gpu.shape : ", C_gpu.shape)
print("grid_ran : ", grid_ran.copy_to_host())
print("grid_multed : ", grid_multed.copy_to_host())
C = C_gpu.copy_to_host()
print("C transpose flat : ", C.T.flatten())
print("C : \n", C)
输出:
grid_ran.shape : (6,)
grid_multed.shape : (6,)
C_gpu.shape : (2, 3, 1)
grid_ran : [1 1 1 1 1 1]
grid_multed : [1 1 0 0 0 0]
C transpose flat : [1 1 0 0 0 0]
C :
[[[1]
[0]
[0]]
[[1]
[0]
[0]]]
您可以看到设备网格形状与阵列的形状不对应:网格是扁平的(M*N*P)
,而阵列都是3维的(M, N, P)
。也就是说,网格的第一维索引在 0..M*N*P-1
范围内(0..5
,在我的示例中总共有 6 个值),而数组的第一维仅在 0..M-1
范围内(0..1
,在我的示例中总共有 2 个值)。这个错误通常会导致越界访问,但是您已经使用减少违规线程的条件来保护您的内核:
if (x <= c.shape[0])
此行不允许索引高于 M-1
(在我的示例中为 1
)到 运行(好吧,有点 [1])的线程,这就是为什么没有值被写入并且你在结果数组中得到许多零。
可能的解决方案:
- 通常,您可以使用多维内核网格配置,即
blockspergrid
的 3D 向量而不是标量 [2]。
- 特别是,由于逐元素乘法是一个映射操作,不依赖于数组形状,您可以将所有 3 个数组展平为一维数组,运行 您的内核在一维网格上,然后重塑结果返回 [3], [4].
参考文献:
- [1]
- [2]Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
- [3]
numpy.ndarray.flatten
- [4]
numpy.ravel
我一直在研究 numba 并尝试实现一个简单的逐元素矩阵乘法。使用 'vectorize' 时,我得到与 numpy 乘法相同的结果,但是当我使用 'cuda.jit' 时,它们并不相同。其中许多是零。为此,我提供了一个最低限度的工作示例。对问题的任何帮助将不胜感激。我正在使用 numba o.35.0 和 python 2.7
from __future__ import division
from __future__ import print_function
import numpy as np
from numba import vectorize, cuda, jit
M = 80
N = 40
P = 40
# Set the number of threads in a block
threadsperblock = 32
# Calculate the number of thread blocks in the grid
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock
@vectorize(['float32(float32,float32)'], target='cuda')
def VectorMult3d(a, b):
return a*b
@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])')
def mult_gpu_3d(a, b, c):
[x, y, z] = cuda.grid(3)
if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]:
c[x, y, z] = a[x, y, z] * b[x, y, z]
if __name__ == '__main__':
A = np.random.normal(size=(M, N, P)).astype(np.float32)
B = np.random.normal(size=(M, N, P)).astype(np.float32)
numpy_C = A*B
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu)
mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu)
cudajit_C = C_gpu.copy_to_host()
print('------- using cuda.jit -------')
print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C)))
print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P))
print('------- using cuda.jit -------\n')
vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu)
vectorize_C = vectorize_C_gpu.copy_to_host()
print('------- using vectorize -------')
print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C)))
print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P))
print('------- using vectorize -------\n')
import numba; print("numba version: "+numba.__version__)
下面是调试方法。
考虑一个更小的简化示例:
- 减少数组大小,例如(2, 3, 1)(这样你就可以实际打印这些值并能够读取它们)
- 简单且确定的内容,例如"all ones"(比较 运行 秒)
- 用于调试的额外内核参数
from __future__ import (division, print_function)
import numpy as np
from numba import cuda
M = 2
N = 3
P = 1
threadsperblock = 1
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock
@cuda.jit
def mult_gpu_3d(a, b, c, grid_ran, grid_multed):
grid = cuda.grid(3)
x, y, z = grid
grid_ran[x] = 1
if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]):
grid_multed[x] = 1
c[grid] = a[grid] * b[grid]
if __name__ == '__main__':
A = np.ones((M, N, P), np.int32)
B = np.ones((M, N, P), np.int32)
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.to_device(np.zeros_like(A))
# Tells whether thread at index i have ran
grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32))
# Tells whether thread at index i have performed multiplication
grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32))
mult_gpu_3d[blockspergrid, threadsperblock](
A_gpu, B_gpu, C_gpu, grid_ran, grid_multed)
print("grid_ran.shape : ", grid_ran.shape)
print("grid_multed.shape : ", grid_multed.shape)
print("C_gpu.shape : ", C_gpu.shape)
print("grid_ran : ", grid_ran.copy_to_host())
print("grid_multed : ", grid_multed.copy_to_host())
C = C_gpu.copy_to_host()
print("C transpose flat : ", C.T.flatten())
print("C : \n", C)
输出:
grid_ran.shape : (6,)
grid_multed.shape : (6,)
C_gpu.shape : (2, 3, 1)
grid_ran : [1 1 1 1 1 1]
grid_multed : [1 1 0 0 0 0]
C transpose flat : [1 1 0 0 0 0]
C :
[[[1]
[0]
[0]]
[[1]
[0]
[0]]]
您可以看到设备网格形状与阵列的形状不对应:网格是扁平的(M*N*P)
,而阵列都是3维的(M, N, P)
。也就是说,网格的第一维索引在 0..M*N*P-1
范围内(0..5
,在我的示例中总共有 6 个值),而数组的第一维仅在 0..M-1
范围内(0..1
,在我的示例中总共有 2 个值)。这个错误通常会导致越界访问,但是您已经使用减少违规线程的条件来保护您的内核:
if (x <= c.shape[0])
此行不允许索引高于 M-1
(在我的示例中为 1
)到 运行(好吧,有点 [1])的线程,这就是为什么没有值被写入并且你在结果数组中得到许多零。
可能的解决方案:
- 通常,您可以使用多维内核网格配置,即
blockspergrid
的 3D 向量而不是标量 [2]。 - 特别是,由于逐元素乘法是一个映射操作,不依赖于数组形状,您可以将所有 3 个数组展平为一维数组,运行 您的内核在一维网格上,然后重塑结果返回 [3], [4].
参考文献:
- [1]
- [2]Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
- [3]
numpy.ndarray.flatten
- [4]
numpy.ravel