如何以与 numpy linalg "inv" 或 "pinv" 函数相同的精度执行 PyCUDA 4x4 矩阵求逆
How to perform PyCUDA 4x4 matrix inversion with same accuracy than numpy linalg "inv" or "pinv" function
我的代码执行 4x4 矩阵求逆(128、256、512 次)时遇到了准确性问题。当我使用原始版本时,即 numpy 函数 np.linalg.inv
或 np.linalg.pinv
,一切正常。
不幸的是,使用下面的 CUDA 代码,我将 nan
和 inf
值转换为倒矩阵。
更明确地说,我将这个矩阵求逆:
2.120771107884677649e+09 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 3.557266600921528288e+27 3.557266600921528041e+07 3.557266600921528320e+17
0.000000000000000000e+00 3.557266600921528041e+07 3.557266600921528288e+27 3.557266600921528041e+07
0.000000000000000000e+00 3.557266600921528320e+17 3.557266600921528041e+07 1.778633300460764144e+27
如果我使用经典的 numpy“inv
”,我会得到以下倒置的 3x3 矩阵:
4.715266047722758306e-10 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 2.811147187396482366e-28 -2.811147186834252285e-48 -5.622294374792964645e-38
0.000000000000000000e+00 -2.811147186834252285e-48 2.811147187396482366e-28 -5.622294374230735768e-48
0.000000000000000000e+00 -5.622294374792964645e-38 -5.622294374230735768e-48 5.622294374792964732e-28
为了检查这个逆矩阵的有效性,我将它乘以原始矩阵,结果就是单位矩阵。
但是使用 CUDA GPU 求逆,我在求逆之后得到这个矩阵:
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
-inf -inf -9.373764907941219970e-01 -inf
inf nan -inf nan
所以,我想提高我的 CUDA 内核或 python 代码的精度以避免这些 nan
和 inf
值。
这里是CUDA内核代码和我主要代码的调用部分(我已经用numpy inv
函数评论了经典方法:
# Create arrayFullCross_vec array
arrayFullCross_vec = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Create arrayFullCross_vec array
invCrossMatrix_gpu = np.zeros((dimBlocks*dimBlocks*integ_prec**2))
# Create arrayFullCross_vec array
invCrossMatrix = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Build observables covariance matrix
arrayFullCross_vec = buildObsCovarianceMatrix4_vec(k_ref, mu_ref, ir)
"""
# Compute integrand from covariance matrix
for r_p in range(integ_prec):
for s_p in range(integ_prec):
# original version (without GPU)
invCrossMatrix[:,:,r_p,s_p] = np.linalg.inv(arrayFullCross_vec[:,:,r_p,s_p])
"""
# GPU version
invCrossMatrix_gpu = gpuinv4x4(arrayFullCross_vec.flatten(),integ_prec**2)
invCrossMatrix = invCrossMatrix_gpu.reshape(dimBlocks,dimBlocks,integ_prec,integ_prec)
"""
这里是 CUDA 内核代码和 gpuinv4x4
函数:
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off = off >> 4;
return ret;
}
const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = double or double only
typedef double T;
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n*16){
si[threadIdx.x] = in[idx];
unsigned lane = threadIdx.x & 15;
unsigned sibase = threadIdx.x & 0x03F0;
__syncwarp();
unsigned off = pat[lane];
T a,b;
a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
if (!getoff(off)) a = -a;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+16];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+32];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
T det = si[sibase + (lane>>2)]*a;
det += __shfl_down_sync(tmsk, det, 4, 16); // first add
det += __shfl_down_sync(tmsk, det, 8, 16); // second add
det = __shfl_sync(tmsk, det, 0, 16); // broadcast
out[idx] = a / det;
}
}
""")
# python function for inverting 4x4 matrices
# n should be an even number
def gpuinv4x4(inp, n):
# internal constants not to be modified
hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
# Convert parameters into numpy array
# float32
"""
inpd = np.array(inp, dtype=np.float32)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float32)
"""
# float64
"""
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float64)
"""
# float128
inpd = np.array(inp, dtype=np.float128)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float128)
# Get kernel function
matinv4x4 = kernel.get_function("inv4x4")
# Define block, grid and compute
blockDim = (256,1,1) # do not change
gridDim = ((n/16)+1,1,1)
# Kernel function
matinv4x4 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
如您所见,我试图通过将 np.float32
替换为 np.float64
或 np.float128
来提高反转操作的准确性,但问题仍然存在。
我也用typedef double T;
替换了typedef float T;
,但没有成功。
我怎样才能对这些矩阵执行正确的求逆并且主要避免“nan
”和“inf
”值?我认为这是一个真正的精度问题,但我找不到如何规避这个问题。
这个问题有之前的相关问题here and (to a lesser extent) here。我不清楚为什么问题的标题指的是 3x3,问题中的粗体文本指的是 3x3,但提出的问题是 4x4 矩阵求逆(正如之前对 OP 所述,此代码可以 仅 用于反转 4x4 矩阵)。我将假设示例案例是所需的案例。
根据我的测试,唯一需要做的就是获取 previous answer code 并将其转换为使用 double
(或者在 pycuda 中,float64
)而不是 float
(或在 pycuda 中,float32
)。我认为这应该是显而易见的,因为示例矩阵值超出了 float32
类型的范围。这是一个有效的例子:
$ cat t10.py
import numpy as np
# import matplotlib.pyplot as plt
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off = off >> 4;
return ret;
}
const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = float or double only
typedef double T; // *** change this typedef to convert between float and double
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n*16){
si[threadIdx.x] = in[idx];
unsigned lane = threadIdx.x & 15;
unsigned sibase = threadIdx.x & 0x03F0;
__syncwarp();
unsigned off = pat[lane];
T a,b;
a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
if (!getoff(off)) a = -a;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+16];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+32];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
T det = si[sibase + (lane>>2)]*a;
det += __shfl_down_sync(tmsk, det, 4, 16); // first add
det += __shfl_down_sync(tmsk, det, 8, 16); // second add
det = __shfl_sync(tmsk, det, 0, 16); // broadcast
out[idx] = a / det;
}
}
""")
# host code
def gpuinv4x4(inp, n):
# internal constants not to be modified
hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
# Convert parameters into numpy array
# *** change next line between float32 and float64 to match float or double
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
# *** change next line between float32 and float64 to match float or double
output = np.empty((n*16), dtype= np.float64)
# Get kernel function
matinv4x4 = kernel.get_function("inv4x4")
# Define block, grid and compute
blockDim = (256,1,1) # do not change
gridDim = ((n/16)+1,1,1)
# Kernel function
matinv4x4 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 3.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 2.120771107884677649e+09, 0.0, 0.0, 0.0, 0.0, 3.557266600921528288e+27, 3.557266600921528041e+07, 3.557266600921528320e+17, 0.0, 3.557266600921528041e+07, 3.557266600921528288e+27, 3.557266600921528041e+07, 0.0, 3.557266600921528320e+17, 3.557266600921528041e+07, 1.778633300460764144e+27)
n = 2
result = gpuinv4x4(inp, n)
print(result)
$ python t10.py
[ -3.00000000e+00 -5.00000000e-01 1.50000000e+00 1.00000000e+00
1.00000000e+00 2.50000000e-01 -2.50000000e-01 -5.00000000e-01
3.00000000e+00 2.50000000e-01 -1.25000000e+00 -5.00000000e-01
-3.00000000e+00 -0.00000000e+00 1.00000000e+00 1.00000000e+00
4.71526605e-10 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 2.81114719e-28 -2.81114719e-48 -5.62229437e-38
0.00000000e+00 -2.81114719e-48 2.81114719e-28 -5.62229437e-48
0.00000000e+00 -5.62229437e-38 -5.62229437e-48 5.62229437e-28]
$
除了更新输入矩阵以使用此问题中提供的矩阵之外,请注意,与之前的答案相比,仅更改了 3 行代码以将 32 位浮点数转换为 64 位浮点数。这 3 行中的每一行都由包含 triple-asterisk (***) 的注释标记。
此外,如果您担心此处显示的前 9 位左右数字以外的准确性,我还没有对此进行调查。可能此代码在数值上并不适合所有情况或需要。
顺便说一句,问题中的代码还在 python 部分显示了 float128
类型。没有与之对应的原生 CUDA 类型。
我的代码执行 4x4 矩阵求逆(128、256、512 次)时遇到了准确性问题。当我使用原始版本时,即 numpy 函数 np.linalg.inv
或 np.linalg.pinv
,一切正常。
不幸的是,使用下面的 CUDA 代码,我将 nan
和 inf
值转换为倒矩阵。
更明确地说,我将这个矩阵求逆:
2.120771107884677649e+09 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 3.557266600921528288e+27 3.557266600921528041e+07 3.557266600921528320e+17
0.000000000000000000e+00 3.557266600921528041e+07 3.557266600921528288e+27 3.557266600921528041e+07
0.000000000000000000e+00 3.557266600921528320e+17 3.557266600921528041e+07 1.778633300460764144e+27
如果我使用经典的 numpy“inv
”,我会得到以下倒置的 3x3 矩阵:
4.715266047722758306e-10 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 2.811147187396482366e-28 -2.811147186834252285e-48 -5.622294374792964645e-38
0.000000000000000000e+00 -2.811147186834252285e-48 2.811147187396482366e-28 -5.622294374230735768e-48
0.000000000000000000e+00 -5.622294374792964645e-38 -5.622294374230735768e-48 5.622294374792964732e-28
为了检查这个逆矩阵的有效性,我将它乘以原始矩阵,结果就是单位矩阵。
但是使用 CUDA GPU 求逆,我在求逆之后得到这个矩阵:
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
-inf -inf -9.373764907941219970e-01 -inf
inf nan -inf nan
所以,我想提高我的 CUDA 内核或 python 代码的精度以避免这些 nan
和 inf
值。
这里是CUDA内核代码和我主要代码的调用部分(我已经用numpy inv
函数评论了经典方法:
# Create arrayFullCross_vec array
arrayFullCross_vec = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Create arrayFullCross_vec array
invCrossMatrix_gpu = np.zeros((dimBlocks*dimBlocks*integ_prec**2))
# Create arrayFullCross_vec array
invCrossMatrix = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Build observables covariance matrix
arrayFullCross_vec = buildObsCovarianceMatrix4_vec(k_ref, mu_ref, ir)
"""
# Compute integrand from covariance matrix
for r_p in range(integ_prec):
for s_p in range(integ_prec):
# original version (without GPU)
invCrossMatrix[:,:,r_p,s_p] = np.linalg.inv(arrayFullCross_vec[:,:,r_p,s_p])
"""
# GPU version
invCrossMatrix_gpu = gpuinv4x4(arrayFullCross_vec.flatten(),integ_prec**2)
invCrossMatrix = invCrossMatrix_gpu.reshape(dimBlocks,dimBlocks,integ_prec,integ_prec)
"""
这里是 CUDA 内核代码和 gpuinv4x4
函数:
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off = off >> 4;
return ret;
}
const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = double or double only
typedef double T;
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n*16){
si[threadIdx.x] = in[idx];
unsigned lane = threadIdx.x & 15;
unsigned sibase = threadIdx.x & 0x03F0;
__syncwarp();
unsigned off = pat[lane];
T a,b;
a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
if (!getoff(off)) a = -a;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+16];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+32];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
T det = si[sibase + (lane>>2)]*a;
det += __shfl_down_sync(tmsk, det, 4, 16); // first add
det += __shfl_down_sync(tmsk, det, 8, 16); // second add
det = __shfl_sync(tmsk, det, 0, 16); // broadcast
out[idx] = a / det;
}
}
""")
# python function for inverting 4x4 matrices
# n should be an even number
def gpuinv4x4(inp, n):
# internal constants not to be modified
hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
# Convert parameters into numpy array
# float32
"""
inpd = np.array(inp, dtype=np.float32)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float32)
"""
# float64
"""
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float64)
"""
# float128
inpd = np.array(inp, dtype=np.float128)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float128)
# Get kernel function
matinv4x4 = kernel.get_function("inv4x4")
# Define block, grid and compute
blockDim = (256,1,1) # do not change
gridDim = ((n/16)+1,1,1)
# Kernel function
matinv4x4 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
如您所见,我试图通过将 np.float32
替换为 np.float64
或 np.float128
来提高反转操作的准确性,但问题仍然存在。
我也用typedef double T;
替换了typedef float T;
,但没有成功。
我怎样才能对这些矩阵执行正确的求逆并且主要避免“nan
”和“inf
”值?我认为这是一个真正的精度问题,但我找不到如何规避这个问题。
这个问题有之前的相关问题here and (to a lesser extent) here。我不清楚为什么问题的标题指的是 3x3,问题中的粗体文本指的是 3x3,但提出的问题是 4x4 矩阵求逆(正如之前对 OP 所述,此代码可以 仅 用于反转 4x4 矩阵)。我将假设示例案例是所需的案例。
根据我的测试,唯一需要做的就是获取 previous answer code 并将其转换为使用 double
(或者在 pycuda 中,float64
)而不是 float
(或在 pycuda 中,float32
)。我认为这应该是显而易见的,因为示例矩阵值超出了 float32
类型的范围。这是一个有效的例子:
$ cat t10.py
import numpy as np
# import matplotlib.pyplot as plt
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off = off >> 4;
return ret;
}
const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = float or double only
typedef double T; // *** change this typedef to convert between float and double
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n*16){
si[threadIdx.x] = in[idx];
unsigned lane = threadIdx.x & 15;
unsigned sibase = threadIdx.x & 0x03F0;
__syncwarp();
unsigned off = pat[lane];
T a,b;
a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
if (!getoff(off)) a = -a;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+16];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+32];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
T det = si[sibase + (lane>>2)]*a;
det += __shfl_down_sync(tmsk, det, 4, 16); // first add
det += __shfl_down_sync(tmsk, det, 8, 16); // second add
det = __shfl_sync(tmsk, det, 0, 16); // broadcast
out[idx] = a / det;
}
}
""")
# host code
def gpuinv4x4(inp, n):
# internal constants not to be modified
hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
# Convert parameters into numpy array
# *** change next line between float32 and float64 to match float or double
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
# *** change next line between float32 and float64 to match float or double
output = np.empty((n*16), dtype= np.float64)
# Get kernel function
matinv4x4 = kernel.get_function("inv4x4")
# Define block, grid and compute
blockDim = (256,1,1) # do not change
gridDim = ((n/16)+1,1,1)
# Kernel function
matinv4x4 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 3.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 2.120771107884677649e+09, 0.0, 0.0, 0.0, 0.0, 3.557266600921528288e+27, 3.557266600921528041e+07, 3.557266600921528320e+17, 0.0, 3.557266600921528041e+07, 3.557266600921528288e+27, 3.557266600921528041e+07, 0.0, 3.557266600921528320e+17, 3.557266600921528041e+07, 1.778633300460764144e+27)
n = 2
result = gpuinv4x4(inp, n)
print(result)
$ python t10.py
[ -3.00000000e+00 -5.00000000e-01 1.50000000e+00 1.00000000e+00
1.00000000e+00 2.50000000e-01 -2.50000000e-01 -5.00000000e-01
3.00000000e+00 2.50000000e-01 -1.25000000e+00 -5.00000000e-01
-3.00000000e+00 -0.00000000e+00 1.00000000e+00 1.00000000e+00
4.71526605e-10 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 2.81114719e-28 -2.81114719e-48 -5.62229437e-38
0.00000000e+00 -2.81114719e-48 2.81114719e-28 -5.62229437e-48
0.00000000e+00 -5.62229437e-38 -5.62229437e-48 5.62229437e-28]
$
除了更新输入矩阵以使用此问题中提供的矩阵之外,请注意,与之前的答案相比,仅更改了 3 行代码以将 32 位浮点数转换为 64 位浮点数。这 3 行中的每一行都由包含 triple-asterisk (***) 的注释标记。
此外,如果您担心此处显示的前 9 位左右数字以外的准确性,我还没有对此进行调查。可能此代码在数值上并不适合所有情况或需要。
顺便说一句,问题中的代码还在 python 部分显示了 float128
类型。没有与之对应的原生 CUDA 类型。