Cuda Tensor Cores:矩阵大小仅为 16x16
Cuda Tensor Cores: Matrix size only 16x16
我有这个非常简单的代码来将两个矩阵与 Cuda Tensor Cores 相乘
constexpr int M = 16;
constexpr int N = 16;
constexpr int K = 16;
/*
* Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
*/
__global__ void wmma_matrix_mult(half *a, half *b, float *out) {
// Declare the fragments
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;
// Initialize the output to zero
wmma::fill_fragment(c_frag, 0.0f);
// Load the inputs
wmma::load_matrix_sync(a_frag, a, N);
wmma::load_matrix_sync(b_frag, b, N);
// Perform the matrix multiplication
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store the output
wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}
只要 M、N 和 K 不是 16,编译器就会崩溃
error: incomplete type is not allowed
error: no instance of function template "nvcuda::wmma::fill_fragment" matches the argument list
argument types are: (<error-type>, float)
这是否意味着 A 和 B 的尺寸必须始终为 16x16?我以为 4x4 或 8x8 也可以吗?
我这样编译:
nvcc -arch=sm_75 -c ./src/main.cu -o ./src/build/main.o
所以架构应该没问题。
I thought 4x4 or 8x8 would be allowed as well?
不幸的是没有。让我们阅读 some documentation.
对于具有单精度累加器的半精度输入,如在您的用例中,仅支持以下大小:
Matrix A Matrix B Accumulator Matrix Size (m-n-k)
__half __half float 16x16x16
__half __half float 32x8x16
__half __half float 8x32x16
我有这个非常简单的代码来将两个矩阵与 Cuda Tensor Cores 相乘
constexpr int M = 16;
constexpr int N = 16;
constexpr int K = 16;
/*
* Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
*/
__global__ void wmma_matrix_mult(half *a, half *b, float *out) {
// Declare the fragments
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;
// Initialize the output to zero
wmma::fill_fragment(c_frag, 0.0f);
// Load the inputs
wmma::load_matrix_sync(a_frag, a, N);
wmma::load_matrix_sync(b_frag, b, N);
// Perform the matrix multiplication
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store the output
wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}
只要 M、N 和 K 不是 16,编译器就会崩溃
error: incomplete type is not allowed
error: no instance of function template "nvcuda::wmma::fill_fragment" matches the argument list
argument types are: (<error-type>, float)
这是否意味着 A 和 B 的尺寸必须始终为 16x16?我以为 4x4 或 8x8 也可以吗?
我这样编译:
nvcc -arch=sm_75 -c ./src/main.cu -o ./src/build/main.o
所以架构应该没问题。
I thought 4x4 or 8x8 would be allowed as well?
不幸的是没有。让我们阅读 some documentation.
对于具有单精度累加器的半精度输入,如在您的用例中,仅支持以下大小:
Matrix A Matrix B Accumulator Matrix Size (m-n-k)
__half __half float 16x16x16
__half __half float 32x8x16
__half __half float 8x32x16