Cuda Tensor Cores：矩阵大小仅为 16x16

Question

我有这个非常简单的代码来将两个矩阵与 Cuda Tensor Cores 相乘

constexpr int M = 16;
constexpr int N = 16;
constexpr int K = 16;

/*
 *  Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
 */
__global__ void wmma_matrix_mult(half *a, half *b, float *out) {

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, N);
   wmma::load_matrix_sync(b_frag, b, N);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}

只要 M、N 和 K 不是 16，编译器就会崩溃

error: incomplete type is not allowed

error: no instance of function template "nvcuda::wmma::fill_fragment" matches the argument list
            argument types are: (<error-type>, float)

这是否意味着 A 和 B 的尺寸必须始终为 16x16？我以为 4x4 或 8x8 也可以吗？

我这样编译：

nvcc -arch=sm_75 -c ./src/main.cu -o ./src/build/main.o

所以架构应该没问题。

Answer 1

I thought 4x4 or 8x8 would be allowed as well?

不幸的是没有。让我们阅读 some documentation.

对于具有单精度累加器的半精度输入，如在您的用例中，仅支持以下大小：

Matrix A    Matrix B    Accumulator Matrix Size (m-n-k)
__half      __half      float       16x16x16
__half      __half      float       32x8x16
__half      __half      float       8x32x16

Cuda Tensor Cores：矩阵大小仅为 16x16

Cuda Tensor Cores: Matrix size only 16x16

cuda

nvidia

tensor