Cuda Tensor Cores：NumBlocks 和 ThreadsPerBlock 的作用是什么？

Question

我想知道 NumBlocks 和 ThreadsPerBlock 对这个简单的矩阵乘法例程的影响是什么

__global__ void wmma_matrix_mult(half *a, half *b, half *out) {

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, M, N, K, half> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, N);
   wmma::load_matrix_sync(b_frag, b, N);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}

通话中

`wmma_matrix_mult<<1, 1>>`: Incorrect
`wmma_matrix_mult<<1, 2>>`: Incorrect
`wmma_matrix_mult<<1, 4>>`: Incorrect
`wmma_matrix_mult<<1, 8>>`: Incorrect
`wmma_matrix_mult<<1, 16>>`: Incorrect
`wmma_matrix_mult<<1, 32>>`: Correct

如果每个线程都在执行相同的执行，为什么每个块的线程数甚至很重要？如您所见，我没有对内核中的 threadIdx.x 做任何事情。

Answer 1

Tensor core operations happen at the warp level. The w in wmma signifies that. Referring to the documentation:

This requires co-operation from all threads in a warp.

每个 tensorcore 单元可以在每个时钟周期从 warp 接受一个矩阵乘法运算（即 wmma::mma_sync）。

这意味着必须有一个完整的 warp（32 个线程）可用并参与其中，操作才能有意义（即合法）。所有 wmma:: 操作都是集体操作，这意味着整个 warp 预计将执行它们，并且是正确使用所必需的。

如果您有多个 warp 参与（例如，线程块大小为 64 或 128 等），您实际上是在请求完成多个操作，就像任何其他 CUDA 代码一样。

与任何其他 CUDA 代码一样，启动具有多个块的操作只是扩大正在完成的工作的一种方式，如果您想使用具有多个 SM 的 GPU 的资源，这当然是必要的。由于 tensorcore 单元是 per-SM 资源，因此有必要见证 CUDA GPU 为 tensorcore ops 提供接近其全额定吞吐量的任何东西。

Why does the number of threads per block even matter if every thread is doing then same execution?

每个线程都在做不同的事情。 wmma:: 集体行动正在隐藏 code under the hood that is specializing thread behavior according to which warp lane it belongs to。例如，warp lane 0 中的线程将 select 与任何其他 warp lane 中的任何线程关联（即加载、存储）的片段的不同元素。

Cuda Tensor Cores：NumBlocks 和 ThreadsPerBlock 的作用是什么？

Cuda Tensor Cores: What is the effect of NumBlocks and ThreadsPerBlock?

cuda

matrix-multiplication

tensor