CUDA分块矩阵乘法讲解

Question

我正在尝试了解来自 CUDA SDK 8.0 的 this sample 代码是如何工作的：

template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block
int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A
int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B
int bStep  = BLOCK_SIZE * wB;

....
....

内核的这一部分对我来说相当棘手。我知道矩阵 A 和 B 表示为数组 (*float)，而且我还知道使用共享内存来计算点积的概念，这要归功于共享内存块。

我的问题是我不理解代码的开头，尤其是 3 个特定变量（aBegin、aEnd 和 bBegin）。有人可以给我一个可能执行的示例图，以帮助我了解索引在这种特定情况下的工作方式吗？谢谢

Answer 1

这是一张图，用于了解为 CUDA 内核的第一个变量设置的值以及执行的整体计算：

矩阵使用行优先顺序存储。 CUDA 代码假设矩阵大小可以除以 BLOCK_SIZE.

矩阵 A、B 和 C 实际上根据内核 CUDA 网格分成块。 C 的所有块都可以并行计算。对于给定的 C 深灰色块，主循环遍历 A 和 B 的几个浅灰色块（步调一致）。每个块使用 BLOCK_SIZE * BLOCK_SIZE 个线程并行计算。

bx 和 by 是当前块在 CUDA 网格中基于块的位置。 tx 和 ty 是由当前线程在 CUDA 网格的当前计算块中计算的单元格的基于单元格的位置。

下面是对aBegin变量的详细分析： aBegin 指的是矩阵 A 的 第一个计算块 的第一个单元格的内存位置。它设置为 wA * BLOCK_SIZE * by，因为每个块包含 BLOCK_SIZE * BLOCK_SIZE 个单元格，水平方向有 wA / BLOCK_SIZE 个块，A 当前计算块上方有 by 个块。因此，(BLOCK_SIZE * BLOCK_SIZE) * (wA / BLOCK_SIZE) * by = BLOCK_SIZE * wA * by.

同样的逻辑适用于 bBegin：它设置为 BLOCK_SIZE * bx，因为在矩阵 B.

的第一个计算块的第一个单元格之前，内存中有 bx 个大小为 BLOCK_SIZE 的块

a 在循环中递增 aStep = BLOCK_SIZE，因此下一个计算块是 A 当前计算块右侧（在图上）的下一个。 b 在同一循环中递增 bStep = BLOCK_SIZE * wB，因此下一个计算块是 B.[=44= 当前计算块底部（在图上）的后续块]

CUDA分块矩阵乘法讲解

CUDA tiled matrix multiplication explanation

parallel-processing

cuda

nvidia

gpu-shared-memory