将矩阵复制到主机所花费的时间随着矩阵使用次数的增加而增加

Question

我正在使用 PyCUDA、CUDAMat 和 Numba 和运行对 GPU 矩阵乘法进行基准测试，将其转化为一些我无法解释的行为。
我独立计算了 3 个不同步骤所花费的时间 - 将 2 个矩阵发送到设备内存，计算点积，并将结果复制回主机内存。
点积步骤的基准测试是在一个循环中完成的，因为我的应用程序将在发回结果之前进行多次乘法运算。

随着我增加循环次数，点积时间如预期的那样线性增加。但我无法理解的部分是，将最终结果发送回主机内存所需的时间也随着循环计数线性增加，即使它只是将一个矩阵复制回主机内存。无论您执行多少次矩阵乘法循环，结果的大小都是恒定的，因此这是没有意义的。它的行为就好像返回最终结果需要返回循环中每一步的所有中间结果。

需要注意的一些有趣的事情是，它所花费的时间增加有一个峰值。当我在一个循环中超过 ~1000 个点积时，复制最终结果所需的时间达到峰值。另一件事是，如果在点积循环内我重新初始化保存结果的矩阵，则此行为停止，并且无论完成多少次乘法，复制回时间都是相同的。
例如 -

for i in range(1000):
    gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
result = gc.get()

最后要注意的是 PyCUDA 和 Numba 都会发生这种情况，但 CUDAMat 不会发生这种情况。我可以做一百万次乘法，检索最终结果仍然需要相同的时间。 CUDAMat 有一个内置的矩阵乘法，这可能是原因，但对于 PyCUDA 和 Numba，我使用的是他们自己的文档中提供的矩阵乘法代码。

这是我的 PyCUDA 代码

from __future__ import division
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import time
import pycuda.autoinit

kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{

  const int wA = %(MATRIX_SIZE)s;
  const int wB = %(MATRIX_SIZE)s;  

  // Block index
  const int bx = blockIdx.x;
  const int by = blockIdx.y;

  // Thread index
  const int tx = threadIdx.x;
  const int ty = threadIdx.y;

  // Index of the first sub-matrix of A processed by the block
  const int aBegin = wA * %(BLOCK_SIZE)s * by;
  // Index of the last sub-matrix of A processed by the block
  const int aEnd = aBegin + wA - 1;
  // Step size used to iterate through the sub-matrices of A
  const int aStep = %(BLOCK_SIZE)s;

  // Index of the first sub-matrix of B processed by the block
  const int bBegin = %(BLOCK_SIZE)s * bx;
  // Step size used to iterate through the sub-matrices of B
  const int bStep = %(BLOCK_SIZE)s * wB;

  // The element of the block sub-matrix that is computed
  // by the thread
  float Csub = 0;
  // Loop over all the sub-matrices of A and B required to
  // compute the block sub-matrix
  for (int a = aBegin, b = bBegin;
       a <= aEnd;
       a += aStep, b += bStep) 
    {
      // Shared memory for the sub-matrix of A
      __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
      // Shared memory for the sub-matrix of B
      __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];

      // Load the matrices from global memory to shared memory
      // each thread loads one element of each matrix
      As[ty][tx] = A[a + wA * ty + tx];
      Bs[ty][tx] = B[b + wB * ty + tx];
      // Synchronize to make sure the matrices are loaded
      __syncthreads();

      // Multiply the two matrices together;
      // each thread computes one element
      // of the block sub-matrix
      for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
        Csub += As[ty][k] * Bs[k][tx];

      // Synchronize to make sure that the preceding
      // computation is done before loading two new
      // sub-matrices of A and B in the next iteration
      __syncthreads();
    }

  // Write the block sub-matrix to global memory;
  // each thread writes one element
  const int c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
  C[c + wB * ty + tx] = Csub;
}
"""


MATRIX_SIZE = 512
TILE_SIZE = 8
BLOCK_SIZE = TILE_SIZE
np.random.seed(100)
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

kernel_code = kernel_code_template % {
    'MATRIX_SIZE': MATRIX_SIZE,
    'BLOCK_SIZE': BLOCK_SIZE,
}
mod = compiler.SourceModule(kernel_code)
matrixmul = mod.get_function("MatrixMulKernel")


#copy to device memory
total = time.clock()
ga = gpuarray.to_gpu(a_cpu)
gb = gpuarray.to_gpu(b_cpu)
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
copy_to = time.clock() - total

#matrix multiplication
mult = time.clock()
for i in range(1000):
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
mult = time.clock() - mult

#copy result back to host memory
copy_from = time.clock()
res = gc.get()
copy_from = time.clock() - copy_from
total = time.clock() - total

#print out times for all 3 steps and the total time taken
print(copy_to)
print(mult)
print(copy_from)
print(total)

Answer 1

GPU 内核启动是异步。这意味着您认为您在 for 循环周围捕获的测量值（执行乘法所花费的时间）并非如此。这只是将内核启动发布到队列中所花费的时间。

实际内核执行时间正在 "absorbed" 进入您对设备-> 主机复制时间的最终测量（因为 D->H 复制强制所有内核在它开始之前完成，并且它阻止了CPU 线程）。

关于 "peak" 行为，当您将足够多的内核启动到队列中时，它最终会停止异步并开始阻塞 CPU 线程，因此您的 "execution time" 测量值开始上升.这解释了变化的峰值行为。

为了 "fix" 这个，如果你在你的 for 循环之后和这一行之前立即插入一个 pycuda driver.Context.synchronize()：

mult = time.clock() - mult

你会看到你的执行时间随着for循环的增加而增加，而你的D->H复制时间将保持不变。

将矩阵复制到主机所花费的时间随着矩阵使用次数的增加而增加

Time taken to copy matrix to host increases by how many times the matrix was used

python

benchmarking

cuda

pycuda