printf 如何在 CUDA 计算上工作 >= 2

How does printf work on CUDA compute >= 2

在早期 printf 不受支持,我们要么 运行 使用模拟器的 CUDA 程序,要么来回复制变量并在主机端打印。

现在 CUDA(arch 2 及更高版本)支持 printf 我很想知道这是如何工作的?我的意思是屏幕上的 GPU printfs 内部如何?计算能力 1 的限制因素是什么?

来自CUDA C Programming Guide

printf prints formatted output from a kernel to a host-side output stream.

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. It is flushed only when one of these actions is performed:

...

Internally printf() uses a shared data structure and so it is possible that calling printf() might change the order of execution of threads. In particular, a thread which calls printf() might take a longer execution path than one which does not call printf(), and that path length is dependent upon the parameters of the printf(). Note, however, that CUDA makes no guarantees of thread execution order except at explicit __syncthreads() barriers, so it is impossible to tell whether execution order has been modified by printf() or by other scheduling behaviour in the hardware.

The following API functions get and set the size of the buffer used to transfer the printf() arguments and internal metadata to the host (default is 1 megabyte):

  • cudaDeviceGetLimit(size_t* size,cudaLimitPrintfFifoSize)
  • cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size_t size)