为什么 printf() 在内核中工作，而使用 std::cout 却不行？

Question

一直在探索并行编程领域，用Cuda和SYCL写过基础内核。我遇到过必须在内核内部打印的情况，我注意到内核内部的 std::cout 不起作用，而 printf 有效。例如，考虑以下 SYCL 代码 - 这有效-

void print(float*A, size_t N){
    buffer<float, 1> Buffer{A, {N}};
    queue Queue((intel_selector()));
    Queue.submit([&Buffer, N](handler& Handler){
       auto accessor = Buffer.get_access<access::mode::read>(Handler);
       Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
           printf("%f", accessor[idx[0]]);
       });
    });
}

而如果我将 printf 替换为 std::cout<<accessor[idx[0]] 它会引发编译时错误 - Accessing non-const global variable is not allowed within SYCL device code. CUDA 内核也会发生类似的事情。这让我想到 printf 和 std::coout 之间可能有什么区别导致了这种行为。

另外假设如果我想实现一个从GPU调用的自定义打印函数，我应该怎么做？
TIA

Answer 1

没有__device__版本的std::cout，所以只能在设备代码中使用printf。

Answer 2

This got me thinking that what may be the difference between printf and std::cout which causes such behavior.

是的，有区别。在您的内核中运行的 printf() 是 而不是 标准 C 库 printf()。对设备上的函数进行了不同的调用（其代码已关闭，如果它在 CUDA C 中存在的话）。该函数使用 NVIDIA GPU 上的硬件机制 - 内核线程打印到的缓冲区，该缓冲区被发送回主机端，然后 CUDA 驱动程序将其转发到启动内核的进程的标准输出文件描述符。

std::cout 没有得到这种编译器辅助 replacement/hijacking - 它的代码在 GPU 上根本不相关。

但是 - 我已经实现了一种用于 GPU 内核的类似 std::cout 的机制；在 SO 上查看我的 this answer 以获取更多信息和链接。

这意味着我不得不亲自回答你的第二个问题：

If I wanted to implement a custom print function to be called from the GPU, how should I do it?

除非您有权访问未公开的 NVIDIA 内部 - 唯一的方法是使用 printf() 调用而不是主机端的 C 标准库或系统调用。您基本上需要在低级原始 I/O 设施上模块化您的整个流。这远非微不足道。

Answer 3

在 SYCL 中，出于与 .

中列出的原因类似的原因，您不能在主机上使用 std::cout 输出代码而不是运行

这意味着如果您在“设备”（例如 GPU）上运行内核代码，那么您需要使用 stream class。在 SYCL developer guide section called Logging.

中有更多关于此的信息

为什么 printf() 在内核中工作，而使用 std::cout 却不行？

Why does printf() work within a kernel, but using std::cout doesn't?

printf

cuda

cout

sycl