CUDA cudaMemcpyAsync 使用单流来托管

Question

我有一个内核，它使用单流将数据传输到两个参数（dev_out_1 和 dev_out_2）。我想将数据从 设备并行复制回主机 。我的要求是使用单流并并行复制回主机。

您如何处理此类问题？

SomeCudaCall<<<25,34>>>(input, dev_out_1,dev_out_2);
cudaMemcpyAsync(toHere_1, dev_out_1, sizeof(int), cudaMemcpyDeviceToHost,0);
cudaMemcpyAsync(toHere_2, dev_out_2, sizeof(int), cudaMemcpyDeviceToHost,0);

Answer 1

I wanted to copy back the data from the device to host in parallel

那是不可能的。

NVIDIA GPU 只能使用一个 DMA 引擎进行设备到主机的传输（即使在有多个 DMA 引擎的情况下），并且 DMA 引擎一次只能执行一个传输。因此，不可能通过 PCI Express 总线在同一方向上进行“并行”复制。

CUDA cudaMemcpyAsync 使用单流来托管

CUDA cudaMemcpyAsync using single stream to host

cuda

cuda-streams