我应该汇集 CUDA 流吗?

Shall I pool CUDA streams?

创建和销毁 CUDA 流的操作有多轻量级?例如。对于 CPU 个线程,这些操作很繁重,因此它们通常合并 CPU 个线程。我也应该汇集 CUDA 流吗?还是每次我需要的时候创建一个流然后销毁它是不是很快?

创建流是否快可能并不重要。创建一次然后重用它们总是比不断创建和销毁它们更快。

分摊延迟是否真的重要取决于您的应用程序。

NVIDIA 的指导是您应该汇集 CUDA 流。下面是马口的评论,https://github.com/pytorch/pytorch/issues/9646:

There is a cost to creating, retaining, and destroying CUDA streams in PyTorch master. In particular:

  • Tracking CUDA streams requires atomic refcounting
  • Destroying a CUDA stream can (rarely) cause implicit device synchronization
  • The refcounting issue has been raised as a concern for expanding stream tracing to allow streaming backwards, for example, and it's clearly best to avoid implicit device synchronization as it causes an often unexpected performance degradation.

For static frameworks the recommended best practice is to create all the needed streams upfront and destroy them after the work is done. This pattern is not immediately applicable to PyTorch, but a per device stream pool would achieve a similar effect.