cuDevicePrimaryCtxRetain() 是否用于在多个进程之间拥有持久的 CUDA 上下文对象？

Question

仅使用驱动程序 api，例如，我有一个下面的单进程分析（cuCtxCreate），cuCtxCreate 开销几乎与 300MB 数据副本相当 to/from GPU:

在 CUDA 文档 here 中，它说（对于 cuDevicePrimaryCtxRetain）Retains the primary context on the device, creating it **if necessary**。这是从命令行重复调用同一进程的预期行为吗（例如运行一个进程 1000 次显式处理 1000 个不同的输入图像）？设备是否需要 CU_COMPUTEMODE_EXCLUSIVE_PROCESS 才能按预期工作（re-use 多次调用时的上下文相同）？

现在，即使我多次调用该过程，上图也是一样的。即使不使用探查器，计时也会显示大约 1 秒的完成时间。

编辑： 根据文档，主要上下文是 one per device per process。这是否意味着使用多线程单应用程序时不会出现问题？

主要上下文的re-use 时间限制是多少？进程之间间隔 1 秒是否可以，还是必须几毫秒才能保持主要上下文活动？

我已经将 ptx 代码缓存到一个文件中，因此唯一剩余的开销看起来像 cuMemAlloc()、malloc() 和 cuMemHostRegister() 所以 re-using 来自上次调用同一进程的最新上下文将优化时间很好。

Edit-2: 文档说 The caller must call cuDevicePrimaryCtxRelease() when done using the context. 对应 cuDevicePrimaryCtxRetain。 caller这里有进程吗？我可以只在第一个调用的进程中使用 retain 并在数百个连续调用的进程列表中的最后一个调用的进程上使用 release 吗？如果最后一个进程无法启动且 cuDevicePrimaryCtxRelease 未调用，系统是否需要重置？

编辑 3：

主要上下文是否用于此目的？

process-1: retain (creates)
process-2: retain (re-uses)
...
process-99: retain (re-uses)
process-100: 1 x retain and 100 x release (to decrease counter and unload at last)

一切都是为 sm_30 编译的，设备是 Grid K520。
GPU 在 cuCtxCreate() 期间处于提升频率
项目是在 windows 服务器 2016 OS 上编译的 64 位（发布模式）和具有 windows-7 兼容性的 CUDA 驱动程序安装（这是唯一的工作方式K520 + windows_server_2016)

Answer 1

tl;dr：不，不是。

Is cuDevicePrimaryCtxRetain() used for having persistent CUDA context objects between multiple processes?

没有。它旨在允许驱动程序 API 绑定到使用运行时 API 的库已经懒惰地创建的上下文。仅此而已。曾几何时，需要使用驱动程序 API 创建上下文，然后将运行时绑定到它们。现在，有了这些 API，您就不必这样做了。例如，您可以在 Tensorflow here.

中查看这是如何完成的

Does this mean there won't be a problem when using multiple threaded single application?

从大约 CUDA 2.0

开始，驱动程序 API 已经完全线程安全

Is caller here any process? Can I just use retain in first called process and use release on the last called process in a list of hundreds of sequentally [sic] called processes?

没有。上下文对于给定进程始终是唯一的。它们不能以这种方式在进程之间共享

Is primary context intended for this?

process-1: retain (creates)
process-2: retain (re-uses)
...
process-99: retain (re-uses)
process-100: 1 x retain and 100 x release (to decrease counter and unload at last)

没有

cuDevicePrimaryCtxRetain() 是否用于在多个进程之间拥有持久的 CUDA 上下文对象？

Is cuDevicePrimaryCtxRetain() used for having persistent CUDA context objects between multiple processes?

cuda

cuda-context

tl;dr：不，不是。