Pytorch C++ (Libtroch)，使用操作间并行性

Pytorch C++ (Libtroch), using inter-op parallelism

我正在使用 PyTorch (libtorch) 的 C++ API 开发机器学习系统。

我最近一直在研究 libtorch 的性能、CPU 利用率和 GPU 使用率。通过我的研究，我了解到 Torch 在 CPUs:

上使用了两种并行化方式

inter-op 并行化
intra-op 并行化

我的主要问题是:

两者的区别
如何利用 inter-op 并行度

我知道我可以使用 torch::set_num_threads() 函数指定用于 intra-op 并行性的线程数（根据我的理解，这是使用 openmp 后端执行的），因为我监视我的模型的性能，我可以清楚地看到它利用了我使用此函数指定的线程数，并且通过更改 intra-op 线程数我可以看到明显的性能差异。

还有一个函数torch::set_num_interop_threads()，但似乎无论我指定多少互操作线程，我都看不出性能有任何差异。

现在我已经阅读了this PyTorch documentation article，但我仍然不清楚如何利用互操作线程池。

文档说：

PyTorch uses a single thread pool for the inter-op parallelism, this thread pool is shared by all inference tasks that are forked within the application process.

这部分我有两个问题：

我是否需要自己创建新线程才能使用 interop 线程，还是 torch 会在内部以某种方式为我创建新线程？
如果我需要自己创建新线程，我该如何在 C++ 中创建，以便从 interop 线程池中创建一个新线程？

在 python 示例中，他们使用 torch.jit 模块中的 fork 函数，但我在 C++ API.

中找不到类似的东西

问题

difference between these two

如图所示：

intra-op - 并行化完成单个操作（如matmul或任何其他“per-tensor”）
inter-op - 你有多个操作，它们的计算可以交织在一起

inter-op "例子":

op1 开始和 returns“未来”对象（这是一个我们可以查询结果的对象 一旦这个操作完成）
op2 在之后 立即开始（因为 op1 现在是非阻塞的）

op2 结束

我们可以查询 op1 结果（希望已经完成或至少接近完成）

我们将 op1 和 op2 结果加在一起（或者我们想对它们做的任何事情）

由于以上：

intra-op 无需任何添加即可工作（因为它是 PyTorch 处理的）并且应该会提高性能

inter-op 是用户驱动的（模型的架构，尤其是 forward），因此 架构必须在创建时考虑到 inter-op！

how can I utilize inter-op parallelism

除非您在构建模型时考虑到 inter-op（例如使用 Futures，请参阅您 post 编辑的 link 中的第一个代码片段），否则您不会看不到任何性能改进。

很有可能:

您的模型是用 Python 编写的，已转换为 torchscript，并且仅在 C++ 中进行推理

您应该在 Python 中编写（或重构现有的）inter-op 代码，例如使用 torch.jit.fork and torch.jit.wait

do I need to create new threads myself to utilize the interop threads, or does torch do it somehow for me internally?

不确定目前在 C++ 中是否可行，找不到任何 torch::jit::fork 或相关功能。

If I need to create new threads myself, how do I do it in C++, so that I create a new thread form the interop thread pool?

不太可能，因为 C++ 的 API 的目标是模仿 Python 的 API 尽可能接近现实。 您可能需要更深入地挖掘与之相关的源代码and/or post 如果需要，请在他们的 GitHub 存储库中提出功能请求

Pytorch C++ (Libtroch)，使用操作间并行性

Pytorch C++ (Libtroch), using inter-op parallelism

c++

python

multithreading

pytorch

libtorch

问题