分布式tensorflow：In-graph replication和Between-graph replication的区别

Question

阅读tensorflow官方How-to中的Replicated training时，我对这两个概念感到困惑：In-graph replication和Between-graph replication。

上面link说的是

In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); ...

这是否意味着 Between-graph replication 方法中有多个 tf.Graph？如果有，对应的代码在哪提供的示例？
虽然上面已经有一个Between-graph replication例子link，谁能提供一个In-graph replication 实现（伪代码很好）并突出显示其主要内容与 Between-graph replication?
的区别
提前致谢！

Edit_1: 更多问题

非常感谢您的详细解释和要点代码@mrry @YaroslavBulatov！看完之后您的回复，我有以下两个问题：

Replicated training中有如下语句：

Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.

我有两个子问题与上面粗体字相关。

(A) 为什么我们说每个客户端构建相似图，而不是相同图？我想知道 Replicated training 示例中每个客户端中内置的图形应该是相同的，因为下面的图构造代码在所有 worker 中共享。:

# Build model...

loss = ...

global_step = tf.Variable(0)

(B)不应该是计算密集部分的多份模型，因为我们有多个 workers?
Replicated training中的示例是否支持在多台机器上进行训练，每台机器都有多个 GPU？如果没有，我们能不能同时使用 In-graph replication 来支持多个训练每台机器上的 GPU 和 Between-graph replication 跨机器训练？我问这个问题是因为 @mrry 表示 In-graph replication 与方法基本相同用于 CIFAR-10 example model for multiple GPUs.

Answer 1

首先，对于一些历史背景，"in-graph replication"是我们在TensorFlow中尝试的第一种方法，并没有达到许多用户要求的性能，所以更复杂的"between-graph"方法是当前推荐的执行分布式训练的方法。 Higher-level tf.learn 等库使用 "between-graph" 方法进行分布式训练。

回答您的具体问题：

Does this mean there are multiple tf.Graphs in the between-graph replication approach? If yes, where are the corresponding codes in the provided examples?

是的。典型的 between-graph 复制设置将为每个工作副本使用一个单独的 TensorFlow 进程，并且每个进程都会为模型构建一个单独的 tf.Graph。通常每个进程都使用全局默认图（可通过 tf.get_default_graph() 访问）并且不会显式创建。

（原则上，您可以使用具有相同 tf.Graph 的单个 TensorFlow 进程和共享相同底层图形的多个 tf.Session 对象，只要您配置了每个会话的选项不同，但这是一种不常见的设置。）
While there is already a between-graph replication example in above link, could anyone provide an in-graph replication implementation (pseudocode is fine) and highlight its main differences from between-graph replication?

由于历史原因，in-graph复制的例子不多（Yaroslav's gist is one exception). A program using in-graph replication will typically include a loop that creates the same graph structure for each worker (e.g. the loop on line 74 of the gist），worker之间使用变量共享。

in-graph 复制持续存在的一个地方是在单个进程中使用多个设备（例如多个 GPU）。 CIFAR-10 example model for multiple GPUs is an example of this pattern (see the loop over GPU devices here).

（在我看来，不幸的是，处理多个工作人员和单个工作人员中的多个设备之间的不一致。In-graph 复制比 between-graph 复制更容易理解，因为它不不依赖于副本之间的隐式共享。Higher-level 库，例如 tf.learn 和 TF-Slim，隐藏了其中的一些问题，并希望我们可以在未来提供更好的复制方案.)

Why do we say each client builds a similar graph, but not the same graph?

因为它们不需要完全相同（并且没有强制执行此操作的完整性检查）。特别是，每个工作人员可能会创建一个具有不同显式设备分配（"/job:worker/task:0"、"/job:worker/task:1" 等）的图。首席工作人员可能会创建其他操作，这些操作不会在 non-chief 工作人员上创建（或使用）。然而，在大多数情况下，这些图在逻辑上（即模数设备分配）是相同的。

Shouldn't it be multiple copies of the compute-intensive part of the model, since we have multiple workers?

通常，每个工作人员都有一个单独的图表，其中包含模型 compute-intensive 部分的单个副本。 worker i 的图不包含 worker j 的节点（假设 i ≠ j）。（例外情况是您使用 between-graph 复制进行分布式训练，而 in-graph 复制用于在每个工作人员中使用多个 GPU。在这种情况下，工作人员的图表通常包含 N 份图的 compute-intensive 部分，其中 N 是该工作程序中的 GPU 数量。）
Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs?

示例代码只涉及多台机器的训练，并没有说明如何在每台机器的多个GPU上进行训练。但是，这些技术很容易组合。在这部分例子中：
```
# Build model...
loss = ...
```
...您可以在本地机器的 GPU 上添加一个循环，以实现分布式训练多个 worker，每个 worker 都有多个 GPU。

分布式tensorflow：In-graph replication和Between-graph replication的区别

Distributed tensorflow: the difference between In-graph replication and Between-graph replication

distributed

graph

tensorflow

Edit_1: 更多问题