Tensorflow图间同步训练中创建的梯度队列和token队列在哪里

Where are the gradient queues and token queue created in Tensorflow between-graph sync training

我一直在做Tensorflow图间同步训练应用。同步训练是通过 class SyncReplicasOptimizerV2 实现的。从class SyncReplicasOptimizerV2的文档中了解到，为了同步，创建了一组梯度队列和一个token队列。

我在想

这些队列位于哪里，是 chief worker 任务还是 ps 任务？如果梯度队列在 chief worker 中，据我所知 chief worker 任务还必须处理检查点、初始化、摘要...
这个单主worker任务容易出现性能瓶颈吗？
不同worker任务之间是否有网络通信（除了chief），如果有，网络通信存在的目的是什么？

PS：我所有的问题都在进行图间复制训练的场景中，每个任务都在不同的机器上。

首先，tf.train.SyncReplicasOptimizerV2中同步训练的新实现实际上并没有为变量使用一组队列。它使用称为 "conditional accumulator" 的更高效的有状态对象，它避免将未聚合的部分梯度存储在内存中，并改进了某些具有陈旧梯度的极端情况下的行为。

每个变量的条件累加器与该变量位于同一设备上，通常在 PS 任务（source); thus the many conditional accumulators will be sharded according to the same policy used for sharing the variables. The token queue for synchronization, on which the workers block before starting the next step, is created on the same device as the global step variable, which is also usually a single PS task (source）上。
一般情况下，chief worker task协调同步训练要做的工作很少。执行同步训练时，没有额外的数据流经首席工作者（在典型设置中，使用例如 tf.traing.replica_device_setter() 将设备分配给变量）。
同步训练不会产生任何额外的工作器间网络流量。当然，您可以选择将模型的不同部分放在不同的 worker 上进行模型并行训练，在这种情况下，TensorFlow 会添加适当的通信。但是我们常用同步训练的图像模型（比如Inception）不需要模型并行，在单GPU上运行效率更高