Tensorflow 分布式传递设备

Question

最近安装了分布式处理的tensorflow版本。来自 trend, I tried to implement with multiple gpus on multiple computers, and also found a 的一些附加规范。我可以运行服务器和工作人员分别在 2 台不同的计算机上使用 2 个和 1 个 gpus，并使用会话 grpc，在远程或本地模式下分配和运行程序。

我运行在远程计算机中使用本地 tensorflow：

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='local|localhost:2500' --job_name=local --task_id=0 &

并在服务器上使用

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='worker|192.168.170.193:2500,prs|192.168.170.226:2500' --job_name=worker --task_id=0 \
--job_name=prs --task_id=0 &

但是，当我尝试同时在 2 台计算机上为运行ning 指定设备时，python 显示错误：

 Could not satisfy explicit device specification '/job:worker/task:0'

当我使用

with tf.device("/job:prs/task:0/device:gpu:0"):
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
with tf.device("/job:prs/task:0/device:gpu:1"):
  b = tf.Variable(tf.zeros([10], name='bias'))
# Use a name scope to organize nodes in the graph visualizer
with tf.device("/job:worker/task:0/device:gpu:0"):
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

甚至更改工作名称。所以我想知道是否需要或者我在集群初始化方面做错了什么。

Answer 1

worker真的是集群的名字。

您的第一个 bazel 调用应该是这样的：

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \ --cluster_spec='worker|192.168.170.193:2500;192.168.170.226:2501' --job_name=worker --task_id=0 &

你运行在第一个节点上，192.168.170.193

你的集群名称是worker，包含两个节点的IP地址。该任务然后引用两个运行ning 节点。您必须在两个节点上启动协议，为每个节点指定不同的任务 ID，即。然后运行:

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='worker|192.168.170.193:2500;192.168.170.226:2501' --job_name=worker --task_id=1 &`

在你的第二个节点上，192.168.170.226

然后运行:

with tf.device("/job:worker/task:0/device:gpu:0"):
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
with tf.device("/job:worker/task:0/device:gpu:1"):
  b = tf.Variable(tf.zeros([10], name='bias'))
# Use a name scope to organize nodes in the graph visualizer
with tf.device("/job:worker/task:1/device:gpu:0"):
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

Tensorflow 分布式传递设备

Tensorflow distributed passing devices

python

tensorflow