Tensorflow 分布式训练因 RecvTensor 取消警告而挂起

Tensorflow distributed training hangs with RecvTensor cancelled warning

我正在参数服务器分布式模式下使用 tensorflow 2.1.0 的估计器 api 训练模型。启动训练的代码是这样的:

# model_fn is defined with model logit
GATNE = tf.estimator.Estimator(model_fn=model_fn, model_dir=args.model_dir, params=model_params, config=config)

# obtain different train and eval files for different workers
# files are stored on shared hdfs
train_files, eval_files = get_files_for_worker()

# the input_fn is going to raises an end-of-input exception after epochs
train_spec = tf.estimator.TrainSpec(
            input_fn=lambda: input_fn(train_files, batch_size=args.batch_size, num_epochs=args.epochs),
            max_steps=None)

eval_spec = tf.estimator.EvalSpec(
            input_fn=lambda: input_fn(eval_files, batch_size=args.batch_size, num_epochs=1),
            steps=1000,
            exporters=exporter,
            throttle_secs=300)

tf.estimator.train_and_evaluate(GATNE, train_spec, eval_spec)

print("Done.")

我正在使用 2 个 ps 服务器和 5 个工作人员(索引范围从 0 到 4)来训练模型。工人0为首席,工人1为评价员,其余为普通工人。

训练任务启动后,worker 4成功完成训练任务,这里是这个worker的一些日志(我在训练的时候添加了一些日志信息,比如******************* [2021-04-01 12:05] global_step: 4331 ************************* 跟踪 steps):

..............
2021-04-01 12:05:13.483 INFO:tensorflow:******************* [2021-04-01 12:05] global_step: 4331 *************************
2021-04-01 12:05:55.315 INFO:tensorflow:******************* [2021-04-01 12:05] global_step: 4535 *************************
2021-04-01 12:06:40.621 INFO:tensorflow:******************* [2021-04-01 12:06] global_step: 4733 *************************
2021-04-01 12:07:30.341 INFO:tensorflow:******************* [2021-04-01 12:07] global_step: 4936 *************************
2021-04-01 12:08:16.981 INFO:tensorflow:loss = 59.152977, step = 5136 (434.022 sec)
2021-04-01 12:08:16.981 INFO:tensorflow:******************* [2021-04-01 12:08] global_step: 5136 *************************
2021-04-01 12:09:04.206 INFO:tensorflow:******************* [2021-04-01 12:09] global_step: 5337 *************************
2021-04-01 12:09:51.85 INFO:tensorflow:******************* [2021-04-01 12:09] global_step: 5537 *************************
2021-04-01 12:10:38.679 INFO:tensorflow:******************* [2021-04-01 12:10] global_step: 5742 *************************
2021-04-01 12:11:25.781 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 5948 *************************
2021-04-01 12:12:07.822 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6123 *************************
2021-04-01 12:12:54.258 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6327 *************************
2021-04-01 12:13:41.683 INFO:tensorflow:******************* [2021-04-01 12:13] global_step: 6528 *************************
2021-04-01 12:14:21.60 INFO:tensorflow:Loss for final step: 62.46709.
2021-04-01 12:14:21.60 Done.

但是,所有其他工作人员都会收到以下警告:W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842 并永远挂起。

工人2的部分日志:

......
2021-04-01 12:11:14.380 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 5899 *************************
2021-04-01 12:11:56.730 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 6077 *************************
2021-04-01 12:12:39.956 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6269 *************************
2021-04-01 12:13:26.957 INFO:tensorflow:******************* [2021-04-01 12:13] global_step: 6468 *************************
2021-04-01 12:14:14.77 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6668 *************************
2021-04-01 12:14:50.611 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6821 *************************
2021-04-01 12:15:19.129 2021-04-01 12:15:19.128988: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.129 2021-04-01 12:15:19.129495: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132203: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132261: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132339: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132388: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842

工人 3:

2021-04-01 12:14:22.586 INFO:tensorflow:loss = 60.513123, step = 6703 (462.406 sec)
2021-04-01 12:14:22.586 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6703 *************************
2021-04-01 12:14:58.652 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6855 *************************
2021-04-01 12:15:30.105 INFO:tensorflow:******************* [2021-04-01 12:15] global_step: 6984 *************************
2021-04-01 12:15:53.527 INFO:tensorflow:******************* [2021-04-01 12:15] global_step: 7084 *************************
2021-04-01 12:16:17.568 INFO:tensorflow:******************* [2021-04-01 12:16] global_step: 7184 *************************
2021-04-01 12:16:41.582 INFO:tensorflow:******************* [2021-04-01 12:16] global_step: 7284 *************************
2021-04-01 12:17:05.140 INFO:tensorflow:******************* [2021-04-01 12:17] global_step: 7384 *************************
2021-04-01 12:17:22.976 INFO:tensorflow:******************* [2021-04-01 12:17] global_step: 7485 *************************
2021-04-01 12:17:36.575 2021-04-01 12:17:36.574901: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363

这些工人挂了,我不知道为什么。从ps服务器日志中,也有一些类似的警告:

ps 服务器 0:

2021-04-01 11:48:01.233 2021-04-01 11:48:01.232972: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
2021-04-01 12:14:20.957 2021-04-01 12:14:20.957449: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.133 2021-04-01 12:15:19.132949: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133103: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133120: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133131: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133141: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133151: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133174: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:35.837 2021-04-01 12:17:35.836897: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577149: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577221: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577233: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577312: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577323: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577330: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577339: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577348: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577355: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577363: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577369: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.579 2021-04-01 12:17:36.579436: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.

ps 服务器 1:

2021-04-01 11:46:28.368 2021-04-01 11:46:28.368034: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
2021-04-01 12:14:20.955 2021-04-01 12:14:20.955100: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 42171233248859519
2021-04-01 12:14:20.958 2021-04-01 12:14:20.957943: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130513: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130639: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130660: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130680: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134413: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134734: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134767: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134853: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134870: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134908: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134976: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134993: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135043: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135084: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135120: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135133: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135177: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135212: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:17:35.837 2021-04-01 12:17:35.837269: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578073: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578184: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578216: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.580 2021-04-01 12:17:36.580204: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.

负责人和评估员的日志看起来不错。我用谷歌搜索了 RecvTensor cancelled 警告,但找不到足够的相关信息。也许这是网络问题?但是这个警告总是出现,每次我启动培训任务时,工人都会被挂起。所以我想这不是偶然的。

我试了几次,发现总会有一个worker成功完成,紧接着其他worker就挂了。

有人可以帮我解决这个问题吗?

我运行我的任务几次,发现一个training worker会先完成,然后其他的training worker都开始挂了。所以我想完成培训的工人对其他工人有一些影响。让第一个完成培训的工人不“完成”怎么样?所以我在每个训练worker的训练代码后加了10分钟的睡眠,就是为了让worker不“完蛋”。这奏效了。当第一个完成的工人睡觉时,其他培训工人继续正常培训,并在 10 分钟内全部完成培训。到目前为止,这是我的解决方案。仍然对其他解决方案持开放态度。