由于评估未分布式，在 AI 引擎上分布式训练速度变慢

Question

我们一直在使用包含 96 000 000 个数据点的数据集在 AI 引擎上训练神经网络。神经网络以分布式方式进行训练，按照惯例，我们使用 20% 的数据集作为评估数据。为了训练分布式，我们使用了 TensorFlow 估计器和方法 tf.estimator.train_and_evaluate。由于我们的数据集非常大，我们的评估集也非常大。查看主节点与工作节点的 cpu 用法，并使用仅包含 100 个样本的评估数据集进行测试，评估似乎没有分布，只发生在主节点上。这使得使用标准大小评估数据（占总数据的 20%）和仅使用 100 个数据点进行评估之间消耗的 ML 单元数量增加了大约 5 倍，而训练数据量是相同的。

我们看到这个问题有两种可能的解决方案：

也在做分布式评估，但这在 AI 平台上在技术上可行吗？
寻找具有代表性的较小评估数据集。是否有构建这个较小数据集的最佳实践方法？

以下是我认为是代码的相关部分。函数input_fn returns a tf.data.Dataset 已经批处理了。

run_config = tf.estimator.RunConfig(
        save_checkpoints_steps=1000, keep_checkpoint_max=10, tf_random_seed=random_seed
    )

    myestimator = _get_estimator(
        hidden_neurons, run_config, learning_rate, output_dir, my_rmse
    )

    # input_fn for tf.estimator Spec must be a callable function without args.
    # So we pack our input_fn in a lambda function
    callable_train_input_fn = lambda: input_fn(
        filenames=train_paths,
        num_epochs=num_epochs,
        batch_size=train_batch_size,
        num_parallel_reads=num_parallel_reads,
        random_seed=random_seed,
        input_format=input_format,
    )
    callable_eval_input_fn = lambda: input_fn(
        filenames=eval_paths,
        num_epochs=num_epochs,
        batch_size=eval_batch_size,
        shuffle=False,
        num_parallel_reads=num_parallel_reads,
        random_seed=random_seed,
        input_format=input_format,
    )

    train_spec = tf.estimator.TrainSpec(
        input_fn=callable_train_input_fn, max_steps=max_steps_train
    )

    eval_spec = tf.estimator.EvalSpec(
        input_fn=callable_eval_input_fn,
        steps=max_steps_eval,
        throttle_secs=throttle_secs,
        exporters=[exporter],
        name="taxifare-eval",
    )

    tf.estimator.train_and_evaluate(myestimator, train_spec, eval_spec)

Answer 1

TF 对于分布式学习来说不是那么舒服。查看 mxnet. There's nice intro here。

Answer 2

在看到评论并进行更多调查后，评估似乎并没有减慢进程，但评估发生了两次（一次在训练期间，一次总是在训练结束时）。因此，培训时间较长，仅仅是因为必须等待评估完成。感谢大家的评论

由于评估未分布式，在 AI 引擎上分布式训练速度变慢

Training distributed on AI engine is slowed down because the evaluation is not distributed

python

distributed-computing

tensorflow

google-cloud-ml