`tf.estimator.train_and_evaluate`函数控制训练和评估周期的机制是什么?

what's the mechanism of `tf.estimator.train_and_evaluate` function to control training and evaluation period?

当我使用 TensorFlow Object Detection API 训练 SSD 对象检测模型 20K 步时,我发现训练时间各不相同: 它在前 10 分钟训练得很快,执行了大约 500 步(即 0.83 steps/seconds)。然后速度变慢,大约需要 40~50 分钟来执行单个训练步骤、在评估数据集上评估模型并将检查点保存在磁盘上。所以我在几步之后中断了训练并继续恢复训练。
每次,它都在前 10 分钟快速训练,然后如图所示急剧减慢。

模型的训练由 TensorFlow's Estimator API tf.estimator.train_and_evaluate()
实现 谁能解释它是如何工作的?估算器如何控制训练和评估周期?我不想每一步都评估模型!

如果您查看讨论中提到的 EvalSpec and TrainSpec there is an argument throttle_secs, which is responsible for deciding when evaluation is called. Refer to this heated discussion, which has many details about Estimator methods! Controlling this would be the option to control train and eval cycles. Also in general, train_and_evaluate will work by building a graph of the the training and evaluation operation. The training graph is created only once, but evaluation graph is recreated every time you need to evaluate. This means that it will load the checkpoint that was created during training, which maybe one reason why this is taking so long! Maybe InMemoryEvaluationHook 可以帮助您解决问题,因为它不会在每次调用评估时重新加载检查点。