使用整个 MNIST 数据集（60000 张图像）训练 tensorflow 需要多少次迭代？

Question

MNIST 集包含 60,000 张图像作为训练集。在训练我的 Tensorflow 时，我想运行训练步骤用整个训练集训练模型。 Tensorflow 网站上的深度学习示例使用 20,000 次迭代，批次大小为 50（总计 1,000,000 个批次）。当我尝试超过 30,000 次迭代时，我的数字预测失败（预测所有手写数字为 0）。我的问题是，我应该使用 50 的批量大小进行多少次迭代来训练整个 MNIST 集的张量流模型？

self.mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
for i in range(FLAGS.training_steps):
    batch = self.mnist.train.next_batch(50)
    self.train_step.run(feed_dict={self.x: batch[0], self.y_: batch[1], self.keep_prob: 0.5})
    if (i+1)%1000 == 0:
       saver.save(self.sess, FLAGS.checkpoint_dir + 'model.ckpt', global_step = i)

Answer 1

我认为这取决于您的停止条件。您可以在损失没有改善时停止训练，或者您可以有一个验证数据集，并在验证准确性不再提高时停止训练。

Answer 2

使用机器学习时，您往往会遇到严重的缩减情况 returns。例如，这是我的一个 CNN 的准确度列表：

Epoch 0 current test set accuracy :  0.5399
Epoch 1 current test set accuracy :  0.7298
Epoch 2 current test set accuracy :  0.7987
Epoch 3 current test set accuracy :  0.8331
Epoch 4 current test set accuracy :  0.8544
Epoch 5 current test set accuracy :  0.8711
Epoch 6 current test set accuracy :  0.888
Epoch 7 current test set accuracy :  0.8969
Epoch 8 current test set accuracy :  0.9064
Epoch 9 current test set accuracy :  0.9148
Epoch 10 current test set accuracy :  0.9203
Epoch 11 current test set accuracy :  0.9233
Epoch 12 current test set accuracy :  0.929
Epoch 13 current test set accuracy :  0.9334
Epoch 14 current test set accuracy :  0.9358
Epoch 15 current test set accuracy :  0.9395
Epoch 16 current test set accuracy :  0.942
Epoch 17 current test set accuracy :  0.9436
Epoch 18 current test set accuracy :  0.9458

如您所见，returns 在 ~10 个纪元* 后开始下降，但这可能因您的网络和学习率而异。根据关键程度/你有多少时间可以做的事情会有所不同，但我发现 20 是一个合理的数字

*我一直使用 epoch 这个词来表示一个完整的运行通过一个数据集，但我不知道该定义的准确性，这里的每个 epoch 大约有 429 个训练步骤，批次大小128.

Answer 3

您可以使用 no_improve_epoch 之类的东西并将其设置为 3。这意味着如果在 3 次迭代中没有改进 >1 %，然后停止迭代。

no_improve_epoch= 0
        with tf.Session() as sess:
            sess.run(cls.init)
            if cls.config.reload=='True':
                print(cls.config.reload)
                cls.logger.info("Reloading the latest trained model...")
                saver.restore(sess, cls.config.model_output)
            cls.add_summary(sess)
            for epoch in range(cls.config.nepochs):
                cls.logger.info("Epoch {:} out of {:}".format(epoch + 1, cls.config.nepochs))
                dev = train
                acc, f1 = cls.run_epoch(sess, train, dev, tags, epoch)

                cls.config.lr *= cls.config.lr_decay

                if f1 >= best_score:
                    nepoch_no_imprv = 0
                    if not os.path.exists(cls.config.model_output):
                        os.makedirs(cls.config.model_output)
                    saver.save(sess, cls.config.model_output)
                    best_score = f1
                    cls.logger.info("- new best score!")

                else:
                    no_improve_epoch+= 1
                    if nepoch_no_imprv >= cls.config.nepoch_no_imprv:
                        cls.logger.info("- early stopping {} Iterations without improvement".format(
                            nepoch_no_imprv))
                        break

Sequence Tagging GITHUB

Answer 4

我发现使用 MNIST，对 3,833 张图像进行训练（在 56,167 because 60k**0.75 is just over 3.833) per epoch tends to converge well before 500 epochs. By "converge," I mean that validation loss does not decrease for 50 consecutive epochs of training with batch size 16; see this 回购上进行验证，以使用 tf.keras 提前停止作为示例；在这种情况下，这对我来说很重要，因为我在做模型搜索，没有时间训练一个模型很长时间。

使用整个 MNIST 数据集（60000 张图像）训练 tensorflow 需要多少次迭代？

How many iterations a needed to train tensorflow with the entire MNIST data set (60000 images)?

python

mnist

tensorflow