为什么在复制 tf.dataset 时使用 steps_per_epoch？

Question

我正在学习 tensorflow 并浏览此处的示例代码： https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/tf-keras

这是一个简短的代码片段，展示了如何对 model.fit 函数进行输入。

def input_fn(dataset,shuffle, n_epoch,s_batch):
    if shuffle:
        dataset = dataset.shuffle(buffer_size=10000)
    dataset = dataset.repeat(n_epochs)
    dataset = dataset.batch(s_batch)
    return dataset

n_epoch=10
s_batch=100
s_samples=number of samples in the training data

training_dataset_input=input_fn(
    training_dataset,
    shuffle=True,
    num_epochs=n_epoch,
    batch_size=s_batch)

mymodel.fit(training_dataset_input,epochs=n_epoch,steps_per_epoch=int(s_samples/s_batch)) </i>

我的问题是理解纪元是如何运作的。我认为一个 epoch 是整个数据集的一个完整的 runtrough。但是当设置参数 steps_per_epoch 时，训练会从它在同一个数据集上离开的地方继续，它似乎并没有从一开始就重新开始。那么有什么区别呢：

mymodel.fit(training_dataset_input,epochs=n_epoch,steps_per_epoch=int(s_samples/s_batch))

并在一个时期内耗尽整个复制的数据集

mymodel.fit(training_dataset_input)

这两种拟合方法都会使用整个数据集 10 次，并进行相同数量的训练步骤。

Answer 1

But when setting the argument steps_per_epoch the training continue where it left on the same dataset, it does not seem to restart at the beginning. So what is then the difference

如果steps_per_epoch没有设置那么1个epoch就是1个完整的运行通过数据。

如果设置了 steps_per_epoch 那么 1 "epoch" 是这个值设置的训练步数然后（正如你指出的那样）下一个 "epoch" 从最后一个开始一个离开了。

如果您想在庞大的数据集上更频繁地进行验证运行等，此功能很有用。

为什么在复制 tf.dataset 时使用 steps_per_epoch？

Why use steps_per_epoch when replicating a tf.dataset?

python

tensorflow

tf.keras