tf.Dataset 不会重复 - WARNING:tensorflow:Your 输入运行数据外；中断训练

Question

使用 Tensorflow 的数据集生成器无需重复工作。但是，当我使用 repeat 将我的训练数据集从 82,000 加倍到 164,000 以进行额外扩充时，我“运行数据不足。”

我读到 steps_per_epoch 可以通过允许多个时期对训练数据进行单次传递来“慢煮”模型。这不是我的意图，但即使我传递了少量 steps_per_epoch（这应该会创建这种慢速烹饪模式），TF 也说我运行没有数据。

有一种情况，TF 说我很接近（“在这种情况下，120 个批次”）。我已经尝试 plus/minus 这个值，但仍然遇到错误 drop_remainder 设置为 True 以丢弃任何剩余的东西。

错误：

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 82,000 batches). You may need to use the repeat() function when building your dataset. WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 120 batches). You may need to use the repeat() function when building your dataset.

Parameters
Train Dataset	82,000
Val Dataset	12,000
Test Dataset	12,000
epochs (early stopping usually stops about 30)	100
batch_size	200

**batch_size 对于模型小批量和生成器批量相同

Attempt	steps_per_epoch Value	Error
steps_per_epoch==None	None	"..in this case, 82,000 batches"
steps_per_epoch==train_len//batch_size	820	"..in this case, 82,000 batches"
steps_per_epoch==(train_len//batch_size)-1	819	Training stops halfway "..in this case, 81,900 batches"
steps_per_epoch==(train_len//batch_size)+1	821	Training stops halfway "..in this case, 82,100 batches"
steps_per_epoch==(train_len//batch_size)//2	410	Training seems complete but errors before validation "..in this case, 120 batches"
steps_per_epoch==((train_len//batch_size)//2)-1	409	Same as above:Training seems complete but errors before validation "..in this case, 120 batches"
steps_per_epoch==((train_len//batch_size)//2)+1	411	Training seems complete but errors before validation "..in this case, 41,100 batches"
steps_per_epoch==(train_len//batch_size)*2	1640	Training stops at one quarter "..in this case, 164,000 batches"
steps_per_epoch==20 (arbitrarily small number)	20	Very surprisingly "..in this case, 120 batches"

生成器 - 目标是重复训练集两次：

    trainDS = tf.data.Dataset.from_tensor_slices(trainPaths).repeat(2) 
    train_len = len(trainDS) #used to calc steps_per_epoch
    trainDS = (trainDS
                .shuffle(train_len)
                .map(load_images, num_parallel_calls=AUTOTUNE)
                .map(augment, num_parallel_calls=AUTOTUNE)
                .cache('train_cache')
                .batch(batch_size, drop_remainder=True )
                .prefetch(AUTOTUNE)
    )
    valDS = tf.data.Dataset.from_tensor_slices(valPaths)
    valDS = (valDS
                .map(load_images, num_parallel_calls=AUTOTUNE)
                .cache('val_cache')
                .batch(batch_size, drop_remainder=True)
                .prefetch(AUTOTUNE)
    )
    testDS = tf.data.Dataset.from_tensor_slices(testPaths)
    testDS = (testDS
                .map(load_images, num_parallel_calls=AUTOTUNE)
                .cache('test_cache')
                .batch(batch_size, drop_remainder=True)
                .prefetch(AUTOTUNE)

    )

Model.fit() 根据文档- len(train)//batch_size 是默认值

    hist= model.fit(trainDS,
                    epochs=epochs, 
                    batch_size=batch_size, 
                    validation_data=valDS,                   
                    steps_per_epoch= <see attempts table above>,
    )

编辑：将重复放在有效方法列表的最后。向@AloneTogether 大声喊出从拟合函数中删除批次的提示。

trainDS = tf.data.Dataset.from_tensor_slices(trainPaths)
trainDS = (trainDS
    .shuffle(len(trainPaths))
    .map(load_images, num_parallel_calls=AUTOTUNE)
    .map(augment, num_parallel_calls=AUTOTUNE)
    .cache('train_cache')
    .batch(batch_size, drop_remainder=True) 
    .prefetch(AUTOTUNE)
    .repeat(2) # <-- put last in the list
)

Answer 1

嗯，也许你不应该在 model.fit(...) 中明确定义 batch_size 和 steps_per_epoch。关于model.fit(...)中的batch_size参数，docs状态：

[...] Do not specify the batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

这似乎有效：

import tensorflow as tf

x = tf.random.normal((1000, 1))
y = tf.random.normal((1000, 1))

ds = tf.data.Dataset.from_tensor_slices((x, y)).repeat(2)
ds = ds.shuffle(2000).cache('train_cache').batch(15, drop_remainder=True ).prefetch(tf.data.AUTOTUNE)

val_ds = tf.data.Dataset.from_tensor_slices((tf.random.normal((300, 1)), tf.random.normal((300, 1))))
val_ds = val_ds.shuffle(300).cache('val_cache').batch(15, drop_remainder=True).prefetch(tf.data.AUTOTUNE)

inputs = tf.keras.layers.Input(shape = (1,))
x = tf.keras.layers.Dense(10, activation='relu')(inputs)
outputs = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer='adam', loss='mse')
model.fit(ds, validation_data=val_ds, epochs = 5)

Epoch 1/5
133/133 [==============================] - 1s 4ms/step - loss: 1.0355 - val_loss: 1.1205
Epoch 2/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9847 - val_loss: 1.1050
Epoch 3/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9810 - val_loss: 1.0982
Epoch 4/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9792 - val_loss: 1.0937
Epoch 5/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9779 - val_loss: 1.0903
<keras.callbacks.History at 0x7f3acb3e5ed0>

133 * batch_size = 1995 --> 剩余部分已删除。

tf.Dataset 不会重复 - WARNING:tensorflow:Your 输入运行数据外；中断训练

tf.Dataset will not repeat without - WARNING:tensorflow:Your input ran out of data; interrupting training

machine-learning

dataset

keras

tensorflow

tensorflow-datasets

tf.Dataset 不会重复 - WARNING:tensorflow:Your 输入 运行 数据外；中断训练

tf.Dataset will not repeat without - WARNING:tensorflow:Your input ran out of data; interrupting training

machine-learning

dataset

keras

tensorflow

tensorflow-datasets

tf.Dataset 不会重复 - WARNING:tensorflow:Your 输入运行数据外；中断训练