使用 ModelCeckpoint 保存检查点后，Keras 停止训练过程

Question

我正在用 tf.keras 训练 CNN。保存检查点后 Keras 没有开始下一个纪元

注： 1）作为一个保护程序被使用 tf.keras.callbacks.ModelCeckpoint 2）用于训练 fit_generator()

def iterate_minibatches(inputs, targets, batchsize):
    assert len(inputs) == len(targets)
    indices = np.arange(len(inputs))
    np.random.shuffle(indices)

    for start_idx in np.arange(0, len(inputs) - batchsize + 1, batchsize):
        excerpt = indices[start_idx:start_idx + batchsize]
        yield load_images(inputs[excerpt], targets[excerpt])

#Model path
model_path = "C:/Users/Paperspace/Desktop/checkpoints/cp.ckpt"
#saver = tf.train.Saver(max_to_keep=3)
cp_callback = tf.keras.callbacks.ModelCheckpoint(model_path, 
                                                 verbose=1,
                                                 save_weights_only=True,
                                                period=2)

tb_callback =TensorBoard(log_dir="./Graph/{}".format(time()))

batch_size = 750
history = model.fit_generator(generator=iterate_minibatches(X_train, Y_train,batch_size),
                                  validation_data=iterate_minibatches(X_test, Y_test, batch_size),
                                  # validation_data=None,
                                  steps_per_epoch=len(X_train)//batch_size,
                                  validation_steps=len(X_test)//batch_size,
                                  verbose=1,
                                  epochs=30,
                                  callbacks=[cp_callback,tb_callback] 
                             )

实际结果它停止训练没有任何问题。下一个时代的预期结果。

**Log**

Epoch 1/30
53/53 [==============================] - 919s 17s/step - loss: 1.2445 - acc: 0.0718
426/426 [==============================] - 7058s 17s/step - loss: 1.7877 - acc: 0.0687 - val_loss: 1.2445 - val_acc: 0.0718
Epoch 2/30
WARNING:tensorflow:Your dataset iterator ran out of data.

Epoch 00002: saving model to C:/Users/Paperspace/Desktop/checkpoints/cp.ckpt
WARNING:tensorflow:This model was compiled with a Keras optimizer (<tensorflow.python.keras.optimizers.Adam object at 0x0000023A913DE470>) but is being saved in TensorFlow format with `save_weights`. The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved.

Consider using a TensorFlow optimizer from `tf.train`.
WARNING:tensorflow:From C:\Users\Paperspace\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\network.py:1436: update_checkpoint_state (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.train.CheckpointManager to manage checkpoints rather than manually editing the Checkpoint proto.
  0/426 [..............................] - ETA: 0s - loss: 0.0000e+00 - acc: 0.0687 - val_loss: 0.0000e+00 - val_acc: 0.0000e+00

Answer 1

Keras 生成器必须在无限循环中生成批次。此更改应该有效，否则您可以按照 this 之类的教程进行操作。

def iterate_minibatches(inputs, targets, batchsize):
    assert len(inputs) == len(targets)
    while True:
        indices = np.arange(len(inputs))
        np.random.shuffle(indices)

        for start_idx in np.arange(0, len(inputs) - batchsize + 1, batchsize):
            excerpt = indices[start_idx:start_idx + batchsize]
            yield load_images(inputs[excerpt], targets[excerpt])

Answer 2

乍一看，您的生成器看起来不正确。 Keras 生成器需要一个 while True: 循环。也许这对你有用

def iterate_minibatches(inputs, targets, batchsize):
    assert len(inputs) == len(targets)
    indices = np.arange(len(inputs))
    np.random.shuffle(indices)

    while True:
        start = 0
        end = batchsize

        while start < len(inputs):
            excerpt = indices[start:end]
            yield load_images(inputs[excerpt], targets[excerpt])

            start += batchsize
            end += batchsize

使用 ModelCeckpoint 保存检查点后，Keras 停止训练过程

After saving checkpoint with ModelCeckpoint, Keras stopped training process

keras

tensorflow

tf.keras