使用带有 save_freq 作为整数的 ModelCheckpoint() 时,如何使用纪元或批号创建检查点文件名?

How to create checkpoint filenames with epoch or batch number when using ModelCheckpoint() with save_freq as interger?

我安装了 tensorflow 2 v.2.5.0 并使用带有 python 3.10 的 jupyter 笔记本。

我正在练习使用参数 save_freq 作为在线课程中的整数(他们使用 tensorflow 2.0.0,其中以下代码运行良好,但它在我的最新版本中确实有效)。

这里是相关文档的 link,但没有在 save_freq 中使用整数的示例。 https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint

这是我的代码:

    import tensorflow as tf
    from tensorflow.keras.callbacks import ModelCheckpoint
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
    
    # Use the CIFAR-10 dataset
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
    x_train = x_train / 255.0
    x_test = x_test / 255.0
    
    # using a smaller subset -- speeds things up
    x_train = x_train[:10000]
    y_train = y_train[:10000]
    x_test = x_test[:1000]
    y_test = y_test[:1000]
    
    # define a function that creates a new instance of a simple CNN.
    def create_model():
        model = Sequential([
            Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3), 
                   activation='relu', name='conv_1'),
            Conv2D(filters=8, kernel_size=(3, 3), activation='relu', name='conv_2'),
            MaxPooling2D(pool_size=(4, 4), name='pool_1'),
            Flatten(name='flatten'),
            Dense(units=32, activation='relu', name='dense_1'),
            Dense(units=10, activation='softmax', name='dense_2')
        ])
        model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])
        return model
    
    
    # Create Tensorflow checkpoint object with epoch and batch details 
    
    checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:04d}'
    checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
                                     save_weights_only = True,
                                     save_freq = 5000,
                                     verbose = 1)
    
    
    # Create and fit model with checkpoint
    
    model = create_model()
    model.fit(x = x_train,
              y = y_train,
              epochs = 3,
              validation_data = (x_test, y_test),
              batch_size = 10,
              callbacks = [checkpoint_5000])

我想创建并保存检查点文件名,包括纪元和批号。 但是,文件未创建,它写入 'File not found'。在我手动创建目录后,model_checkpoints_5000,没有添加任何文件。

(我们可以通过 运行 '!dir -a model_checkpoints_5000' (in windows) 或 'ls -lh model_checkpoints_500' (in linux)).

我也试过改成'model_checkpoints_5000/cp_{epoch:02d}',还是没有保存每个epoch的文件。

然后我尝试按照 save_freq 的 Checkpoint Callback options 中的示例进行操作,这会与我一起保存文件。 https://www.tensorflow.org/tutorials/keras/save_and_load

然而,它仍然没有保存我的任何文件。

checkpoint_path = "model_checkpoints_5000/cp-{epoch:02d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

batch_size = 10

checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_path,
                                 save_weights_only = True,
                                 save_freq = 500*batch_size,


model = create_model()

model.fit(x = x_train,
          y = y_train,
          epochs = 3,
          validation_data = (x_test, y_test),
          batch_size = batch_size,
          callbacks = [checkpoint_5000])                                verbose = 1)

有什么建议可以让它发挥作用吗?除了降级我的 tensorflow。

参数save_freg太大。它需要等于或小于 save_freg = training_samples // batch_size。也许尝试这样的事情:

batch_size = 10
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:1d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
                                  save_weights_only = True,
                                  save_freq = len(x_train) // batch_size // batch_size,
                                  verbose = 1)
model = create_model()
model.fit(x = x_train,
          y = y_train,
          epochs = 3,
          validation_data = (x_test, y_test),
          batch_size = batch_size,
          callbacks = [checkpoint_5000])
Epoch 1/3
  97/1000 [=>............................] - ETA: 3s - loss: 2.2801 - accuracy: 0.1536
Epoch 00001: saving model to model_checkpoints_5000/cp_01-100
 198/1000 [====>.........................] - ETA: 3s - loss: 2.2347 - accuracy: 0.1500
Epoch 00001: saving model to model_checkpoints_5000/cp_01-200
 288/1000 [=======>......................] - ETA: 3s - loss: 2.1979 - accuracy: 0.1736
Epoch 00001: saving model to model_checkpoints_5000/cp_01-300
 397/1000 [==========>...................] - ETA: 2s - loss: 2.1337 - accuracy: 0.2020
Epoch 00001: saving model to model_checkpoints_5000/cp_01-400
 497/1000 [=============>................] - ETA: 2s - loss: 2.0952 - accuracy: 0.2197
Epoch 00001: saving model to model_checkpoints_5000/cp_01-500
 598/1000 [================>.............] - ETA: 1s - loss: 2.0496 - accuracy: 0.2395
Epoch 00001: saving model to model_checkpoints_5000/cp_01-600
 698/1000 [===================>..........] - ETA: 1s - loss: 2.0122 - accuracy: 0.2520
Epoch 00001: saving model to model_checkpoints_5000/cp_01-700
 703/1000 [====================>.........] - ETA: 1s - loss: 2.0082 - accuracy: 0.2538
...

在此示例中,每个纪元每 x 步创建一个检查点。